Files

ened 73d9d8d9a6 # CUDA Surface Object Refactoring - COMPLETED ✅

2025-10-06 09:16:01 +09:00

11 KiB

Raw Blame History

D3D12-CUDA RGB Pipeline Design - PROJECT STATUS ✅

Project: D3D12-CUDA NV12 Interop Alternative Solutions Status: ✅ SUPERSEDED BY CUDA SURFACE OBJECTS Date Started: 2025-10-05 Date Completed: 2025-10-06 (via alternative solution) Related: CUDA_Surface_Object_Refactoring_Completed.md

Executive Summary

Original Problem

D3D12-CUDA NV12 interop failed due to UV plane copy errors caused by tiled texture memory layouts.

Planned Solution (This Document)

NVIDIA NPP-based RGB Pipeline:

NVDEC → CUDA NV12 → NPP NV12→RGB conversion → D3D12 RGB texture (ROW_MAJOR)
Rationale: D3D12 supports ROW_MAJOR for RGB textures, enabling linear CUDA access
Expected benefits: Zero-copy GPU pipeline, simpler D3D12 renderer (no YUV→RGB shader)

✅ Actual Solution Implemented

CUDA Surface Objects with Direct Texture Writing (see CUDA_Surface_Object_Refactoring_Completed.md):

NVDEC → CUDA NV12 → YUV→RGBA Conversion → surf2Dwrite() to D3D12 tiled texture
Key Innovation: Uses cudaExternalMemoryGetMappedMipmappedArray() + CUDA surface objects
Result: Properly handles D3D12 tiled layout without requiring ROW_MAJOR or NPP

Project Status Breakdown

✅ Completed Items (from original plan)

1. D3D12 ROW_MAJOR Layout Investigation

Status: ✅ COMPLETED - Critical findings documented

Findings:

✅ Confirmed D3D12 does NOT support D3D12_TEXTURE_LAYOUT_ROW_MAJOR for R8/RG8 2D textures
✅ Verified RGB textures (R8G8B8A8_UNORM) CAN use ROW_MAJOR layout
✅ Microsoft D3D12 specification limits documented:
- ROW_MAJOR only valid for buffers and Texture1D
- Texture2D requires D3D12_TEXTURE_LAYOUT_UNKNOWN (tiled/swizzled)

Test Results (separate-texture-test):

// R8_UNORM with ROW_MAJOR → FAILED (0x80070057 E_INVALIDARG)
// R8G8B8A8_UNORM with ROW_MAJOR → SUCCESS (0x00000000)

Impact: Confirmed that separate Y/UV texture approach is NOT viable

2. CUDA External Memory API Analysis

Status: ✅ COMPLETED - Existing implementation reviewed

Findings:

✅ ExternalMemoryCache.cpp already implements proper D3D12-CUDA interop
✅ Uses correct cudaExternalMemoryHandleTypeD3D12Resource API
✅ Implements caching to avoid repeated imports
✅ Returns CUdeviceptr with zero-copy GPU VRAM mapping

Note: This analysis led to discovering the mipmapped array approach instead

3. Zero-Copy GPU Pipeline Architecture Validation

Status: ✅ COMPLETED - Validated and IMPROVED

Original Plan:

NVDEC → CUDA NV12 → NPP RGB → D3D12 RGB (ROW_MAJOR)

Implemented Solution (better):

NVDEC → CUDA NV12 → RGBA Conversion → CUDA Surface → D3D12 Tiled Texture

Why Better:

No NPP library dependency needed
No ROW_MAJOR requirement (works with native D3D12 tiled layout)
Single CUDA kernel handles YUV→RGBA + tiled write
Simpler implementation, better performance

❌ Abandoned Items (superseded by better solution)

1. NVIDIA NPP Library Integration

Status: ❌ ABANDONED - Not needed

Original Plan:

Integrate NPP (NVIDIA Performance Primitives)
Use nppiNV12ToRGB_8u_ColorTwist32f_P2C3R_Ctx() for NV12→RGB conversion
Link nppi.lib and implement NV12ToRGBConverter class

Why Abandoned:

CUDA surface objects solve the problem more elegantly
No external library dependency needed
Custom YUV→RGBA kernel provides same performance with more control
Avoids BT.709/BT.601 color matrix complexity

2. D3D12 RGB Texture ROW_MAJOR Implementation

Status: ❌ ABANDONED - Alternative approach used

Original Plan:

Create D3D12 RGB textures with D3D12_TEXTURE_LAYOUT_ROW_MAJOR
Import as linear CUDA buffers via cudaExternalMemoryGetMappedBuffer()
Copy RGB data using cudaMemcpy2D(DeviceToDevice)

Why Abandoned:

CUDA surface objects work with native tiled textures (no ROW_MAJOR needed)
Tiled textures have better GPU performance than linear layouts
Surface objects provide automatic layout handling

3. D3D12VideoRenderer Shader Simplification

Status: ❌ ABANDONED - YUV→RGB shader retained

Original Plan:

Replace YUV→RGB pixel shader with simple RGB sampling
Single texture SRV instead of dual Y/UV textures
Simpler descriptor heap layout

Why Abandoned:

Final solution uses RGBA textures (already simple)
YUV→RGBA conversion moved to CUDA kernel
D3D12 renderer just samples RGBA texture directly

4. VavCore Public API Color Format Flag

Status: ❌ ABANDONED - Not necessary

Original Plan:

enum VavCoreColorFormat {
    VAVCORE_COLOR_NV12 = 0,
    VAVCORE_COLOR_RGB = 1
};
vavcore_set_color_format(decoder, VAVCORE_COLOR_RGB);

Why Abandoned:

Single RGBA texture format handles all cases
No need for runtime switching between color formats
Simpler API surface

📝 Reference Implementation (Not Used)

The original document provided detailed NPP integration code examples. While not implemented, this research was valuable:

Preserved Knowledge:

NPP Color Conversion API: Complete BT.709/BT.601 color twist matrices documented
Memory Layout Analysis: Understanding of D3D12 tiled vs linear layouts
Performance Expectations: 1.5-2.5ms overhead benchmarks for 1080p
Alternative Approaches: Evaluation of readback buffers, compute shader detiling, CPU staging

Reusable Insights:

Zero-copy GPU pipeline principles apply to final solution
Performance analysis methods (measure NVDEC, conversion, copy separately)
Memory overhead calculations (NV12 vs RGB vs RGBA)

Technical Achievements

What Was Learned

1. D3D12 Texture Layout Constraints

Critical Finding: D3D12 Texture2D does NOT support ROW_MAJOR for single-channel formats
Exception: RGB formats (R8G8B8A8_UNORM) can use ROW_MAJOR
Implication: Separate Y/UV texture approach fundamentally incompatible

2. CUDA-D3D12 Interop Best Practices

Wrong Approach: cudaExternalMemoryGetMappedBuffer() for tiled textures → 90° rotation
Right Approach: cudaExternalMemoryGetMappedMipmappedArray() + surface objects → correct layout
Key API: cudaCreateSurfaceObject() + surf2Dwrite() handles tiling automatically

3. Performance Trade-offs

NPP Library: External dependency, added complexity
Custom CUDA Kernel: No dependencies, full control, similar performance
Memory Overhead: RGBA textures (4 bytes/pixel) vs NV12 (1.5 bytes/pixel)
- Trade-off: 2.67× memory for correct rendering and zero-copy benefits

What Was NOT Needed

NPP Library Integration: Custom CUDA kernel sufficient
ROW_MAJOR Textures: Surface objects work with native tiled layout
Separate Color Format API: Single RGBA format covers all use cases
Complex Color Matrix Math: Standard YUV→RGB conversion in kernel

Final Implementation Summary

Actual Solution (CUDA Surface Objects)

Architecture:

[NVDEC Decoder]
    ↓ (CUDA NV12 Buffer)
[YUV→RGBA Conversion Kernel]
    ↓ (CUDA RGBA Buffer)
[surf2Dwrite() Kernel]
    ↓ (cudaSurfaceObject_t → D3D12 Tiled Texture)
[D3D12 Renderer]
    ↓ (Sample RGBA texture)
[Display]

Key Components (see CUDA_Surface_Object_Refactoring_Completed.md):

ExternalMemoryCache: GetOrCreateSurfaceObject() API
CUDA Kernel: WriteSurfaceFromBuffer_Kernel() with surf2Dwrite()
D3D12SurfaceHandler: CopyRGBAToSurfaceViaKernel() orchestration
NVDECAV1Decoder: RGB frame processing pipeline

Performance:

✅ Zero-copy GPU pipeline maintained
✅ 90-degree rotation issue completely fixed
✅ BMP validation confirms correct output
✅ All 25 frames decoded successfully

Comparison: Planned vs Implemented

Aspect	Original NPP Plan	Actual Surface Object Solution
Color Conversion	NPP library (NV12→RGB)	Custom CUDA kernel (NV12→RGBA)
D3D12 Texture	RGB ROW_MAJOR	RGBA Tiled (native layout)
CUDA API	`cudaExternalMemoryGetMappedBuffer()`	`cudaExternalMemoryGetMappedMipmappedArray()`
Write Method	`cudaMemcpy2D()`	`surf2Dwrite()` kernel
Dependencies	NPP library (nppi.lib)	None (CUDA runtime only)
Complexity	Medium (NPP API + color matrices)	Low (single custom kernel)
Memory Format	RGB (3 bytes/pixel)	RGBA (4 bytes/pixel)
Performance	~1.5-2.5ms (estimated)	~1-2ms (actual)
Correctness	Would work (if implemented)	✅ Proven working

Lessons Learned

Design Process

Research First: Extensive D3D12/CUDA API investigation prevented wrong implementation
Test Assumptions: separate-texture-test validated ROW_MAJOR constraints early
Stay Flexible: Better solution emerged during implementation research

Technical Insights

Tiled Textures: D3D12 native tiled layout is NOT accessible via linear CUDA buffers
Surface Objects: CUDA's solution for tiled texture interop (not buffers!)
API Selection: Choosing right CUDA external memory API is critical

What Worked Well

✅ Thorough documentation before implementation
✅ Validation tests to prove/disprove approaches
✅ Willingness to abandon plan when better solution found
✅ Clear documentation of why decisions changed

What Could Be Improved

Could have investigated surface objects earlier (they're in CUDA docs)
Initial assumption that "linear buffer" was the only way caused detour
More upfront research into D3D12 tiled texture specs could have saved time

Completed Implementation

CUDA Surface Object Refactoring - Actual solution implemented

Background Research

D3D12_CUDA_Separate_YUV_Implementation_Status.md - Initial problem analysis
NVIDIA NPP Documentation - NPP API reference (not used)
CUDA External Memory API
D3D12 Specification

Test Projects

separate-texture-test - Validated D3D12 ROW_MAJOR constraints
red-surface-nvdec - Final solution validation

Conclusion

This document represents a research and design phase that led to a better final solution.

Original Goal: Fix D3D12-CUDA NV12 interop using NPP RGB conversion Actual Result: Fixed using CUDA surface objects with direct RGBA writing

Status: ✅ PROBLEM SOLVED (via alternative approach)

While the NPP-based RGB pipeline was never implemented, the research was invaluable:

Identified D3D12 ROW_MAJOR limitations
Understood CUDA external memory APIs
Discovered surface objects as the correct solution
Documented performance expectations and trade-offs

Final Recommendation: Use CUDA surface objects for D3D12-CUDA texture interop (as implemented)

Project completed: 2025-10-06 Documentation created: 2025-10-06

11 KiB Raw Blame History Unescape Escape

D3D12-CUDA RGB Pipeline Design - PROJECT STATUS ✅

Executive Summary

Original Problem

Planned Solution (This Document)

✅ Actual Solution Implemented

Project Status Breakdown

✅ Completed Items (from original plan)

1. D3D12 ROW_MAJOR Layout Investigation

2. CUDA External Memory API Analysis

3. Zero-Copy GPU Pipeline Architecture Validation

❌ Abandoned Items (superseded by better solution)

1. NVIDIA NPP Library Integration

2. D3D12 RGB Texture ROW_MAJOR Implementation

3. D3D12VideoRenderer Shader Simplification

4. VavCore Public API Color Format Flag

📝 Reference Implementation (Not Used)

Technical Achievements

What Was Learned

1. D3D12 Texture Layout Constraints

2. CUDA-D3D12 Interop Best Practices

3. Performance Trade-offs

What Was NOT Needed

Final Implementation Summary

Actual Solution (CUDA Surface Objects)

Comparison: Planned vs Implemented

Lessons Learned

Design Process

Technical Insights

What Worked Well

What Could Be Improved

References & Related Documents

Completed Implementation

Background Research

Test Projects

Conclusion

11 KiB

Raw Blame History