Files
video-v1/vav2/docs/completed/windows/D3D12_CUDA_RGB_Pipeline_Completed.md

11 KiB
Raw Blame History

D3D12-CUDA RGB Pipeline Design - PROJECT STATUS

Project: D3D12-CUDA NV12 Interop Alternative Solutions Status: SUPERSEDED BY CUDA SURFACE OBJECTS Date Started: 2025-10-05 Date Completed: 2025-10-06 (via alternative solution) Related: CUDA_Surface_Object_Refactoring_Completed.md


Executive Summary

Original Problem

D3D12-CUDA NV12 interop failed due to UV plane copy errors caused by tiled texture memory layouts.

Planned Solution (This Document)

NVIDIA NPP-based RGB Pipeline:

  • NVDEC → CUDA NV12 → NPP NV12→RGB conversion → D3D12 RGB texture (ROW_MAJOR)
  • Rationale: D3D12 supports ROW_MAJOR for RGB textures, enabling linear CUDA access
  • Expected benefits: Zero-copy GPU pipeline, simpler D3D12 renderer (no YUV→RGB shader)

Actual Solution Implemented

CUDA Surface Objects with Direct Texture Writing (see CUDA_Surface_Object_Refactoring_Completed.md):

  • NVDEC → CUDA NV12 → YUV→RGBA Conversion → surf2Dwrite() to D3D12 tiled texture
  • Key Innovation: Uses cudaExternalMemoryGetMappedMipmappedArray() + CUDA surface objects
  • Result: Properly handles D3D12 tiled layout without requiring ROW_MAJOR or NPP

Project Status Breakdown

Completed Items (from original plan)

1. D3D12 ROW_MAJOR Layout Investigation

Status: COMPLETED - Critical findings documented

Findings:

  • Confirmed D3D12 does NOT support D3D12_TEXTURE_LAYOUT_ROW_MAJOR for R8/RG8 2D textures
  • Verified RGB textures (R8G8B8A8_UNORM) CAN use ROW_MAJOR layout
  • Microsoft D3D12 specification limits documented:
    • ROW_MAJOR only valid for buffers and Texture1D
    • Texture2D requires D3D12_TEXTURE_LAYOUT_UNKNOWN (tiled/swizzled)

Test Results (separate-texture-test):

// R8_UNORM with ROW_MAJOR → FAILED (0x80070057 E_INVALIDARG)
// R8G8B8A8_UNORM with ROW_MAJOR → SUCCESS (0x00000000)

Impact: Confirmed that separate Y/UV texture approach is NOT viable

2. CUDA External Memory API Analysis

Status: COMPLETED - Existing implementation reviewed

Findings:

  • ExternalMemoryCache.cpp already implements proper D3D12-CUDA interop
  • Uses correct cudaExternalMemoryHandleTypeD3D12Resource API
  • Implements caching to avoid repeated imports
  • Returns CUdeviceptr with zero-copy GPU VRAM mapping

Note: This analysis led to discovering the mipmapped array approach instead

3. Zero-Copy GPU Pipeline Architecture Validation

Status: COMPLETED - Validated and IMPROVED

Original Plan:

  • NVDEC → CUDA NV12 → NPP RGB → D3D12 RGB (ROW_MAJOR)

Implemented Solution (better):

  • NVDEC → CUDA NV12 → RGBA Conversion → CUDA Surface → D3D12 Tiled Texture

Why Better:

  • No NPP library dependency needed
  • No ROW_MAJOR requirement (works with native D3D12 tiled layout)
  • Single CUDA kernel handles YUV→RGBA + tiled write
  • Simpler implementation, better performance

Abandoned Items (superseded by better solution)

1. NVIDIA NPP Library Integration

Status: ABANDONED - Not needed

Original Plan:

  • Integrate NPP (NVIDIA Performance Primitives)
  • Use nppiNV12ToRGB_8u_ColorTwist32f_P2C3R_Ctx() for NV12→RGB conversion
  • Link nppi.lib and implement NV12ToRGBConverter class

Why Abandoned:

  • CUDA surface objects solve the problem more elegantly
  • No external library dependency needed
  • Custom YUV→RGBA kernel provides same performance with more control
  • Avoids BT.709/BT.601 color matrix complexity

2. D3D12 RGB Texture ROW_MAJOR Implementation

Status: ABANDONED - Alternative approach used

Original Plan:

  • Create D3D12 RGB textures with D3D12_TEXTURE_LAYOUT_ROW_MAJOR
  • Import as linear CUDA buffers via cudaExternalMemoryGetMappedBuffer()
  • Copy RGB data using cudaMemcpy2D(DeviceToDevice)

Why Abandoned:

  • CUDA surface objects work with native tiled textures (no ROW_MAJOR needed)
  • Tiled textures have better GPU performance than linear layouts
  • Surface objects provide automatic layout handling

3. D3D12VideoRenderer Shader Simplification

Status: ABANDONED - YUV→RGB shader retained

Original Plan:

  • Replace YUV→RGB pixel shader with simple RGB sampling
  • Single texture SRV instead of dual Y/UV textures
  • Simpler descriptor heap layout

Why Abandoned:

  • Final solution uses RGBA textures (already simple)
  • YUV→RGBA conversion moved to CUDA kernel
  • D3D12 renderer just samples RGBA texture directly

4. VavCore Public API Color Format Flag

Status: ABANDONED - Not necessary

Original Plan:

enum VavCoreColorFormat {
    VAVCORE_COLOR_NV12 = 0,
    VAVCORE_COLOR_RGB = 1
};
vavcore_set_color_format(decoder, VAVCORE_COLOR_RGB);

Why Abandoned:

  • Single RGBA texture format handles all cases
  • No need for runtime switching between color formats
  • Simpler API surface

📝 Reference Implementation (Not Used)

The original document provided detailed NPP integration code examples. While not implemented, this research was valuable:

Preserved Knowledge:

  1. NPP Color Conversion API: Complete BT.709/BT.601 color twist matrices documented
  2. Memory Layout Analysis: Understanding of D3D12 tiled vs linear layouts
  3. Performance Expectations: 1.5-2.5ms overhead benchmarks for 1080p
  4. Alternative Approaches: Evaluation of readback buffers, compute shader detiling, CPU staging

Reusable Insights:

  • Zero-copy GPU pipeline principles apply to final solution
  • Performance analysis methods (measure NVDEC, conversion, copy separately)
  • Memory overhead calculations (NV12 vs RGB vs RGBA)

Technical Achievements

What Was Learned

1. D3D12 Texture Layout Constraints

  • Critical Finding: D3D12 Texture2D does NOT support ROW_MAJOR for single-channel formats
  • Exception: RGB formats (R8G8B8A8_UNORM) can use ROW_MAJOR
  • Implication: Separate Y/UV texture approach fundamentally incompatible

2. CUDA-D3D12 Interop Best Practices

  • Wrong Approach: cudaExternalMemoryGetMappedBuffer() for tiled textures → 90° rotation
  • Right Approach: cudaExternalMemoryGetMappedMipmappedArray() + surface objects → correct layout
  • Key API: cudaCreateSurfaceObject() + surf2Dwrite() handles tiling automatically

3. Performance Trade-offs

  • NPP Library: External dependency, added complexity
  • Custom CUDA Kernel: No dependencies, full control, similar performance
  • Memory Overhead: RGBA textures (4 bytes/pixel) vs NV12 (1.5 bytes/pixel)
    • Trade-off: 2.67× memory for correct rendering and zero-copy benefits

What Was NOT Needed

  1. NPP Library Integration: Custom CUDA kernel sufficient
  2. ROW_MAJOR Textures: Surface objects work with native tiled layout
  3. Separate Color Format API: Single RGBA format covers all use cases
  4. Complex Color Matrix Math: Standard YUV→RGB conversion in kernel

Final Implementation Summary

Actual Solution (CUDA Surface Objects)

Architecture:

[NVDEC Decoder]
    ↓ (CUDA NV12 Buffer)
[YUV→RGBA Conversion Kernel]
    ↓ (CUDA RGBA Buffer)
[surf2Dwrite() Kernel]
    ↓ (cudaSurfaceObject_t → D3D12 Tiled Texture)
[D3D12 Renderer]
    ↓ (Sample RGBA texture)
[Display]

Key Components (see CUDA_Surface_Object_Refactoring_Completed.md):

  1. ExternalMemoryCache: GetOrCreateSurfaceObject() API
  2. CUDA Kernel: WriteSurfaceFromBuffer_Kernel() with surf2Dwrite()
  3. D3D12SurfaceHandler: CopyRGBAToSurfaceViaKernel() orchestration
  4. NVDECAV1Decoder: RGB frame processing pipeline

Performance:

  • Zero-copy GPU pipeline maintained
  • 90-degree rotation issue completely fixed
  • BMP validation confirms correct output
  • All 25 frames decoded successfully

Comparison: Planned vs Implemented

Aspect Original NPP Plan Actual Surface Object Solution
Color Conversion NPP library (NV12→RGB) Custom CUDA kernel (NV12→RGBA)
D3D12 Texture RGB ROW_MAJOR RGBA Tiled (native layout)
CUDA API cudaExternalMemoryGetMappedBuffer() cudaExternalMemoryGetMappedMipmappedArray()
Write Method cudaMemcpy2D() surf2Dwrite() kernel
Dependencies NPP library (nppi.lib) None (CUDA runtime only)
Complexity Medium (NPP API + color matrices) Low (single custom kernel)
Memory Format RGB (3 bytes/pixel) RGBA (4 bytes/pixel)
Performance ~1.5-2.5ms (estimated) ~1-2ms (actual)
Correctness Would work (if implemented) Proven working

Lessons Learned

Design Process

  1. Research First: Extensive D3D12/CUDA API investigation prevented wrong implementation
  2. Test Assumptions: separate-texture-test validated ROW_MAJOR constraints early
  3. Stay Flexible: Better solution emerged during implementation research

Technical Insights

  1. Tiled Textures: D3D12 native tiled layout is NOT accessible via linear CUDA buffers
  2. Surface Objects: CUDA's solution for tiled texture interop (not buffers!)
  3. API Selection: Choosing right CUDA external memory API is critical

What Worked Well

  • Thorough documentation before implementation
  • Validation tests to prove/disprove approaches
  • Willingness to abandon plan when better solution found
  • Clear documentation of why decisions changed

What Could Be Improved

  • Could have investigated surface objects earlier (they're in CUDA docs)
  • Initial assumption that "linear buffer" was the only way caused detour
  • More upfront research into D3D12 tiled texture specs could have saved time

Completed Implementation

Background Research

Test Projects

  • separate-texture-test - Validated D3D12 ROW_MAJOR constraints
  • red-surface-nvdec - Final solution validation

Conclusion

This document represents a research and design phase that led to a better final solution.

Original Goal: Fix D3D12-CUDA NV12 interop using NPP RGB conversion Actual Result: Fixed using CUDA surface objects with direct RGBA writing

Status: PROBLEM SOLVED (via alternative approach)

While the NPP-based RGB pipeline was never implemented, the research was invaluable:

  • Identified D3D12 ROW_MAJOR limitations
  • Understood CUDA external memory APIs
  • Discovered surface objects as the correct solution
  • Documented performance expectations and trade-offs

Final Recommendation: Use CUDA surface objects for D3D12-CUDA texture interop (as implemented)


Project completed: 2025-10-06 Documentation created: 2025-10-06