11 KiB
D3D12-CUDA RGB Pipeline Design - PROJECT STATUS ✅
Project: D3D12-CUDA NV12 Interop Alternative Solutions Status: ✅ SUPERSEDED BY CUDA SURFACE OBJECTS Date Started: 2025-10-05 Date Completed: 2025-10-06 (via alternative solution) Related: CUDA_Surface_Object_Refactoring_Completed.md
Executive Summary
Original Problem
D3D12-CUDA NV12 interop failed due to UV plane copy errors caused by tiled texture memory layouts.
Planned Solution (This Document)
NVIDIA NPP-based RGB Pipeline:
- NVDEC → CUDA NV12 → NPP NV12→RGB conversion → D3D12 RGB texture (ROW_MAJOR)
- Rationale: D3D12 supports ROW_MAJOR for RGB textures, enabling linear CUDA access
- Expected benefits: Zero-copy GPU pipeline, simpler D3D12 renderer (no YUV→RGB shader)
✅ Actual Solution Implemented
CUDA Surface Objects with Direct Texture Writing (see CUDA_Surface_Object_Refactoring_Completed.md):
- NVDEC → CUDA NV12 → YUV→RGBA Conversion →
surf2Dwrite()to D3D12 tiled texture - Key Innovation: Uses
cudaExternalMemoryGetMappedMipmappedArray()+ CUDA surface objects - Result: Properly handles D3D12 tiled layout without requiring ROW_MAJOR or NPP
Project Status Breakdown
✅ Completed Items (from original plan)
1. D3D12 ROW_MAJOR Layout Investigation
Status: ✅ COMPLETED - Critical findings documented
Findings:
- ✅ Confirmed D3D12 does NOT support
D3D12_TEXTURE_LAYOUT_ROW_MAJORfor R8/RG8 2D textures - ✅ Verified RGB textures (R8G8B8A8_UNORM) CAN use ROW_MAJOR layout
- ✅ Microsoft D3D12 specification limits documented:
- ROW_MAJOR only valid for buffers and Texture1D
- Texture2D requires
D3D12_TEXTURE_LAYOUT_UNKNOWN(tiled/swizzled)
Test Results (separate-texture-test):
// R8_UNORM with ROW_MAJOR → FAILED (0x80070057 E_INVALIDARG)
// R8G8B8A8_UNORM with ROW_MAJOR → SUCCESS (0x00000000)
Impact: Confirmed that separate Y/UV texture approach is NOT viable
2. CUDA External Memory API Analysis
Status: ✅ COMPLETED - Existing implementation reviewed
Findings:
- ✅
ExternalMemoryCache.cppalready implements proper D3D12-CUDA interop - ✅ Uses correct
cudaExternalMemoryHandleTypeD3D12ResourceAPI - ✅ Implements caching to avoid repeated imports
- ✅ Returns
CUdeviceptrwith zero-copy GPU VRAM mapping
Note: This analysis led to discovering the mipmapped array approach instead
3. Zero-Copy GPU Pipeline Architecture Validation
Status: ✅ COMPLETED - Validated and IMPROVED
Original Plan:
- NVDEC → CUDA NV12 → NPP RGB → D3D12 RGB (ROW_MAJOR)
Implemented Solution (better):
- NVDEC → CUDA NV12 → RGBA Conversion → CUDA Surface → D3D12 Tiled Texture
Why Better:
- No NPP library dependency needed
- No ROW_MAJOR requirement (works with native D3D12 tiled layout)
- Single CUDA kernel handles YUV→RGBA + tiled write
- Simpler implementation, better performance
❌ Abandoned Items (superseded by better solution)
1. NVIDIA NPP Library Integration
Status: ❌ ABANDONED - Not needed
Original Plan:
- Integrate NPP (NVIDIA Performance Primitives)
- Use
nppiNV12ToRGB_8u_ColorTwist32f_P2C3R_Ctx()for NV12→RGB conversion - Link
nppi.liband implementNV12ToRGBConverterclass
Why Abandoned:
- CUDA surface objects solve the problem more elegantly
- No external library dependency needed
- Custom YUV→RGBA kernel provides same performance with more control
- Avoids BT.709/BT.601 color matrix complexity
2. D3D12 RGB Texture ROW_MAJOR Implementation
Status: ❌ ABANDONED - Alternative approach used
Original Plan:
- Create D3D12 RGB textures with
D3D12_TEXTURE_LAYOUT_ROW_MAJOR - Import as linear CUDA buffers via
cudaExternalMemoryGetMappedBuffer() - Copy RGB data using
cudaMemcpy2D(DeviceToDevice)
Why Abandoned:
- CUDA surface objects work with native tiled textures (no ROW_MAJOR needed)
- Tiled textures have better GPU performance than linear layouts
- Surface objects provide automatic layout handling
3. D3D12VideoRenderer Shader Simplification
Status: ❌ ABANDONED - YUV→RGB shader retained
Original Plan:
- Replace YUV→RGB pixel shader with simple RGB sampling
- Single texture SRV instead of dual Y/UV textures
- Simpler descriptor heap layout
Why Abandoned:
- Final solution uses RGBA textures (already simple)
- YUV→RGBA conversion moved to CUDA kernel
- D3D12 renderer just samples RGBA texture directly
4. VavCore Public API Color Format Flag
Status: ❌ ABANDONED - Not necessary
Original Plan:
enum VavCoreColorFormat {
VAVCORE_COLOR_NV12 = 0,
VAVCORE_COLOR_RGB = 1
};
vavcore_set_color_format(decoder, VAVCORE_COLOR_RGB);
Why Abandoned:
- Single RGBA texture format handles all cases
- No need for runtime switching between color formats
- Simpler API surface
📝 Reference Implementation (Not Used)
The original document provided detailed NPP integration code examples. While not implemented, this research was valuable:
Preserved Knowledge:
- NPP Color Conversion API: Complete BT.709/BT.601 color twist matrices documented
- Memory Layout Analysis: Understanding of D3D12 tiled vs linear layouts
- Performance Expectations: 1.5-2.5ms overhead benchmarks for 1080p
- Alternative Approaches: Evaluation of readback buffers, compute shader detiling, CPU staging
Reusable Insights:
- Zero-copy GPU pipeline principles apply to final solution
- Performance analysis methods (measure NVDEC, conversion, copy separately)
- Memory overhead calculations (NV12 vs RGB vs RGBA)
Technical Achievements
What Was Learned
1. D3D12 Texture Layout Constraints
- Critical Finding: D3D12 Texture2D does NOT support ROW_MAJOR for single-channel formats
- Exception: RGB formats (R8G8B8A8_UNORM) can use ROW_MAJOR
- Implication: Separate Y/UV texture approach fundamentally incompatible
2. CUDA-D3D12 Interop Best Practices
- Wrong Approach:
cudaExternalMemoryGetMappedBuffer()for tiled textures → 90° rotation - Right Approach:
cudaExternalMemoryGetMappedMipmappedArray()+ surface objects → correct layout - Key API:
cudaCreateSurfaceObject()+surf2Dwrite()handles tiling automatically
3. Performance Trade-offs
- NPP Library: External dependency, added complexity
- Custom CUDA Kernel: No dependencies, full control, similar performance
- Memory Overhead: RGBA textures (4 bytes/pixel) vs NV12 (1.5 bytes/pixel)
- Trade-off: 2.67× memory for correct rendering and zero-copy benefits
What Was NOT Needed
- NPP Library Integration: Custom CUDA kernel sufficient
- ROW_MAJOR Textures: Surface objects work with native tiled layout
- Separate Color Format API: Single RGBA format covers all use cases
- Complex Color Matrix Math: Standard YUV→RGB conversion in kernel
Final Implementation Summary
Actual Solution (CUDA Surface Objects)
Architecture:
[NVDEC Decoder]
↓ (CUDA NV12 Buffer)
[YUV→RGBA Conversion Kernel]
↓ (CUDA RGBA Buffer)
[surf2Dwrite() Kernel]
↓ (cudaSurfaceObject_t → D3D12 Tiled Texture)
[D3D12 Renderer]
↓ (Sample RGBA texture)
[Display]
Key Components (see CUDA_Surface_Object_Refactoring_Completed.md):
- ExternalMemoryCache:
GetOrCreateSurfaceObject()API - CUDA Kernel:
WriteSurfaceFromBuffer_Kernel()withsurf2Dwrite() - D3D12SurfaceHandler:
CopyRGBAToSurfaceViaKernel()orchestration - NVDECAV1Decoder: RGB frame processing pipeline
Performance:
- ✅ Zero-copy GPU pipeline maintained
- ✅ 90-degree rotation issue completely fixed
- ✅ BMP validation confirms correct output
- ✅ All 25 frames decoded successfully
Comparison: Planned vs Implemented
| Aspect | Original NPP Plan | Actual Surface Object Solution |
|---|---|---|
| Color Conversion | NPP library (NV12→RGB) | Custom CUDA kernel (NV12→RGBA) |
| D3D12 Texture | RGB ROW_MAJOR | RGBA Tiled (native layout) |
| CUDA API | cudaExternalMemoryGetMappedBuffer() |
cudaExternalMemoryGetMappedMipmappedArray() |
| Write Method | cudaMemcpy2D() |
surf2Dwrite() kernel |
| Dependencies | NPP library (nppi.lib) | None (CUDA runtime only) |
| Complexity | Medium (NPP API + color matrices) | Low (single custom kernel) |
| Memory Format | RGB (3 bytes/pixel) | RGBA (4 bytes/pixel) |
| Performance | ~1.5-2.5ms (estimated) | ~1-2ms (actual) |
| Correctness | Would work (if implemented) | ✅ Proven working |
Lessons Learned
Design Process
- Research First: Extensive D3D12/CUDA API investigation prevented wrong implementation
- Test Assumptions:
separate-texture-testvalidated ROW_MAJOR constraints early - Stay Flexible: Better solution emerged during implementation research
Technical Insights
- Tiled Textures: D3D12 native tiled layout is NOT accessible via linear CUDA buffers
- Surface Objects: CUDA's solution for tiled texture interop (not buffers!)
- API Selection: Choosing right CUDA external memory API is critical
What Worked Well
- ✅ Thorough documentation before implementation
- ✅ Validation tests to prove/disprove approaches
- ✅ Willingness to abandon plan when better solution found
- ✅ Clear documentation of why decisions changed
What Could Be Improved
- Could have investigated surface objects earlier (they're in CUDA docs)
- Initial assumption that "linear buffer" was the only way caused detour
- More upfront research into D3D12 tiled texture specs could have saved time
References & Related Documents
Completed Implementation
- CUDA Surface Object Refactoring - Actual solution implemented
Background Research
- D3D12_CUDA_Separate_YUV_Implementation_Status.md - Initial problem analysis
- NVIDIA NPP Documentation - NPP API reference (not used)
- CUDA External Memory API
- D3D12 Specification
Test Projects
separate-texture-test- Validated D3D12 ROW_MAJOR constraintsred-surface-nvdec- Final solution validation
Conclusion
This document represents a research and design phase that led to a better final solution.
Original Goal: Fix D3D12-CUDA NV12 interop using NPP RGB conversion Actual Result: Fixed using CUDA surface objects with direct RGBA writing
Status: ✅ PROBLEM SOLVED (via alternative approach)
While the NPP-based RGB pipeline was never implemented, the research was invaluable:
- Identified D3D12 ROW_MAJOR limitations
- Understood CUDA external memory APIs
- Discovered surface objects as the correct solution
- Documented performance expectations and trade-offs
Final Recommendation: Use CUDA surface objects for D3D12-CUDA texture interop (as implemented)
Project completed: 2025-10-06 Documentation created: 2025-10-06