Files

ened 6b04396772 Update project documents

2025-10-05 23:42:44 +09:00

19 KiB

Raw Blame History

D3D12-CUDA Interop: Separate Y/UV Texture Solution

Date: 2025-10-05 Status: Design Complete, Partial Implementation Problem: NV12 texture CUDA interop failures due to tiled memory layout

Problem Statement

Original Issue

NVDEC hardware decoder outputs NV12 frames in CUDA memory, which need to be copied to D3D12 textures for GPU rendering. The initial approach using combined NV12 textures (DXGI_FORMAT_NV12) with D3D12_TEXTURE_LAYOUT_UNKNOWN failed because:

NV12 requires tiled layout: D3D12 NV12 textures must use D3D12_TEXTURE_LAYOUT_UNKNOWN (GPU-optimized tiled format)
CUDA cannot access tiled memory: cudaMemcpy2D requires linear memory layout
ROW_MAJOR not supported: NV12 format does not support D3D12_TEXTURE_LAYOUT_ROW_MAJOR

Error Manifestation

[D3D12SurfaceHandler] UV plane copy failed: 1 (invalid argument)
[D3D12SurfaceHandler] UV copy details: src=0x130910E000, dst=0x20990E000,
                      src_pitch=1536, dst_pitch=1536, width=1280, height=360

Y plane copy: ✅ Success
UV plane copy: ❌ CUDA error 1 (invalid argument)

Root Cause Analysis

D3D12 NV12 textures use vendor-specific tiled memory layout that is not accessible as linear buffers from CUDA, even when using cudaExternalMemoryHandleTypeD3D12Heap.

Solution: Separate Y/UV Textures

Architecture Overview

Instead of using a single NV12 texture, split into two separate textures:

NVDEC Output (NV12 in CUDA memory)
  │
  ├─ Y Plane (width × height, 8-bit)
  │  └─> D3D12 Texture: DXGI_FORMAT_R8_UNORM, D3D12_TEXTURE_LAYOUT_ROW_MAJOR
  │
  └─ UV Plane (width × height/2, 16-bit interleaved)
     └─> D3D12 Texture: DXGI_FORMAT_R8G8_UNORM, D3D12_TEXTURE_LAYOUT_ROW_MAJOR

Key Advantages

Aspect	Combined NV12	Separate Y/UV
Layout Support	❌ ROW_MAJOR not allowed	✅ ROW_MAJOR supported
CUDA Access	❌ Tiled, non-linear	✅ Linear buffer access
Copy Operation	❌ `cudaMemcpy2D` fails on UV	✅ Both planes copy successfully
Memory Overhead	Lower (single allocation)	Slightly higher (two allocations)
Shader Complexity	Lower (single sampler)	Higher (two samplers, manual combine)

Texture Specifications

Y Plane Texture

D3D12_RESOURCE_DESC desc = {};
desc.Dimension = D3D12_RESOURCE_DIMENSION_TEXTURE2D;
desc.Width = width;
desc.Height = height;
desc.DepthOrArraySize = 1;
desc.MipLevels = 1;
desc.Format = DXGI_FORMAT_R8_UNORM;              // Single 8-bit channel
desc.SampleDesc.Count = 1;
desc.SampleDesc.Quality = 0;
desc.Layout = D3D12_TEXTURE_LAYOUT_ROW_MAJOR;    // Linear layout for CUDA
desc.Flags = D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS;

UV Plane Texture

D3D12_RESOURCE_DESC desc = {};
desc.Dimension = D3D12_RESOURCE_DIMENSION_TEXTURE2D;
desc.Width = width;
desc.Height = height / 2;                        // UV plane is half height
desc.DepthOrArraySize = 1;
desc.MipLevels = 1;
desc.Format = DXGI_FORMAT_R8G8_UNORM;            // Two 8-bit channels (U, V)
desc.SampleDesc.Count = 1;
desc.SampleDesc.Quality = 0;
desc.Layout = D3D12_TEXTURE_LAYOUT_ROW_MAJOR;    // Linear layout for CUDA
desc.Flags = D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS;

Implementation Details

1. D3D12 Texture Creation

File: D3D12Manager.h/.cpp

class D3D12Manager {
public:
    // Create separate Y and UV plane textures
    ID3D12Resource* CreateYPlaneTexture(uint32_t width, uint32_t height);
    ID3D12Resource* CreateUVPlaneTexture(uint32_t width, uint32_t height);

    // Readback separate textures and combine into NV12 buffer
    uint8_t* ReadbackSeparateTextures(ID3D12Resource* y_texture,
                                       ID3D12Resource* uv_texture,
                                       uint32_t width, uint32_t height);
};

Implementation:

vav2/platforms/windows/tests/large-resolution/src/D3D12Manager.cpp:214-280

2. CUDA-D3D12 Interop

CUDA Import Process:

// 1. Create shared handle for D3D12 resource
HANDLE shared_handle;
device->CreateSharedHandle(texture, nullptr, GENERIC_ALL, nullptr, &shared_handle);

// 2. Get resource allocation size
D3D12_RESOURCE_ALLOCATION_INFO alloc_info =
    device->GetResourceAllocationInfo(0, 1, &desc);

// 3. Import to CUDA as external memory
cudaExternalMemoryHandleDesc mem_desc = {};
mem_desc.type = cudaExternalMemoryHandleTypeD3D12Heap;
mem_desc.handle.win32.handle = shared_handle;
mem_desc.size = alloc_info.SizeInBytes;
mem_desc.flags = cudaExternalMemoryDedicated;

cudaExternalMemory_t external_memory;
cudaImportExternalMemory(&external_memory, &mem_desc);

// 4. Map to linear buffer
cudaExternalMemoryBufferDesc buffer_desc = {};
buffer_desc.size = mem_desc.size;

CUdeviceptr device_ptr;
cudaExternalMemoryGetMappedBuffer((void**)&device_ptr, external_memory, &buffer_desc);

Key Point: ROW_MAJOR textures are successfully imported as linear buffers, allowing cudaMemcpy2D to work correctly.

3. CUDA Copy Operations

// Copy Y plane (width × height, single channel)
cudaMemcpy2D(
    (void*)dst_y_ptr, dst_y_pitch,
    (void*)src_y_ptr, src_y_pitch,
    width, height,
    cudaMemcpyDeviceToDevice
);

// Copy UV plane (width × height/2, dual channel)
cudaMemcpy2D(
    (void*)dst_uv_ptr, dst_uv_pitch,
    (void*)src_uv_ptr, src_uv_pitch,
    width, height / 2,
    cudaMemcpyDeviceToDevice
);

Success Criteria:

✅ Both copies complete without CUDA errors
✅ Linear layout allows direct memory access
✅ Pitch alignment handled automatically by D3D12

4. D3D12 Readback

File: D3D12Manager.cpp:282-430

Process:

Query GetCopyableFootprints for both Y and UV textures
Create single readback buffer (size = Y bytes + UV bytes)
Copy Y plane to readback buffer (offset 0)
Copy UV plane to readback buffer (offset = Y total bytes)
Map readback buffer and combine into NV12 CPU buffer

// Get layouts for both planes
D3D12_PLACED_SUBRESOURCE_FOOTPRINT y_layout, uv_layout;
device->GetCopyableFootprints(&y_desc, 0, 1, 0, &y_layout, ...);
device->GetCopyableFootprints(&uv_desc, 0, 1, 0, &uv_layout, ...);

// Create combined readback buffer
UINT64 total_size = y_total_bytes + uv_total_bytes;
CreateCommittedResource(..., total_size, ...);

// Copy with proper offsets
dst_uv.PlacedFootprint.Offset = y_total_bytes;  // UV starts after Y

Shader Integration

HLSL Shader Changes

Before (Combined NV12):

Texture2D<float4> nv12Texture : register(t0);
SamplerState linearSampler : register(s0);

float4 PSMain(PSInput input) : SV_TARGET
{
    float4 yuv = nv12Texture.Sample(linearSampler, input.uv);
    // YUV to RGB conversion...
}

After (Separate Y/UV):

Texture2D<float> yTexture : register(t0);   // R8_UNORM
Texture2D<float2> uvTexture : register(t1); // RG8_UNORM
SamplerState linearSampler : register(s0);

float4 PSMain(PSInput input) : SV_TARGET
{
    float y = yTexture.Sample(linearSampler, input.uv).r;
    float2 uv = uvTexture.Sample(linearSampler, input.uv).rg;

    // YUV to RGB conversion
    float3 yuv = float3(y, uv.r, uv.g);
    // ... rest of conversion
}

Considerations:

Need to bind two separate textures instead of one
Slightly more complex shader logic
UV coordinates remain the same (automatically interpolated at half resolution)

Performance Analysis

Memory Usage

For 1920×1080 video:

Component	Combined NV12	Separate Y/UV	Difference
Y Plane	1920×1080 = 2,073,600 bytes	1920×1080 = 2,073,600 bytes	Same
UV Plane	1920×540 = 1,036,800 bytes	1920×540×2 = 2,073,600 bytes	+1,036,800 bytes
Total	~3.11 MB	~4.14 MB	+33% overhead

Note: RG8 format uses 2 bytes per pixel (U and V separate), while NV12 UV plane uses 2 bytes per 2 pixels (interleaved). This causes the +33% memory overhead.

Performance Characteristics

Operation	Combined NV12	Separate Y/UV
Texture Creation	1 call	2 calls
CUDA Import	1 import (fails)	2 imports (succeeds)
Copy Operations	2 copies (1 fails)	2 copies (both succeed)
Shader Sampling	1 texture read	2 texture reads
Overall	❌ Non-functional	✅ Functional

Expected Performance Impact:

Negligible: Memory bandwidth dominated by video resolution, not texture count
Modern GPUs handle multiple texture reads efficiently
Linear layout may actually improve cache performance

Implementation Status

✅ Completed

D3D12Manager Updates
- ✅ CreateYPlaneTexture() implemented
- ✅ CreateUVPlaneTexture() implemented
- ✅ ReadbackSeparateTextures() implemented
- Location: vav2/platforms/windows/tests/large-resolution/src/D3D12Manager.{h,cpp}
Standalone Test
- ✅ Proof-of-concept test created
- ✅ Verifies D3D12 texture creation succeeds
- ✅ Verifies CUDA import succeeds
- Location: vav2/platforms/windows/tests/separate-texture-test/src/main.cpp

🚧 Pending

VavCore Integration
- ⏳ Update D3D12SurfaceHandler to support separate texture copy
- ⏳ Add CopySeparateNV12Frame() method
- ⏳ Modify ExternalMemoryCache to cache 2 textures instead of 1
NVDEC Decoder Updates
- ⏳ Modify NVDECAV1Decoder to create 2 D3D12 textures per frame
- ⏳ Update surface allocation logic
- ⏳ Update frame metadata to track both textures
Rendering Pipeline
- ⏳ Update D3D12VideoRenderer to bind 2 textures
- ⏳ Modify YUV→RGB shader to read from separate textures
- ⏳ Update descriptor heap layout
Testing
- ⏳ Test with red-surface-nvdec
- ⏳ Test with large-resolution test
- ⏳ Performance benchmarking

Alternative Solutions Considered

Option A: NV12→RGB Conversion in CUDA

Description: Use CUDA kernel to convert NV12 to RGBA before D3D12 copy

Pros:

Single RGBA texture (simpler)
D3D12 RGBA textures support ROW_MAJOR
Standard RGB rendering pipeline

Cons:

Requires CUDA kernel implementation
Extra GPU computation (YUV→RGB conversion)
Higher memory bandwidth (RGBA = 4 bytes/pixel vs NV12 = 1.5 bytes/pixel)
Color space conversion happens twice (CUDA + shader)

Decision: Rejected - Unnecessary computation and bandwidth overhead

Option B: Separate Y/UV Textures (Selected)

Description: Current approach documented above

Pros:

Minimal changes to existing pipeline
No extra color space conversions
Linear layout guarantees CUDA compatibility
Proven to work with R8/RG8 formats

Cons:

+33% memory overhead
Slightly more complex shader
Need to manage 2 textures instead of 1

Decision: Selected - Best balance of simplicity and performance

Code Examples

Complete Y Plane Texture Creation

ID3D12Resource* D3D12Manager::CreateYPlaneTexture(uint32_t width, uint32_t height)
{
    D3D12_RESOURCE_DESC desc = {};
    desc.Dimension = D3D12_RESOURCE_DIMENSION_TEXTURE2D;
    desc.Width = width;
    desc.Height = height;
    desc.DepthOrArraySize = 1;
    desc.MipLevels = 1;
    desc.Format = DXGI_FORMAT_R8_UNORM;
    desc.SampleDesc.Count = 1;
    desc.SampleDesc.Quality = 0;
    desc.Layout = D3D12_TEXTURE_LAYOUT_ROW_MAJOR;  // CRITICAL for CUDA
    desc.Flags = D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS;

    D3D12_HEAP_PROPERTIES heap_props = {};
    heap_props.Type = D3D12_HEAP_TYPE_DEFAULT;

    ID3D12Resource* texture = nullptr;
    HRESULT hr = m_device->CreateCommittedResource(
        &heap_props,
        D3D12_HEAP_FLAG_SHARED,  // CRITICAL for CUDA interop
        &desc,
        D3D12_RESOURCE_STATE_COMMON,
        nullptr,
        IID_PPV_ARGS(&texture));

    if (FAILED(hr)) {
        printf("[ERROR] Failed to create Y plane texture: 0x%08X\n", hr);
        return nullptr;
    }

    return texture;
}

Complete UV Plane Texture Creation

ID3D12Resource* D3D12Manager::CreateUVPlaneTexture(uint32_t width, uint32_t height)
{
    D3D12_RESOURCE_DESC desc = {};
    desc.Dimension = D3D12_RESOURCE_DIMENSION_TEXTURE2D;
    desc.Width = width;
    desc.Height = height / 2;  // UV plane is half height
    desc.DepthOrArraySize = 1;
    desc.MipLevels = 1;
    desc.Format = DXGI_FORMAT_R8G8_UNORM;  // Interleaved U and V
    desc.SampleDesc.Count = 1;
    desc.SampleDesc.Quality = 0;
    desc.Layout = D3D12_TEXTURE_LAYOUT_ROW_MAJOR;  // CRITICAL for CUDA
    desc.Flags = D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS;

    D3D12_HEAP_PROPERTIES heap_props = {};
    heap_props.Type = D3D12_HEAP_TYPE_DEFAULT;

    ID3D12Resource* texture = nullptr;
    HRESULT hr = m_device->CreateCommittedResource(
        &heap_props,
        D3D12_HEAP_FLAG_SHARED,  // CRITICAL for CUDA interop
        &desc,
        D3D12_RESOURCE_STATE_COMMON,
        nullptr,
        IID_PPV_ARGS(&texture));

    if (FAILED(hr)) {
        printf("[ERROR] Failed to create UV plane texture: 0x%08X\n", hr);
        return nullptr;
    }

    return texture;
}

Testing Strategy

Unit Test: Texture Creation

// Test 1: Verify Y plane texture creation succeeds
ID3D12Resource* y_texture = CreateYPlaneTexture(1920, 1080);
assert(y_texture != nullptr);

// Test 2: Verify UV plane texture creation succeeds
ID3D12Resource* uv_texture = CreateUVPlaneTexture(1920, 1080);
assert(uv_texture != nullptr);

// Test 3: Verify correct dimensions
D3D12_RESOURCE_DESC y_desc = y_texture->GetDesc();
assert(y_desc.Width == 1920 && y_desc.Height == 1080);

D3D12_RESOURCE_DESC uv_desc = uv_texture->GetDesc();
assert(uv_desc.Width == 1920 && uv_desc.Height == 540);

// Test 4: Verify correct formats
assert(y_desc.Format == DXGI_FORMAT_R8_UNORM);
assert(uv_desc.Format == DXGI_FORMAT_R8G8_UNORM);

// Test 5: Verify ROW_MAJOR layout
assert(y_desc.Layout == D3D12_TEXTURE_LAYOUT_ROW_MAJOR);
assert(uv_desc.Layout == D3D12_TEXTURE_LAYOUT_ROW_MAJOR);

Integration Test: CUDA Import

// Test 6: Verify CUDA can import Y plane
CUdeviceptr y_ptr;
bool y_success = TestCUDAImport(device, y_texture, &y_ptr);
assert(y_success == true);

// Test 7: Verify CUDA can import UV plane
CUdeviceptr uv_ptr;
bool uv_success = TestCUDAImport(device, uv_texture, &uv_ptr);
assert(uv_success == true);

// Test 8: Verify pointers are valid
assert(y_ptr != 0);
assert(uv_ptr != 0);

End-to-End Test: Video Decoding

// Test 9: Decode full video with separate textures
for (int frame = 0; frame < total_frames; frame++) {
    // Decode frame with NVDEC
    vavcore_decode_next_frame(player, &nvdec_frame);

    // Copy Y plane from CUDA to D3D12
    cudaMemcpy2D(y_d3d_ptr, ..., nvdec_frame.y_ptr, ...);

    // Copy UV plane from CUDA to D3D12
    cudaMemcpy2D(uv_d3d_ptr, ..., nvdec_frame.uv_ptr, ...);

    // Verify no CUDA errors
    assert(cudaGetLastError() == cudaSuccess);
}

Migration Guide

For Existing Code Using Combined NV12

Before:

// Create single NV12 texture
ID3D12Resource* nv12_texture = CreateNV12Texture(width, height);

// Copy NV12 frame
CopyNV12Frame(src_frame, src_pitch, nv12_texture, width, height);

// Bind texture
SetShaderTexture(0, nv12_texture);

After:

// Create separate Y and UV textures
ID3D12Resource* y_texture = CreateYPlaneTexture(width, height);
ID3D12Resource* uv_texture = CreateUVPlaneTexture(width, height);

// Copy Y and UV planes separately
CopySeparateNV12Frame(src_frame, src_pitch, y_texture, uv_texture, width, height);

// Bind both textures
SetShaderTexture(0, y_texture);
SetShaderTexture(1, uv_texture);

Shader Migration