Files
video-v1/vav2/docs/working/NVDECAV1Decoder_CPP_Refactoring_Design.md
2025-10-03 10:54:48 +09:00

16 KiB

NVDECAV1Decoder C++ Refactoring Design

Date: 2025-10-03 Status: Design Phase Goal: Refactor NVDECAV1Decoder internal C++ code for readability and maintainability


Problem Analysis

Current State

  • File: vav2/platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp
  • Lines: 1,722 lines (too large)
  • Main Method: DecodeToSurface() is 500+ lines with deeply nested logic

Key Issues

  1. Monolithic Method: DecodeToSurface() handles CPU, D3D11, D3D12, CUDA in one giant function
  2. Mixed Responsibilities: Decoding + Surface copying + Memory management + Fence signaling all mixed
  3. Hard to Debug: Pitch/stride bugs are difficult to trace due to complex nesting
  4. Difficult to Test: Cannot unit test individual components in isolation
  5. Poor Readability: Excessive debug logging makes logic hard to follow

Design Goals

Primary Goals

  1. Readability: Each method should do ONE thing clearly
  2. Maintainability: Easy to locate and fix bugs (like current NV12 stride issue)
  3. Testability: Each component can be tested independently
  4. Performance: Zero overhead - use inline functions where appropriate

Non-Goals

  • NOT creating a C API (VavCore already provides that)
  • NOT changing external interface of NVDECAV1Decoder
  • NOT over-engineering with complex patterns

Proposed Architecture

File Structure

NVDECAV1Decoder.h         (Public interface - unchanged)
NVDECAV1Decoder.cpp       (Main decoder - 400 lines)
  └── Uses helper classes below

D3D12SurfaceHandler.h     (D3D12-specific logic - 300 lines)
D3D12SurfaceHandler.cpp
  ├── ImportD3D12Resource()
  ├── CopyNV12Frame()
  └── SignalFence()

ExternalMemoryCache.h     (CUDA-D3D12 interop cache - 200 lines)
ExternalMemoryCache.cpp
  ├── GetOrCreate()
  ├── Release()
  └── Clear()

Class Diagram

NVDECAV1Decoder (Main decoder)
├── CUvideodecoder m_decoder
├── CUvideoparser m_parser
├── CUcontext m_cudaContext
├── D3D12SurfaceHandler* m_d3d12Handler (on-demand)
└── ExternalMemoryCache* m_memoryCache (on-demand)

D3D12SurfaceHandler
├── ID3D12Device* m_device
├── CUcontext m_cudaContext
├── ExternalMemoryCache* m_cache
└── Methods:
    ├── CopyNV12Frame(src, dst, width, height, srcPitch)
    ├── GetD3D12CUDAPointer(ID3D12Resource*)
    └── SignalD3D12Fence(value)

ExternalMemoryCache
├── std::map<ID3D12Resource*, CachedEntry>
└── Methods:
    ├── GetOrCreateExternalMemory(resource)
    └── ReleaseAll()

Refactored Code Structure

1. NVDECAV1Decoder.cpp (Main decoder - simplified)

Before: 500+ lines in DecodeToSurface()

After: Clean routing logic

bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_size,
                                      void* target_surface, SurfaceType target_type)
{
    // Step 1: Decode packet to NVDEC internal buffer
    if (!DecodePacket(packet_data, packet_size)) {
        return false;
    }

    // Step 2: Get decoded frame info
    DecodedFrameInfo frame_info;
    if (!GetDecodedFrame(&frame_info)) {
        return false;
    }

    // Step 3: Copy to target surface based on type
    bool result = false;
    switch (target_type) {
        case SURFACE_TYPE_CPU:
            result = CopyToCPUSurface(frame_info, target_surface);
            break;
        case SURFACE_TYPE_D3D12:
            result = CopyToD3D12Surface(frame_info, target_surface);
            break;
        case SURFACE_TYPE_D3D11:
            result = CopyToD3D11Surface(frame_info, target_surface);
            break;
        case SURFACE_TYPE_CUDA:
            result = CopyToCUDASurface(frame_info, target_surface);
            break;
    }

    // Step 4: Cleanup
    cuvidUnmapVideoFrame(m_decoder, frame_info.device_ptr);
    return result;
}

2. Private Helper Methods (in NVDECAV1Decoder.cpp)

// Decode packet using cuvidParseVideoData
// Returns: true on success
// Complexity: ~30 lines
private:
bool DecodePacket(const uint8_t* data, size_t size)
{
    CUVIDSOURCEDATAPACKET packet = {};
    packet.payload = data;
    packet.payload_size = size;
    packet.flags = CUVID_PKT_TIMESTAMP;

    CUresult result = cuvidParseVideoData(m_parser, &packet);
    if (result != CUDA_SUCCESS) {
        LogError("cuvidParseVideoData failed: %d", result);
        return false;
    }
    return true;
}

// Get decoded frame from internal queue
// Returns: true if frame available
// Complexity: ~40 lines
private:
struct DecodedFrameInfo {
    CUdeviceptr device_ptr;
    uint32_t pitch;
    uint32_t width;
    uint32_t height;
};

bool GetDecodedFrame(DecodedFrameInfo* out_info)
{
    if (m_frameQueue.empty()) {
        return false;
    }

    int frame_index = m_frameQueue.front();
    m_frameQueue.pop();

    CUVIDPROCPARAMS proc_params = {};
    proc_params.progressive_frame = 1;

    CUdeviceptr device_ptr;
    unsigned int pitch;
    CUresult result = cuvidMapVideoFrame(m_decoder, frame_index,
                                         &device_ptr, &pitch, &proc_params);

    if (result != CUDA_SUCCESS) {
        LogError("cuvidMapVideoFrame failed: %d", result);
        return false;
    }

    out_info->device_ptr = device_ptr;
    out_info->pitch = pitch;
    out_info->width = m_width;
    out_info->height = m_height;

    return true;
}

// Copy to D3D12 surface (delegates to handler)
// Returns: true on success
// Complexity: ~20 lines
private:
bool CopyToD3D12Surface(const DecodedFrameInfo& frame, void* surface)
{
    auto* d3d12_resource = static_cast<ID3D12Resource*>(surface);

    // Create handler on-demand
    if (!m_d3d12Handler) {
        m_d3d12Handler = std::make_unique<D3D12SurfaceHandler>(
            m_d3d12Device, m_cudaContext
        );
    }

    return m_d3d12Handler->CopyNV12Frame(
        frame.device_ptr,
        frame.pitch,
        d3d12_resource,
        frame.width,
        frame.height
    );
}

3. D3D12SurfaceHandler.h (D3D12-specific operations)

#pragma once

#include <d3d12.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <memory>

namespace VavCore {

// Forward declaration
class ExternalMemoryCache;

class D3D12SurfaceHandler {
public:
    D3D12SurfaceHandler(ID3D12Device* device, CUcontext cuda_context);
    ~D3D12SurfaceHandler();

    // Copy NV12 frame from CUDA to D3D12 texture
    // Returns: true on success
    bool CopyNV12Frame(CUdeviceptr src_frame,
                       uint32_t src_pitch,
                       ID3D12Resource* dst_texture,
                       uint32_t width,
                       uint32_t height);

    // Signal D3D12 fence from CUDA stream
    // Returns: true on success
    bool SignalD3D12Fence(uint64_t fence_value);

private:
    // Get CUDA device pointer for D3D12 resource (uses cache)
    bool GetD3D12CUDAPointer(ID3D12Resource* resource, CUdeviceptr* out_ptr);

    // Copy Y plane (8-bit single channel)
    bool CopyYPlane(CUdeviceptr src, uint32_t src_pitch,
                   CUdeviceptr dst, uint32_t dst_pitch,
                   uint32_t width, uint32_t height);

    // Copy UV plane (8-bit dual channel, interleaved)
    bool CopyUVPlane(CUdeviceptr src, uint32_t src_pitch,
                    CUdeviceptr dst, uint32_t dst_pitch,
                    uint32_t width, uint32_t height);

private:
    ID3D12Device* m_device;
    CUcontext m_cudaContext;
    std::unique_ptr<ExternalMemoryCache> m_cache;
};

} // namespace VavCore

4. D3D12SurfaceHandler.cpp (Implementation)

#include "D3D12SurfaceHandler.h"
#include "ExternalMemoryCache.h"
#include <stdio.h>

namespace VavCore {

D3D12SurfaceHandler::D3D12SurfaceHandler(ID3D12Device* device, CUcontext cuda_context)
    : m_device(device)
    , m_cudaContext(cuda_context)
    , m_cache(std::make_unique<ExternalMemoryCache>(device, cuda_context))
{
}

D3D12SurfaceHandler::~D3D12SurfaceHandler()
{
}

bool D3D12SurfaceHandler::CopyNV12Frame(CUdeviceptr src_frame,
                                        uint32_t src_pitch,
                                        ID3D12Resource* dst_texture,
                                        uint32_t width,
                                        uint32_t height)
{
    // Get CUDA pointer for D3D12 resource
    CUdeviceptr dst_ptr = 0;
    if (!GetD3D12CUDAPointer(dst_texture, &dst_ptr)) {
        return false;
    }

    // Get D3D12 texture layout
    D3D12_RESOURCE_DESC desc = dst_texture->GetDesc();
    D3D12_PLACED_SUBRESOURCE_FOOTPRINT layouts[2];
    UINT num_rows[2] = {0};
    UINT64 row_sizes[2] = {0};
    UINT64 total_bytes = 0;

    m_device->GetCopyableFootprints(&desc, 0, 2, 0,
                                    layouts, num_rows, row_sizes, &total_bytes);

    // Copy Y plane
    if (!CopyYPlane(src_frame, src_pitch,
                   dst_ptr, layouts[0].Footprint.RowPitch,
                   width, height)) {
        return false;
    }

    // Copy UV plane
    CUdeviceptr src_uv = src_frame + (src_pitch * height);
    CUdeviceptr dst_uv = dst_ptr + layouts[1].Offset;

    if (!CopyUVPlane(src_uv, src_pitch,
                    dst_uv, layouts[1].Footprint.RowPitch,
                    width, height / 2)) {
        return false;
    }

    return true;
}

bool D3D12SurfaceHandler::GetD3D12CUDAPointer(ID3D12Resource* resource,
                                              CUdeviceptr* out_ptr)
{
    return m_cache->GetOrCreateExternalMemory(resource, out_ptr);
}

bool D3D12SurfaceHandler::CopyYPlane(CUdeviceptr src, uint32_t src_pitch,
                                     CUdeviceptr dst, uint32_t dst_pitch,
                                     uint32_t width, uint32_t height)
{
    cudaError_t err = cudaMemcpy2D(
        (void*)dst, dst_pitch,
        (void*)src, src_pitch,
        width, height,  // Copy only valid pixels, not padding
        cudaMemcpyDeviceToDevice
    );

    if (err != cudaSuccess) {
        printf("[D3D12] Y plane copy failed: %d\n", err);
        return false;
    }

    return true;
}

bool D3D12SurfaceHandler::CopyUVPlane(CUdeviceptr src, uint32_t src_pitch,
                                      CUdeviceptr dst, uint32_t dst_pitch,
                                      uint32_t width, uint32_t height)
{
    // NV12 UV plane: interleaved U and V, so width in bytes = width of Y plane
    cudaError_t err = cudaMemcpy2D(
        (void*)dst, dst_pitch,
        (void*)src, src_pitch,
        width, height,  // UV plane has same width in bytes, half height
        cudaMemcpyDeviceToDevice
    );

    if (err != cudaSuccess) {
        printf("[D3D12] UV plane copy failed: %d\n", err);
        return false;
    }

    return true;
}

} // namespace VavCore

5. ExternalMemoryCache.h (CUDA-D3D12 interop cache)

#pragma once

#include <d3d12.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <map>

namespace VavCore {

class ExternalMemoryCache {
public:
    ExternalMemoryCache(ID3D12Device* device, CUcontext cuda_context);
    ~ExternalMemoryCache();

    // Get or create CUDA device pointer for D3D12 resource
    // Returns: true on success
    bool GetOrCreateExternalMemory(ID3D12Resource* resource, CUdeviceptr* out_ptr);

    // Release specific resource
    void Release(ID3D12Resource* resource);

    // Release all cached resources
    void ReleaseAll();

private:
    struct CachedEntry {
        cudaExternalMemory_t external_memory;
        CUdeviceptr device_ptr;
        size_t size;
    };

    bool ImportD3D12Resource(ID3D12Resource* resource,
                            cudaExternalMemory_t* out_ext_mem,
                            CUdeviceptr* out_ptr);

private:
    ID3D12Device* m_device;
    CUcontext m_cudaContext;
    std::map<ID3D12Resource*, CachedEntry> m_cache;
};

} // namespace VavCore

Key Improvements

Readability

Before:

  • DecodeToSurface(): 500+ lines with 5 levels of nesting
  • Mixed concerns: decoding, copying, caching, signaling

After:

  • DecodeToSurface(): 40 lines, clear 4-step process
  • Each helper method: 20-60 lines, single responsibility

Debugging

Before:

  • NV12 stride bug hidden in 500 lines of mixed logic
  • Hard to locate which cudaMemcpy2D call is wrong

After:

  • CopyYPlane() and CopyUVPlane() are separate methods
  • Easy to add breakpoint and inspect parameters
  • Clear separation of Y and UV plane logic

Testing

Before:

  • Cannot test D3D12 copying without full decoder setup
  • Cannot mock CUDA operations

After:

  • Can unit test D3D12SurfaceHandler independently
  • Can test ExternalMemoryCache in isolation
  • Easy to add mock implementations

Maintenance

Before:

  • Adding D3D11 support requires modifying 500+ line method
  • Risk of breaking existing D3D12 code

After:

  • Add new D3D11SurfaceHandler class
  • Existing D3D12 code untouched
  • Clean separation of concerns

File Size Comparison

File Before After
NVDECAV1Decoder.cpp 1,722 lines ~600 lines
D3D12SurfaceHandler.cpp - ~300 lines
ExternalMemoryCache.cpp - ~200 lines
Total 1,722 lines 1,100 lines

Reduction: 36% code reduction while improving readability


Implementation Plan

Phase 1: Extract D3D12 Handler (2-3 hours)

  1. Create D3D12SurfaceHandler.h/.cpp
  2. Move D3D12 resource import logic
  3. Move NV12 plane copying logic
  4. Test with existing Vav2Player

Acceptance Criteria:

  • Vav2Player displays video correctly
  • No memory leaks
  • Performance same or better

Phase 2: Extract External Memory Cache (1-2 hours)

  1. Create ExternalMemoryCache.h/.cpp
  2. Move external memory caching logic
  3. Add proper cleanup on resource release
  4. Test memory management

Acceptance Criteria:

  • Cache hit/miss working correctly
  • No memory leaks on repeated loads
  • Cache cleared on decoder cleanup

Phase 3: Refactor Main Decoder (1-2 hours)

  1. Simplify DecodeToSurface() to routing logic
  2. Extract DecodePacket() method
  3. Extract GetDecodedFrame() method
  4. Extract CopyToCPUSurface() method
  5. Test all surface types

Acceptance Criteria:

  • All surface types working
  • Code passes all existing tests
  • Debug logging reduced

Phase 4: Fix NV12 Stride Bug (30 minutes)

  1. Fix CopyYPlane() width parameter
  2. Fix CopyUVPlane() width parameter
  3. Verify with test video

Acceptance Criteria:

  • No stripe pattern in displayed video
  • Correct colors displayed
  • Performance maintained

Testing Strategy

Unit Tests

TEST(D3D12SurfaceHandler, CopiesNV12FrameCorrectly)
{
    auto handler = CreateTestHandler();
    auto src_frame = CreateTestNV12Frame(1920, 1080);
    auto dst_texture = CreateTestD3D12Texture(1920, 1080);

    bool result = handler->CopyNV12Frame(
        src_frame.device_ptr, src_frame.pitch,
        dst_texture, 1920, 1080
    );

    EXPECT_TRUE(result);
    VerifyNV12Data(dst_texture);
}

TEST(ExternalMemoryCache, ReusesExistingEntry)
{
    auto cache = CreateTestCache();
    auto resource = CreateTestD3D12Resource();

    CUdeviceptr ptr1, ptr2;
    cache->GetOrCreateExternalMemory(resource, &ptr1);
    cache->GetOrCreateExternalMemory(resource, &ptr2);

    EXPECT_EQ(ptr1, ptr2);  // Should return same pointer
}

Integration Tests

  • Load video file
  • Decode multiple frames
  • Verify no memory leaks
  • Verify correct video display

Success Criteria

  • Design document complete
  • Phase 1 complete: D3D12SurfaceHandler working
  • Phase 2 complete: ExternalMemoryCache working
  • Phase 3 complete: Main decoder simplified
  • Phase 4 complete: NV12 stripe bug fixed
  • All existing tests passing
  • No performance regression
  • Code review passed
  • Documentation updated

Next Step: Start Phase 1 - Extract D3D12SurfaceHandler

Last Updated: 2025-10-03