Files

ened 92e2e6464a Refactoring NVDEC decoder

2025-10-03 10:54:48 +09:00

16 KiB

Raw Blame History

NVDECAV1Decoder C++ Refactoring Design

Date: 2025-10-03 Status: Design Phase Goal: Refactor NVDECAV1Decoder internal C++ code for readability and maintainability

Problem Analysis

Current State

File: vav2/platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp
Lines: 1,722 lines (too large)
Main Method: DecodeToSurface() is 500+ lines with deeply nested logic

Key Issues

Monolithic Method: DecodeToSurface() handles CPU, D3D11, D3D12, CUDA in one giant function
Mixed Responsibilities: Decoding + Surface copying + Memory management + Fence signaling all mixed
Hard to Debug: Pitch/stride bugs are difficult to trace due to complex nesting
Difficult to Test: Cannot unit test individual components in isolation
Poor Readability: Excessive debug logging makes logic hard to follow

Design Goals

Primary Goals

Readability: Each method should do ONE thing clearly
Maintainability: Easy to locate and fix bugs (like current NV12 stride issue)
Testability: Each component can be tested independently
Performance: Zero overhead - use inline functions where appropriate

Non-Goals

NOT creating a C API (VavCore already provides that)
NOT changing external interface of NVDECAV1Decoder
NOT over-engineering with complex patterns

Proposed Architecture

File Structure

NVDECAV1Decoder.h         (Public interface - unchanged)
NVDECAV1Decoder.cpp       (Main decoder - 400 lines)
  └── Uses helper classes below

D3D12SurfaceHandler.h     (D3D12-specific logic - 300 lines)
D3D12SurfaceHandler.cpp
  ├── ImportD3D12Resource()
  ├── CopyNV12Frame()
  └── SignalFence()

ExternalMemoryCache.h     (CUDA-D3D12 interop cache - 200 lines)
ExternalMemoryCache.cpp
  ├── GetOrCreate()
  ├── Release()
  └── Clear()

Class Diagram

NVDECAV1Decoder (Main decoder)
├── CUvideodecoder m_decoder
├── CUvideoparser m_parser
├── CUcontext m_cudaContext
├── D3D12SurfaceHandler* m_d3d12Handler (on-demand)
└── ExternalMemoryCache* m_memoryCache (on-demand)

D3D12SurfaceHandler
├── ID3D12Device* m_device
├── CUcontext m_cudaContext
├── ExternalMemoryCache* m_cache
└── Methods:
    ├── CopyNV12Frame(src, dst, width, height, srcPitch)
    ├── GetD3D12CUDAPointer(ID3D12Resource*)
    └── SignalD3D12Fence(value)

ExternalMemoryCache
├── std::map<ID3D12Resource*, CachedEntry>
└── Methods:
    ├── GetOrCreateExternalMemory(resource)
    └── ReleaseAll()

Refactored Code Structure

1. NVDECAV1Decoder.cpp (Main decoder - simplified)

Before: 500+ lines in DecodeToSurface()

After: Clean routing logic

bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_size,
                                      void* target_surface, SurfaceType target_type)
{
    // Step 1: Decode packet to NVDEC internal buffer
    if (!DecodePacket(packet_data, packet_size)) {
        return false;
    }

    // Step 2: Get decoded frame info
    DecodedFrameInfo frame_info;
    if (!GetDecodedFrame(&frame_info)) {
        return false;
    }

    // Step 3: Copy to target surface based on type
    bool result = false;
    switch (target_type) {
        case SURFACE_TYPE_CPU:
            result = CopyToCPUSurface(frame_info, target_surface);
            break;
        case SURFACE_TYPE_D3D12:
            result = CopyToD3D12Surface(frame_info, target_surface);
            break;
        case SURFACE_TYPE_D3D11:
            result = CopyToD3D11Surface(frame_info, target_surface);
            break;
        case SURFACE_TYPE_CUDA:
            result = CopyToCUDASurface(frame_info, target_surface);
            break;
    }

    // Step 4: Cleanup
    cuvidUnmapVideoFrame(m_decoder, frame_info.device_ptr);
    return result;
}

2. Private Helper Methods (in NVDECAV1Decoder.cpp)

// Decode packet using cuvidParseVideoData
// Returns: true on success
// Complexity: ~30 lines
private:
bool DecodePacket(const uint8_t* data, size_t size)
{
    CUVIDSOURCEDATAPACKET packet = {};
    packet.payload = data;
    packet.payload_size = size;
    packet.flags = CUVID_PKT_TIMESTAMP;

    CUresult result = cuvidParseVideoData(m_parser, &packet);
    if (result != CUDA_SUCCESS) {
        LogError("cuvidParseVideoData failed: %d", result);
        return false;
    }
    return true;
}

// Get decoded frame from internal queue
// Returns: true if frame available
// Complexity: ~40 lines
private:
struct DecodedFrameInfo {
    CUdeviceptr device_ptr;
    uint32_t pitch;
    uint32_t width;
    uint32_t height;
};

bool GetDecodedFrame(DecodedFrameInfo* out_info)
{
    if (m_frameQueue.empty()) {
        return false;
    }

    int frame_index = m_frameQueue.front();
    m_frameQueue.pop();

    CUVIDPROCPARAMS proc_params = {};
    proc_params.progressive_frame = 1;

    CUdeviceptr device_ptr;
    unsigned int pitch;
    CUresult result = cuvidMapVideoFrame(m_decoder, frame_index,
                                         &device_ptr, &pitch, &proc_params);

    if (result != CUDA_SUCCESS) {
        LogError("cuvidMapVideoFrame failed: %d", result);
        return false;
    }

    out_info->device_ptr = device_ptr;
    out_info->pitch = pitch;
    out_info->width = m_width;
    out_info->height = m_height;

    return true;
}

// Copy to D3D12 surface (delegates to handler)
// Returns: true on success
// Complexity: ~20 lines
private:
bool CopyToD3D12Surface(const DecodedFrameInfo& frame, void* surface)
{
    auto* d3d12_resource = static_cast<ID3D12Resource*>(surface);

    // Create handler on-demand
    if (!m_d3d12Handler) {
        m_d3d12Handler = std::make_unique<D3D12SurfaceHandler>(
            m_d3d12Device, m_cudaContext
        );
    }

    return m_d3d12Handler->CopyNV12Frame(
        frame.device_ptr,
        frame.pitch,
        d3d12_resource,
        frame.width,
        frame.height
    );
}

3. D3D12SurfaceHandler.h (D3D12-specific operations)

#pragma once

#include <d3d12.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <memory>

namespace VavCore {

// Forward declaration
class ExternalMemoryCache;

class D3D12SurfaceHandler {
public:
    D3D12SurfaceHandler(ID3D12Device* device, CUcontext cuda_context);
    ~D3D12SurfaceHandler();

    // Copy NV12 frame from CUDA to D3D12 texture
    // Returns: true on success
    bool CopyNV12Frame(CUdeviceptr src_frame,
                       uint32_t src_pitch,
                       ID3D12Resource* dst_texture,
                       uint32_t width,
                       uint32_t height);

    // Signal D3D12 fence from CUDA stream
    // Returns: true on success
    bool SignalD3D12Fence(uint64_t fence_value);

private:
    // Get CUDA device pointer for D3D12 resource (uses cache)
    bool GetD3D12CUDAPointer(ID3D12Resource* resource, CUdeviceptr* out_ptr);

    // Copy Y plane (8-bit single channel)
    bool CopyYPlane(CUdeviceptr src, uint32_t src_pitch,
                   CUdeviceptr dst, uint32_t dst_pitch,
                   uint32_t width, uint32_t height);

    // Copy UV plane (8-bit dual channel, interleaved)
    bool CopyUVPlane(CUdeviceptr src, uint32_t src_pitch,
                    CUdeviceptr dst, uint32_t dst_pitch,
                    uint32_t width, uint32_t height);

private:
    ID3D12Device* m_device;
    CUcontext m_cudaContext;
    std::unique_ptr<ExternalMemoryCache> m_cache;
};

} // namespace VavCore

4. D3D12SurfaceHandler.cpp (Implementation)

#include "D3D12SurfaceHandler.h"
#include "ExternalMemoryCache.h"
#include <stdio.h>

namespace VavCore {

D3D12SurfaceHandler::D3D12SurfaceHandler(ID3D12Device* device, CUcontext cuda_context)
    : m_device(device)
    , m_cudaContext(cuda_context)
    , m_cache(std::make_unique<ExternalMemoryCache>(device, cuda_context))
{
}

D3D12SurfaceHandler::~D3D12SurfaceHandler()
{
}

bool D3D12SurfaceHandler::CopyNV12Frame(CUdeviceptr src_frame,
                                        uint32_t src_pitch,
                                        ID3D12Resource* dst_texture,
                                        uint32_t width,
                                        uint32_t height)
{
    // Get CUDA pointer for D3D12 resource
    CUdeviceptr dst_ptr = 0;
    if (!GetD3D12CUDAPointer(dst_texture, &dst_ptr)) {
        return false;
    }

    // Get D3D12 texture layout
    D3D12_RESOURCE_DESC desc = dst_texture->GetDesc();
    D3D12_PLACED_SUBRESOURCE_FOOTPRINT layouts[2];
    UINT num_rows[2] = {0};
    UINT64 row_sizes[2] = {0};
    UINT64 total_bytes = 0;

    m_device->GetCopyableFootprints(&desc, 0, 2, 0,
                                    layouts, num_rows, row_sizes, &total_bytes);

    // Copy Y plane
    if (!CopyYPlane(src_frame, src_pitch,
                   dst_ptr, layouts[0].Footprint.RowPitch,
                   width, height)) {
        return false;
    }

    // Copy UV plane
    CUdeviceptr src_uv = src_frame + (src_pitch * height);
    CUdeviceptr dst_uv = dst_ptr + layouts[1].Offset;

    if (!CopyUVPlane(src_uv, src_pitch,
                    dst_uv, layouts[1].Footprint.RowPitch,
                    width, height / 2)) {
        return false;
    }

    return true;
}

bool D3D12SurfaceHandler::GetD3D12CUDAPointer(ID3D12Resource* resource,
                                              CUdeviceptr* out_ptr)
{
    return m_cache->GetOrCreateExternalMemory(resource, out_ptr);
}

bool D3D12SurfaceHandler::CopyYPlane(CUdeviceptr src, uint32_t src_pitch,
                                     CUdeviceptr dst, uint32_t dst_pitch,
                                     uint32_t width, uint32_t height)
{
    cudaError_t err = cudaMemcpy2D(
        (void*)dst, dst_pitch,
        (void*)src, src_pitch,
        width, height,  // Copy only valid pixels, not padding
        cudaMemcpyDeviceToDevice
    );

    if (err != cudaSuccess) {
        printf("[D3D12] Y plane copy failed: %d\n", err);
        return false;
    }

    return true;
}

bool D3D12SurfaceHandler::CopyUVPlane(CUdeviceptr src, uint32_t src_pitch,
                                      CUdeviceptr dst, uint32_t dst_pitch,
                                      uint32_t width, uint32_t height)
{
    // NV12 UV plane: interleaved U and V, so width in bytes = width of Y plane
    cudaError_t err = cudaMemcpy2D(
        (void*)dst, dst_pitch,
        (void*)src, src_pitch,
        width, height,  // UV plane has same width in bytes, half height
        cudaMemcpyDeviceToDevice
    );

    if (err != cudaSuccess) {
        printf("[D3D12] UV plane copy failed: %d\n", err);
        return false;
    }

    return true;
}

} // namespace VavCore

5. ExternalMemoryCache.h (CUDA-D3D12 interop cache)

#pragma once

#include <d3d12.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <map>

namespace VavCore {

class ExternalMemoryCache {
public:
    ExternalMemoryCache(ID3D12Device* device, CUcontext cuda_context);
    ~ExternalMemoryCache();

    // Get or create CUDA device pointer for D3D12 resource
    // Returns: true on success
    bool GetOrCreateExternalMemory(ID3D12Resource* resource, CUdeviceptr* out_ptr);

    // Release specific resource
    void Release(ID3D12Resource* resource);

    // Release all cached resources
    void ReleaseAll();

private:
    struct CachedEntry {
        cudaExternalMemory_t external_memory;
        CUdeviceptr device_ptr;
        size_t size;
    };

    bool ImportD3D12Resource(ID3D12Resource* resource,
                            cudaExternalMemory_t* out_ext_mem,
                            CUdeviceptr* out_ptr);

private:
    ID3D12Device* m_device;
    CUcontext m_cudaContext;
    std::map<ID3D12Resource*, CachedEntry> m_cache;
};

} // namespace VavCore

Key Improvements

Readability

Before:

DecodeToSurface(): 500+ lines with 5 levels of nesting
Mixed concerns: decoding, copying, caching, signaling

After:

DecodeToSurface(): 40 lines, clear 4-step process
Each helper method: 20-60 lines, single responsibility

Debugging

Before:

NV12 stride bug hidden in 500 lines of mixed logic
Hard to locate which cudaMemcpy2D call is wrong

After:

CopyYPlane() and CopyUVPlane() are separate methods
Easy to add breakpoint and inspect parameters
Clear separation of Y and UV plane logic

Testing

Before:

Cannot test D3D12 copying without full decoder setup
Cannot mock CUDA operations

After:

Can unit test D3D12SurfaceHandler independently
Can test ExternalMemoryCache in isolation
Easy to add mock implementations

Maintenance

Before:

Adding D3D11 support requires modifying 500+ line method
Risk of breaking existing D3D12 code

After:

Add new D3D11SurfaceHandler class
Existing D3D12 code untouched
Clean separation of concerns

File Size Comparison

File	Before	After
NVDECAV1Decoder.cpp	1,722 lines	~600 lines
D3D12SurfaceHandler.cpp	-	~300 lines
ExternalMemoryCache.cpp	-	~200 lines
Total	1,722 lines	1,100 lines

Reduction: 36% code reduction while improving readability

Implementation Plan

Phase 1: Extract D3D12 Handler (2-3 hours)

Create D3D12SurfaceHandler.h/.cpp
Move D3D12 resource import logic
Move NV12 plane copying logic
Test with existing Vav2Player

Acceptance Criteria:

Vav2Player displays video correctly
No memory leaks
Performance same or better

Phase 2: Extract External Memory Cache (1-2 hours)

Create ExternalMemoryCache.h/.cpp
Move external memory caching logic
Add proper cleanup on resource release
Test memory management

Acceptance Criteria:

Cache hit/miss working correctly
No memory leaks on repeated loads
Cache cleared on decoder cleanup

Phase 3: Refactor Main Decoder (1-2 hours)

Simplify DecodeToSurface() to routing logic
Extract DecodePacket() method
Extract GetDecodedFrame() method
Extract CopyToCPUSurface() method
Test all surface types

Acceptance Criteria:

All surface types working
Code passes all existing tests
Debug logging reduced

Phase 4: Fix NV12 Stride Bug (30 minutes)

Fix CopyYPlane() width parameter
Fix CopyUVPlane() width parameter
Verify with test video

Acceptance Criteria:

No stripe pattern in displayed video
Correct colors displayed
Performance maintained

Testing Strategy

Unit Tests

TEST(D3D12SurfaceHandler, CopiesNV12FrameCorrectly)
{
    auto handler = CreateTestHandler();
    auto src_frame = CreateTestNV12Frame(1920, 1080);
    auto dst_texture = CreateTestD3D12Texture(1920, 1080);

    bool result = handler->CopyNV12Frame(
        src_frame.device_ptr, src_frame.pitch,
        dst_texture, 1920, 1080
    );

    EXPECT_TRUE(result);
    VerifyNV12Data(dst_texture);
}

TEST(ExternalMemoryCache, ReusesExistingEntry)
{
    auto cache = CreateTestCache();
    auto resource = CreateTestD3D12Resource();

    CUdeviceptr ptr1, ptr2;
    cache->GetOrCreateExternalMemory(resource, &ptr1);
    cache->GetOrCreateExternalMemory(resource, &ptr2);

    EXPECT_EQ(ptr1, ptr2);  // Should return same pointer
}

Integration Tests

Load video file
Decode multiple frames
Verify no memory leaks
Verify correct video display

Success Criteria

Design document complete
Phase 1 complete: D3D12SurfaceHandler working
Phase 2 complete: ExternalMemoryCache working
Phase 3 complete: Main decoder simplified
Phase 4 complete: NV12 stripe bug fixed
All existing tests passing
No performance regression
Code review passed
Documentation updated

Next Step: Start Phase 1 - Extract D3D12SurfaceHandler

Last Updated: 2025-10-03

16 KiB Raw Blame History

NVDECAV1Decoder C++ Refactoring Design

Problem Analysis

Current State

Key Issues

Design Goals

Primary Goals

Non-Goals

Proposed Architecture

File Structure

Class Diagram

Refactored Code Structure

1. NVDECAV1Decoder.cpp (Main decoder - simplified)

2. Private Helper Methods (in NVDECAV1Decoder.cpp)

3. D3D12SurfaceHandler.h (D3D12-specific operations)

4. D3D12SurfaceHandler.cpp (Implementation)

5. ExternalMemoryCache.h (CUDA-D3D12 interop cache)

Key Improvements

Readability

Debugging

Testing

Maintenance

File Size Comparison

Implementation Plan

Phase 1: Extract D3D12 Handler (2-3 hours)

Phase 2: Extract External Memory Cache (1-2 hours)

Phase 3: Refactor Main Decoder (1-2 hours)

Phase 4: Fix NV12 Stride Bug (30 minutes)

Testing Strategy

Unit Tests

Integration Tests

Success Criteria

16 KiB

Raw Blame History