# NVDECAV1Decoder C++ Refactoring Design

**Date**: 2025-10-03
**Status**: Design Phase
**Goal**: Refactor NVDECAV1Decoder internal C++ code for readability and maintainability

---

## Problem Analysis

### Current State
- **File**: `vav2/platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp`
- **Lines**: 1,722 lines (too large)
- **Main Method**: `DecodeToSurface()` is 500+ lines with deeply nested logic

### Key Issues
1. **Monolithic Method**: `DecodeToSurface()` handles CPU, D3D11, D3D12, CUDA in one giant function
2. **Mixed Responsibilities**: Decoding + Surface copying + Memory management + Fence signaling all mixed
3. **Hard to Debug**: Pitch/stride bugs are difficult to trace due to complex nesting
4. **Difficult to Test**: Cannot unit test individual components in isolation
5. **Poor Readability**: Excessive debug logging makes logic hard to follow

---

## Design Goals

### Primary Goals
1. **Readability**: Each method should do ONE thing clearly
2. **Maintainability**: Easy to locate and fix bugs (like current NV12 stride issue)
3. **Testability**: Each component can be tested independently
4. **Performance**: Zero overhead - use inline functions where appropriate

### Non-Goals
- NOT creating a C API (VavCore already provides that)
- NOT changing external interface of NVDECAV1Decoder
- NOT over-engineering with complex patterns

---

## Proposed Architecture

### File Structure

```
NVDECAV1Decoder.h         (Public interface - unchanged)
NVDECAV1Decoder.cpp       (Main decoder - 400 lines)
  └── Uses helper classes below

D3D12SurfaceHandler.h     (D3D12-specific logic - 300 lines)
D3D12SurfaceHandler.cpp
  ├── ImportD3D12Resource()
  ├── CopyNV12Frame()
  └── SignalFence()

ExternalMemoryCache.h     (CUDA-D3D12 interop cache - 200 lines)
ExternalMemoryCache.cpp
  ├── GetOrCreate()
  ├── Release()
  └── Clear()
```

### Class Diagram

```
NVDECAV1Decoder (Main decoder)
├── CUvideodecoder m_decoder
├── CUvideoparser m_parser
├── CUcontext m_cudaContext
├── D3D12SurfaceHandler* m_d3d12Handler (on-demand)
└── ExternalMemoryCache* m_memoryCache (on-demand)

D3D12SurfaceHandler
├── ID3D12Device* m_device
├── CUcontext m_cudaContext
├── ExternalMemoryCache* m_cache
└── Methods:
    ├── CopyNV12Frame(src, dst, width, height, srcPitch)
    ├── GetD3D12CUDAPointer(ID3D12Resource*)
    └── SignalD3D12Fence(value)

ExternalMemoryCache
├── std::map<ID3D12Resource*, CachedEntry>
└── Methods:
    ├── GetOrCreateExternalMemory(resource)
    └── ReleaseAll()
```

---

## Refactored Code Structure

### 1. NVDECAV1Decoder.cpp (Main decoder - simplified)

**Before**: 500+ lines in DecodeToSurface()

**After**: Clean routing logic

```cpp
bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_size,
                                      void* target_surface, SurfaceType target_type)
{
    // Step 1: Decode packet to NVDEC internal buffer
    if (!DecodePacket(packet_data, packet_size)) {
        return false;
    }

    // Step 2: Get decoded frame info
    DecodedFrameInfo frame_info;
    if (!GetDecodedFrame(&frame_info)) {
        return false;
    }

    // Step 3: Copy to target surface based on type
    bool result = false;
    switch (target_type) {
        case SURFACE_TYPE_CPU:
            result = CopyToCPUSurface(frame_info, target_surface);
            break;
        case SURFACE_TYPE_D3D12:
            result = CopyToD3D12Surface(frame_info, target_surface);
            break;
        case SURFACE_TYPE_D3D11:
            result = CopyToD3D11Surface(frame_info, target_surface);
            break;
        case SURFACE_TYPE_CUDA:
            result = CopyToCUDASurface(frame_info, target_surface);
            break;
    }

    // Step 4: Cleanup
    cuvidUnmapVideoFrame(m_decoder, frame_info.device_ptr);
    return result;
}
```

### 2. Private Helper Methods (in NVDECAV1Decoder.cpp)

```cpp
// Decode packet using cuvidParseVideoData
// Returns: true on success
// Complexity: ~30 lines
private:
bool DecodePacket(const uint8_t* data, size_t size)
{
    CUVIDSOURCEDATAPACKET packet = {};
    packet.payload = data;
    packet.payload_size = size;
    packet.flags = CUVID_PKT_TIMESTAMP;

    CUresult result = cuvidParseVideoData(m_parser, &packet);
    if (result != CUDA_SUCCESS) {
        LogError("cuvidParseVideoData failed: %d", result);
        return false;
    }
    return true;
}

// Get decoded frame from internal queue
// Returns: true if frame available
// Complexity: ~40 lines
private:
struct DecodedFrameInfo {
    CUdeviceptr device_ptr;
    uint32_t pitch;
    uint32_t width;
    uint32_t height;
};

bool GetDecodedFrame(DecodedFrameInfo* out_info)
{
    if (m_frameQueue.empty()) {
        return false;
    }

    int frame_index = m_frameQueue.front();
    m_frameQueue.pop();

    CUVIDPROCPARAMS proc_params = {};
    proc_params.progressive_frame = 1;

    CUdeviceptr device_ptr;
    unsigned int pitch;
    CUresult result = cuvidMapVideoFrame(m_decoder, frame_index,
                                         &device_ptr, &pitch, &proc_params);

    if (result != CUDA_SUCCESS) {
        LogError("cuvidMapVideoFrame failed: %d", result);
        return false;
    }

    out_info->device_ptr = device_ptr;
    out_info->pitch = pitch;
    out_info->width = m_width;
    out_info->height = m_height;

    return true;
}

// Copy to D3D12 surface (delegates to handler)
// Returns: true on success
// Complexity: ~20 lines
private:
bool CopyToD3D12Surface(const DecodedFrameInfo& frame, void* surface)
{
    auto* d3d12_resource = static_cast<ID3D12Resource*>(surface);

    // Create handler on-demand
    if (!m_d3d12Handler) {
        m_d3d12Handler = std::make_unique<D3D12SurfaceHandler>(
            m_d3d12Device, m_cudaContext
        );
    }

    return m_d3d12Handler->CopyNV12Frame(
        frame.device_ptr,
        frame.pitch,
        d3d12_resource,
        frame.width,
        frame.height
    );
}
```

### 3. D3D12SurfaceHandler.h (D3D12-specific operations)

```cpp
#pragma once

#include <d3d12.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <memory>

namespace VavCore {

// Forward declaration
class ExternalMemoryCache;

class D3D12SurfaceHandler {
public:
    D3D12SurfaceHandler(ID3D12Device* device, CUcontext cuda_context);
    ~D3D12SurfaceHandler();

    // Copy NV12 frame from CUDA to D3D12 texture
    // Returns: true on success
    bool CopyNV12Frame(CUdeviceptr src_frame,
                       uint32_t src_pitch,
                       ID3D12Resource* dst_texture,
                       uint32_t width,
                       uint32_t height);

    // Signal D3D12 fence from CUDA stream
    // Returns: true on success
    bool SignalD3D12Fence(uint64_t fence_value);

private:
    // Get CUDA device pointer for D3D12 resource (uses cache)
    bool GetD3D12CUDAPointer(ID3D12Resource* resource, CUdeviceptr* out_ptr);

    // Copy Y plane (8-bit single channel)
    bool CopyYPlane(CUdeviceptr src, uint32_t src_pitch,
                   CUdeviceptr dst, uint32_t dst_pitch,
                   uint32_t width, uint32_t height);

    // Copy UV plane (8-bit dual channel, interleaved)
    bool CopyUVPlane(CUdeviceptr src, uint32_t src_pitch,
                    CUdeviceptr dst, uint32_t dst_pitch,
                    uint32_t width, uint32_t height);

private:
    ID3D12Device* m_device;
    CUcontext m_cudaContext;
    std::unique_ptr<ExternalMemoryCache> m_cache;
};

} // namespace VavCore
```

### 4. D3D12SurfaceHandler.cpp (Implementation)

```cpp
#include "D3D12SurfaceHandler.h"
#include "ExternalMemoryCache.h"
#include <stdio.h>

namespace VavCore {

D3D12SurfaceHandler::D3D12SurfaceHandler(ID3D12Device* device, CUcontext cuda_context)
    : m_device(device)
    , m_cudaContext(cuda_context)
    , m_cache(std::make_unique<ExternalMemoryCache>(device, cuda_context))
{
}

D3D12SurfaceHandler::~D3D12SurfaceHandler()
{
}

bool D3D12SurfaceHandler::CopyNV12Frame(CUdeviceptr src_frame,
                                        uint32_t src_pitch,
                                        ID3D12Resource* dst_texture,
                                        uint32_t width,
                                        uint32_t height)
{
    // Get CUDA pointer for D3D12 resource
    CUdeviceptr dst_ptr = 0;
    if (!GetD3D12CUDAPointer(dst_texture, &dst_ptr)) {
        return false;
    }

    // Get D3D12 texture layout
    D3D12_RESOURCE_DESC desc = dst_texture->GetDesc();
    D3D12_PLACED_SUBRESOURCE_FOOTPRINT layouts[2];
    UINT num_rows[2] = {0};
    UINT64 row_sizes[2] = {0};
    UINT64 total_bytes = 0;

    m_device->GetCopyableFootprints(&desc, 0, 2, 0,
                                    layouts, num_rows, row_sizes, &total_bytes);

    // Copy Y plane
    if (!CopyYPlane(src_frame, src_pitch,
                   dst_ptr, layouts[0].Footprint.RowPitch,
                   width, height)) {
        return false;
    }

    // Copy UV plane
    CUdeviceptr src_uv = src_frame + (src_pitch * height);
    CUdeviceptr dst_uv = dst_ptr + layouts[1].Offset;

    if (!CopyUVPlane(src_uv, src_pitch,
                    dst_uv, layouts[1].Footprint.RowPitch,
                    width, height / 2)) {
        return false;
    }

    return true;
}

bool D3D12SurfaceHandler::GetD3D12CUDAPointer(ID3D12Resource* resource,
                                              CUdeviceptr* out_ptr)
{
    return m_cache->GetOrCreateExternalMemory(resource, out_ptr);
}

bool D3D12SurfaceHandler::CopyYPlane(CUdeviceptr src, uint32_t src_pitch,
                                     CUdeviceptr dst, uint32_t dst_pitch,
                                     uint32_t width, uint32_t height)
{
    cudaError_t err = cudaMemcpy2D(
        (void*)dst, dst_pitch,
        (void*)src, src_pitch,
        width, height,  // Copy only valid pixels, not padding
        cudaMemcpyDeviceToDevice
    );

    if (err != cudaSuccess) {
        printf("[D3D12] Y plane copy failed: %d\n", err);
        return false;
    }

    return true;
}

bool D3D12SurfaceHandler::CopyUVPlane(CUdeviceptr src, uint32_t src_pitch,
                                      CUdeviceptr dst, uint32_t dst_pitch,
                                      uint32_t width, uint32_t height)
{
    // NV12 UV plane: interleaved U and V, so width in bytes = width of Y plane
    cudaError_t err = cudaMemcpy2D(
        (void*)dst, dst_pitch,
        (void*)src, src_pitch,
        width, height,  // UV plane has same width in bytes, half height
        cudaMemcpyDeviceToDevice
    );

    if (err != cudaSuccess) {
        printf("[D3D12] UV plane copy failed: %d\n", err);
        return false;
    }

    return true;
}

} // namespace VavCore
```

### 5. ExternalMemoryCache.h (CUDA-D3D12 interop cache)

```cpp
#pragma once

#include <d3d12.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <map>

namespace VavCore {

class ExternalMemoryCache {
public:
    ExternalMemoryCache(ID3D12Device* device, CUcontext cuda_context);
    ~ExternalMemoryCache();

    // Get or create CUDA device pointer for D3D12 resource
    // Returns: true on success
    bool GetOrCreateExternalMemory(ID3D12Resource* resource, CUdeviceptr* out_ptr);

    // Release specific resource
    void Release(ID3D12Resource* resource);

    // Release all cached resources
    void ReleaseAll();

private:
    struct CachedEntry {
        cudaExternalMemory_t external_memory;
        CUdeviceptr device_ptr;
        size_t size;
    };

    bool ImportD3D12Resource(ID3D12Resource* resource,
                            cudaExternalMemory_t* out_ext_mem,
                            CUdeviceptr* out_ptr);

private:
    ID3D12Device* m_device;
    CUcontext m_cudaContext;
    std::map<ID3D12Resource*, CachedEntry> m_cache;
};

} // namespace VavCore
```

---

## Key Improvements

### Readability
**Before**:
- `DecodeToSurface()`: 500+ lines with 5 levels of nesting
- Mixed concerns: decoding, copying, caching, signaling

**After**:
- `DecodeToSurface()`: 40 lines, clear 4-step process
- Each helper method: 20-60 lines, single responsibility

### Debugging
**Before**:
- NV12 stride bug hidden in 500 lines of mixed logic
- Hard to locate which `cudaMemcpy2D` call is wrong

**After**:
- `CopyYPlane()` and `CopyUVPlane()` are separate methods
- Easy to add breakpoint and inspect parameters
- Clear separation of Y and UV plane logic

### Testing
**Before**:
- Cannot test D3D12 copying without full decoder setup
- Cannot mock CUDA operations

**After**:
- Can unit test `D3D12SurfaceHandler` independently
- Can test `ExternalMemoryCache` in isolation
- Easy to add mock implementations

### Maintenance
**Before**:
- Adding D3D11 support requires modifying 500+ line method
- Risk of breaking existing D3D12 code

**After**:
- Add new `D3D11SurfaceHandler` class
- Existing D3D12 code untouched
- Clean separation of concerns

---

## File Size Comparison

| File | Before | After |
|------|--------|-------|
| NVDECAV1Decoder.cpp | 1,722 lines | ~600 lines |
| D3D12SurfaceHandler.cpp | - | ~300 lines |
| ExternalMemoryCache.cpp | - | ~200 lines |
| **Total** | 1,722 lines | 1,100 lines |

**Reduction**: 36% code reduction while improving readability

---

## Implementation Plan

### Phase 1: Extract D3D12 Handler (2-3 hours)
1. Create `D3D12SurfaceHandler.h/.cpp`
2. Move D3D12 resource import logic
3. Move NV12 plane copying logic
4. Test with existing Vav2Player

**Acceptance Criteria**:
- Vav2Player displays video correctly
- No memory leaks
- Performance same or better

### Phase 2: Extract External Memory Cache (1-2 hours)
1. Create `ExternalMemoryCache.h/.cpp`
2. Move external memory caching logic
3. Add proper cleanup on resource release
4. Test memory management

**Acceptance Criteria**:
- Cache hit/miss working correctly
- No memory leaks on repeated loads
- Cache cleared on decoder cleanup

### Phase 3: Refactor Main Decoder (1-2 hours)
1. Simplify `DecodeToSurface()` to routing logic
2. Extract `DecodePacket()` method
3. Extract `GetDecodedFrame()` method
4. Extract `CopyToCPUSurface()` method
5. Test all surface types

**Acceptance Criteria**:
- All surface types working
- Code passes all existing tests
- Debug logging reduced

### Phase 4: Fix NV12 Stride Bug (30 minutes)
1. Fix `CopyYPlane()` width parameter
2. Fix `CopyUVPlane()` width parameter
3. Verify with test video

**Acceptance Criteria**:
- No stripe pattern in displayed video
- Correct colors displayed
- Performance maintained

---

## Testing Strategy

### Unit Tests
```cpp
TEST(D3D12SurfaceHandler, CopiesNV12FrameCorrectly)
{
    auto handler = CreateTestHandler();
    auto src_frame = CreateTestNV12Frame(1920, 1080);
    auto dst_texture = CreateTestD3D12Texture(1920, 1080);

    bool result = handler->CopyNV12Frame(
        src_frame.device_ptr, src_frame.pitch,
        dst_texture, 1920, 1080
    );

    EXPECT_TRUE(result);
    VerifyNV12Data(dst_texture);
}

TEST(ExternalMemoryCache, ReusesExistingEntry)
{
    auto cache = CreateTestCache();
    auto resource = CreateTestD3D12Resource();

    CUdeviceptr ptr1, ptr2;
    cache->GetOrCreateExternalMemory(resource, &ptr1);
    cache->GetOrCreateExternalMemory(resource, &ptr2);

    EXPECT_EQ(ptr1, ptr2);  // Should return same pointer
}
```

### Integration Tests
- Load video file
- Decode multiple frames
- Verify no memory leaks
- Verify correct video display

---

## Success Criteria

- [x] Design document complete
- [ ] Phase 1 complete: D3D12SurfaceHandler working
- [ ] Phase 2 complete: ExternalMemoryCache working
- [ ] Phase 3 complete: Main decoder simplified
- [ ] Phase 4 complete: NV12 stripe bug fixed
- [ ] All existing tests passing
- [ ] No performance regression
- [ ] Code review passed
- [ ] Documentation updated

---

**Next Step**: Start Phase 1 - Extract D3D12SurfaceHandler

**Last Updated**: 2025-10-03