# NVDECAV1Decoder C++ Refactoring Design **Date**: 2025-10-03 **Status**: Design Phase **Goal**: Refactor NVDECAV1Decoder internal C++ code for readability and maintainability --- ## Problem Analysis ### Current State - **File**: `vav2/platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp` - **Lines**: 1,722 lines (too large) - **Main Method**: `DecodeToSurface()` is 500+ lines with deeply nested logic ### Key Issues 1. **Monolithic Method**: `DecodeToSurface()` handles CPU, D3D11, D3D12, CUDA in one giant function 2. **Mixed Responsibilities**: Decoding + Surface copying + Memory management + Fence signaling all mixed 3. **Hard to Debug**: Pitch/stride bugs are difficult to trace due to complex nesting 4. **Difficult to Test**: Cannot unit test individual components in isolation 5. **Poor Readability**: Excessive debug logging makes logic hard to follow --- ## Design Goals ### Primary Goals 1. **Readability**: Each method should do ONE thing clearly 2. **Maintainability**: Easy to locate and fix bugs (like current NV12 stride issue) 3. **Testability**: Each component can be tested independently 4. **Performance**: Zero overhead - use inline functions where appropriate ### Non-Goals - NOT creating a C API (VavCore already provides that) - NOT changing external interface of NVDECAV1Decoder - NOT over-engineering with complex patterns --- ## Proposed Architecture ### File Structure ``` NVDECAV1Decoder.h (Public interface - unchanged) NVDECAV1Decoder.cpp (Main decoder - 400 lines) └── Uses helper classes below D3D12SurfaceHandler.h (D3D12-specific logic - 300 lines) D3D12SurfaceHandler.cpp ├── ImportD3D12Resource() ├── CopyNV12Frame() └── SignalFence() ExternalMemoryCache.h (CUDA-D3D12 interop cache - 200 lines) ExternalMemoryCache.cpp ├── GetOrCreate() ├── Release() └── Clear() ``` ### Class Diagram ``` NVDECAV1Decoder (Main decoder) ├── CUvideodecoder m_decoder ├── CUvideoparser m_parser ├── CUcontext m_cudaContext ├── D3D12SurfaceHandler* m_d3d12Handler (on-demand) └── ExternalMemoryCache* m_memoryCache (on-demand) D3D12SurfaceHandler ├── ID3D12Device* m_device ├── CUcontext m_cudaContext ├── ExternalMemoryCache* m_cache └── Methods: ├── CopyNV12Frame(src, dst, width, height, srcPitch) ├── GetD3D12CUDAPointer(ID3D12Resource*) └── SignalD3D12Fence(value) ExternalMemoryCache ├── std::map └── Methods: ├── GetOrCreateExternalMemory(resource) └── ReleaseAll() ``` --- ## Refactored Code Structure ### 1. NVDECAV1Decoder.cpp (Main decoder - simplified) **Before**: 500+ lines in DecodeToSurface() **After**: Clean routing logic ```cpp bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_size, void* target_surface, SurfaceType target_type) { // Step 1: Decode packet to NVDEC internal buffer if (!DecodePacket(packet_data, packet_size)) { return false; } // Step 2: Get decoded frame info DecodedFrameInfo frame_info; if (!GetDecodedFrame(&frame_info)) { return false; } // Step 3: Copy to target surface based on type bool result = false; switch (target_type) { case SURFACE_TYPE_CPU: result = CopyToCPUSurface(frame_info, target_surface); break; case SURFACE_TYPE_D3D12: result = CopyToD3D12Surface(frame_info, target_surface); break; case SURFACE_TYPE_D3D11: result = CopyToD3D11Surface(frame_info, target_surface); break; case SURFACE_TYPE_CUDA: result = CopyToCUDASurface(frame_info, target_surface); break; } // Step 4: Cleanup cuvidUnmapVideoFrame(m_decoder, frame_info.device_ptr); return result; } ``` ### 2. Private Helper Methods (in NVDECAV1Decoder.cpp) ```cpp // Decode packet using cuvidParseVideoData // Returns: true on success // Complexity: ~30 lines private: bool DecodePacket(const uint8_t* data, size_t size) { CUVIDSOURCEDATAPACKET packet = {}; packet.payload = data; packet.payload_size = size; packet.flags = CUVID_PKT_TIMESTAMP; CUresult result = cuvidParseVideoData(m_parser, &packet); if (result != CUDA_SUCCESS) { LogError("cuvidParseVideoData failed: %d", result); return false; } return true; } // Get decoded frame from internal queue // Returns: true if frame available // Complexity: ~40 lines private: struct DecodedFrameInfo { CUdeviceptr device_ptr; uint32_t pitch; uint32_t width; uint32_t height; }; bool GetDecodedFrame(DecodedFrameInfo* out_info) { if (m_frameQueue.empty()) { return false; } int frame_index = m_frameQueue.front(); m_frameQueue.pop(); CUVIDPROCPARAMS proc_params = {}; proc_params.progressive_frame = 1; CUdeviceptr device_ptr; unsigned int pitch; CUresult result = cuvidMapVideoFrame(m_decoder, frame_index, &device_ptr, &pitch, &proc_params); if (result != CUDA_SUCCESS) { LogError("cuvidMapVideoFrame failed: %d", result); return false; } out_info->device_ptr = device_ptr; out_info->pitch = pitch; out_info->width = m_width; out_info->height = m_height; return true; } // Copy to D3D12 surface (delegates to handler) // Returns: true on success // Complexity: ~20 lines private: bool CopyToD3D12Surface(const DecodedFrameInfo& frame, void* surface) { auto* d3d12_resource = static_cast(surface); // Create handler on-demand if (!m_d3d12Handler) { m_d3d12Handler = std::make_unique( m_d3d12Device, m_cudaContext ); } return m_d3d12Handler->CopyNV12Frame( frame.device_ptr, frame.pitch, d3d12_resource, frame.width, frame.height ); } ``` ### 3. D3D12SurfaceHandler.h (D3D12-specific operations) ```cpp #pragma once #include #include #include #include namespace VavCore { // Forward declaration class ExternalMemoryCache; class D3D12SurfaceHandler { public: D3D12SurfaceHandler(ID3D12Device* device, CUcontext cuda_context); ~D3D12SurfaceHandler(); // Copy NV12 frame from CUDA to D3D12 texture // Returns: true on success bool CopyNV12Frame(CUdeviceptr src_frame, uint32_t src_pitch, ID3D12Resource* dst_texture, uint32_t width, uint32_t height); // Signal D3D12 fence from CUDA stream // Returns: true on success bool SignalD3D12Fence(uint64_t fence_value); private: // Get CUDA device pointer for D3D12 resource (uses cache) bool GetD3D12CUDAPointer(ID3D12Resource* resource, CUdeviceptr* out_ptr); // Copy Y plane (8-bit single channel) bool CopyYPlane(CUdeviceptr src, uint32_t src_pitch, CUdeviceptr dst, uint32_t dst_pitch, uint32_t width, uint32_t height); // Copy UV plane (8-bit dual channel, interleaved) bool CopyUVPlane(CUdeviceptr src, uint32_t src_pitch, CUdeviceptr dst, uint32_t dst_pitch, uint32_t width, uint32_t height); private: ID3D12Device* m_device; CUcontext m_cudaContext; std::unique_ptr m_cache; }; } // namespace VavCore ``` ### 4. D3D12SurfaceHandler.cpp (Implementation) ```cpp #include "D3D12SurfaceHandler.h" #include "ExternalMemoryCache.h" #include namespace VavCore { D3D12SurfaceHandler::D3D12SurfaceHandler(ID3D12Device* device, CUcontext cuda_context) : m_device(device) , m_cudaContext(cuda_context) , m_cache(std::make_unique(device, cuda_context)) { } D3D12SurfaceHandler::~D3D12SurfaceHandler() { } bool D3D12SurfaceHandler::CopyNV12Frame(CUdeviceptr src_frame, uint32_t src_pitch, ID3D12Resource* dst_texture, uint32_t width, uint32_t height) { // Get CUDA pointer for D3D12 resource CUdeviceptr dst_ptr = 0; if (!GetD3D12CUDAPointer(dst_texture, &dst_ptr)) { return false; } // Get D3D12 texture layout D3D12_RESOURCE_DESC desc = dst_texture->GetDesc(); D3D12_PLACED_SUBRESOURCE_FOOTPRINT layouts[2]; UINT num_rows[2] = {0}; UINT64 row_sizes[2] = {0}; UINT64 total_bytes = 0; m_device->GetCopyableFootprints(&desc, 0, 2, 0, layouts, num_rows, row_sizes, &total_bytes); // Copy Y plane if (!CopyYPlane(src_frame, src_pitch, dst_ptr, layouts[0].Footprint.RowPitch, width, height)) { return false; } // Copy UV plane CUdeviceptr src_uv = src_frame + (src_pitch * height); CUdeviceptr dst_uv = dst_ptr + layouts[1].Offset; if (!CopyUVPlane(src_uv, src_pitch, dst_uv, layouts[1].Footprint.RowPitch, width, height / 2)) { return false; } return true; } bool D3D12SurfaceHandler::GetD3D12CUDAPointer(ID3D12Resource* resource, CUdeviceptr* out_ptr) { return m_cache->GetOrCreateExternalMemory(resource, out_ptr); } bool D3D12SurfaceHandler::CopyYPlane(CUdeviceptr src, uint32_t src_pitch, CUdeviceptr dst, uint32_t dst_pitch, uint32_t width, uint32_t height) { cudaError_t err = cudaMemcpy2D( (void*)dst, dst_pitch, (void*)src, src_pitch, width, height, // Copy only valid pixels, not padding cudaMemcpyDeviceToDevice ); if (err != cudaSuccess) { printf("[D3D12] Y plane copy failed: %d\n", err); return false; } return true; } bool D3D12SurfaceHandler::CopyUVPlane(CUdeviceptr src, uint32_t src_pitch, CUdeviceptr dst, uint32_t dst_pitch, uint32_t width, uint32_t height) { // NV12 UV plane: interleaved U and V, so width in bytes = width of Y plane cudaError_t err = cudaMemcpy2D( (void*)dst, dst_pitch, (void*)src, src_pitch, width, height, // UV plane has same width in bytes, half height cudaMemcpyDeviceToDevice ); if (err != cudaSuccess) { printf("[D3D12] UV plane copy failed: %d\n", err); return false; } return true; } } // namespace VavCore ``` ### 5. ExternalMemoryCache.h (CUDA-D3D12 interop cache) ```cpp #pragma once #include #include #include #include namespace VavCore { class ExternalMemoryCache { public: ExternalMemoryCache(ID3D12Device* device, CUcontext cuda_context); ~ExternalMemoryCache(); // Get or create CUDA device pointer for D3D12 resource // Returns: true on success bool GetOrCreateExternalMemory(ID3D12Resource* resource, CUdeviceptr* out_ptr); // Release specific resource void Release(ID3D12Resource* resource); // Release all cached resources void ReleaseAll(); private: struct CachedEntry { cudaExternalMemory_t external_memory; CUdeviceptr device_ptr; size_t size; }; bool ImportD3D12Resource(ID3D12Resource* resource, cudaExternalMemory_t* out_ext_mem, CUdeviceptr* out_ptr); private: ID3D12Device* m_device; CUcontext m_cudaContext; std::map m_cache; }; } // namespace VavCore ``` --- ## Key Improvements ### Readability **Before**: - `DecodeToSurface()`: 500+ lines with 5 levels of nesting - Mixed concerns: decoding, copying, caching, signaling **After**: - `DecodeToSurface()`: 40 lines, clear 4-step process - Each helper method: 20-60 lines, single responsibility ### Debugging **Before**: - NV12 stride bug hidden in 500 lines of mixed logic - Hard to locate which `cudaMemcpy2D` call is wrong **After**: - `CopyYPlane()` and `CopyUVPlane()` are separate methods - Easy to add breakpoint and inspect parameters - Clear separation of Y and UV plane logic ### Testing **Before**: - Cannot test D3D12 copying without full decoder setup - Cannot mock CUDA operations **After**: - Can unit test `D3D12SurfaceHandler` independently - Can test `ExternalMemoryCache` in isolation - Easy to add mock implementations ### Maintenance **Before**: - Adding D3D11 support requires modifying 500+ line method - Risk of breaking existing D3D12 code **After**: - Add new `D3D11SurfaceHandler` class - Existing D3D12 code untouched - Clean separation of concerns --- ## File Size Comparison | File | Before | After | |------|--------|-------| | NVDECAV1Decoder.cpp | 1,722 lines | ~600 lines | | D3D12SurfaceHandler.cpp | - | ~300 lines | | ExternalMemoryCache.cpp | - | ~200 lines | | **Total** | 1,722 lines | 1,100 lines | **Reduction**: 36% code reduction while improving readability --- ## Implementation Plan ### Phase 1: Extract D3D12 Handler (2-3 hours) 1. Create `D3D12SurfaceHandler.h/.cpp` 2. Move D3D12 resource import logic 3. Move NV12 plane copying logic 4. Test with existing Vav2Player **Acceptance Criteria**: - Vav2Player displays video correctly - No memory leaks - Performance same or better ### Phase 2: Extract External Memory Cache (1-2 hours) 1. Create `ExternalMemoryCache.h/.cpp` 2. Move external memory caching logic 3. Add proper cleanup on resource release 4. Test memory management **Acceptance Criteria**: - Cache hit/miss working correctly - No memory leaks on repeated loads - Cache cleared on decoder cleanup ### Phase 3: Refactor Main Decoder (1-2 hours) 1. Simplify `DecodeToSurface()` to routing logic 2. Extract `DecodePacket()` method 3. Extract `GetDecodedFrame()` method 4. Extract `CopyToCPUSurface()` method 5. Test all surface types **Acceptance Criteria**: - All surface types working - Code passes all existing tests - Debug logging reduced ### Phase 4: Fix NV12 Stride Bug (30 minutes) 1. Fix `CopyYPlane()` width parameter 2. Fix `CopyUVPlane()` width parameter 3. Verify with test video **Acceptance Criteria**: - No stripe pattern in displayed video - Correct colors displayed - Performance maintained --- ## Testing Strategy ### Unit Tests ```cpp TEST(D3D12SurfaceHandler, CopiesNV12FrameCorrectly) { auto handler = CreateTestHandler(); auto src_frame = CreateTestNV12Frame(1920, 1080); auto dst_texture = CreateTestD3D12Texture(1920, 1080); bool result = handler->CopyNV12Frame( src_frame.device_ptr, src_frame.pitch, dst_texture, 1920, 1080 ); EXPECT_TRUE(result); VerifyNV12Data(dst_texture); } TEST(ExternalMemoryCache, ReusesExistingEntry) { auto cache = CreateTestCache(); auto resource = CreateTestD3D12Resource(); CUdeviceptr ptr1, ptr2; cache->GetOrCreateExternalMemory(resource, &ptr1); cache->GetOrCreateExternalMemory(resource, &ptr2); EXPECT_EQ(ptr1, ptr2); // Should return same pointer } ``` ### Integration Tests - Load video file - Decode multiple frames - Verify no memory leaks - Verify correct video display --- ## Success Criteria - [x] Design document complete - [ ] Phase 1 complete: D3D12SurfaceHandler working - [ ] Phase 2 complete: ExternalMemoryCache working - [ ] Phase 3 complete: Main decoder simplified - [ ] Phase 4 complete: NV12 stripe bug fixed - [ ] All existing tests passing - [ ] No performance regression - [ ] Code review passed - [ ] Documentation updated --- **Next Step**: Start Phase 1 - Extract D3D12SurfaceHandler **Last Updated**: 2025-10-03