# NVDEC RingBuffer-based Asynchronous Decoding Design **Date**: 2025-10-05 (Updated) **Status**: βœ… **Design Finalized - Ready for Implementation** **Target**: NVDECAV1Decoder DecodeToSurface() optimization with multi-frame packet support --- ## πŸ“‹ Table of Contents 1. [Background & Problem Statement](#background--problem-statement) 2. [Design Goals](#design-goals) 3. [Architecture Overview](#architecture-overview) 4. [Detailed Design](#detailed-design) 5. [Implementation Plan](#implementation-plan) 6. [Performance Analysis](#performance-analysis) 7. [Risk Assessment](#risk-assessment) 8. [References](#references) --- ## 🎯 Background & Problem Statement ### Current Implementation Issues #### Issue 1: Sequential Decoding Bottleneck **Current Code** (`NVDECAV1Decoder::DecodeToSurface`): ```cpp bool DecodeToSurface(packet_data, target_surface, output_frame) { cuvidParseVideoData(packet); // Submit packet wait_for_frame(); // Block until decode complete cuvidMapVideoFrame(); // Map decoded frame CopyNV12Frame(srcPtr, target_surface); // Copy to surface cuvidUnmapVideoFrame(); // Unmap frame return true; } ``` **Timeline**: ``` Call 1: [Submit][Wait____][Map][Copy][Unmap] Call 2: [Submit][Wait____][Map][Copy][Unmap] Call 3: [Submit][Wait____][Map][Copy][Unmap] Total: ~45ms (15ms * 3) ``` **Problem**: GPU sits idle during packet submission and surface copying. --- #### Issue 2: Packet-Surface Mapping Mismatch **Scenario**: Multi-threaded DecodeToSurface calls ```cpp // Thread A DecodeToSurface(packet1, surface1, frame1); // I-frame, 15ms // Thread B (simultaneous) DecodeToSurface(packet2, surface2, frame2); // P-frame, 3ms // GPU completes out-of-order packet2 completes @ t=3ms β†’ picture_index=5 packet1 completes @ t=15ms β†’ picture_index=4 // Current queue-based retrieval Thread A: pop() β†’ picture_index=5 ❌ (gets packet2 result!) Thread B: pop() β†’ picture_index=4 ❌ (gets packet1 result!) // Result: surface1 contains packet2, surface2 contains packet1 ``` **Root Cause**: - FIFO queue doesn't track which packet belongs to which surface - Decoding completion order β‰  submission order --- #### Issue 3: Multi-Frame Packet Handling ⚠️ **Critical Discovery** **Scenario**: Single WebM packet contains multiple AV1 frames ```cpp // DecodeToSurface called ONCE DecodeToSurface(packet_175bytes, surface1, frame1); // NVDEC parser extracts MULTIPLE frames from single packet: β†’ HandlePictureDecode(CurrPicIdx=0, IntraPicFlag=1) // I-frame β†’ HandlePictureDecode(CurrPicIdx=1, IntraPicFlag=0) // P-frame β†’ HandlePictureDecode(CurrPicIdx=2, IntraPicFlag=0) // P-frame β†’ ... (up to 8 frames in one packet) // Problem: Which picture_index should be returned? // Current design assumes: 1 packet = 1 frame ❌ ``` **Impact**: - Slot allocation assumes 1 packet β†’ 1 slot β†’ 1 picture_index - Reality: 1 packet β†’ 1 slot β†’ **N picture_indices** - Must track multiple picture_indices per slot - Must decide which frame to return (first? last? all?) --- #### Issue 3: ulNumOutputSurfaces Underutilization **NVDEC Configuration**: ```cpp m_createInfo.ulNumOutputSurfaces = 8; // 8 output buffers allocated ``` **Purpose of 8 Surfaces**: 1. Reference frame storage (AV1 needs up to 8 reference frames) 2. Display reordering buffer (B-frames, decode order β‰  display order) 3. Pipeline parallelism (overlap decode/map/copy operations) **Current Usage**: Only 1-2 surfaces actually utilized β†’ Wasted GPU memory (~24MB for 1920x1080) --- ## 🎯 Design Goals ### Primary Goals 1. **βœ… Correct Packet-Surface Mapping** - Guarantee: `DecodeToSurface(packetN, surfaceN)` β†’ surfaceN contains packetN result - No out-of-order completion issues 2. **βœ… Sequential Return Order** - Even if packet2 completes before packet1, return surface1 first - Maintain FIFO semantics for API users 3. **⚑ Parallel Decoding Throughput** - Allow 2-3 packets to decode simultaneously on GPU - Reduce total decode latency by ~30-40% 4. **πŸ›‘οΈ Simplicity & Maintainability** - Avoid complex ID mapping systems - Use fixed-size RingBuffer (matches NVDEC hardware design) - Single polling thread (not per-decode thread) ### Non-Goals - ❌ Full 8-frame parallel decoding (limited by reference frame dependencies) - ❌ Out-of-order surface return (would break video playback semantics) - ❌ Dynamic buffer sizing (fixed 8 slots matches NVDEC hardware) --- ## πŸ—οΈ Architecture Overview ### High-Level Design ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DecodeToSurface() Call β”‚ β”‚ (Thread 1, 2, 3, ... can call concurrently) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ RingBuffer (8 slots) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ [0] β”‚ [1] β”‚ [2] β”‚ [3] β”‚ [4] β”‚ [5] β”‚ [6] β”‚ [7] β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–² β–² β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ m_returnIndex m_submitIndex β”‚ β”‚ (consumer) (producer) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Polling Thread β”‚ β”‚ while (polling_running) { β”‚ β”‚ slot = ringBuffer[m_returnIndex] β”‚ β”‚ if (cuvidGetDecodeStatus(slot.picture_index) == Success) { β”‚ β”‚ slot.is_ready = true β”‚ β”‚ slot.frame_ready.notify_one() β”‚ β”‚ } β”‚ β”‚ sleep(100us) β”‚ β”‚ } β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Data Structures #### DecodeSlot Structure ```cpp struct DecodeSlot { // Slot state bool in_use; // Is this slot allocated? // Submitted information (from DecodeToSurface args) void* target_surface; // Destination D3D12 resource VavCoreSurfaceType surface_type; // Surface type // NVDEC information (from HandlePictureDecode callback) // ⚠️ Multi-frame support: One packet can decode to multiple frames std::vector picture_indices; // All NVDEC frame indices from this packet // Synchronization primitives std::condition_variable frame_ready; // Signaled when ALL frames are decoded std::mutex slot_mutex; // Protects this slot's state bool is_ready; // All frames decoded flag }; ``` #### RingBuffer Members ```cpp class NVDECAV1Decoder { private: static constexpr size_t RING_BUFFER_SIZE = 8; // Match ulNumOutputSurfaces DecodeSlot m_ringBuffer[RING_BUFFER_SIZE]; // 🎯 Option C: Unified slot allocation counter (no mapping needed!) std::atomic m_slotIdCounter{0}; // Monotonically increasing slot ID std::atomic m_returnIdCounter{0}; // Return order enforcement (FIFO) // Polling thread std::thread m_pollingThread; std::atomic m_pollingRunning{false}; }; ``` --- ## πŸ”§ Detailed Design ### Component 1: Slot Allocation (Producer) **Purpose**: Assign RingBuffer slot to each DecodeToSurface call using Option C design ```cpp // In DecodeToSurface() // 🎯 Option C: Allocate unique ID (serves as both slot ID and submission order) uint64_t my_id = m_slotIdCounter.fetch_add(1); size_t slot_idx = my_id % RING_BUFFER_SIZE; DecodeSlot& my_slot = m_ringBuffer[slot_idx]; // 2. Check for overflow { std::lock_guard lock(my_slot.slot_mutex); if (my_slot.in_use) { // RingBuffer full! Too many concurrent decodes LogError("RingBuffer overflow - max 8 concurrent decodes"); return false; } // 3. Initialize slot my_slot.in_use = true; my_slot.target_surface = target_surface; my_slot.surface_type = target_type; my_slot.picture_indices.clear(); // Multi-frame support: clear previous frames my_slot.is_ready = false; } ``` **🎯 Option C: No Mapping Needed - Direct Calculation**: ``` Thread 1: m_slotIdCounter.fetch_add(1) β†’ ID=0, slot_idx = 0 % 8 = slot[0] Thread 2: m_slotIdCounter.fetch_add(1) β†’ ID=1, slot_idx = 1 % 8 = slot[1] Thread 3: m_slotIdCounter.fetch_add(1) β†’ ID=2, slot_idx = 2 % 8 = slot[2] ... Thread 9: m_slotIdCounter.fetch_add(1) β†’ ID=8, slot_idx = 8 % 8 = slot[0] (wrap around) ``` **Key Advantage**: ID β†’ slot_idx calculation is **deterministic**, no map storage needed! **Overflow Protection**: If `slot[0].in_use == true` when Thread 9 arrives β†’ error --- ### Component 2: Packet Submission **Purpose**: Submit packet to NVDEC with slot ID tracking via timestamp ```cpp // 4. Submit packet to NVDEC parser CUVIDSOURCEDATAPACKET packet = {}; packet.payload = packet_data; packet.payload_size = packet_size; packet.flags = CUVID_PKT_ENDOFPICTURE; packet.timestamp = static_cast(my_id); // 🎯 Pass full slot_id (NOT modulo!) CUresult result = cuvidParseVideoData(m_parser, &packet); if (result != CUDA_SUCCESS) { my_slot.in_use = false; // Release slot on error return false; } ``` **🎯 Option C Timestamp Flow** (multi-frame packet support): ``` cuvidParseVideoData(packet, timestamp=my_id=17) ↓ HandleVideoSequence() (first time only) ↓ HandlePictureDecode(timestamp=17, CurrPicIdx=0) β†’ slot_idx = 17 % 8 = 1 β†’ m_ringBuffer[1].picture_indices.push_back(0) ↓ HandlePictureDecode(timestamp=17, CurrPicIdx=1) β†’ slot_idx = 17 % 8 = 1 β†’ m_ringBuffer[1].picture_indices.push_back(1) // Same packet, multiple frames! ↓ HandlePictureDecode(timestamp=17, CurrPicIdx=2) β†’ slot_idx = 17 % 8 = 1 β†’ m_ringBuffer[1].picture_indices.push_back(2) ↓ PollingThread checks ALL picture_indices for slot[1] ↓ When all complete: slot[1].is_ready = true, notify thread ``` **Key Point**: Timestamp carries **full slot_id**, HandlePictureDecode calculates slot_idx directly --- ### Component 3: Polling Thread (Background Status Checker) **Purpose**: Continuously poll `m_returnIdCounter` slot for decode completion (multi-frame support) ```cpp void NVDECAV1Decoder::PollingThreadFunc() { while (m_pollingRunning) { // 1. Get current return ID and calculate slot index uint64_t current_return_id = m_returnIdCounter.load(); size_t slot_idx = current_return_id % RING_BUFFER_SIZE; DecodeSlot& slot = m_ringBuffer[slot_idx]; // 2. Check if slot is in use and not yet ready if (slot.in_use && !slot.is_ready) { // 3. Get copy of picture indices (multi-frame support) std::vector picture_indices_copy; { std::lock_guard lock(slot.slot_mutex); picture_indices_copy = slot.picture_indices; } // 4. Check if ALL frames are decoded bool all_complete = true; for (int pic_idx : picture_indices_copy) { CUVIDGETDECODESTATUS decodeStatus = {}; CUresult result = cuvidGetDecodeStatus(m_decoder, pic_idx, &decodeStatus); if (result != CUDA_SUCCESS || decodeStatus.decodeStatus != cuvidDecodeStatus_Success) { all_complete = false; break; } } // 5. If all frames complete, signal ready if (all_complete && !picture_indices_copy.empty()) { { std::lock_guard lock(slot.slot_mutex); slot.is_ready = true; } slot.frame_ready.notify_one(); } } // 6. Sleep to avoid busy-wait std::this_thread::sleep_for(std::chrono::microseconds(100)); } } ``` **Key Points**: - βœ… Only polls `m_returnIdCounter` slot (not all 8 slots) β†’ efficient - βœ… **Multi-frame support**: Checks ALL picture_indices for completion - βœ… Uses `cuvidGetDecodeStatus()` non-blocking query - βœ… 100us sleep β†’ ~10,000 checks/second (low CPU usage) - βœ… Thread-safe picture_indices copy to avoid lock contention --- ### Component 4: Sequential Return Wait (FIFO Guarantee) **Purpose**: Enforce FIFO order even when decodes complete out-of-order using Option C ```cpp // In DecodeToSurface() - PHASE 2 // 5. Wait for my turn (FIFO order enforcement) while (m_returnIdCounter.load() != my_id) { std::this_thread::sleep_for(std::chrono::milliseconds(1)); } // Now I'm first in line! // 6. Wait for decode to complete { std::unique_lock lock(my_slot.slot_mutex); if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500), [&my_slot]() { return my_slot.is_ready; })) { // Timeout - decode took too long LogError("Decode timeout"); my_slot.in_use = false; m_returnIdCounter.fetch_add(1); // Skip this slot to avoid deadlock return false; } } ``` **🎯 Option C Timeline Example** (using slot ID, not slot index): ``` Thread 1 (ID=17, slot 1): Wait for returnIdCounter==17 βœ… (immediate) Wait for is_ready... Thread 2 (ID=18, slot 2): Wait for returnIdCounter==18 ⏸️ (blocked) Thread 3 (ID=19, slot 3): Wait for returnIdCounter==19 ⏸️ (blocked) GPU: ID=18 completes first @ t=3ms β†’ slot[2].is_ready = true β†’ Thread 2 still blocked (returnIdCounter=17) GPU: ID=17 completes @ t=15ms β†’ slot[1].is_ready = true β†’ Thread 1 wakes up βœ… β†’ Thread 1 processes β†’ returnIdCounter = 18 β†’ Thread 2 now unblocked βœ… ``` **Key Point**: Wait on **slot_id** (my_id), not slot_idx, for correct FIFO ordering --- ### Component 5: Frame Retrieval & Cleanup (Multi-Frame Support) **Purpose**: Map decoded frame, copy to surface, release slot ```cpp // In DecodeToSurface() - PHASE 3 // 7. Get first frame from multi-frame packet int frameIdx = my_slot.picture_indices[0]; // Return first frame only // 8. Map decoded frame from NVDEC CUVIDPROCPARAMS procParams = {}; procParams.progressive_frame = 1; CUdeviceptr srcDevicePtr = 0; unsigned int srcPitch = 0; CUresult result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams); if (result != CUDA_SUCCESS) { my_slot.in_use = false; m_returnIdCounter.fetch_add(1); return false; } // 9. Copy to D3D12 surface ID3D12Resource* d3d12Resource = static_cast(target_surface); bool copySuccess = m_d3d12Handler->CopyNV12Frame( srcDevicePtr, srcPitch, d3d12Resource, m_width, m_height ); // 10. Unmap frame cuvidUnmapVideoFrame(m_decoder, srcDevicePtr); // 11. Release slot { std::lock_guard lock(my_slot.slot_mutex); my_slot.in_use = false; } // 12. Advance return ID counter (FIFO order) m_returnIdCounter.fetch_add(1); return copySuccess; ``` **Multi-Frame Decision**: Return **first frame only** from multi-frame packet (indices[0]) --- ### Component 6: NVDEC Callback Integration (Option C) **Purpose**: Link NVDEC picture_index to RingBuffer slot using direct slot_id calculation ```cpp int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) { auto* decoder = static_cast(user_data); // 🎯 Option C: Direct slot_id β†’ slot_idx calculation (no map lookup!) uint64_t slot_id = static_cast(pic_params->nTimeStamp); size_t slot_idx = slot_id % RING_BUFFER_SIZE; DecodeSlot& slot = decoder->m_ringBuffer[slot_idx]; // Submit frame to NVDEC decoder CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params); if (result != CUDA_SUCCESS) { decoder->LogCUDAError(result, "cuvidDecodePicture failed"); return 0; } // Store picture_index for polling (multi-frame support) { std::lock_guard lock(slot.slot_mutex); slot.picture_indices.push_back(pic_params->CurrPicIdx); } // Polling thread will check cuvidGetDecodeStatus() for ALL picture_indices return 1; } ``` **🎯 Option C Key Advantages**: - βœ… **No mapping overhead**: Direct modulo calculation, no unordered_map lookup - βœ… **No mutex contention**: No global map mutex needed - βœ… **Multi-frame support**: Automatically handles multiple frames per packet - βœ… **Deterministic**: Same slot_id always maps to same slot_idx --- ## πŸ“ Implementation Plan (Updated for Option C) ### Phase 1: Data Structure Setup βœ… **Files to Modify**: - `NVDECAV1Decoder.h` - Add RingBuffer members with Option C design - `NVDECAV1Decoder.cpp` - Initialize RingBuffer in constructor **Tasks**: - [x] Define `DecodeSlot` structure with `std::vector picture_indices` - [x] Add `m_ringBuffer[8]` array - [x] **🎯 Option C**: Add `m_slotIdCounter`, `m_returnIdCounter` atomic counters (NOT submitIndex/returnIndex) - [x] Add `m_pollingThread`, `m_pollingRunning` members **Estimated Time**: 30 minutes --- ### Phase 2: Polling Thread Implementation (Multi-Frame Support) **Files to Modify**: - `NVDECAV1Decoder.cpp` - Implement `PollingThreadFunc()` **Tasks**: - [ ] Implement polling loop with `cuvidGetDecodeStatus()` checking ALL picture_indices - [ ] Poll slot at `m_returnIdCounter % RING_BUFFER_SIZE` (NOT all slots) - [ ] Add thread start in `Initialize()` - [ ] Add thread stop in `Cleanup()` - [ ] Add debug logging for multi-frame packet events **Testing**: - Verify thread starts/stops correctly - Verify multi-frame packet handling (all frames checked) - Verify `cuvidGetDecodeStatus()` calls work **Estimated Time**: 1 hour --- ### Phase 3: DecodeToSurface Refactoring (Option C Implementation) **Files to Modify**: - `NVDECAV1Decoder.cpp` - Rewrite `DecodeToSurface()` with Option C design **Tasks**: - [ ] **🎯 Phase 1**: Slot allocation using `my_id = m_slotIdCounter.fetch_add(1)` - [ ] **🎯 Phase 2**: Packet submission with `packet.timestamp = my_id` (full slot_id, NOT modulo!) - [ ] **🎯 Phase 3**: FIFO wait using `while(m_returnIdCounter != my_id)` (NOT slot_idx!) - [ ] **🎯 Phase 4**: Frame retrieval from `picture_indices[0]` (first frame only) - [ ] **🎯 Phase 5**: Cleanup and `m_returnIdCounter.fetch_add(1)` - [ ] Error handling for all failure paths **Testing**: - Single-threaded decode test - Multi-threaded decode test (2-3 threads) - Verify packet-surface mapping correctness with multi-frame packets - Verify FIFO ordering with out-of-order completion **Estimated Time**: 2 hours --- ### Phase 4: HandlePictureDecode Update (Option C) **Files to Modify**: - `NVDECAV1Decoder.cpp` - Modify `HandlePictureDecode()` callback **Tasks**: - [ ] **🎯 Option C**: Extract `slot_id` from `pic_params->nTimeStamp` - [ ] **🎯 Option C**: Calculate `slot_idx = slot_id % RING_BUFFER_SIZE` (NO map lookup!) - [ ] **🎯 Multi-frame**: Use `slot.picture_indices.push_back(CurrPicIdx)` (NOT single index!) - [ ] Add debug logging for multi-frame packets **Testing**: - Verify timestamp β†’ slot_idx direct calculation works - Verify picture_indices vector correctly stores multiple frames - Test with video that has multi-frame packets (test_720p_stripe.webm) **Estimated Time**: 30 minutes --- ### Phase 5: Integration Testing (Option C Validation) **Test Scenarios**: 1. **Single packet decode** - Verify Option C basic functionality 2. **Multi-frame packet** - Verify vector-based picture_indices handling (test_720p_stripe.webm) 3. **Sequential 3 packets** - Verify FIFO order using m_returnIdCounter 4. **Out-of-order completion** - Verify slot_id β†’ slot_idx mapping (I-frame after P-frame) 5. **RingBuffer overflow** - Verify error handling (9+ concurrent calls) 6. **Decode errors** - Verify graceful failure with multi-frame packets 7. **Performance benchmark** - Measure latency reduction vs old queue-based approach **Test Files**: - Simple test video (simple_test.webm) - basic validation - **Multi-frame packet video (test_720p_stripe.webm)** - critical multi-frame test ⚠️ **Validation Criteria**: - βœ… No slot ID β†’ slot_idx mapping errors - βœ… All frames from multi-frame packets detected and polled - βœ… FIFO order maintained even with out-of-order GPU completion - βœ… No memory corruption or race conditions **Estimated Time**: 2 hours --- ### Phase 6: Documentation & Cleanup **Tasks**: - [x] Update NVDEC design documentation with Option C and multi-frame support - [ ] Add inline code comments explaining Option C design choices - [ ] Remove old queue-based code and any IDβ†’Index mapping attempts - [ ] Move design doc to `docs/completed/` after successful implementation - [ ] Document multi-frame packet behavior and first-frame-only decision **Estimated Time**: 1 hour --- ## 🎯 Option C Design Summary ### **Core Principle**: Eliminate mapping overhead through deterministic calculation **Key Components**: 1. **Single Counter for Dual Purpose**: `m_slotIdCounter` serves as both unique ID and submission order 2. **Direct Slot Calculation**: `slot_idx = slot_id % RING_BUFFER_SIZE` (no map needed) 3. **FIFO via ID Comparison**: `while(m_returnIdCounter != my_id)` ensures ordering 4. **Multi-Frame Vector**: `std::vector picture_indices` handles packets with multiple frames **Data Flow**: ``` DecodeToSurface: my_id = m_slotIdCounter++ (e.g., 17) slot_idx = 17 % 8 = 1 packet.timestamp = 17 HandlePictureDecode: slot_id = pic_params->nTimeStamp (17) slot_idx = 17 % 8 = 1 m_ringBuffer[1].picture_indices.push_back(CurrPicIdx) PollingThread: return_id = m_returnIdCounter (17) slot_idx = 17 % 8 = 1 Check all m_ringBuffer[1].picture_indices[] DecodeToSurface (wait): while(m_returnIdCounter != 17) { wait } Process frame from picture_indices[0] m_returnIdCounter++ (18) ``` **Eliminated Complexity**: - ❌ No `std::unordered_map` mapping - ❌ No global map mutex - ❌ No map insert/erase operations - ❌ No lookup failures or stale entries **Multi-Frame Packet Handling**: - One packet β†’ multiple HandlePictureDecode calls - All frames stored in `picture_indices` vector - PollingThread checks ALL frames complete - Return first frame only (`picture_indices[0]`) --- ## πŸ“Š Performance Analysis ### Expected Improvements #### Scenario 1: Sequential 3-Frame Decode **Current (Sequential)**: ``` Frame 1: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms Frame 2: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms Frame 3: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms Total: 45ms ``` **With RingBuffer (Parallel)**: ``` Submit all 3: [S1][S2][S3] = 3ms GPU decode: [─── 10ms ───] (parallel) Sequential return: Frame 1: [Wait][Map][Copy][Unmap] = 5ms Frame 2: [Map][Copy][Unmap] = 4ms (already decoded) Frame 3: [Map][Copy][Unmap] = 4ms (already decoded) Total: 3ms + 10ms + 13ms = 26ms Improvement: 45ms β†’ 26ms (42% faster) ``` --- #### Scenario 2: Complex I/P Frame Mix **Video**: I-frame (15ms), P-frame (3ms), P-frame (3ms) **Current**: ``` I: 15ms P: 3ms P: 3ms Total: 21ms ``` **With RingBuffer**: ``` Submit: 3ms GPU: [I:15ms, P:3ms, P:3ms overlap] = 15ms (limited by I-frame) Return: 9ms (sequential) Total: 27ms (slower due to overhead) But if I-frame completes first: 21ms (same) ``` **Note**: Benefit depends on GOP structure. Simple videos may see less improvement. --- ### CPU & GPU Utilization **Current**: ``` CPU: [Active──────][Idle────────][Active──────][Idle────────] GPU: [Decode──────] [Decode──────] ``` **With RingBuffer**: ``` CPU: [Active][Active][Active][─────Wait─────][Active][Active] GPU: [Decode1][Decode2][Decode3] ``` **GPU Utilization**: ⬆️ Increased by ~30% **CPU Utilization**: ➑️ Similar (waiting in different places) --- ## πŸ›‘οΈ Risk Assessment ### Risk 1: RingBuffer Overflow **Scenario**: Application calls `DecodeToSurface()` 9+ times before first frame completes **Mitigation**: ```cpp if (my_slot.in_use) { LogError("RingBuffer overflow"); return false; // Fail fast } ``` **Alternative**: Block until slot available ```cpp while (my_slot.in_use) { std::this_thread::sleep_for(std::chrono::milliseconds(1)); } ``` **Recommendation**: Fail fast + document max concurrent limit (8 frames) --- ### Risk 2: Decode Timeout Deadlock **Scenario**: GPU hangs, polling thread never sets `is_ready` **Mitigation**: ```cpp if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500), ...)) { // Timeout - skip this slot my_slot.in_use = false; m_returnIndex.fetch_add(1); // Advance to prevent deadlock return false; } ``` **Risk**: Skipping slot may cause next slot to fail (reference frame missing) **Recommendation**: Return error, let application decide (retry/abort) --- ### Risk 3: Polling Thread CPU Usage **100us sleep β†’ ~10,000 checks/second** **CPU Impact**: ``` Per check: ~1us (cuvidGetDecodeStatus overhead) Total: 10,000 * 1us = 10ms/sec = 1% CPU ``` **Mitigation**: Acceptable overhead for real-time video decoding **Alternative**: Adaptive polling (faster when frames pending, slower when idle) --- ### Risk 4: Thread Safety - CUDA Context **Issue**: `cuvidGetDecodeStatus()` requires correct CUDA context **Current Protection**: ```cpp std::lock_guard contextLock(m_cudaContextMutex); cuCtxSetCurrent(m_cuContext); ``` **In Polling Thread**: ```cpp void PollingThreadFunc() { // Set CUDA context once at thread start cuCtxSetCurrent(m_cuContext); while (m_pollingRunning) { // cuvidGetDecodeStatus() now safe cuvidGetDecodeStatus(...); } } ``` **Recommendation**: Set context once per thread (not per call) --- ## πŸ“š References ### NVIDIA Documentation 1. **NVDEC Video Decoder API Programming Guide** - URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/nvdec-video-decoder-api-prog-guide/ - Section: "cuvidGetDecodeStatus" API reference 2. **Video Codec SDK 12.2** - URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/ - Section: "NVDECODE API" - asynchronous decoding patterns 3. **CUDA Programming Guide** - URL: https://docs.nvidia.com/cuda/cuda-c-programming-guide/ - Section: "Multi-threading" - CUDA context management ### Previous Design Documents 1. **NVDEC Deferred Decoder Recreation Design** (Abandoned) - Path: `docs/completed/windows/nvdec/NVDEC_Deferred_Decoder_Recreation_Design.md` - Reason for abandonment: Incompatible with NVDEC synchronous callback architecture - Lessons learned: NVDEC parser requires immediate decoder availability 2. **VavCore Library Design** - Path: `VavCore_Library_Design.md` - Relevant sections: C API design, cross-platform surface types ### Code References 1. **Current DecodeToSurface Implementation** - File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp` - Lines: 976-1200 - Current approach: Synchronous wait with condition variable 2. **HandlePictureDisplay Callback** - File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp` - Lines: 791-813 - Current approach: Push picture_index to queue, notify condition variable 3. **NVDEC Initialization** - File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp` - Lines: 600-650 - Key setting: `m_createInfo.ulNumOutputSurfaces = 8` --- ## πŸ”„ Change Log | Date | Author | Change Description | |------------|--------|-------------------| | 2025-10-04 | Claude | Initial design document created | | 2025-10-04 | Claude | Added RingBuffer architecture details | | 2025-10-04 | Claude | Added polling thread design | | 2025-10-04 | Claude | Added performance analysis | --- ## βœ… Approval & Sign-off **Design Review**: - [ ] Architecture approved - [ ] Performance estimates validated - [ ] Risk mitigation strategies accepted **Implementation Ready**: - [ ] All design questions resolved - [ ] Implementation plan reviewed - [ ] Test plan defined --- **Next Steps**: Proceed to Phase 1 - Data Structure Setup