# NVDEC RingBuffer-based Asynchronous Decoding Design **Date**: 2025-10-04 **Status**: πŸ”„ **In Progress** **Target**: NVDECAV1Decoder DecodeToSurface() optimization --- ## πŸ“‹ Table of Contents 1. [Background & Problem Statement](#background--problem-statement) 2. [Design Goals](#design-goals) 3. [Architecture Overview](#architecture-overview) 4. [Detailed Design](#detailed-design) 5. [Implementation Plan](#implementation-plan) 6. [Performance Analysis](#performance-analysis) 7. [Risk Assessment](#risk-assessment) 8. [References](#references) --- ## 🎯 Background & Problem Statement ### Current Implementation Issues #### Issue 1: Sequential Decoding Bottleneck **Current Code** (`NVDECAV1Decoder::DecodeToSurface`): ```cpp bool DecodeToSurface(packet_data, target_surface, output_frame) { cuvidParseVideoData(packet); // Submit packet wait_for_frame(); // Block until decode complete cuvidMapVideoFrame(); // Map decoded frame CopyNV12Frame(srcPtr, target_surface); // Copy to surface cuvidUnmapVideoFrame(); // Unmap frame return true; } ``` **Timeline**: ``` Call 1: [Submit][Wait____][Map][Copy][Unmap] Call 2: [Submit][Wait____][Map][Copy][Unmap] Call 3: [Submit][Wait____][Map][Copy][Unmap] Total: ~45ms (15ms * 3) ``` **Problem**: GPU sits idle during packet submission and surface copying. --- #### Issue 2: Packet-Surface Mapping Mismatch **Scenario**: Multi-threaded DecodeToSurface calls ```cpp // Thread A DecodeToSurface(packet1, surface1, frame1); // I-frame, 15ms // Thread B (simultaneous) DecodeToSurface(packet2, surface2, frame2); // P-frame, 3ms // GPU completes out-of-order packet2 completes @ t=3ms β†’ picture_index=5 packet1 completes @ t=15ms β†’ picture_index=4 // Current queue-based retrieval Thread A: pop() β†’ picture_index=5 ❌ (gets packet2 result!) Thread B: pop() β†’ picture_index=4 ❌ (gets packet1 result!) // Result: surface1 contains packet2, surface2 contains packet1 ``` **Root Cause**: - FIFO queue doesn't track which packet belongs to which surface - Decoding completion order β‰  submission order --- #### Issue 3: ulNumOutputSurfaces Underutilization **NVDEC Configuration**: ```cpp m_createInfo.ulNumOutputSurfaces = 8; // 8 output buffers allocated ``` **Purpose of 8 Surfaces**: 1. Reference frame storage (AV1 needs up to 8 reference frames) 2. Display reordering buffer (B-frames, decode order β‰  display order) 3. Pipeline parallelism (overlap decode/map/copy operations) **Current Usage**: Only 1-2 surfaces actually utilized β†’ Wasted GPU memory (~24MB for 1920x1080) --- ## 🎯 Design Goals ### Primary Goals 1. **βœ… Correct Packet-Surface Mapping** - Guarantee: `DecodeToSurface(packetN, surfaceN)` β†’ surfaceN contains packetN result - No out-of-order completion issues 2. **βœ… Sequential Return Order** - Even if packet2 completes before packet1, return surface1 first - Maintain FIFO semantics for API users 3. **⚑ Parallel Decoding Throughput** - Allow 2-3 packets to decode simultaneously on GPU - Reduce total decode latency by ~30-40% 4. **πŸ›‘οΈ Simplicity & Maintainability** - Avoid complex ID mapping systems - Use fixed-size RingBuffer (matches NVDEC hardware design) - Single polling thread (not per-decode thread) ### Non-Goals - ❌ Full 8-frame parallel decoding (limited by reference frame dependencies) - ❌ Out-of-order surface return (would break video playback semantics) - ❌ Dynamic buffer sizing (fixed 8 slots matches NVDEC hardware) --- ## πŸ—οΈ Architecture Overview ### High-Level Design ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DecodeToSurface() Call β”‚ β”‚ (Thread 1, 2, 3, ... can call concurrently) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ RingBuffer (8 slots) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ [0] β”‚ [1] β”‚ [2] β”‚ [3] β”‚ [4] β”‚ [5] β”‚ [6] β”‚ [7] β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–² β–² β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ m_returnIndex m_submitIndex β”‚ β”‚ (consumer) (producer) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Polling Thread β”‚ β”‚ while (polling_running) { β”‚ β”‚ slot = ringBuffer[m_returnIndex] β”‚ β”‚ if (cuvidGetDecodeStatus(slot.picture_index) == Success) { β”‚ β”‚ slot.is_ready = true β”‚ β”‚ slot.frame_ready.notify_one() β”‚ β”‚ } β”‚ β”‚ sleep(100us) β”‚ β”‚ } β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Data Structures #### DecodeSlot Structure ```cpp struct DecodeSlot { // Slot state bool in_use; // Is this slot allocated? // Submitted information (from DecodeToSurface args) void* target_surface; // Destination D3D12 resource VavCoreSurfaceType surface_type; // Surface type // NVDEC information (from HandlePictureDisplay callback) int picture_index; // NVDEC frame index for cuvidMapVideoFrame // Synchronization primitives std::condition_variable frame_ready; // Signaled when decode complete std::mutex slot_mutex; // Protects this slot's state bool is_ready; // Decode completed flag }; ``` #### RingBuffer Members ```cpp class NVDECAV1Decoder { private: static constexpr size_t RING_BUFFER_SIZE = 8; // Match ulNumOutputSurfaces DecodeSlot m_ringBuffer[RING_BUFFER_SIZE]; // Producer-consumer indices std::atomic m_submitIndex{0}; // Next slot to allocate (producer) std::atomic m_returnIndex{0}; // Next slot to return (consumer) // Polling thread std::thread m_pollingThread; std::atomic m_pollingRunning{false}; }; ``` --- ## πŸ”§ Detailed Design ### Component 1: Slot Allocation (Producer) **Purpose**: Assign RingBuffer slot to each DecodeToSurface call ```cpp // In DecodeToSurface() // 1. Allocate next available slot size_t my_slot_idx = m_submitIndex.fetch_add(1) % RING_BUFFER_SIZE; DecodeSlot& my_slot = m_ringBuffer[my_slot_idx]; // 2. Check for overflow { std::lock_guard lock(my_slot.slot_mutex); if (my_slot.in_use) { // RingBuffer full! Too many concurrent decodes LogError("RingBuffer overflow - max 8 concurrent decodes"); return false; } // 3. Initialize slot my_slot.in_use = true; my_slot.target_surface = target_surface; my_slot.surface_type = target_type; my_slot.picture_index = -1; // Set by HandlePictureDisplay my_slot.is_ready = false; } ``` **Atomic Counter Behavior**: ``` Thread 1: m_submitIndex.fetch_add(1) β†’ 0 % 8 = slot[0] Thread 2: m_submitIndex.fetch_add(1) β†’ 1 % 8 = slot[1] Thread 3: m_submitIndex.fetch_add(1) β†’ 2 % 8 = slot[2] ... Thread 9: m_submitIndex.fetch_add(1) β†’ 8 % 8 = slot[0] (wrap around) ``` **Overflow Protection**: If `slot[0].in_use == true` when Thread 9 arrives β†’ error --- ### Component 2: Packet Submission **Purpose**: Submit packet to NVDEC with slot index tracking ```cpp // 4. Submit packet to NVDEC parser CUVIDSOURCEDATAPACKET packet = {}; packet.payload = packet_data; packet.payload_size = packet_size; packet.flags = CUVID_PKT_ENDOFPICTURE; packet.timestamp = my_slot_idx; // βœ… Embed slot index in timestamp CUresult result = cuvidParseVideoData(m_parser, &packet); if (result != CUDA_SUCCESS) { my_slot.in_use = false; // Release slot on error return false; } ``` **Timestamp Flow**: ``` cuvidParseVideoData(packet, timestamp=2) ↓ HandleVideoSequence() (first time only) ↓ HandlePictureDecode(timestamp=2) ↓ GPU decodes packet... ↓ HandlePictureDisplay(timestamp=2, picture_index=5) ↓ m_ringBuffer[2].picture_index = 5 // βœ… Slot 2 now linked to picture_index 5 ``` --- ### Component 3: Polling Thread (Background Status Checker) **Purpose**: Continuously poll `m_returnIndex` slot for decode completion ```cpp void NVDECAV1Decoder::PollingThreadFunc() { while (m_pollingRunning) { // 1. Get current return slot (oldest pending decode) size_t current_return_idx = m_returnIndex.load(); DecodeSlot& slot = m_ringBuffer[current_return_idx]; // 2. Check if slot is in use and not yet ready if (slot.in_use && !slot.is_ready && slot.picture_index >= 0) { // 3. Query NVDEC for decode status CUVIDGETDECODESTATUS decodeStatus = {}; CUresult result = cuvidGetDecodeStatus(m_decoder, slot.picture_index, &decodeStatus); if (result == CUDA_SUCCESS) { if (decodeStatus.decodeStatus == cuvidDecodeStatus_Success) { // βœ… Decode complete! { std::lock_guard lock(slot.slot_mutex); slot.is_ready = true; } // Wake up waiting DecodeToSurface thread slot.frame_ready.notify_one(); OutputDebugStringA("[Polling] Slot ready\n"); } else if (decodeStatus.decodeStatus == cuvidDecodeStatus_Error) { // Decode error - mark as ready to unblock { std::lock_guard lock(slot.slot_mutex); slot.is_ready = true; // Error also counts as "ready" } slot.frame_ready.notify_one(); OutputDebugStringA("[Polling] Decode error\n"); } // cuvidDecodeStatus_InProgress β†’ keep polling } } // 4. Sleep to avoid busy-wait std::this_thread::sleep_for(std::chrono::microseconds(100)); } } ``` **Key Points**: - βœ… Only polls `m_returnIndex` slot (not all 8 slots) β†’ efficient - βœ… Uses `cuvidGetDecodeStatus()` non-blocking query - βœ… 100us sleep β†’ ~10,000 checks/second (low CPU usage) - βœ… Handles decode errors gracefully --- ### Component 4: Sequential Return Wait **Purpose**: Enforce FIFO order even when decodes complete out-of-order ```cpp // In DecodeToSurface() - PHASE 2 // 5. Wait for my turn (sequential return order) while (m_returnIndex.load() != my_slot_idx) { std::this_thread::sleep_for(std::chrono::milliseconds(1)); } // Now I'm first in line! // 6. Wait for decode to complete { std::unique_lock lock(my_slot.slot_mutex); if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500), [&my_slot]() { return my_slot.is_ready; })) { // Timeout - decode took too long LogError("Decode timeout"); my_slot.in_use = false; m_returnIndex.fetch_add(1); // Skip this slot to avoid deadlock return false; } } ``` **Timeline Example**: ``` Thread 1 (slot 0): Wait for returnIndex==0 βœ… (immediate) Wait for is_ready... Thread 2 (slot 1): Wait for returnIndex==1 ⏸️ (blocked) Thread 3 (slot 2): Wait for returnIndex==2 ⏸️ (blocked) GPU: packet2 completes first @ t=3ms β†’ slot[1].is_ready = true β†’ Thread 2 still blocked (returnIndex=0) GPU: packet1 completes @ t=15ms β†’ slot[0].is_ready = true β†’ Thread 1 wakes up βœ… β†’ Thread 1 processes β†’ returnIndex = 1 β†’ Thread 2 now unblocked βœ… ``` --- ### Component 5: Frame Retrieval & Cleanup **Purpose**: Map decoded frame, copy to surface, release slot ```cpp // In DecodeToSurface() - PHASE 3 int frameIdx = my_slot.picture_index; // 7. Map decoded frame from NVDEC CUVIDPROCPARAMS procParams = {}; procParams.progressive_frame = 1; CUdeviceptr srcDevicePtr = 0; unsigned int srcPitch = 0; CUresult result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams); if (result != CUDA_SUCCESS) { my_slot.in_use = false; m_returnIndex.fetch_add(1); return false; } // 8. Copy to D3D12 surface ID3D12Resource* d3d12Resource = static_cast(target_surface); bool copySuccess = m_d3d12Handler->CopyNV12Frame( srcDevicePtr, srcPitch, d3d12Resource, m_width, m_height ); // 9. Unmap frame cuvidUnmapVideoFrame(m_decoder, srcDevicePtr); // 10. Release slot { std::lock_guard lock(my_slot.slot_mutex); my_slot.in_use = false; } // 11. Advance return index m_returnIndex.fetch_add(1); return copySuccess; ``` --- ### Component 6: NVDEC Callback Integration **Purpose**: Link NVDEC picture_index to RingBuffer slot ```cpp int CUDAAPI NVDECAV1Decoder::HandlePictureDisplay(void* user_data, CUVIDPARSERDISPINFO* disp_info) { auto* decoder = static_cast(user_data); // Extract slot index from timestamp size_t slot_idx = static_cast(disp_info->timestamp) % RING_BUFFER_SIZE; DecodeSlot& slot = decoder->m_ringBuffer[slot_idx]; // Store NVDEC picture index { std::lock_guard lock(slot.slot_mutex); slot.picture_index = disp_info->picture_index; } // Polling thread will check cuvidGetDecodeStatus() for this picture_index return 1; } ``` --- ## πŸ“ Implementation Plan ### Phase 1: Data Structure Setup βœ… **Files to Modify**: - `NVDECAV1Decoder.h` - Add RingBuffer members - `NVDECAV1Decoder.cpp` - Initialize RingBuffer in constructor **Tasks**: - [x] Define `DecodeSlot` structure - [x] Add `m_ringBuffer[8]` array - [x] Add `m_submitIndex`, `m_returnIndex` atomic counters - [x] Add `m_pollingThread`, `m_pollingRunning` members **Estimated Time**: 30 minutes --- ### Phase 2: Polling Thread Implementation **Files to Modify**: - `NVDECAV1Decoder.cpp` - Implement `PollingThreadFunc()` **Tasks**: - [ ] Implement polling loop with `cuvidGetDecodeStatus()` - [ ] Add thread start in `Initialize()` - [ ] Add thread stop in `Cleanup()` - [ ] Add debug logging for polling events **Testing**: - Verify thread starts/stops correctly - Verify `cuvidGetDecodeStatus()` calls work **Estimated Time**: 1 hour --- ### Phase 3: DecodeToSurface Refactoring **Files to Modify**: - `NVDECAV1Decoder.cpp` - Rewrite `DecodeToSurface()` **Tasks**: - [ ] Phase 1: Slot allocation logic - [ ] Phase 2: Sequential return wait logic - [ ] Phase 3: Frame retrieval & cleanup logic - [ ] Error handling for all failure paths **Testing**: - Single-threaded decode test - Multi-threaded decode test (2-3 threads) - Verify packet-surface mapping correctness **Estimated Time**: 2 hours --- ### Phase 4: HandlePictureDisplay Update **Files to Modify**: - `NVDECAV1Decoder.cpp` - Modify `HandlePictureDisplay()` **Tasks**: - [ ] Extract slot_idx from timestamp - [ ] Store picture_index in correct slot - [ ] Add debug logging **Testing**: - Verify timestamp β†’ slot_idx mapping - Verify picture_index stored correctly **Estimated Time**: 30 minutes --- ### Phase 5: Integration Testing **Test Scenarios**: 1. **Single packet decode** - Verify basic functionality 2. **Sequential 3 packets** - Verify FIFO order 3. **Out-of-order completion** - Verify correct mapping (I-frame after P-frame) 4. **RingBuffer overflow** - Verify error handling (9+ concurrent calls) 5. **Decode errors** - Verify graceful failure 6. **Performance benchmark** - Measure latency reduction **Test Files**: - Simple test video (simple_test.webm) - Complex GOP structure video (test_720p_stripe.webm) **Estimated Time**: 2 hours --- ### Phase 6: Documentation & Cleanup **Tasks**: - [ ] Update NVDEC design documentation - [ ] Add inline code comments - [ ] Remove old queue-based code - [ ] Move design doc to `docs/completed/` **Estimated Time**: 1 hour --- ## πŸ“Š Performance Analysis ### Expected Improvements #### Scenario 1: Sequential 3-Frame Decode **Current (Sequential)**: ``` Frame 1: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms Frame 2: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms Frame 3: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms Total: 45ms ``` **With RingBuffer (Parallel)**: ``` Submit all 3: [S1][S2][S3] = 3ms GPU decode: [─── 10ms ───] (parallel) Sequential return: Frame 1: [Wait][Map][Copy][Unmap] = 5ms Frame 2: [Map][Copy][Unmap] = 4ms (already decoded) Frame 3: [Map][Copy][Unmap] = 4ms (already decoded) Total: 3ms + 10ms + 13ms = 26ms Improvement: 45ms β†’ 26ms (42% faster) ``` --- #### Scenario 2: Complex I/P Frame Mix **Video**: I-frame (15ms), P-frame (3ms), P-frame (3ms) **Current**: ``` I: 15ms P: 3ms P: 3ms Total: 21ms ``` **With RingBuffer**: ``` Submit: 3ms GPU: [I:15ms, P:3ms, P:3ms overlap] = 15ms (limited by I-frame) Return: 9ms (sequential) Total: 27ms (slower due to overhead) But if I-frame completes first: 21ms (same) ``` **Note**: Benefit depends on GOP structure. Simple videos may see less improvement. --- ### CPU & GPU Utilization **Current**: ``` CPU: [Active──────][Idle────────][Active──────][Idle────────] GPU: [Decode──────] [Decode──────] ``` **With RingBuffer**: ``` CPU: [Active][Active][Active][─────Wait─────][Active][Active] GPU: [Decode1][Decode2][Decode3] ``` **GPU Utilization**: ⬆️ Increased by ~30% **CPU Utilization**: ➑️ Similar (waiting in different places) --- ## πŸ›‘οΈ Risk Assessment ### Risk 1: RingBuffer Overflow **Scenario**: Application calls `DecodeToSurface()` 9+ times before first frame completes **Mitigation**: ```cpp if (my_slot.in_use) { LogError("RingBuffer overflow"); return false; // Fail fast } ``` **Alternative**: Block until slot available ```cpp while (my_slot.in_use) { std::this_thread::sleep_for(std::chrono::milliseconds(1)); } ``` **Recommendation**: Fail fast + document max concurrent limit (8 frames) --- ### Risk 2: Decode Timeout Deadlock **Scenario**: GPU hangs, polling thread never sets `is_ready` **Mitigation**: ```cpp if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500), ...)) { // Timeout - skip this slot my_slot.in_use = false; m_returnIndex.fetch_add(1); // Advance to prevent deadlock return false; } ``` **Risk**: Skipping slot may cause next slot to fail (reference frame missing) **Recommendation**: Return error, let application decide (retry/abort) --- ### Risk 3: Polling Thread CPU Usage **100us sleep β†’ ~10,000 checks/second** **CPU Impact**: ``` Per check: ~1us (cuvidGetDecodeStatus overhead) Total: 10,000 * 1us = 10ms/sec = 1% CPU ``` **Mitigation**: Acceptable overhead for real-time video decoding **Alternative**: Adaptive polling (faster when frames pending, slower when idle) --- ### Risk 4: Thread Safety - CUDA Context **Issue**: `cuvidGetDecodeStatus()` requires correct CUDA context **Current Protection**: ```cpp std::lock_guard contextLock(m_cudaContextMutex); cuCtxSetCurrent(m_cuContext); ``` **In Polling Thread**: ```cpp void PollingThreadFunc() { // Set CUDA context once at thread start cuCtxSetCurrent(m_cuContext); while (m_pollingRunning) { // cuvidGetDecodeStatus() now safe cuvidGetDecodeStatus(...); } } ``` **Recommendation**: Set context once per thread (not per call) --- ## πŸ“š References ### NVIDIA Documentation 1. **NVDEC Video Decoder API Programming Guide** - URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/nvdec-video-decoder-api-prog-guide/ - Section: "cuvidGetDecodeStatus" API reference 2. **Video Codec SDK 12.2** - URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/ - Section: "NVDECODE API" - asynchronous decoding patterns 3. **CUDA Programming Guide** - URL: https://docs.nvidia.com/cuda/cuda-c-programming-guide/ - Section: "Multi-threading" - CUDA context management ### Previous Design Documents 1. **NVDEC Deferred Decoder Recreation Design** (Abandoned) - Path: `docs/completed/windows/nvdec/NVDEC_Deferred_Decoder_Recreation_Design.md` - Reason for abandonment: Incompatible with NVDEC synchronous callback architecture - Lessons learned: NVDEC parser requires immediate decoder availability 2. **VavCore Library Design** - Path: `VavCore_Library_Design.md` - Relevant sections: C API design, cross-platform surface types ### Code References 1. **Current DecodeToSurface Implementation** - File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp` - Lines: 976-1200 - Current approach: Synchronous wait with condition variable 2. **HandlePictureDisplay Callback** - File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp` - Lines: 791-813 - Current approach: Push picture_index to queue, notify condition variable 3. **NVDEC Initialization** - File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp` - Lines: 600-650 - Key setting: `m_createInfo.ulNumOutputSurfaces = 8` --- ## πŸ”„ Change Log | Date | Author | Change Description | |------------|--------|-------------------| | 2025-10-04 | Claude | Initial design document created | | 2025-10-04 | Claude | Added RingBuffer architecture details | | 2025-10-04 | Claude | Added polling thread design | | 2025-10-04 | Claude | Added performance analysis | --- ## βœ… Approval & Sign-off **Design Review**: - [ ] Architecture approved - [ ] Performance estimates validated - [ ] Risk mitigation strategies accepted **Implementation Ready**: - [ ] All design questions resolved - [ ] Implementation plan reviewed - [ ] Test plan defined --- **Next Steps**: Proceed to Phase 1 - Data Structure Setup