# NVDEC Self-Queue Management Design **Date**: 2025-10-11 **Author**: Claude **Purpose**: Alternative architecture to Round-Robin synchronization for 4-player simultaneous AV1 playback --- ## 1. Executive Summary ### Current Problem Round-Robin synchronization causes sequential blocking, resulting in: - Player#3 DECODE: 53.6ms (waiting for Player#2) - Player#3 QUEUE_DELAY: 58.0ms (UI thread callback delayed) - Video jitter/stuttering despite Triple Buffering, VSync removal, and rendering separation ### Proposed Solution Remove Round-Robin and rely on NVDEC's internal queue management: - All players submit decode requests independently (no blocking) - NVDEC manages concurrent requests using its internal FIFO queue - Expected DECODE time: 2-5ms (actual hardware decode time) - Expected QUEUE_DELAY: <10ms (consistent with Player#2 performance) --- ## 2. Current Architecture Analysis ### 2.1 Round-Robin Synchronization Flow ``` Timeline for 4 players (P1, P2, P3, P4): Frame N: 0ms - P1 WaitForMyTurnInBuffering() → ACQUIRED (turn=1) 0ms - P1 vavcore_decode_to_surface() → 2.5ms 2.5ms - P1 SignalNextPlayer() → turn=2 2.5ms - P2 WaitForMyTurnInBuffering() → ACQUIRED (turn=2) 2.5ms - P2 vavcore_decode_to_surface() → 2.7ms 5.2ms - P2 SignalNextPlayer() → turn=3 5.2ms - P3 WaitForMyTurnInBuffering() → ACQUIRED (turn=3) 5.2ms - P3 vavcore_decode_to_surface() → 3.1ms 8.3ms - P3 SignalNextPlayer() → turn=4 8.3ms - P4 WaitForMyTurnInBuffering() → ACQUIRED (turn=4) 8.3ms - P4 vavcore_decode_to_surface() → 2.8ms 11.1ms - P4 SignalNextPlayer() → turn=1 (wrap around) Total latency for P4: 11.1ms (accumulated waiting + decode) ``` **Problem**: Player#3 waits 5.2ms before even starting decode, causing cascading delays. ### 2.2 Code Locations **FrameProcessor.cpp (lines 98-107):** ```cpp // Round-Robin coordination for ALL frames to prevent NVDEC queue saturation LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn", m_playerInstanceId, m_framesDecoded.load()); // Wait for my turn in round-robin (blocking) GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn", m_playerInstanceId, m_framesDecoded.load()); ``` **GlobalFrameScheduler.cpp:** ```cpp void GlobalFrameScheduler::WaitForMyTurnInBuffering(int playerId) { std::unique_lock lock(m_bufferingMutex); m_bufferingCV.wait(lock, [this, playerId]() { return m_currentPlayerTurn == playerId; }); } void GlobalFrameScheduler::SignalNextPlayer(int playerId) { std::unique_lock lock(m_bufferingMutex); m_currentPlayerTurn = (playerId % m_totalPlayers) + 1; m_bufferingCV.notify_all(); } ``` --- ## 3. NVDEC Queue Characteristics ### 3.1 NVIDIA NVDEC Architecture **Hardware Queue Management:** - NVDEC has an internal FIFO queue for decode requests - Queue depth: Typically 8-16 frames (hardware dependent) - Concurrent submission: NVDEC accepts multiple `cuvidDecodePicture()` calls - Internal scheduling: NVDEC processes requests in submission order - Back-pressure: When queue is full, `cuvidDecodePicture()` blocks until slot available **CUDA DPB (Decoded Picture Buffer):** - VavCore uses CUDA memory for DPB (for B-frame reordering) - DPB size: 16 frames (VAVCORE_NVDEC_INITIAL_BUFFERING) - Separate from NVDEC hardware queue - Each player has its own DPB in CUDA memory ### 3.2 VavCore Internal Queue **VavCore (vavcore_decode_to_surface):** ```c // Simplified VavCore decode flow VavCoreResult vavcore_decode_to_surface( VavCorePlayer* player, VavCoreSurfaceType surface_type, void* surface, VavCoreVideoFrame* frame ) { // 1. Read packet from demuxer AVPacket* pkt = av_read_frame(player->format_ctx); // 2. Submit to NVDEC (CUDA) CUVIDPICPARAMS pic_params = {...}; cuvidDecodePicture(player->cuda_decoder, &pic_params); // ← NVDEC queue submission // 3. Map decoded surface (if available) if (frame_available) { CUVIDPROCPARAMS proc_params = {...}; cuvidMapVideoFrame(player->cuda_decoder, ...); // ← Get decoded frame // 4. Copy to D3D12 surface (if provided) if (surface) { cudaMemcpy2DToArray(...); // ← CUDA → D3D12 } return VAVCORE_SUCCESS; } else { return VAVCORE_PACKET_ACCEPTED; // Buffering in CUDA DPB } } ``` **Key Insight**: `cuvidDecodePicture()` is thread-safe and manages its own queue internally. --- ## 4. Proposed Architecture: NVDEC Self-Queue Management ### 4.1 Core Concept **Remove Round-Robin entirely and rely on NVDEC's internal queue:** ``` Timeline for 4 players (P1, P2, P3, P4) - NO Round-Robin: Frame N: 0.0ms - P1, P2, P3, P4 all call vavcore_decode_to_surface() simultaneously NVDEC Queue (internal): ┌─────────────────────────────────────┐ │ [P1] [P2] [P3] [P4] │ ← FIFO queue │ ↓ ↓ ↓ ↓ │ │ Decode requests processed in order │ └─────────────────────────────────────┘ 0.0ms - P1 decode → 2.5ms (NVDEC processing) 2.5ms - P2 decode → 2.7ms (NVDEC processing) 5.2ms - P3 decode → 3.1ms (NVDEC processing) 8.3ms - P4 decode → 2.8ms (NVDEC processing) BUT: Each player's thread doesn't BLOCK on NVDEC completion - cuvidDecodePicture() returns immediately (async submission) - cuvidMapVideoFrame() waits only for its OWN frame ``` ### 4.2 Implementation Changes #### 4.2.1 Remove Round-Robin Synchronization **FrameProcessor.cpp (lines 98-107) - DELETE:** ```cpp // DELETE THIS ENTIRE BLOCK: LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn", m_playerInstanceId, m_framesDecoded.load()); GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn", m_playerInstanceId, m_framesDecoded.load()); ``` **FrameProcessor.cpp (lines 209, 244) - DELETE:** ```cpp // DELETE THIS: GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId); ``` #### 4.2.2 Keep Initial Buffering Synchronization **IMPORTANT**: Keep the synchronization barrier for initial 16-frame buffering completion: **FrameProcessor.cpp (lines 215-220) - KEEP:** ```cpp // Synchronization barrier: Wait for all players to complete INITIAL_BUFFERING // This ensures all players start TRIPLE_FILLING phase simultaneously if (m_framesDecoded == VAVCORE_NVDEC_INITIAL_BUFFERING) { LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING completed - signaling and waiting for all players", m_playerInstanceId); GlobalFrameScheduler::GetInstance().SignalPlayerBuffered(m_playerInstanceId); GlobalFrameScheduler::GetInstance().WaitAllPlayersBuffered(); LOGF_INFO("[Player#%d] [FrameProcessor] All players buffered - starting TRIPLE_FILLING phase", m_playerInstanceId); } ``` **Rationale**: This barrier ensures all players have filled their CUDA DPB before starting triple buffer filling. This is a ONE-TIME synchronization (not per-frame). #### 4.2.3 Updated Decode Flow **New flow (without Round-Robin):** ```cpp // Phase 1: Initial NVDEC DPB buffering (frames 0-15) if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING) { result = vavcore_decode_to_surface( player, VAVCORE_SURFACE_D3D12_RESOURCE, nullptr, // NULL surface during initial buffering &vavFrame ); // Expected: VAVCORE_PACKET_ACCEPTED for first 16 frames // No per-frame synchronization - each player proceeds independently } // Phase 2: Triple buffer filling (frames 16-18) else if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING + VAV2PLAYER_TRIPLE_BUFFER_SIZE) { auto backend = m_renderer->GetRGBASurfaceBackend(); ID3D12Resource* decodeTexture = backend->GetNextDecodeTexture(); result = vavcore_decode_to_surface( player, VAVCORE_SURFACE_D3D12_RESOURCE, decodeTexture, // Valid D3D12 texture &vavFrame ); if (result == VAVCORE_SUCCESS) { backend->AdvanceDecodeOnly(); } } // Phase 3: Normal operation (frame 19+) else { auto backend = m_renderer->GetRGBASurfaceBackend(); ID3D12Resource* decodeTexture = backend->GetNextDecodeTexture(); result = vavcore_decode_to_surface( player, VAVCORE_SURFACE_D3D12_RESOURCE, decodeTexture, &vavFrame ); if (result == VAVCORE_SUCCESS) { backend->AdvanceFrame(); } } ``` --- ## 5. Synchronization Strategy ### 5.1 What Replaces Round-Robin? **Answer: Nothing (for per-frame synchronization).** NVDEC's internal queue provides natural serialization: - Thread-safe submission via `cuvidDecodePicture()` - FIFO ordering prevents starvation - Back-pressure when queue full (blocking at hardware level) ### 5.2 Remaining Synchronization Points **ONE-TIME synchronization (kept):** 1. **Initial buffering completion** (line 215-220): - Ensures all players complete 16-frame CUDA DPB buffering - Prevents race condition where Player#1 starts rendering while Player#4 is still buffering - ONE-TIME event (happens once per playback session) 2. **Triple buffer filling completion** (implicit): - Each player fills textures [0, 1, 2] before starting normal playback - No explicit synchronization needed (local state machine per player) **NO per-frame synchronization** (removed): - Round-Robin turn-based waiting (DELETED) - Per-frame SignalNextPlayer (DELETED) --- ## 6. Risk Analysis ### 6.1 Potential Issues #### Risk 1: NVDEC Queue Saturation **Description**: If all 4 players submit frames simultaneously, NVDEC queue might fill up. **Analysis**: - NVDEC queue depth: 8-16 frames (hardware dependent) - Submission rate: 4 players × 30fps = 120 submissions/sec - Processing rate: NVDEC can handle 4K@60fps = 240 frames/sec (RTX 3080) - **Conclusion**: Processing rate (240fps) > Submission rate (120fps) → No saturation **Mitigation**: None needed (hardware has sufficient throughput). #### Risk 2: Frame Reordering Across Players **Description**: NVDEC might interleave frames from different players. **Analysis**: - Each player has separate `CUvideodecoder` instance - Each decoder has its own CUDA DPB (16 frames) - NVDEC processes each decoder's queue independently - **Conclusion**: No cross-player interference (separate decoder instances) **Mitigation**: None needed (architecture prevents this). #### Risk 3: Unbalanced Decode Timing **Description**: One player might consistently decode faster, causing timing drift. **Analysis**: - Timing controlled by PlaybackController (PlaybackTimingThread) - ProcessNextFrame() called at regular intervals (33.33ms for 30fps) - Even if decode is fast (2ms), player waits for next timing tick - **Conclusion**: Timing thread prevents drift (independent of decode speed) **Mitigation**: None needed (PlaybackController already handles this). #### Risk 4: CUDA Context Switching Overhead **Description**: Frequent context switches between 4 players might add latency. **Analysis**: - Each player uses the same CUDA context (shared across application) - VavCore creates one context per GPU device (not per player) - Context switch cost: ~10-50μs (microseconds) - **Conclusion**: Overhead negligible (<0.05ms per frame) **Mitigation**: None needed (minimal overhead). --- ## 7. Performance Impact Analysis ### 7.1 Expected Improvements #### Current Performance (with Round-Robin): ``` Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 8ms = 10.5ms Player#2: DECODE: 2.7ms (wait 2.5ms) + QUEUE_DELAY: 9.9ms = 12.6ms Player#3: DECODE: 53.6ms (wait 50ms) + QUEUE_DELAY: 58.0ms = 111.6ms ❌ Player#4: DECODE: 2.8ms (wait 8ms) + QUEUE_DELAY: 12ms = 20.8ms ``` #### Expected Performance (without Round-Robin): ``` Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 5ms = 7.5ms ✅ Player#2: DECODE: 2.7ms (no wait) + QUEUE_DELAY: 5ms = 7.7ms ✅ Player#3: DECODE: 3.1ms (no wait) + QUEUE_DELAY: 6ms = 9.1ms ✅ (58ms → 6ms!) Player#4: DECODE: 2.8ms (no wait) + QUEUE_DELAY: 5ms = 7.8ms ✅ ``` **Key Improvement**: Player#3 QUEUE_DELAY reduces from 58ms to ~6ms (90% reduction). ### 7.2 Worst-Case Scenario **If NVDEC queue becomes full** (unlikely, see Risk 1): - `cuvidDecodePicture()` blocks until queue slot available - Blocking time: ~10-20ms (time for NVDEC to process one frame) - Still better than Round-Robin (50ms wait) **Comparison**: - Round-Robin worst-case: 50ms wait (sequential blocking) - NVDEC queue worst-case: 20ms wait (hardware back-pressure) - **Conclusion**: Even worst-case is 2.5× faster --- ## 8. Implementation Plan ### 8.1 Code Changes **File: FrameProcessor.cpp** **Change 1: Remove Round-Robin waiting (lines 98-107)** ```cpp // DELETE: LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn", m_playerInstanceId, m_framesDecoded.load()); GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn", m_playerInstanceId, m_framesDecoded.load()); ``` **Change 2: Remove Round-Robin signaling (line 209)** ```cpp // DELETE: GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player", m_playerInstanceId, m_framesDecoded.load()); ``` **Change 3: Remove Round-Robin signaling (line 244)** ```cpp // DELETE: GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player (SUCCESS path)", m_playerInstanceId, m_framesDecoded.load()); ``` **KEEP: Initial buffering synchronization (lines 215-220)** ```cpp // KEEP THIS: if (m_framesDecoded == VAVCORE_NVDEC_INITIAL_BUFFERING) { LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING completed - signaling and waiting for all players", m_playerInstanceId); GlobalFrameScheduler::GetInstance().SignalPlayerBuffered(m_playerInstanceId); GlobalFrameScheduler::GetInstance().WaitAllPlayersBuffered(); LOGF_INFO("[Player#%d] [FrameProcessor] All players buffered - starting TRIPLE_FILLING phase", m_playerInstanceId); } ``` ### 8.2 Testing Plan **Test 1: Verify QUEUE_DELAY reduction** - Run 4-player playback - Monitor time.log for QUEUE_DELAY values - Expected: All players <10ms (currently Player#3 is 58ms) **Test 2: Verify no frame drops** - Run 10-minute playback session - Check `m_framesDropped` counter - Expected: 0 dropped frames **Test 3: Verify smooth playback** - Visual inspection for jitter/stuttering - Expected: Smooth 30fps playback across all 4 players **Test 4: Verify NVDEC queue stability** - Monitor DECODE times in logs - Expected: Consistent 2-5ms (no sudden spikes indicating queue saturation) ### 8.3 Rollback Plan If NVDEC queue saturation occurs (unlikely): 1. Revert code changes (git restore FrameProcessor.cpp) 2. Implement alternative: **Semaphore-based limiting** (max 2 concurrent decodes) - Less restrictive than Round-Robin (2 concurrent vs. 1 sequential) - Prevents queue saturation while reducing blocking time --- ## 9. Comparison Table | Aspect | Round-Robin (Current) | NVDEC Self-Queue (Proposed) | |--------|----------------------|----------------------------| | **Synchronization** | Per-frame sequential | Hardware-managed FIFO | | **Blocking Time** | 0-50ms (depends on player position) | 0-20ms (only if queue full) | | **QUEUE_DELAY** | 8-58ms (varies by player) | 5-10ms (consistent) | | **Code Complexity** | High (GlobalFrameScheduler) | Low (delete code) | | **NVDEC Utilization** | Low (sequential submission) | High (parallel submission) | | **Frame Drop Risk** | High (cascading delays) | Low (independent timing) | --- ## 10. Conclusion ### Recommended Action: **Remove Round-Robin** **Justification**: 1. **Performance**: 90% reduction in QUEUE_DELAY for Player#3 (58ms → 6ms) 2. **Simplicity**: Deleting code reduces complexity and maintenance burden 3. **Hardware Design**: NVDEC is designed for concurrent submissions (thread-safe queue) 4. **Low Risk**: Risks analyzed and deemed negligible (hardware has sufficient throughput) 5. **Proven Architecture**: Industry-standard approach (FFmpeg, VLC, Chrome all submit concurrently) **One-Time Synchronization** (kept): - Initial buffering completion barrier ensures all players start TRIPLE_FILLING phase together - This is necessary for correct startup behavior (not a performance bottleneck) **Next Steps**: 1. User reviews this design document 2. If approved, implement code changes (3 deletions in FrameProcessor.cpp) 3. Test and verify QUEUE_DELAY improvement 4. Monitor for any unexpected issues (rollback plan available) --- **End of Design Document**