# NVDEC Hybrid Synchronization Design **Date**: 2025-10-11 **Author**: Claude **Purpose**: Hybrid Round-Robin strategy for 4-player simultaneous AV1 playback - INITIAL_BUFFERING only --- ## 1. Executive Summary ### Current Problem Round-Robin synchronization applied to **ALL frames** causes sequential blocking: - Player#3 DECODE: 53.6ms (waiting for Player#2) - Player#3 QUEUE_DELAY: 58.0ms (UI thread callback delayed) - Video jitter/stuttering despite Triple Buffering, VSync removal, and rendering separation ### Proposed Solution (Hybrid Approach) **Keep Round-Robin for INITIAL_BUFFERING phase, remove for TRIPLE_FILLING and NORMAL_PLAYBACK:** - **Phase 1 (frames 0-15)**: WITH Round-Robin - Controlled startup to prevent NVDEC queue saturation - **Phase 2 (frames 16-18)**: NO Round-Robin - Parallel triple buffer filling - **Phase 3 (frames 19+)**: NO Round-Robin - Full parallel playback with NVDEC self-queue management - **Expected QUEUE_DELAY**: <10ms (consistent across all players) ### Why Hybrid? **Initial buffering에 Round-Robin이 도입된 이유**: - 4 players가 동시에 16 packets을 NVDEC에 submit하면 queue saturation 가능성 - 초기 startup 단계에서의 안정성 확보 - CUDA DPB 순차적 filling을 통한 예측 가능한 동작 **TRIPLE_FILLING/NORMAL_PLAYBACK에서는 불필요한 이유**: - Initial buffering 완료 후에는 NVDEC queue가 안정화됨 - 30fps 타이밍 제어로 인해 burst submission이 발생하지 않음 - 각 player가 독립적으로 timing thread에 의해 제어됨 --- ## 2. Current Architecture Analysis ### 2.1 Current Round-Robin Applied to ALL Phases **FrameProcessor.cpp (lines 98-107):** ```cpp // Round-Robin coordination for ALL frames to prevent NVDEC queue saturation // This ensures sequential submission to NVDEC across all players throughout entire playback LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn", m_playerInstanceId, m_framesDecoded.load()); // Wait for my turn in round-robin (blocking) GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn", m_playerInstanceId, m_framesDecoded.load()); ``` **Problem**: This blocking applies to **frames 0-∞**, causing cascading delays in TRIPLE_FILLING and NORMAL_PLAYBACK phases. ### 2.2 Performance Impact by Phase **INITIAL_BUFFERING (frames 0-15)**: Round-Robin is **NEEDED** - Controlled sequential submission prevents startup queue saturation - Minimal performance impact (happens once, during startup) **TRIPLE_FILLING (frames 16-18)**: Round-Robin is **HARMFUL** - Only 3 frames per player, sequential blocking adds unnecessary latency - Player#3 waits 5ms+ unnecessarily **NORMAL_PLAYBACK (frames 19+)**: Round-Robin is **HARMFUL** - Every frame suffers from sequential blocking (53ms wait for Player#3) - Cascading QUEUE_DELAY (58ms) causing visible jitter --- ## 3. Proposed Hybrid Architecture ### 3.1 Phase-Based Synchronization Strategy ``` ┌─────────────────────────────────────────────────────────────────────┐ │ Phase 1: INITIAL_BUFFERING (frames 0-15) │ │ - WITH Round-Robin: Sequential submission │ │ - Purpose: Prevent NVDEC queue saturation during startup │ │ - Performance: One-time cost, minimal impact │ └─────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────┐ │ Synchronization Barrier: WaitAllPlayersBuffered() │ │ - Ensures all players complete INITIAL_BUFFERING simultaneously │ └─────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────┐ │ Phase 2: TRIPLE_FILLING (frames 16-18) │ │ - NO Round-Robin: Parallel submission │ │ - Purpose: Fill textures [0,1,2] quickly │ │ - Performance: Fast parallel filling │ └─────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────┐ │ Phase 3: NORMAL_PLAYBACK (frames 19+) │ │ - NO Round-Robin: Full parallel operation │ │ - Purpose: Smooth 30fps playback without blocking │ │ - Performance: NVDEC self-queue management │ └─────────────────────────────────────────────────────────────────────┘ ``` ### 3.2 Timeline Comparison **INITIAL_BUFFERING (frames 0-15): Sequential (KEEP Round-Robin)** ``` Frame 0: 0ms - P1 WaitForMyTurnInBuffering() → ACQUIRED 0ms - P1 decode → 2.5ms → SignalNextPlayer() 2.5ms - P2 WaitForMyTurnInBuffering() → ACQUIRED 2.5ms - P2 decode → 2.7ms → SignalNextPlayer() 5.2ms - P3 WaitForMyTurnInBuffering() → ACQUIRED 5.2ms - P3 decode → 3.1ms → SignalNextPlayer() 8.3ms - P4 WaitForMyTurnInBuffering() → ACQUIRED 8.3ms - P4 decode → 2.8ms → SignalNextPlayer() ↓ (Repeat for frames 1-15) Frame 15 completed: All players call WaitAllPlayersBuffered() ``` **TRIPLE_FILLING (frames 16-18): Parallel (REMOVE Round-Robin)** ``` Frame 16: 0.0ms - P1, P2, P3, P4 all decode simultaneously (NO WAITING) NVDEC Queue (internal): ┌─────────────────────────────────────┐ │ [P1] [P2] [P3] [P4] │ ← FIFO queue │ ↓ ↓ ↓ ↓ │ │ Processed in submission order │ └─────────────────────────────────────┘ Expected DECODE times: - P1: 2.5ms - P2: 2.7ms - P3: 3.1ms (NOT 53ms!) - P4: 2.8ms ↓ (Repeat for frames 17-18) ``` **NORMAL_PLAYBACK (frames 19+): Parallel (REMOVE Round-Robin)** ``` Frame 19: 0.0ms - P1, P2, P3, P4 all decode simultaneously (NO WAITING) Each player's timing thread controls submission rate (33.33ms interval) NVDEC processes requests in FIFO order Expected performance: - DECODE: 2-5ms (all players) - QUEUE_DELAY: <10ms (all players) - TOTAL: <15ms (all players) ``` --- ## 4. Implementation Changes ### 4.1 Conditional Round-Robin Based on Phase **FrameProcessor.cpp - Modify lines 95-107:** **BEFORE (Current - Round-Robin for ALL frames):** ```cpp } else { // NVDEC/Hardware: D3D12 surface decoding with CUDA DPB buffering // Round-Robin coordination for ALL frames to prevent NVDEC queue saturation // This ensures sequential submission to NVDEC across all players throughout entire playback LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn", m_playerInstanceId, m_framesDecoded.load()); // Wait for my turn in round-robin (blocking) GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn", m_playerInstanceId, m_framesDecoded.load()); ``` **AFTER (Hybrid - Round-Robin ONLY for INITIAL_BUFFERING):** ```cpp } else { // NVDEC/Hardware: D3D12 surface decoding with CUDA DPB buffering // Round-Robin coordination ONLY during INITIAL_BUFFERING phase // This prevents NVDEC queue saturation during startup when all players submit 16 packets simultaneously // After initial buffering, NVDEC self-manages queue and timing thread controls submission rate if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING) { LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - WAITING for turn", m_playerInstanceId, m_framesDecoded.load()); GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - ACQUIRED turn", m_playerInstanceId, m_framesDecoded.load()); } // TRIPLE_FILLING and NORMAL_PLAYBACK: NO Round-Robin (parallel submission) ``` ### 4.2 Conditional SignalNextPlayer Based on Phase **FrameProcessor.cpp - Modify lines 207-211:** **BEFORE (Current - Signal for ALL frames):** ```cpp // Round-Robin coordination: Signal next player to proceed // This applies to ALL frames, not just initial buffering GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player", m_playerInstanceId, m_framesDecoded.load()); ``` **AFTER (Hybrid - Signal ONLY during INITIAL_BUFFERING):** ```cpp // Round-Robin coordination: Signal next player ONLY during INITIAL_BUFFERING // After initial buffering, no per-frame signaling needed (parallel operation) if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING) { GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - SIGNALED next player", m_playerInstanceId, m_framesDecoded.load()); } ``` **FrameProcessor.cpp - Modify lines 242-246:** **BEFORE (Current - Signal for ALL frames):** ```cpp // Round-Robin coordination: Signal next player to proceed // This applies to ALL successful decodes (TRIPLE_FILLING and NORMAL_PLAYBACK phases) GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player (SUCCESS path)", m_playerInstanceId, m_framesDecoded.load()); ``` **AFTER (Hybrid - Signal ONLY during INITIAL_BUFFERING):** ```cpp // Round-Robin coordination: Signal next player ONLY during INITIAL_BUFFERING // TRIPLE_FILLING and NORMAL_PLAYBACK: No signaling needed (parallel operation) if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING) { GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId); LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - SIGNALED next player (SUCCESS path)", m_playerInstanceId, m_framesDecoded.load()); } ``` ### 4.3 Keep Initial Buffering Barrier (NO CHANGE) **FrameProcessor.cpp (lines 215-220) - KEEP AS IS:** ```cpp // Synchronization barrier: Wait for all players to complete INITIAL_BUFFERING // This ensures all players start TRIPLE_FILLING phase simultaneously if (m_framesDecoded == VAVCORE_NVDEC_INITIAL_BUFFERING) { LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING completed - signaling and waiting for all players", m_playerInstanceId); GlobalFrameScheduler::GetInstance().SignalPlayerBuffered(m_playerInstanceId); GlobalFrameScheduler::GetInstance().WaitAllPlayersBuffered(); LOGF_INFO("[Player#%d] [FrameProcessor] All players buffered - starting TRIPLE_FILLING phase", m_playerInstanceId); } ``` **Rationale**: This ONE-TIME barrier ensures synchronized transition from INITIAL_BUFFERING to TRIPLE_FILLING. --- ## 5. Risk Analysis ### 5.1 INITIAL_BUFFERING Phase Risks #### Risk: NVDEC Queue Saturation (WITH Round-Robin - MITIGATED) **Analysis**: - 4 players × 16 packets = 64 simultaneous submissions (if no Round-Robin) - NVDEC queue depth: 8-16 frames → **SATURATED** - Round-Robin ensures sequential submission → **NO SATURATION** **Mitigation**: Keep Round-Robin for INITIAL_BUFFERING phase ✅ ### 5.2 TRIPLE_FILLING Phase Risks #### Risk: NVDEC Queue Saturation (NO Round-Robin) **Analysis**: - 4 players × 3 packets = 12 simultaneous submissions (worst-case) - NVDEC queue depth: 8-16 frames → **MARGIN EXISTS** - NVDEC internal FIFO queue handles burst → **NO SATURATION** **Mitigation**: None needed (queue depth sufficient) ✅ ### 5.3 NORMAL_PLAYBACK Phase Risks #### Risk: NVDEC Queue Saturation (NO Round-Robin) **Analysis**: - Timing thread controls submission rate: 33.33ms interval - Even if all 4 players submit simultaneously, NVDEC processes at 240fps (RTX 3080) - Submission rate: 4 × 30fps = 120fps < Processing rate: 240fps - **Conclusion**: NO SATURATION **Mitigation**: None needed (timing thread prevents burst) ✅ #### Risk: Frame Reordering Across Players **Analysis**: - Each player has separate `CUvideodecoder` instance - Each decoder has its own CUDA DPB (16 frames) - NVDEC processes each decoder's queue independently - **Conclusion**: No cross-player interference **Mitigation**: None needed (architecture prevents this) ✅ #### Risk: Unbalanced Decode Timing **Analysis**: - Timing controlled by PlaybackController (PlaybackTimingThread) - ProcessNextFrame() called at regular intervals (33.33ms for 30fps) - Even if decode is fast (2ms), player waits for next timing tick - **Conclusion**: Timing thread prevents drift **Mitigation**: None needed (PlaybackController already handles this) ✅ --- ## 6. Performance Impact Analysis ### 6.1 INITIAL_BUFFERING Phase (frames 0-15) **Current (with Round-Robin)**: Same as proposed ✅ - Sequential submission - One-time cost during startup - Minimal impact on overall playback **Proposed (with Round-Robin)**: No change - Keep existing behavior - Ensure stable startup ### 6.2 TRIPLE_FILLING Phase (frames 16-18) **Current (with Round-Robin)**: ``` Player#3 Frame 16: DECODE: ~50ms (waiting for P1, P2) Player#3 Frame 17: DECODE: ~50ms (waiting for P1, P2) Player#3 Frame 18: DECODE: ~50ms (waiting for P1, P2) Total delay for P3 during TRIPLE_FILLING: ~150ms ``` **Proposed (without Round-Robin)**: ``` Player#3 Frame 16: DECODE: 3.1ms (no wait) Player#3 Frame 17: DECODE: 3.1ms (no wait) Player#3 Frame 18: DECODE: 3.1ms (no wait) Total delay for P3 during TRIPLE_FILLING: ~9ms ``` **Improvement**: 150ms → 9ms (94% reduction) ✅ ### 6.3 NORMAL_PLAYBACK Phase (frames 19+) **Current (with Round-Robin)**: ``` Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 8ms = 10.5ms Player#2: DECODE: 2.7ms (wait 2.5ms) + QUEUE_DELAY: 9.9ms = 12.6ms Player#3: DECODE: 53.6ms (wait 50ms) + QUEUE_DELAY: 58.0ms = 111.6ms ❌ Player#4: DECODE: 2.8ms (wait 8ms) + QUEUE_DELAY: 12ms = 20.8ms ``` **Proposed (without Round-Robin)**: ``` Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 5ms = 7.5ms ✅ Player#2: DECODE: 2.7ms (no wait) + QUEUE_DELAY: 5ms = 7.7ms ✅ Player#3: DECODE: 3.1ms (no wait) + QUEUE_DELAY: 6ms = 9.1ms ✅ (58ms → 6ms!) Player#4: DECODE: 2.8ms (no wait) + QUEUE_DELAY: 5ms = 7.8ms ✅ ``` **Improvement**: Player#3 QUEUE_DELAY reduces from 58ms to ~6ms (90% reduction) ✅ --- ## 7. Implementation Summary ### 7.1 Code Changes Required **File**: `FrameProcessor.cpp` **3 locations to modify** (conditional Round-Robin based on phase): 1. **Lines 95-107**: Add `if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING)` around WaitForMyTurnInBuffering() 2. **Lines 207-211**: Add `if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING)` around SignalNextPlayer() 3. **Lines 242-246**: Add `if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING)` around SignalNextPlayer() **1 location to keep unchanged**: 4. **Lines 215-220**: Keep WaitAllPlayersBuffered() barrier (ONE-TIME synchronization) ### 7.2 Testing Plan **Test 1: Verify INITIAL_BUFFERING still works correctly** - Monitor first 16 frames for sequential submission - Verify no NVDEC queue saturation during startup - Expected: Stable startup, no errors **Test 2: Verify TRIPLE_FILLING improvement** - Monitor frames 16-18 for parallel submission - Verify DECODE times are low (<5ms) - Expected: No blocking delays **Test 3: Verify NORMAL_PLAYBACK improvement** - Monitor frames 19+ for parallel submission - Verify QUEUE_DELAY reduction (<10ms) - Expected: Smooth playback, no jitter **Test 4: Verify overall stability** - Run 10-minute playback session - Check `m_framesDropped` counter - Expected: 0 dropped frames, consistent performance ### 7.3 Rollback Plan If unexpected issues occur: 1. Revert code changes (git restore FrameProcessor.cpp) 2. Return to full Round-Robin synchronization (current behavior) 3. Analyze logs to identify specific failure point --- ## 8. Comparison Table | Aspect | Full Round-Robin (Current) | Hybrid Round-Robin (Proposed) | |--------|---------------------------|------------------------------| | **INITIAL_BUFFERING** | Sequential (0-15) | Sequential (0-15) - **SAME** | | **TRIPLE_FILLING** | Sequential (16-18) ❌ | Parallel (16-18) ✅ | | **NORMAL_PLAYBACK** | Sequential (19+) ❌ | Parallel (19+) ✅ | | **P3 DECODE Time** | 53.6ms ❌ | 3.1ms ✅ | | **P3 QUEUE_DELAY** | 58.0ms ❌ | 6.0ms ✅ | | **Startup Stability** | High ✅ | High ✅ | | **Playback Performance** | Poor ❌ | Excellent ✅ | | **Code Complexity** | Medium | Medium (same) | | **Risk Level** | Low (proven) | Low (INITIAL_BUFFERING protected) | --- ## 9. Conclusion ### Recommended Action: **Implement Hybrid Round-Robin** **Justification**: 1. **Best of Both Worlds**: - Keep Round-Robin for INITIAL_BUFFERING → Stable startup - Remove Round-Robin for TRIPLE_FILLING/NORMAL_PLAYBACK → Smooth playback 2. **Performance Gains**: - 90% reduction in QUEUE_DELAY for Player#3 (58ms → 6ms) - 94% reduction in TRIPLE_FILLING delay (150ms → 9ms) 3. **Risk Mitigation**: - INITIAL_BUFFERING: Protected by Round-Robin (no queue saturation risk) - TRIPLE_FILLING: NVDEC queue depth sufficient for 12 packets - NORMAL_PLAYBACK: Timing thread prevents burst submissions 4. **Minimal Code Changes**: - 3 conditional checks added - No architectural changes - Easy to rollback if needed 5. **User Request Alignment**: - Directly addresses user's concern about initial buffering overload - Maintains safety where needed, removes bottleneck where harmful **Next Steps**: 1. User reviews and approves this hybrid design 2. Implement 3 conditional checks in FrameProcessor.cpp 3. Test with 4-player playback 4. Monitor time.log for QUEUE_DELAY improvement 5. Verify smooth playback and no frame drops --- **End of Design Document**