19 KiB
NVDEC Hybrid Synchronization Design
Date: 2025-10-11 Author: Claude Purpose: Hybrid Round-Robin strategy for 4-player simultaneous AV1 playback - INITIAL_BUFFERING only
1. Executive Summary
Current Problem
Round-Robin synchronization applied to ALL frames causes sequential blocking:
- Player#3 DECODE: 53.6ms (waiting for Player#2)
- Player#3 QUEUE_DELAY: 58.0ms (UI thread callback delayed)
- Video jitter/stuttering despite Triple Buffering, VSync removal, and rendering separation
Proposed Solution (Hybrid Approach)
Keep Round-Robin for INITIAL_BUFFERING phase, remove for TRIPLE_FILLING and NORMAL_PLAYBACK:
- Phase 1 (frames 0-15): WITH Round-Robin - Controlled startup to prevent NVDEC queue saturation
- Phase 2 (frames 16-18): NO Round-Robin - Parallel triple buffer filling
- Phase 3 (frames 19+): NO Round-Robin - Full parallel playback with NVDEC self-queue management
- Expected QUEUE_DELAY: <10ms (consistent across all players)
Why Hybrid?
Initial buffering에 Round-Robin이 도입된 이유:
- 4 players가 동시에 16 packets을 NVDEC에 submit하면 queue saturation 가능성
- 초기 startup 단계에서의 안정성 확보
- CUDA DPB 순차적 filling을 통한 예측 가능한 동작
TRIPLE_FILLING/NORMAL_PLAYBACK에서는 불필요한 이유:
- Initial buffering 완료 후에는 NVDEC queue가 안정화됨
- 30fps 타이밍 제어로 인해 burst submission이 발생하지 않음
- 각 player가 독립적으로 timing thread에 의해 제어됨
2. Current Architecture Analysis
2.1 Current Round-Robin Applied to ALL Phases
FrameProcessor.cpp (lines 98-107):
// Round-Robin coordination for ALL frames to prevent NVDEC queue saturation
// This ensures sequential submission to NVDEC across all players throughout entire playback
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn",
m_playerInstanceId, m_framesDecoded.load());
// Wait for my turn in round-robin (blocking)
GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn",
m_playerInstanceId, m_framesDecoded.load());
Problem: This blocking applies to frames 0-∞, causing cascading delays in TRIPLE_FILLING and NORMAL_PLAYBACK phases.
2.2 Performance Impact by Phase
INITIAL_BUFFERING (frames 0-15): Round-Robin is NEEDED
- Controlled sequential submission prevents startup queue saturation
- Minimal performance impact (happens once, during startup)
TRIPLE_FILLING (frames 16-18): Round-Robin is HARMFUL
- Only 3 frames per player, sequential blocking adds unnecessary latency
- Player#3 waits 5ms+ unnecessarily
NORMAL_PLAYBACK (frames 19+): Round-Robin is HARMFUL
- Every frame suffers from sequential blocking (53ms wait for Player#3)
- Cascading QUEUE_DELAY (58ms) causing visible jitter
3. Proposed Hybrid Architecture
3.1 Phase-Based Synchronization Strategy
┌─────────────────────────────────────────────────────────────────────┐
│ Phase 1: INITIAL_BUFFERING (frames 0-15) │
│ - WITH Round-Robin: Sequential submission │
│ - Purpose: Prevent NVDEC queue saturation during startup │
│ - Performance: One-time cost, minimal impact │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ Synchronization Barrier: WaitAllPlayersBuffered() │
│ - Ensures all players complete INITIAL_BUFFERING simultaneously │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ Phase 2: TRIPLE_FILLING (frames 16-18) │
│ - NO Round-Robin: Parallel submission │
│ - Purpose: Fill textures [0,1,2] quickly │
│ - Performance: Fast parallel filling │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ Phase 3: NORMAL_PLAYBACK (frames 19+) │
│ - NO Round-Robin: Full parallel operation │
│ - Purpose: Smooth 30fps playback without blocking │
│ - Performance: NVDEC self-queue management │
└─────────────────────────────────────────────────────────────────────┘
3.2 Timeline Comparison
INITIAL_BUFFERING (frames 0-15): Sequential (KEEP Round-Robin)
Frame 0:
0ms - P1 WaitForMyTurnInBuffering() → ACQUIRED
0ms - P1 decode → 2.5ms → SignalNextPlayer()
2.5ms - P2 WaitForMyTurnInBuffering() → ACQUIRED
2.5ms - P2 decode → 2.7ms → SignalNextPlayer()
5.2ms - P3 WaitForMyTurnInBuffering() → ACQUIRED
5.2ms - P3 decode → 3.1ms → SignalNextPlayer()
8.3ms - P4 WaitForMyTurnInBuffering() → ACQUIRED
8.3ms - P4 decode → 2.8ms → SignalNextPlayer()
↓ (Repeat for frames 1-15)
Frame 15 completed: All players call WaitAllPlayersBuffered()
TRIPLE_FILLING (frames 16-18): Parallel (REMOVE Round-Robin)
Frame 16:
0.0ms - P1, P2, P3, P4 all decode simultaneously (NO WAITING)
NVDEC Queue (internal):
┌─────────────────────────────────────┐
│ [P1] [P2] [P3] [P4] │ ← FIFO queue
│ ↓ ↓ ↓ ↓ │
│ Processed in submission order │
└─────────────────────────────────────┘
Expected DECODE times:
- P1: 2.5ms
- P2: 2.7ms
- P3: 3.1ms (NOT 53ms!)
- P4: 2.8ms
↓ (Repeat for frames 17-18)
NORMAL_PLAYBACK (frames 19+): Parallel (REMOVE Round-Robin)
Frame 19:
0.0ms - P1, P2, P3, P4 all decode simultaneously (NO WAITING)
Each player's timing thread controls submission rate (33.33ms interval)
NVDEC processes requests in FIFO order
Expected performance:
- DECODE: 2-5ms (all players)
- QUEUE_DELAY: <10ms (all players)
- TOTAL: <15ms (all players)
4. Implementation Changes
4.1 Conditional Round-Robin Based on Phase
FrameProcessor.cpp - Modify lines 95-107:
BEFORE (Current - Round-Robin for ALL frames):
} else {
// NVDEC/Hardware: D3D12 surface decoding with CUDA DPB buffering
// Round-Robin coordination for ALL frames to prevent NVDEC queue saturation
// This ensures sequential submission to NVDEC across all players throughout entire playback
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn",
m_playerInstanceId, m_framesDecoded.load());
// Wait for my turn in round-robin (blocking)
GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn",
m_playerInstanceId, m_framesDecoded.load());
AFTER (Hybrid - Round-Robin ONLY for INITIAL_BUFFERING):
} else {
// NVDEC/Hardware: D3D12 surface decoding with CUDA DPB buffering
// Round-Robin coordination ONLY during INITIAL_BUFFERING phase
// This prevents NVDEC queue saturation during startup when all players submit 16 packets simultaneously
// After initial buffering, NVDEC self-manages queue and timing thread controls submission rate
if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING) {
LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - WAITING for turn",
m_playerInstanceId, m_framesDecoded.load());
GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - ACQUIRED turn",
m_playerInstanceId, m_framesDecoded.load());
}
// TRIPLE_FILLING and NORMAL_PLAYBACK: NO Round-Robin (parallel submission)
4.2 Conditional SignalNextPlayer Based on Phase
FrameProcessor.cpp - Modify lines 207-211:
BEFORE (Current - Signal for ALL frames):
// Round-Robin coordination: Signal next player to proceed
// This applies to ALL frames, not just initial buffering
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player",
m_playerInstanceId, m_framesDecoded.load());
AFTER (Hybrid - Signal ONLY during INITIAL_BUFFERING):
// Round-Robin coordination: Signal next player ONLY during INITIAL_BUFFERING
// After initial buffering, no per-frame signaling needed (parallel operation)
if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING) {
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - SIGNALED next player",
m_playerInstanceId, m_framesDecoded.load());
}
FrameProcessor.cpp - Modify lines 242-246:
BEFORE (Current - Signal for ALL frames):
// Round-Robin coordination: Signal next player to proceed
// This applies to ALL successful decodes (TRIPLE_FILLING and NORMAL_PLAYBACK phases)
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player (SUCCESS path)",
m_playerInstanceId, m_framesDecoded.load());
AFTER (Hybrid - Signal ONLY during INITIAL_BUFFERING):
// Round-Robin coordination: Signal next player ONLY during INITIAL_BUFFERING
// TRIPLE_FILLING and NORMAL_PLAYBACK: No signaling needed (parallel operation)
if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING) {
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - SIGNALED next player (SUCCESS path)",
m_playerInstanceId, m_framesDecoded.load());
}
4.3 Keep Initial Buffering Barrier (NO CHANGE)
FrameProcessor.cpp (lines 215-220) - KEEP AS IS:
// Synchronization barrier: Wait for all players to complete INITIAL_BUFFERING
// This ensures all players start TRIPLE_FILLING phase simultaneously
if (m_framesDecoded == VAVCORE_NVDEC_INITIAL_BUFFERING) {
LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING completed - signaling and waiting for all players", m_playerInstanceId);
GlobalFrameScheduler::GetInstance().SignalPlayerBuffered(m_playerInstanceId);
GlobalFrameScheduler::GetInstance().WaitAllPlayersBuffered();
LOGF_INFO("[Player#%d] [FrameProcessor] All players buffered - starting TRIPLE_FILLING phase", m_playerInstanceId);
}
Rationale: This ONE-TIME barrier ensures synchronized transition from INITIAL_BUFFERING to TRIPLE_FILLING.
5. Risk Analysis
5.1 INITIAL_BUFFERING Phase Risks
Risk: NVDEC Queue Saturation (WITH Round-Robin - MITIGATED)
Analysis:
- 4 players × 16 packets = 64 simultaneous submissions (if no Round-Robin)
- NVDEC queue depth: 8-16 frames → SATURATED
- Round-Robin ensures sequential submission → NO SATURATION
Mitigation: Keep Round-Robin for INITIAL_BUFFERING phase ✅
5.2 TRIPLE_FILLING Phase Risks
Risk: NVDEC Queue Saturation (NO Round-Robin)
Analysis:
- 4 players × 3 packets = 12 simultaneous submissions (worst-case)
- NVDEC queue depth: 8-16 frames → MARGIN EXISTS
- NVDEC internal FIFO queue handles burst → NO SATURATION
Mitigation: None needed (queue depth sufficient) ✅
5.3 NORMAL_PLAYBACK Phase Risks
Risk: NVDEC Queue Saturation (NO Round-Robin)
Analysis:
- Timing thread controls submission rate: 33.33ms interval
- Even if all 4 players submit simultaneously, NVDEC processes at 240fps (RTX 3080)
- Submission rate: 4 × 30fps = 120fps < Processing rate: 240fps
- Conclusion: NO SATURATION
Mitigation: None needed (timing thread prevents burst) ✅
Risk: Frame Reordering Across Players
Analysis:
- Each player has separate
CUvideodecoderinstance - Each decoder has its own CUDA DPB (16 frames)
- NVDEC processes each decoder's queue independently
- Conclusion: No cross-player interference
Mitigation: None needed (architecture prevents this) ✅
Risk: Unbalanced Decode Timing
Analysis:
- Timing controlled by PlaybackController (PlaybackTimingThread)
- ProcessNextFrame() called at regular intervals (33.33ms for 30fps)
- Even if decode is fast (2ms), player waits for next timing tick
- Conclusion: Timing thread prevents drift
Mitigation: None needed (PlaybackController already handles this) ✅
6. Performance Impact Analysis
6.1 INITIAL_BUFFERING Phase (frames 0-15)
Current (with Round-Robin): Same as proposed ✅
- Sequential submission
- One-time cost during startup
- Minimal impact on overall playback
Proposed (with Round-Robin): No change
- Keep existing behavior
- Ensure stable startup
6.2 TRIPLE_FILLING Phase (frames 16-18)
Current (with Round-Robin):
Player#3 Frame 16: DECODE: ~50ms (waiting for P1, P2)
Player#3 Frame 17: DECODE: ~50ms (waiting for P1, P2)
Player#3 Frame 18: DECODE: ~50ms (waiting for P1, P2)
Total delay for P3 during TRIPLE_FILLING: ~150ms
Proposed (without Round-Robin):
Player#3 Frame 16: DECODE: 3.1ms (no wait)
Player#3 Frame 17: DECODE: 3.1ms (no wait)
Player#3 Frame 18: DECODE: 3.1ms (no wait)
Total delay for P3 during TRIPLE_FILLING: ~9ms
Improvement: 150ms → 9ms (94% reduction) ✅
6.3 NORMAL_PLAYBACK Phase (frames 19+)
Current (with Round-Robin):
Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 8ms = 10.5ms
Player#2: DECODE: 2.7ms (wait 2.5ms) + QUEUE_DELAY: 9.9ms = 12.6ms
Player#3: DECODE: 53.6ms (wait 50ms) + QUEUE_DELAY: 58.0ms = 111.6ms ❌
Player#4: DECODE: 2.8ms (wait 8ms) + QUEUE_DELAY: 12ms = 20.8ms
Proposed (without Round-Robin):
Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 5ms = 7.5ms ✅
Player#2: DECODE: 2.7ms (no wait) + QUEUE_DELAY: 5ms = 7.7ms ✅
Player#3: DECODE: 3.1ms (no wait) + QUEUE_DELAY: 6ms = 9.1ms ✅ (58ms → 6ms!)
Player#4: DECODE: 2.8ms (no wait) + QUEUE_DELAY: 5ms = 7.8ms ✅
Improvement: Player#3 QUEUE_DELAY reduces from 58ms to ~6ms (90% reduction) ✅
7. Implementation Summary
7.1 Code Changes Required
File: FrameProcessor.cpp
3 locations to modify (conditional Round-Robin based on phase):
- Lines 95-107: Add
if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING)around WaitForMyTurnInBuffering() - Lines 207-211: Add
if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING)around SignalNextPlayer() - Lines 242-246: Add
if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING)around SignalNextPlayer()
1 location to keep unchanged:
- Lines 215-220: Keep WaitAllPlayersBuffered() barrier (ONE-TIME synchronization)
7.2 Testing Plan
Test 1: Verify INITIAL_BUFFERING still works correctly
- Monitor first 16 frames for sequential submission
- Verify no NVDEC queue saturation during startup
- Expected: Stable startup, no errors
Test 2: Verify TRIPLE_FILLING improvement
- Monitor frames 16-18 for parallel submission
- Verify DECODE times are low (<5ms)
- Expected: No blocking delays
Test 3: Verify NORMAL_PLAYBACK improvement
- Monitor frames 19+ for parallel submission
- Verify QUEUE_DELAY reduction (<10ms)
- Expected: Smooth playback, no jitter
Test 4: Verify overall stability
- Run 10-minute playback session
- Check
m_framesDroppedcounter - Expected: 0 dropped frames, consistent performance
7.3 Rollback Plan
If unexpected issues occur:
- Revert code changes (git restore FrameProcessor.cpp)
- Return to full Round-Robin synchronization (current behavior)
- Analyze logs to identify specific failure point
8. Comparison Table
| Aspect | Full Round-Robin (Current) | Hybrid Round-Robin (Proposed) |
|---|---|---|
| INITIAL_BUFFERING | Sequential (0-15) | Sequential (0-15) - SAME |
| TRIPLE_FILLING | Sequential (16-18) ❌ | Parallel (16-18) ✅ |
| NORMAL_PLAYBACK | Sequential (19+) ❌ | Parallel (19+) ✅ |
| P3 DECODE Time | 53.6ms ❌ | 3.1ms ✅ |
| P3 QUEUE_DELAY | 58.0ms ❌ | 6.0ms ✅ |
| Startup Stability | High ✅ | High ✅ |
| Playback Performance | Poor ❌ | Excellent ✅ |
| Code Complexity | Medium | Medium (same) |
| Risk Level | Low (proven) | Low (INITIAL_BUFFERING protected) |
9. Conclusion
Recommended Action: Implement Hybrid Round-Robin
Justification:
-
Best of Both Worlds:
- Keep Round-Robin for INITIAL_BUFFERING → Stable startup
- Remove Round-Robin for TRIPLE_FILLING/NORMAL_PLAYBACK → Smooth playback
-
Performance Gains:
- 90% reduction in QUEUE_DELAY for Player#3 (58ms → 6ms)
- 94% reduction in TRIPLE_FILLING delay (150ms → 9ms)
-
Risk Mitigation:
- INITIAL_BUFFERING: Protected by Round-Robin (no queue saturation risk)
- TRIPLE_FILLING: NVDEC queue depth sufficient for 12 packets
- NORMAL_PLAYBACK: Timing thread prevents burst submissions
-
Minimal Code Changes:
- 3 conditional checks added
- No architectural changes
- Easy to rollback if needed
-
User Request Alignment:
- Directly addresses user's concern about initial buffering overload
- Maintains safety where needed, removes bottleneck where harmful
Next Steps:
- User reviews and approves this hybrid design
- Implement 3 conditional checks in FrameProcessor.cpp
- Test with 4-player playback
- Monitor time.log for QUEUE_DELAY improvement
- Verify smooth playback and no frame drops
End of Design Document