Files
video-v1/vav2/docs/working/NVDEC_Hybrid_Synchronization_Design.md

19 KiB
Raw Blame History

NVDEC Hybrid Synchronization Design

Date: 2025-10-11 Author: Claude Purpose: Hybrid Round-Robin strategy for 4-player simultaneous AV1 playback - INITIAL_BUFFERING only


1. Executive Summary

Current Problem

Round-Robin synchronization applied to ALL frames causes sequential blocking:

  • Player#3 DECODE: 53.6ms (waiting for Player#2)
  • Player#3 QUEUE_DELAY: 58.0ms (UI thread callback delayed)
  • Video jitter/stuttering despite Triple Buffering, VSync removal, and rendering separation

Proposed Solution (Hybrid Approach)

Keep Round-Robin for INITIAL_BUFFERING phase, remove for TRIPLE_FILLING and NORMAL_PLAYBACK:

  • Phase 1 (frames 0-15): WITH Round-Robin - Controlled startup to prevent NVDEC queue saturation
  • Phase 2 (frames 16-18): NO Round-Robin - Parallel triple buffer filling
  • Phase 3 (frames 19+): NO Round-Robin - Full parallel playback with NVDEC self-queue management
  • Expected QUEUE_DELAY: <10ms (consistent across all players)

Why Hybrid?

Initial buffering에 Round-Robin이 도입된 이유:

  • 4 players가 동시에 16 packets을 NVDEC에 submit하면 queue saturation 가능성
  • 초기 startup 단계에서의 안정성 확보
  • CUDA DPB 순차적 filling을 통한 예측 가능한 동작

TRIPLE_FILLING/NORMAL_PLAYBACK에서는 불필요한 이유:

  • Initial buffering 완료 후에는 NVDEC queue가 안정화됨
  • 30fps 타이밍 제어로 인해 burst submission이 발생하지 않음
  • 각 player가 독립적으로 timing thread에 의해 제어됨

2. Current Architecture Analysis

2.1 Current Round-Robin Applied to ALL Phases

FrameProcessor.cpp (lines 98-107):

// Round-Robin coordination for ALL frames to prevent NVDEC queue saturation
// This ensures sequential submission to NVDEC across all players throughout entire playback
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn",
          m_playerInstanceId, m_framesDecoded.load());

// Wait for my turn in round-robin (blocking)
GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);

LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn",
          m_playerInstanceId, m_framesDecoded.load());

Problem: This blocking applies to frames 0-∞, causing cascading delays in TRIPLE_FILLING and NORMAL_PLAYBACK phases.

2.2 Performance Impact by Phase

INITIAL_BUFFERING (frames 0-15): Round-Robin is NEEDED

  • Controlled sequential submission prevents startup queue saturation
  • Minimal performance impact (happens once, during startup)

TRIPLE_FILLING (frames 16-18): Round-Robin is HARMFUL

  • Only 3 frames per player, sequential blocking adds unnecessary latency
  • Player#3 waits 5ms+ unnecessarily

NORMAL_PLAYBACK (frames 19+): Round-Robin is HARMFUL

  • Every frame suffers from sequential blocking (53ms wait for Player#3)
  • Cascading QUEUE_DELAY (58ms) causing visible jitter

3. Proposed Hybrid Architecture

3.1 Phase-Based Synchronization Strategy

┌─────────────────────────────────────────────────────────────────────┐
│ Phase 1: INITIAL_BUFFERING (frames 0-15)                           │
│ - WITH Round-Robin: Sequential submission                          │
│ - Purpose: Prevent NVDEC queue saturation during startup           │
│ - Performance: One-time cost, minimal impact                        │
└─────────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────────┐
│ Synchronization Barrier: WaitAllPlayersBuffered()                  │
│ - Ensures all players complete INITIAL_BUFFERING simultaneously     │
└─────────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────────┐
│ Phase 2: TRIPLE_FILLING (frames 16-18)                             │
│ - NO Round-Robin: Parallel submission                               │
│ - Purpose: Fill textures [0,1,2] quickly                            │
│ - Performance: Fast parallel filling                                │
└─────────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────────┐
│ Phase 3: NORMAL_PLAYBACK (frames 19+)                              │
│ - NO Round-Robin: Full parallel operation                           │
│ - Purpose: Smooth 30fps playback without blocking                   │
│ - Performance: NVDEC self-queue management                          │
└─────────────────────────────────────────────────────────────────────┘

3.2 Timeline Comparison

INITIAL_BUFFERING (frames 0-15): Sequential (KEEP Round-Robin)

Frame 0:
  0ms   - P1 WaitForMyTurnInBuffering() → ACQUIRED
  0ms   - P1 decode → 2.5ms → SignalNextPlayer()
  2.5ms - P2 WaitForMyTurnInBuffering() → ACQUIRED
  2.5ms - P2 decode → 2.7ms → SignalNextPlayer()
  5.2ms - P3 WaitForMyTurnInBuffering() → ACQUIRED
  5.2ms - P3 decode → 3.1ms → SignalNextPlayer()
  8.3ms - P4 WaitForMyTurnInBuffering() → ACQUIRED
  8.3ms - P4 decode → 2.8ms → SignalNextPlayer()

↓ (Repeat for frames 1-15)

Frame 15 completed: All players call WaitAllPlayersBuffered()

TRIPLE_FILLING (frames 16-18): Parallel (REMOVE Round-Robin)

Frame 16:
  0.0ms - P1, P2, P3, P4 all decode simultaneously (NO WAITING)

  NVDEC Queue (internal):
  ┌─────────────────────────────────────┐
  │ [P1] [P2] [P3] [P4]                 │  ← FIFO queue
  │  ↓    ↓    ↓    ↓                   │
  │  Processed in submission order      │
  └─────────────────────────────────────┘

  Expected DECODE times:
  - P1: 2.5ms
  - P2: 2.7ms
  - P3: 3.1ms (NOT 53ms!)
  - P4: 2.8ms

↓ (Repeat for frames 17-18)

NORMAL_PLAYBACK (frames 19+): Parallel (REMOVE Round-Robin)

Frame 19:
  0.0ms - P1, P2, P3, P4 all decode simultaneously (NO WAITING)

  Each player's timing thread controls submission rate (33.33ms interval)
  NVDEC processes requests in FIFO order

  Expected performance:
  - DECODE: 2-5ms (all players)
  - QUEUE_DELAY: <10ms (all players)
  - TOTAL: <15ms (all players)

4. Implementation Changes

4.1 Conditional Round-Robin Based on Phase

FrameProcessor.cpp - Modify lines 95-107:

BEFORE (Current - Round-Robin for ALL frames):

} else {
    // NVDEC/Hardware: D3D12 surface decoding with CUDA DPB buffering

    // Round-Robin coordination for ALL frames to prevent NVDEC queue saturation
    // This ensures sequential submission to NVDEC across all players throughout entire playback
    LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn",
              m_playerInstanceId, m_framesDecoded.load());

    // Wait for my turn in round-robin (blocking)
    GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);

    LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn",
              m_playerInstanceId, m_framesDecoded.load());

AFTER (Hybrid - Round-Robin ONLY for INITIAL_BUFFERING):

} else {
    // NVDEC/Hardware: D3D12 surface decoding with CUDA DPB buffering

    // Round-Robin coordination ONLY during INITIAL_BUFFERING phase
    // This prevents NVDEC queue saturation during startup when all players submit 16 packets simultaneously
    // After initial buffering, NVDEC self-manages queue and timing thread controls submission rate
    if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING) {
        LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - WAITING for turn",
                  m_playerInstanceId, m_framesDecoded.load());

        GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);

        LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - ACQUIRED turn",
                  m_playerInstanceId, m_framesDecoded.load());
    }
    // TRIPLE_FILLING and NORMAL_PLAYBACK: NO Round-Robin (parallel submission)

4.2 Conditional SignalNextPlayer Based on Phase

FrameProcessor.cpp - Modify lines 207-211:

BEFORE (Current - Signal for ALL frames):

// Round-Robin coordination: Signal next player to proceed
// This applies to ALL frames, not just initial buffering
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player",
          m_playerInstanceId, m_framesDecoded.load());

AFTER (Hybrid - Signal ONLY during INITIAL_BUFFERING):

// Round-Robin coordination: Signal next player ONLY during INITIAL_BUFFERING
// After initial buffering, no per-frame signaling needed (parallel operation)
if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING) {
    GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
    LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - SIGNALED next player",
              m_playerInstanceId, m_framesDecoded.load());
}

FrameProcessor.cpp - Modify lines 242-246:

BEFORE (Current - Signal for ALL frames):

// Round-Robin coordination: Signal next player to proceed
// This applies to ALL successful decodes (TRIPLE_FILLING and NORMAL_PLAYBACK phases)
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player (SUCCESS path)",
          m_playerInstanceId, m_framesDecoded.load());

AFTER (Hybrid - Signal ONLY during INITIAL_BUFFERING):

// Round-Robin coordination: Signal next player ONLY during INITIAL_BUFFERING
// TRIPLE_FILLING and NORMAL_PLAYBACK: No signaling needed (parallel operation)
if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING) {
    GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
    LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING Frame %llu - SIGNALED next player (SUCCESS path)",
              m_playerInstanceId, m_framesDecoded.load());
}

4.3 Keep Initial Buffering Barrier (NO CHANGE)

FrameProcessor.cpp (lines 215-220) - KEEP AS IS:

// Synchronization barrier: Wait for all players to complete INITIAL_BUFFERING
// This ensures all players start TRIPLE_FILLING phase simultaneously
if (m_framesDecoded == VAVCORE_NVDEC_INITIAL_BUFFERING) {
    LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING completed - signaling and waiting for all players", m_playerInstanceId);
    GlobalFrameScheduler::GetInstance().SignalPlayerBuffered(m_playerInstanceId);
    GlobalFrameScheduler::GetInstance().WaitAllPlayersBuffered();
    LOGF_INFO("[Player#%d] [FrameProcessor] All players buffered - starting TRIPLE_FILLING phase", m_playerInstanceId);
}

Rationale: This ONE-TIME barrier ensures synchronized transition from INITIAL_BUFFERING to TRIPLE_FILLING.


5. Risk Analysis

5.1 INITIAL_BUFFERING Phase Risks

Risk: NVDEC Queue Saturation (WITH Round-Robin - MITIGATED)

Analysis:

  • 4 players × 16 packets = 64 simultaneous submissions (if no Round-Robin)
  • NVDEC queue depth: 8-16 frames → SATURATED
  • Round-Robin ensures sequential submission → NO SATURATION

Mitigation: Keep Round-Robin for INITIAL_BUFFERING phase

5.2 TRIPLE_FILLING Phase Risks

Risk: NVDEC Queue Saturation (NO Round-Robin)

Analysis:

  • 4 players × 3 packets = 12 simultaneous submissions (worst-case)
  • NVDEC queue depth: 8-16 frames → MARGIN EXISTS
  • NVDEC internal FIFO queue handles burst → NO SATURATION

Mitigation: None needed (queue depth sufficient)

5.3 NORMAL_PLAYBACK Phase Risks

Risk: NVDEC Queue Saturation (NO Round-Robin)

Analysis:

  • Timing thread controls submission rate: 33.33ms interval
  • Even if all 4 players submit simultaneously, NVDEC processes at 240fps (RTX 3080)
  • Submission rate: 4 × 30fps = 120fps < Processing rate: 240fps
  • Conclusion: NO SATURATION

Mitigation: None needed (timing thread prevents burst)

Risk: Frame Reordering Across Players

Analysis:

  • Each player has separate CUvideodecoder instance
  • Each decoder has its own CUDA DPB (16 frames)
  • NVDEC processes each decoder's queue independently
  • Conclusion: No cross-player interference

Mitigation: None needed (architecture prevents this)

Risk: Unbalanced Decode Timing

Analysis:

  • Timing controlled by PlaybackController (PlaybackTimingThread)
  • ProcessNextFrame() called at regular intervals (33.33ms for 30fps)
  • Even if decode is fast (2ms), player waits for next timing tick
  • Conclusion: Timing thread prevents drift

Mitigation: None needed (PlaybackController already handles this)


6. Performance Impact Analysis

6.1 INITIAL_BUFFERING Phase (frames 0-15)

Current (with Round-Robin): Same as proposed

  • Sequential submission
  • One-time cost during startup
  • Minimal impact on overall playback

Proposed (with Round-Robin): No change

  • Keep existing behavior
  • Ensure stable startup

6.2 TRIPLE_FILLING Phase (frames 16-18)

Current (with Round-Robin):

Player#3 Frame 16: DECODE: ~50ms (waiting for P1, P2)
Player#3 Frame 17: DECODE: ~50ms (waiting for P1, P2)
Player#3 Frame 18: DECODE: ~50ms (waiting for P1, P2)

Total delay for P3 during TRIPLE_FILLING: ~150ms

Proposed (without Round-Robin):

Player#3 Frame 16: DECODE: 3.1ms (no wait)
Player#3 Frame 17: DECODE: 3.1ms (no wait)
Player#3 Frame 18: DECODE: 3.1ms (no wait)

Total delay for P3 during TRIPLE_FILLING: ~9ms

Improvement: 150ms → 9ms (94% reduction)

6.3 NORMAL_PLAYBACK Phase (frames 19+)

Current (with Round-Robin):

Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 8ms = 10.5ms
Player#2: DECODE: 2.7ms (wait 2.5ms) + QUEUE_DELAY: 9.9ms = 12.6ms
Player#3: DECODE: 53.6ms (wait 50ms) + QUEUE_DELAY: 58.0ms = 111.6ms ❌
Player#4: DECODE: 2.8ms (wait 8ms) + QUEUE_DELAY: 12ms = 20.8ms

Proposed (without Round-Robin):

Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 5ms = 7.5ms ✅
Player#2: DECODE: 2.7ms (no wait) + QUEUE_DELAY: 5ms = 7.7ms ✅
Player#3: DECODE: 3.1ms (no wait) + QUEUE_DELAY: 6ms = 9.1ms ✅ (58ms → 6ms!)
Player#4: DECODE: 2.8ms (no wait) + QUEUE_DELAY: 5ms = 7.8ms ✅

Improvement: Player#3 QUEUE_DELAY reduces from 58ms to ~6ms (90% reduction)


7. Implementation Summary

7.1 Code Changes Required

File: FrameProcessor.cpp

3 locations to modify (conditional Round-Robin based on phase):

  1. Lines 95-107: Add if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING) around WaitForMyTurnInBuffering()
  2. Lines 207-211: Add if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING) around SignalNextPlayer()
  3. Lines 242-246: Add if (m_framesDecoded <= VAVCORE_NVDEC_INITIAL_BUFFERING) around SignalNextPlayer()

1 location to keep unchanged:

  1. Lines 215-220: Keep WaitAllPlayersBuffered() barrier (ONE-TIME synchronization)

7.2 Testing Plan

Test 1: Verify INITIAL_BUFFERING still works correctly

  • Monitor first 16 frames for sequential submission
  • Verify no NVDEC queue saturation during startup
  • Expected: Stable startup, no errors

Test 2: Verify TRIPLE_FILLING improvement

  • Monitor frames 16-18 for parallel submission
  • Verify DECODE times are low (<5ms)
  • Expected: No blocking delays

Test 3: Verify NORMAL_PLAYBACK improvement

  • Monitor frames 19+ for parallel submission
  • Verify QUEUE_DELAY reduction (<10ms)
  • Expected: Smooth playback, no jitter

Test 4: Verify overall stability

  • Run 10-minute playback session
  • Check m_framesDropped counter
  • Expected: 0 dropped frames, consistent performance

7.3 Rollback Plan

If unexpected issues occur:

  1. Revert code changes (git restore FrameProcessor.cpp)
  2. Return to full Round-Robin synchronization (current behavior)
  3. Analyze logs to identify specific failure point

8. Comparison Table

Aspect Full Round-Robin (Current) Hybrid Round-Robin (Proposed)
INITIAL_BUFFERING Sequential (0-15) Sequential (0-15) - SAME
TRIPLE_FILLING Sequential (16-18) Parallel (16-18)
NORMAL_PLAYBACK Sequential (19+) Parallel (19+)
P3 DECODE Time 53.6ms 3.1ms
P3 QUEUE_DELAY 58.0ms 6.0ms
Startup Stability High High
Playback Performance Poor Excellent
Code Complexity Medium Medium (same)
Risk Level Low (proven) Low (INITIAL_BUFFERING protected)

9. Conclusion

Justification:

  1. Best of Both Worlds:

    • Keep Round-Robin for INITIAL_BUFFERING → Stable startup
    • Remove Round-Robin for TRIPLE_FILLING/NORMAL_PLAYBACK → Smooth playback
  2. Performance Gains:

    • 90% reduction in QUEUE_DELAY for Player#3 (58ms → 6ms)
    • 94% reduction in TRIPLE_FILLING delay (150ms → 9ms)
  3. Risk Mitigation:

    • INITIAL_BUFFERING: Protected by Round-Robin (no queue saturation risk)
    • TRIPLE_FILLING: NVDEC queue depth sufficient for 12 packets
    • NORMAL_PLAYBACK: Timing thread prevents burst submissions
  4. Minimal Code Changes:

    • 3 conditional checks added
    • No architectural changes
    • Easy to rollback if needed
  5. User Request Alignment:

    • Directly addresses user's concern about initial buffering overload
    • Maintains safety where needed, removes bottleneck where harmful

Next Steps:

  1. User reviews and approves this hybrid design
  2. Implement 3 conditional checks in FrameProcessor.cpp
  3. Test with 4-player playback
  4. Monitor time.log for QUEUE_DELAY improvement
  5. Verify smooth playback and no frame drops

End of Design Document