Files
video-v1/vav2/docs/working/NVDEC_Self_Queue_Management_Design.md

17 KiB
Raw Permalink Blame History

NVDEC Self-Queue Management Design

Date: 2025-10-11 Author: Claude Purpose: Alternative architecture to Round-Robin synchronization for 4-player simultaneous AV1 playback


1. Executive Summary

Current Problem

Round-Robin synchronization causes sequential blocking, resulting in:

  • Player#3 DECODE: 53.6ms (waiting for Player#2)
  • Player#3 QUEUE_DELAY: 58.0ms (UI thread callback delayed)
  • Video jitter/stuttering despite Triple Buffering, VSync removal, and rendering separation

Proposed Solution

Remove Round-Robin and rely on NVDEC's internal queue management:

  • All players submit decode requests independently (no blocking)
  • NVDEC manages concurrent requests using its internal FIFO queue
  • Expected DECODE time: 2-5ms (actual hardware decode time)
  • Expected QUEUE_DELAY: <10ms (consistent with Player#2 performance)

2. Current Architecture Analysis

2.1 Round-Robin Synchronization Flow

Timeline for 4 players (P1, P2, P3, P4):

Frame N:
  0ms   - P1 WaitForMyTurnInBuffering() → ACQUIRED (turn=1)
  0ms   - P1 vavcore_decode_to_surface() → 2.5ms
  2.5ms - P1 SignalNextPlayer() → turn=2

  2.5ms - P2 WaitForMyTurnInBuffering() → ACQUIRED (turn=2)
  2.5ms - P2 vavcore_decode_to_surface() → 2.7ms
  5.2ms - P2 SignalNextPlayer() → turn=3

  5.2ms - P3 WaitForMyTurnInBuffering() → ACQUIRED (turn=3)
  5.2ms - P3 vavcore_decode_to_surface() → 3.1ms
  8.3ms - P3 SignalNextPlayer() → turn=4

  8.3ms - P4 WaitForMyTurnInBuffering() → ACQUIRED (turn=4)
  8.3ms - P4 vavcore_decode_to_surface() → 2.8ms
  11.1ms - P4 SignalNextPlayer() → turn=1 (wrap around)

Total latency for P4: 11.1ms (accumulated waiting + decode)

Problem: Player#3 waits 5.2ms before even starting decode, causing cascading delays.

2.2 Code Locations

FrameProcessor.cpp (lines 98-107):

// Round-Robin coordination for ALL frames to prevent NVDEC queue saturation
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn",
          m_playerInstanceId, m_framesDecoded.load());

// Wait for my turn in round-robin (blocking)
GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);

LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn",
          m_playerInstanceId, m_framesDecoded.load());

GlobalFrameScheduler.cpp:

void GlobalFrameScheduler::WaitForMyTurnInBuffering(int playerId) {
    std::unique_lock<std::mutex> lock(m_bufferingMutex);
    m_bufferingCV.wait(lock, [this, playerId]() {
        return m_currentPlayerTurn == playerId;
    });
}

void GlobalFrameScheduler::SignalNextPlayer(int playerId) {
    std::unique_lock<std::mutex> lock(m_bufferingMutex);
    m_currentPlayerTurn = (playerId % m_totalPlayers) + 1;
    m_bufferingCV.notify_all();
}

3. NVDEC Queue Characteristics

3.1 NVIDIA NVDEC Architecture

Hardware Queue Management:

  • NVDEC has an internal FIFO queue for decode requests
  • Queue depth: Typically 8-16 frames (hardware dependent)
  • Concurrent submission: NVDEC accepts multiple cuvidDecodePicture() calls
  • Internal scheduling: NVDEC processes requests in submission order
  • Back-pressure: When queue is full, cuvidDecodePicture() blocks until slot available

CUDA DPB (Decoded Picture Buffer):

  • VavCore uses CUDA memory for DPB (for B-frame reordering)
  • DPB size: 16 frames (VAVCORE_NVDEC_INITIAL_BUFFERING)
  • Separate from NVDEC hardware queue
  • Each player has its own DPB in CUDA memory

3.2 VavCore Internal Queue

VavCore (vavcore_decode_to_surface):

// Simplified VavCore decode flow
VavCoreResult vavcore_decode_to_surface(
    VavCorePlayer* player,
    VavCoreSurfaceType surface_type,
    void* surface,
    VavCoreVideoFrame* frame
) {
    // 1. Read packet from demuxer
    AVPacket* pkt = av_read_frame(player->format_ctx);

    // 2. Submit to NVDEC (CUDA)
    CUVIDPICPARAMS pic_params = {...};
    cuvidDecodePicture(player->cuda_decoder, &pic_params);  // ← NVDEC queue submission

    // 3. Map decoded surface (if available)
    if (frame_available) {
        CUVIDPROCPARAMS proc_params = {...};
        cuvidMapVideoFrame(player->cuda_decoder, ...);      // ← Get decoded frame

        // 4. Copy to D3D12 surface (if provided)
        if (surface) {
            cudaMemcpy2DToArray(...);                       // ← CUDA → D3D12
        }

        return VAVCORE_SUCCESS;
    } else {
        return VAVCORE_PACKET_ACCEPTED;  // Buffering in CUDA DPB
    }
}

Key Insight: cuvidDecodePicture() is thread-safe and manages its own queue internally.


4. Proposed Architecture: NVDEC Self-Queue Management

4.1 Core Concept

Remove Round-Robin entirely and rely on NVDEC's internal queue:

Timeline for 4 players (P1, P2, P3, P4) - NO Round-Robin:

Frame N:
  0.0ms - P1, P2, P3, P4 all call vavcore_decode_to_surface() simultaneously

  NVDEC Queue (internal):
  ┌─────────────────────────────────────┐
  │ [P1] [P2] [P3] [P4]                 │  ← FIFO queue
  │  ↓    ↓    ↓    ↓                   │
  │  Decode requests processed in order │
  └─────────────────────────────────────┘

  0.0ms - P1 decode → 2.5ms (NVDEC processing)
  2.5ms - P2 decode → 2.7ms (NVDEC processing)
  5.2ms - P3 decode → 3.1ms (NVDEC processing)
  8.3ms - P4 decode → 2.8ms (NVDEC processing)

BUT: Each player's thread doesn't BLOCK on NVDEC completion
     - cuvidDecodePicture() returns immediately (async submission)
     - cuvidMapVideoFrame() waits only for its OWN frame

4.2 Implementation Changes

4.2.1 Remove Round-Robin Synchronization

FrameProcessor.cpp (lines 98-107) - DELETE:

// DELETE THIS ENTIRE BLOCK:
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn",
          m_playerInstanceId, m_framesDecoded.load());

GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);

LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn",
          m_playerInstanceId, m_framesDecoded.load());

FrameProcessor.cpp (lines 209, 244) - DELETE:

// DELETE THIS:
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);

4.2.2 Keep Initial Buffering Synchronization

IMPORTANT: Keep the synchronization barrier for initial 16-frame buffering completion:

FrameProcessor.cpp (lines 215-220) - KEEP:

// Synchronization barrier: Wait for all players to complete INITIAL_BUFFERING
// This ensures all players start TRIPLE_FILLING phase simultaneously
if (m_framesDecoded == VAVCORE_NVDEC_INITIAL_BUFFERING) {
    LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING completed - signaling and waiting for all players", m_playerInstanceId);
    GlobalFrameScheduler::GetInstance().SignalPlayerBuffered(m_playerInstanceId);
    GlobalFrameScheduler::GetInstance().WaitAllPlayersBuffered();
    LOGF_INFO("[Player#%d] [FrameProcessor] All players buffered - starting TRIPLE_FILLING phase", m_playerInstanceId);
}

Rationale: This barrier ensures all players have filled their CUDA DPB before starting triple buffer filling. This is a ONE-TIME synchronization (not per-frame).

4.2.3 Updated Decode Flow

New flow (without Round-Robin):

// Phase 1: Initial NVDEC DPB buffering (frames 0-15)
if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING) {
    result = vavcore_decode_to_surface(
        player,
        VAVCORE_SURFACE_D3D12_RESOURCE,
        nullptr,  // NULL surface during initial buffering
        &vavFrame
    );
    // Expected: VAVCORE_PACKET_ACCEPTED for first 16 frames
    // No per-frame synchronization - each player proceeds independently
}

// Phase 2: Triple buffer filling (frames 16-18)
else if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING + VAV2PLAYER_TRIPLE_BUFFER_SIZE) {
    auto backend = m_renderer->GetRGBASurfaceBackend();
    ID3D12Resource* decodeTexture = backend->GetNextDecodeTexture();

    result = vavcore_decode_to_surface(
        player,
        VAVCORE_SURFACE_D3D12_RESOURCE,
        decodeTexture,  // Valid D3D12 texture
        &vavFrame
    );

    if (result == VAVCORE_SUCCESS) {
        backend->AdvanceDecodeOnly();
    }
}

// Phase 3: Normal operation (frame 19+)
else {
    auto backend = m_renderer->GetRGBASurfaceBackend();
    ID3D12Resource* decodeTexture = backend->GetNextDecodeTexture();

    result = vavcore_decode_to_surface(
        player,
        VAVCORE_SURFACE_D3D12_RESOURCE,
        decodeTexture,
        &vavFrame
    );

    if (result == VAVCORE_SUCCESS) {
        backend->AdvanceFrame();
    }
}

5. Synchronization Strategy

5.1 What Replaces Round-Robin?

Answer: Nothing (for per-frame synchronization).

NVDEC's internal queue provides natural serialization:

  • Thread-safe submission via cuvidDecodePicture()
  • FIFO ordering prevents starvation
  • Back-pressure when queue full (blocking at hardware level)

5.2 Remaining Synchronization Points

ONE-TIME synchronization (kept):

  1. Initial buffering completion (line 215-220):

    • Ensures all players complete 16-frame CUDA DPB buffering
    • Prevents race condition where Player#1 starts rendering while Player#4 is still buffering
    • ONE-TIME event (happens once per playback session)
  2. Triple buffer filling completion (implicit):

    • Each player fills textures [0, 1, 2] before starting normal playback
    • No explicit synchronization needed (local state machine per player)

NO per-frame synchronization (removed):

  • Round-Robin turn-based waiting (DELETED)
  • Per-frame SignalNextPlayer (DELETED)

6. Risk Analysis

6.1 Potential Issues

Risk 1: NVDEC Queue Saturation

Description: If all 4 players submit frames simultaneously, NVDEC queue might fill up.

Analysis:

  • NVDEC queue depth: 8-16 frames (hardware dependent)
  • Submission rate: 4 players × 30fps = 120 submissions/sec
  • Processing rate: NVDEC can handle 4K@60fps = 240 frames/sec (RTX 3080)
  • Conclusion: Processing rate (240fps) > Submission rate (120fps) → No saturation

Mitigation: None needed (hardware has sufficient throughput).

Risk 2: Frame Reordering Across Players

Description: NVDEC might interleave frames from different players.

Analysis:

  • Each player has separate CUvideodecoder instance
  • Each decoder has its own CUDA DPB (16 frames)
  • NVDEC processes each decoder's queue independently
  • Conclusion: No cross-player interference (separate decoder instances)

Mitigation: None needed (architecture prevents this).

Risk 3: Unbalanced Decode Timing

Description: One player might consistently decode faster, causing timing drift.

Analysis:

  • Timing controlled by PlaybackController (PlaybackTimingThread)
  • ProcessNextFrame() called at regular intervals (33.33ms for 30fps)
  • Even if decode is fast (2ms), player waits for next timing tick
  • Conclusion: Timing thread prevents drift (independent of decode speed)

Mitigation: None needed (PlaybackController already handles this).

Risk 4: CUDA Context Switching Overhead

Description: Frequent context switches between 4 players might add latency.

Analysis:

  • Each player uses the same CUDA context (shared across application)
  • VavCore creates one context per GPU device (not per player)
  • Context switch cost: ~10-50μs (microseconds)
  • Conclusion: Overhead negligible (<0.05ms per frame)

Mitigation: None needed (minimal overhead).


7. Performance Impact Analysis

7.1 Expected Improvements

Current Performance (with Round-Robin):

Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 8ms = 10.5ms
Player#2: DECODE: 2.7ms (wait 2.5ms) + QUEUE_DELAY: 9.9ms = 12.6ms
Player#3: DECODE: 53.6ms (wait 50ms) + QUEUE_DELAY: 58.0ms = 111.6ms ❌
Player#4: DECODE: 2.8ms (wait 8ms) + QUEUE_DELAY: 12ms = 20.8ms

Expected Performance (without Round-Robin):

Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 5ms = 7.5ms ✅
Player#2: DECODE: 2.7ms (no wait) + QUEUE_DELAY: 5ms = 7.7ms ✅
Player#3: DECODE: 3.1ms (no wait) + QUEUE_DELAY: 6ms = 9.1ms ✅ (58ms → 6ms!)
Player#4: DECODE: 2.8ms (no wait) + QUEUE_DELAY: 5ms = 7.8ms ✅

Key Improvement: Player#3 QUEUE_DELAY reduces from 58ms to ~6ms (90% reduction).

7.2 Worst-Case Scenario

If NVDEC queue becomes full (unlikely, see Risk 1):

  • cuvidDecodePicture() blocks until queue slot available
  • Blocking time: ~10-20ms (time for NVDEC to process one frame)
  • Still better than Round-Robin (50ms wait)

Comparison:

  • Round-Robin worst-case: 50ms wait (sequential blocking)
  • NVDEC queue worst-case: 20ms wait (hardware back-pressure)
  • Conclusion: Even worst-case is 2.5× faster

8. Implementation Plan

8.1 Code Changes

File: FrameProcessor.cpp

Change 1: Remove Round-Robin waiting (lines 98-107)

// DELETE:
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn",
          m_playerInstanceId, m_framesDecoded.load());

GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);

LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn",
          m_playerInstanceId, m_framesDecoded.load());

Change 2: Remove Round-Robin signaling (line 209)

// DELETE:
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player",
          m_playerInstanceId, m_framesDecoded.load());

Change 3: Remove Round-Robin signaling (line 244)

// DELETE:
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player (SUCCESS path)",
          m_playerInstanceId, m_framesDecoded.load());

KEEP: Initial buffering synchronization (lines 215-220)

// KEEP THIS:
if (m_framesDecoded == VAVCORE_NVDEC_INITIAL_BUFFERING) {
    LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING completed - signaling and waiting for all players", m_playerInstanceId);
    GlobalFrameScheduler::GetInstance().SignalPlayerBuffered(m_playerInstanceId);
    GlobalFrameScheduler::GetInstance().WaitAllPlayersBuffered();
    LOGF_INFO("[Player#%d] [FrameProcessor] All players buffered - starting TRIPLE_FILLING phase", m_playerInstanceId);
}

8.2 Testing Plan

Test 1: Verify QUEUE_DELAY reduction

  • Run 4-player playback
  • Monitor time.log for QUEUE_DELAY values
  • Expected: All players <10ms (currently Player#3 is 58ms)

Test 2: Verify no frame drops

  • Run 10-minute playback session
  • Check m_framesDropped counter
  • Expected: 0 dropped frames

Test 3: Verify smooth playback

  • Visual inspection for jitter/stuttering
  • Expected: Smooth 30fps playback across all 4 players

Test 4: Verify NVDEC queue stability

  • Monitor DECODE times in logs
  • Expected: Consistent 2-5ms (no sudden spikes indicating queue saturation)

8.3 Rollback Plan

If NVDEC queue saturation occurs (unlikely):

  1. Revert code changes (git restore FrameProcessor.cpp)
  2. Implement alternative: Semaphore-based limiting (max 2 concurrent decodes)
    • Less restrictive than Round-Robin (2 concurrent vs. 1 sequential)
    • Prevents queue saturation while reducing blocking time

9. Comparison Table

Aspect Round-Robin (Current) NVDEC Self-Queue (Proposed)
Synchronization Per-frame sequential Hardware-managed FIFO
Blocking Time 0-50ms (depends on player position) 0-20ms (only if queue full)
QUEUE_DELAY 8-58ms (varies by player) 5-10ms (consistent)
Code Complexity High (GlobalFrameScheduler) Low (delete code)
NVDEC Utilization Low (sequential submission) High (parallel submission)
Frame Drop Risk High (cascading delays) Low (independent timing)

10. Conclusion

Justification:

  1. Performance: 90% reduction in QUEUE_DELAY for Player#3 (58ms → 6ms)
  2. Simplicity: Deleting code reduces complexity and maintenance burden
  3. Hardware Design: NVDEC is designed for concurrent submissions (thread-safe queue)
  4. Low Risk: Risks analyzed and deemed negligible (hardware has sufficient throughput)
  5. Proven Architecture: Industry-standard approach (FFmpeg, VLC, Chrome all submit concurrently)

One-Time Synchronization (kept):

  • Initial buffering completion barrier ensures all players start TRIPLE_FILLING phase together
  • This is necessary for correct startup behavior (not a performance bottleneck)

Next Steps:

  1. User reviews this design document
  2. If approved, implement code changes (3 deletions in FrameProcessor.cpp)
  3. Test and verify QUEUE_DELAY improvement
  4. Monitor for any unexpected issues (rollback plan available)

End of Design Document