494 lines
17 KiB
Markdown
494 lines
17 KiB
Markdown
|
|
# NVDEC Self-Queue Management Design
|
|||
|
|
**Date**: 2025-10-11
|
|||
|
|
**Author**: Claude
|
|||
|
|
**Purpose**: Alternative architecture to Round-Robin synchronization for 4-player simultaneous AV1 playback
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Executive Summary
|
|||
|
|
|
|||
|
|
### Current Problem
|
|||
|
|
Round-Robin synchronization causes sequential blocking, resulting in:
|
|||
|
|
- Player#3 DECODE: 53.6ms (waiting for Player#2)
|
|||
|
|
- Player#3 QUEUE_DELAY: 58.0ms (UI thread callback delayed)
|
|||
|
|
- Video jitter/stuttering despite Triple Buffering, VSync removal, and rendering separation
|
|||
|
|
|
|||
|
|
### Proposed Solution
|
|||
|
|
Remove Round-Robin and rely on NVDEC's internal queue management:
|
|||
|
|
- All players submit decode requests independently (no blocking)
|
|||
|
|
- NVDEC manages concurrent requests using its internal FIFO queue
|
|||
|
|
- Expected DECODE time: 2-5ms (actual hardware decode time)
|
|||
|
|
- Expected QUEUE_DELAY: <10ms (consistent with Player#2 performance)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Current Architecture Analysis
|
|||
|
|
|
|||
|
|
### 2.1 Round-Robin Synchronization Flow
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Timeline for 4 players (P1, P2, P3, P4):
|
|||
|
|
|
|||
|
|
Frame N:
|
|||
|
|
0ms - P1 WaitForMyTurnInBuffering() → ACQUIRED (turn=1)
|
|||
|
|
0ms - P1 vavcore_decode_to_surface() → 2.5ms
|
|||
|
|
2.5ms - P1 SignalNextPlayer() → turn=2
|
|||
|
|
|
|||
|
|
2.5ms - P2 WaitForMyTurnInBuffering() → ACQUIRED (turn=2)
|
|||
|
|
2.5ms - P2 vavcore_decode_to_surface() → 2.7ms
|
|||
|
|
5.2ms - P2 SignalNextPlayer() → turn=3
|
|||
|
|
|
|||
|
|
5.2ms - P3 WaitForMyTurnInBuffering() → ACQUIRED (turn=3)
|
|||
|
|
5.2ms - P3 vavcore_decode_to_surface() → 3.1ms
|
|||
|
|
8.3ms - P3 SignalNextPlayer() → turn=4
|
|||
|
|
|
|||
|
|
8.3ms - P4 WaitForMyTurnInBuffering() → ACQUIRED (turn=4)
|
|||
|
|
8.3ms - P4 vavcore_decode_to_surface() → 2.8ms
|
|||
|
|
11.1ms - P4 SignalNextPlayer() → turn=1 (wrap around)
|
|||
|
|
|
|||
|
|
Total latency for P4: 11.1ms (accumulated waiting + decode)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Problem**: Player#3 waits 5.2ms before even starting decode, causing cascading delays.
|
|||
|
|
|
|||
|
|
### 2.2 Code Locations
|
|||
|
|
|
|||
|
|
**FrameProcessor.cpp (lines 98-107):**
|
|||
|
|
```cpp
|
|||
|
|
// Round-Robin coordination for ALL frames to prevent NVDEC queue saturation
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn",
|
|||
|
|
m_playerInstanceId, m_framesDecoded.load());
|
|||
|
|
|
|||
|
|
// Wait for my turn in round-robin (blocking)
|
|||
|
|
GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);
|
|||
|
|
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn",
|
|||
|
|
m_playerInstanceId, m_framesDecoded.load());
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**GlobalFrameScheduler.cpp:**
|
|||
|
|
```cpp
|
|||
|
|
void GlobalFrameScheduler::WaitForMyTurnInBuffering(int playerId) {
|
|||
|
|
std::unique_lock<std::mutex> lock(m_bufferingMutex);
|
|||
|
|
m_bufferingCV.wait(lock, [this, playerId]() {
|
|||
|
|
return m_currentPlayerTurn == playerId;
|
|||
|
|
});
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
void GlobalFrameScheduler::SignalNextPlayer(int playerId) {
|
|||
|
|
std::unique_lock<std::mutex> lock(m_bufferingMutex);
|
|||
|
|
m_currentPlayerTurn = (playerId % m_totalPlayers) + 1;
|
|||
|
|
m_bufferingCV.notify_all();
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. NVDEC Queue Characteristics
|
|||
|
|
|
|||
|
|
### 3.1 NVIDIA NVDEC Architecture
|
|||
|
|
|
|||
|
|
**Hardware Queue Management:**
|
|||
|
|
- NVDEC has an internal FIFO queue for decode requests
|
|||
|
|
- Queue depth: Typically 8-16 frames (hardware dependent)
|
|||
|
|
- Concurrent submission: NVDEC accepts multiple `cuvidDecodePicture()` calls
|
|||
|
|
- Internal scheduling: NVDEC processes requests in submission order
|
|||
|
|
- Back-pressure: When queue is full, `cuvidDecodePicture()` blocks until slot available
|
|||
|
|
|
|||
|
|
**CUDA DPB (Decoded Picture Buffer):**
|
|||
|
|
- VavCore uses CUDA memory for DPB (for B-frame reordering)
|
|||
|
|
- DPB size: 16 frames (VAVCORE_NVDEC_INITIAL_BUFFERING)
|
|||
|
|
- Separate from NVDEC hardware queue
|
|||
|
|
- Each player has its own DPB in CUDA memory
|
|||
|
|
|
|||
|
|
### 3.2 VavCore Internal Queue
|
|||
|
|
|
|||
|
|
**VavCore (vavcore_decode_to_surface):**
|
|||
|
|
```c
|
|||
|
|
// Simplified VavCore decode flow
|
|||
|
|
VavCoreResult vavcore_decode_to_surface(
|
|||
|
|
VavCorePlayer* player,
|
|||
|
|
VavCoreSurfaceType surface_type,
|
|||
|
|
void* surface,
|
|||
|
|
VavCoreVideoFrame* frame
|
|||
|
|
) {
|
|||
|
|
// 1. Read packet from demuxer
|
|||
|
|
AVPacket* pkt = av_read_frame(player->format_ctx);
|
|||
|
|
|
|||
|
|
// 2. Submit to NVDEC (CUDA)
|
|||
|
|
CUVIDPICPARAMS pic_params = {...};
|
|||
|
|
cuvidDecodePicture(player->cuda_decoder, &pic_params); // ← NVDEC queue submission
|
|||
|
|
|
|||
|
|
// 3. Map decoded surface (if available)
|
|||
|
|
if (frame_available) {
|
|||
|
|
CUVIDPROCPARAMS proc_params = {...};
|
|||
|
|
cuvidMapVideoFrame(player->cuda_decoder, ...); // ← Get decoded frame
|
|||
|
|
|
|||
|
|
// 4. Copy to D3D12 surface (if provided)
|
|||
|
|
if (surface) {
|
|||
|
|
cudaMemcpy2DToArray(...); // ← CUDA → D3D12
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
return VAVCORE_SUCCESS;
|
|||
|
|
} else {
|
|||
|
|
return VAVCORE_PACKET_ACCEPTED; // Buffering in CUDA DPB
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key Insight**: `cuvidDecodePicture()` is thread-safe and manages its own queue internally.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Proposed Architecture: NVDEC Self-Queue Management
|
|||
|
|
|
|||
|
|
### 4.1 Core Concept
|
|||
|
|
|
|||
|
|
**Remove Round-Robin entirely and rely on NVDEC's internal queue:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Timeline for 4 players (P1, P2, P3, P4) - NO Round-Robin:
|
|||
|
|
|
|||
|
|
Frame N:
|
|||
|
|
0.0ms - P1, P2, P3, P4 all call vavcore_decode_to_surface() simultaneously
|
|||
|
|
|
|||
|
|
NVDEC Queue (internal):
|
|||
|
|
┌─────────────────────────────────────┐
|
|||
|
|
│ [P1] [P2] [P3] [P4] │ ← FIFO queue
|
|||
|
|
│ ↓ ↓ ↓ ↓ │
|
|||
|
|
│ Decode requests processed in order │
|
|||
|
|
└─────────────────────────────────────┘
|
|||
|
|
|
|||
|
|
0.0ms - P1 decode → 2.5ms (NVDEC processing)
|
|||
|
|
2.5ms - P2 decode → 2.7ms (NVDEC processing)
|
|||
|
|
5.2ms - P3 decode → 3.1ms (NVDEC processing)
|
|||
|
|
8.3ms - P4 decode → 2.8ms (NVDEC processing)
|
|||
|
|
|
|||
|
|
BUT: Each player's thread doesn't BLOCK on NVDEC completion
|
|||
|
|
- cuvidDecodePicture() returns immediately (async submission)
|
|||
|
|
- cuvidMapVideoFrame() waits only for its OWN frame
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4.2 Implementation Changes
|
|||
|
|
|
|||
|
|
#### 4.2.1 Remove Round-Robin Synchronization
|
|||
|
|
|
|||
|
|
**FrameProcessor.cpp (lines 98-107) - DELETE:**
|
|||
|
|
```cpp
|
|||
|
|
// DELETE THIS ENTIRE BLOCK:
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn",
|
|||
|
|
m_playerInstanceId, m_framesDecoded.load());
|
|||
|
|
|
|||
|
|
GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);
|
|||
|
|
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn",
|
|||
|
|
m_playerInstanceId, m_framesDecoded.load());
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**FrameProcessor.cpp (lines 209, 244) - DELETE:**
|
|||
|
|
```cpp
|
|||
|
|
// DELETE THIS:
|
|||
|
|
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 4.2.2 Keep Initial Buffering Synchronization
|
|||
|
|
|
|||
|
|
**IMPORTANT**: Keep the synchronization barrier for initial 16-frame buffering completion:
|
|||
|
|
|
|||
|
|
**FrameProcessor.cpp (lines 215-220) - KEEP:**
|
|||
|
|
```cpp
|
|||
|
|
// Synchronization barrier: Wait for all players to complete INITIAL_BUFFERING
|
|||
|
|
// This ensures all players start TRIPLE_FILLING phase simultaneously
|
|||
|
|
if (m_framesDecoded == VAVCORE_NVDEC_INITIAL_BUFFERING) {
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING completed - signaling and waiting for all players", m_playerInstanceId);
|
|||
|
|
GlobalFrameScheduler::GetInstance().SignalPlayerBuffered(m_playerInstanceId);
|
|||
|
|
GlobalFrameScheduler::GetInstance().WaitAllPlayersBuffered();
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] All players buffered - starting TRIPLE_FILLING phase", m_playerInstanceId);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Rationale**: This barrier ensures all players have filled their CUDA DPB before starting triple buffer filling. This is a ONE-TIME synchronization (not per-frame).
|
|||
|
|
|
|||
|
|
#### 4.2.3 Updated Decode Flow
|
|||
|
|
|
|||
|
|
**New flow (without Round-Robin):**
|
|||
|
|
|
|||
|
|
```cpp
|
|||
|
|
// Phase 1: Initial NVDEC DPB buffering (frames 0-15)
|
|||
|
|
if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING) {
|
|||
|
|
result = vavcore_decode_to_surface(
|
|||
|
|
player,
|
|||
|
|
VAVCORE_SURFACE_D3D12_RESOURCE,
|
|||
|
|
nullptr, // NULL surface during initial buffering
|
|||
|
|
&vavFrame
|
|||
|
|
);
|
|||
|
|
// Expected: VAVCORE_PACKET_ACCEPTED for first 16 frames
|
|||
|
|
// No per-frame synchronization - each player proceeds independently
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Phase 2: Triple buffer filling (frames 16-18)
|
|||
|
|
else if (m_framesDecoded < VAVCORE_NVDEC_INITIAL_BUFFERING + VAV2PLAYER_TRIPLE_BUFFER_SIZE) {
|
|||
|
|
auto backend = m_renderer->GetRGBASurfaceBackend();
|
|||
|
|
ID3D12Resource* decodeTexture = backend->GetNextDecodeTexture();
|
|||
|
|
|
|||
|
|
result = vavcore_decode_to_surface(
|
|||
|
|
player,
|
|||
|
|
VAVCORE_SURFACE_D3D12_RESOURCE,
|
|||
|
|
decodeTexture, // Valid D3D12 texture
|
|||
|
|
&vavFrame
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
if (result == VAVCORE_SUCCESS) {
|
|||
|
|
backend->AdvanceDecodeOnly();
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Phase 3: Normal operation (frame 19+)
|
|||
|
|
else {
|
|||
|
|
auto backend = m_renderer->GetRGBASurfaceBackend();
|
|||
|
|
ID3D12Resource* decodeTexture = backend->GetNextDecodeTexture();
|
|||
|
|
|
|||
|
|
result = vavcore_decode_to_surface(
|
|||
|
|
player,
|
|||
|
|
VAVCORE_SURFACE_D3D12_RESOURCE,
|
|||
|
|
decodeTexture,
|
|||
|
|
&vavFrame
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
if (result == VAVCORE_SUCCESS) {
|
|||
|
|
backend->AdvanceFrame();
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. Synchronization Strategy
|
|||
|
|
|
|||
|
|
### 5.1 What Replaces Round-Robin?
|
|||
|
|
|
|||
|
|
**Answer: Nothing (for per-frame synchronization).**
|
|||
|
|
|
|||
|
|
NVDEC's internal queue provides natural serialization:
|
|||
|
|
- Thread-safe submission via `cuvidDecodePicture()`
|
|||
|
|
- FIFO ordering prevents starvation
|
|||
|
|
- Back-pressure when queue full (blocking at hardware level)
|
|||
|
|
|
|||
|
|
### 5.2 Remaining Synchronization Points
|
|||
|
|
|
|||
|
|
**ONE-TIME synchronization (kept):**
|
|||
|
|
|
|||
|
|
1. **Initial buffering completion** (line 215-220):
|
|||
|
|
- Ensures all players complete 16-frame CUDA DPB buffering
|
|||
|
|
- Prevents race condition where Player#1 starts rendering while Player#4 is still buffering
|
|||
|
|
- ONE-TIME event (happens once per playback session)
|
|||
|
|
|
|||
|
|
2. **Triple buffer filling completion** (implicit):
|
|||
|
|
- Each player fills textures [0, 1, 2] before starting normal playback
|
|||
|
|
- No explicit synchronization needed (local state machine per player)
|
|||
|
|
|
|||
|
|
**NO per-frame synchronization** (removed):
|
|||
|
|
- Round-Robin turn-based waiting (DELETED)
|
|||
|
|
- Per-frame SignalNextPlayer (DELETED)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. Risk Analysis
|
|||
|
|
|
|||
|
|
### 6.1 Potential Issues
|
|||
|
|
|
|||
|
|
#### Risk 1: NVDEC Queue Saturation
|
|||
|
|
**Description**: If all 4 players submit frames simultaneously, NVDEC queue might fill up.
|
|||
|
|
|
|||
|
|
**Analysis**:
|
|||
|
|
- NVDEC queue depth: 8-16 frames (hardware dependent)
|
|||
|
|
- Submission rate: 4 players × 30fps = 120 submissions/sec
|
|||
|
|
- Processing rate: NVDEC can handle 4K@60fps = 240 frames/sec (RTX 3080)
|
|||
|
|
- **Conclusion**: Processing rate (240fps) > Submission rate (120fps) → No saturation
|
|||
|
|
|
|||
|
|
**Mitigation**: None needed (hardware has sufficient throughput).
|
|||
|
|
|
|||
|
|
#### Risk 2: Frame Reordering Across Players
|
|||
|
|
**Description**: NVDEC might interleave frames from different players.
|
|||
|
|
|
|||
|
|
**Analysis**:
|
|||
|
|
- Each player has separate `CUvideodecoder` instance
|
|||
|
|
- Each decoder has its own CUDA DPB (16 frames)
|
|||
|
|
- NVDEC processes each decoder's queue independently
|
|||
|
|
- **Conclusion**: No cross-player interference (separate decoder instances)
|
|||
|
|
|
|||
|
|
**Mitigation**: None needed (architecture prevents this).
|
|||
|
|
|
|||
|
|
#### Risk 3: Unbalanced Decode Timing
|
|||
|
|
**Description**: One player might consistently decode faster, causing timing drift.
|
|||
|
|
|
|||
|
|
**Analysis**:
|
|||
|
|
- Timing controlled by PlaybackController (PlaybackTimingThread)
|
|||
|
|
- ProcessNextFrame() called at regular intervals (33.33ms for 30fps)
|
|||
|
|
- Even if decode is fast (2ms), player waits for next timing tick
|
|||
|
|
- **Conclusion**: Timing thread prevents drift (independent of decode speed)
|
|||
|
|
|
|||
|
|
**Mitigation**: None needed (PlaybackController already handles this).
|
|||
|
|
|
|||
|
|
#### Risk 4: CUDA Context Switching Overhead
|
|||
|
|
**Description**: Frequent context switches between 4 players might add latency.
|
|||
|
|
|
|||
|
|
**Analysis**:
|
|||
|
|
- Each player uses the same CUDA context (shared across application)
|
|||
|
|
- VavCore creates one context per GPU device (not per player)
|
|||
|
|
- Context switch cost: ~10-50μs (microseconds)
|
|||
|
|
- **Conclusion**: Overhead negligible (<0.05ms per frame)
|
|||
|
|
|
|||
|
|
**Mitigation**: None needed (minimal overhead).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. Performance Impact Analysis
|
|||
|
|
|
|||
|
|
### 7.1 Expected Improvements
|
|||
|
|
|
|||
|
|
#### Current Performance (with Round-Robin):
|
|||
|
|
```
|
|||
|
|
Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 8ms = 10.5ms
|
|||
|
|
Player#2: DECODE: 2.7ms (wait 2.5ms) + QUEUE_DELAY: 9.9ms = 12.6ms
|
|||
|
|
Player#3: DECODE: 53.6ms (wait 50ms) + QUEUE_DELAY: 58.0ms = 111.6ms ❌
|
|||
|
|
Player#4: DECODE: 2.8ms (wait 8ms) + QUEUE_DELAY: 12ms = 20.8ms
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Expected Performance (without Round-Robin):
|
|||
|
|
```
|
|||
|
|
Player#1: DECODE: 2.5ms (no wait) + QUEUE_DELAY: 5ms = 7.5ms ✅
|
|||
|
|
Player#2: DECODE: 2.7ms (no wait) + QUEUE_DELAY: 5ms = 7.7ms ✅
|
|||
|
|
Player#3: DECODE: 3.1ms (no wait) + QUEUE_DELAY: 6ms = 9.1ms ✅ (58ms → 6ms!)
|
|||
|
|
Player#4: DECODE: 2.8ms (no wait) + QUEUE_DELAY: 5ms = 7.8ms ✅
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key Improvement**: Player#3 QUEUE_DELAY reduces from 58ms to ~6ms (90% reduction).
|
|||
|
|
|
|||
|
|
### 7.2 Worst-Case Scenario
|
|||
|
|
|
|||
|
|
**If NVDEC queue becomes full** (unlikely, see Risk 1):
|
|||
|
|
- `cuvidDecodePicture()` blocks until queue slot available
|
|||
|
|
- Blocking time: ~10-20ms (time for NVDEC to process one frame)
|
|||
|
|
- Still better than Round-Robin (50ms wait)
|
|||
|
|
|
|||
|
|
**Comparison**:
|
|||
|
|
- Round-Robin worst-case: 50ms wait (sequential blocking)
|
|||
|
|
- NVDEC queue worst-case: 20ms wait (hardware back-pressure)
|
|||
|
|
- **Conclusion**: Even worst-case is 2.5× faster
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 8. Implementation Plan
|
|||
|
|
|
|||
|
|
### 8.1 Code Changes
|
|||
|
|
|
|||
|
|
**File: FrameProcessor.cpp**
|
|||
|
|
|
|||
|
|
**Change 1: Remove Round-Robin waiting (lines 98-107)**
|
|||
|
|
```cpp
|
|||
|
|
// DELETE:
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - WAITING for turn",
|
|||
|
|
m_playerInstanceId, m_framesDecoded.load());
|
|||
|
|
|
|||
|
|
GlobalFrameScheduler::GetInstance().WaitForMyTurnInBuffering(m_playerInstanceId);
|
|||
|
|
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - ACQUIRED turn",
|
|||
|
|
m_playerInstanceId, m_framesDecoded.load());
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Change 2: Remove Round-Robin signaling (line 209)**
|
|||
|
|
```cpp
|
|||
|
|
// DELETE:
|
|||
|
|
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player",
|
|||
|
|
m_playerInstanceId, m_framesDecoded.load());
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Change 3: Remove Round-Robin signaling (line 244)**
|
|||
|
|
```cpp
|
|||
|
|
// DELETE:
|
|||
|
|
GlobalFrameScheduler::GetInstance().SignalNextPlayer(m_playerInstanceId);
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] Frame %llu - SIGNALED next player (SUCCESS path)",
|
|||
|
|
m_playerInstanceId, m_framesDecoded.load());
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**KEEP: Initial buffering synchronization (lines 215-220)**
|
|||
|
|
```cpp
|
|||
|
|
// KEEP THIS:
|
|||
|
|
if (m_framesDecoded == VAVCORE_NVDEC_INITIAL_BUFFERING) {
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] INITIAL_BUFFERING completed - signaling and waiting for all players", m_playerInstanceId);
|
|||
|
|
GlobalFrameScheduler::GetInstance().SignalPlayerBuffered(m_playerInstanceId);
|
|||
|
|
GlobalFrameScheduler::GetInstance().WaitAllPlayersBuffered();
|
|||
|
|
LOGF_INFO("[Player#%d] [FrameProcessor] All players buffered - starting TRIPLE_FILLING phase", m_playerInstanceId);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 8.2 Testing Plan
|
|||
|
|
|
|||
|
|
**Test 1: Verify QUEUE_DELAY reduction**
|
|||
|
|
- Run 4-player playback
|
|||
|
|
- Monitor time.log for QUEUE_DELAY values
|
|||
|
|
- Expected: All players <10ms (currently Player#3 is 58ms)
|
|||
|
|
|
|||
|
|
**Test 2: Verify no frame drops**
|
|||
|
|
- Run 10-minute playback session
|
|||
|
|
- Check `m_framesDropped` counter
|
|||
|
|
- Expected: 0 dropped frames
|
|||
|
|
|
|||
|
|
**Test 3: Verify smooth playback**
|
|||
|
|
- Visual inspection for jitter/stuttering
|
|||
|
|
- Expected: Smooth 30fps playback across all 4 players
|
|||
|
|
|
|||
|
|
**Test 4: Verify NVDEC queue stability**
|
|||
|
|
- Monitor DECODE times in logs
|
|||
|
|
- Expected: Consistent 2-5ms (no sudden spikes indicating queue saturation)
|
|||
|
|
|
|||
|
|
### 8.3 Rollback Plan
|
|||
|
|
|
|||
|
|
If NVDEC queue saturation occurs (unlikely):
|
|||
|
|
1. Revert code changes (git restore FrameProcessor.cpp)
|
|||
|
|
2. Implement alternative: **Semaphore-based limiting** (max 2 concurrent decodes)
|
|||
|
|
- Less restrictive than Round-Robin (2 concurrent vs. 1 sequential)
|
|||
|
|
- Prevents queue saturation while reducing blocking time
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 9. Comparison Table
|
|||
|
|
|
|||
|
|
| Aspect | Round-Robin (Current) | NVDEC Self-Queue (Proposed) |
|
|||
|
|
|--------|----------------------|----------------------------|
|
|||
|
|
| **Synchronization** | Per-frame sequential | Hardware-managed FIFO |
|
|||
|
|
| **Blocking Time** | 0-50ms (depends on player position) | 0-20ms (only if queue full) |
|
|||
|
|
| **QUEUE_DELAY** | 8-58ms (varies by player) | 5-10ms (consistent) |
|
|||
|
|
| **Code Complexity** | High (GlobalFrameScheduler) | Low (delete code) |
|
|||
|
|
| **NVDEC Utilization** | Low (sequential submission) | High (parallel submission) |
|
|||
|
|
| **Frame Drop Risk** | High (cascading delays) | Low (independent timing) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 10. Conclusion
|
|||
|
|
|
|||
|
|
### Recommended Action: **Remove Round-Robin**
|
|||
|
|
|
|||
|
|
**Justification**:
|
|||
|
|
1. **Performance**: 90% reduction in QUEUE_DELAY for Player#3 (58ms → 6ms)
|
|||
|
|
2. **Simplicity**: Deleting code reduces complexity and maintenance burden
|
|||
|
|
3. **Hardware Design**: NVDEC is designed for concurrent submissions (thread-safe queue)
|
|||
|
|
4. **Low Risk**: Risks analyzed and deemed negligible (hardware has sufficient throughput)
|
|||
|
|
5. **Proven Architecture**: Industry-standard approach (FFmpeg, VLC, Chrome all submit concurrently)
|
|||
|
|
|
|||
|
|
**One-Time Synchronization** (kept):
|
|||
|
|
- Initial buffering completion barrier ensures all players start TRIPLE_FILLING phase together
|
|||
|
|
- This is necessary for correct startup behavior (not a performance bottleneck)
|
|||
|
|
|
|||
|
|
**Next Steps**:
|
|||
|
|
1. User reviews this design document
|
|||
|
|
2. If approved, implement code changes (3 deletions in FrameProcessor.cpp)
|
|||
|
|
3. Test and verify QUEUE_DELAY improvement
|
|||
|
|
4. Monitor for any unexpected issues (rollback plan available)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**End of Design Document**
|