17 KiB
GlobalFrameBudget Design Document
1. Overview
Purpose
Problem: When 4 VideoPlayerControl2 instances play simultaneously, they all hit the initial buffering bottleneck (frames 16-18) at the same time, causing NVDEC queue overflow with QUEUE_DELAY of 35-42ms (exceeding the 33.33ms budget for 30fps).
Solution: Implement a global frame processing budget manager that limits concurrent frame processing during the bottleneck phase, reducing the load from 4 players to 3 maximum, bringing QUEUE_DELAY down to ~28-33ms (within budget).
Key Constraints
- NVDEC DPB_SIZE = 16: Required by AV1 sequence header (min_num_decode_surfaces=9 for test video, up to 12+ for complex GOPs)
- INITIAL_BUFFERING = 16: NVDEC requires full DPB filling for B-frame reordering
- Cannot reduce buffer sizes: Tested DPB_SIZE=4/8 both crash with "Invalid CurrPicIdx"
- Must maintain sync: All 4 players should remain synchronized after initial buffering
2. Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ GlobalFrameBudget │
│ (Singleton) │
├─────────────────────────────────────────────────────────────┤
│ State: │
│ - m_activeFrames: atomic<int> (current active frames) │
│ - MAX_CONCURRENT_FRAMES = 3 (bottleneck phase limit) │
├─────────────────────────────────────────────────────────────┤
│ Public API: │
│ + TryAcquireFrameSlot(playerId, frameNumber) → bool │
│ + ReleaseFrameSlot(playerId) → void │
│ + GetActiveFrameCount() → int │
│ + GetStatistics() → BudgetStatistics │
│ + ResetStatistics() → void │
└─────────────────────────────────────────────────────────────┘
▲
│ uses
┌────────────────┼────────────────┐
│ │ │
┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐
│ Player#0 │ │ Player#1 │ │ Player#2 │ ...
│FrameProc │ │FrameProc │ │FrameProc │
└──────────┘ └──────────┘ └──────────┘
Call Flow:
1. FrameProcessor::ProcessFrame()
→ Check Phase == TRIPLE_FILLING?
2. YES → TryAcquireFrameSlot()
→ m_activeFrames < 3?
3. YES → Proceed with decode
→ ReleaseFrameSlot() after render completes
4. NO → Skip frame (m_framesDropped++)
3. Processing Phases
FrameProcessor operates in 3 distinct phases:
enum class Phase {
INITIAL_BUFFERING, // frames 0-15: NULL surface submission to NVDEC DPB
TRIPLE_FILLING, // frames 16-18: triple buffer filling (BOTTLENECK)
NORMAL_PLAYBACK // frames 19+: stable rendering
};
Phase Details
| Phase | Frame Range | Behavior | QUEUE_DELAY |
|---|---|---|---|
| INITIAL_BUFFERING | 0-15 | NULL surface, no render | 6-15ms (stable) |
| TRIPLE_FILLING | 16-18 | Fill triple buffer, first renders | 35-42ms (SPIKE) |
| NORMAL_PLAYBACK | 19+ | Steady state rendering | 6-22ms (stable) |
GlobalFrameBudget is only active during TRIPLE_FILLING phase.
4. Class Interface
Header: GlobalFrameBudget.h
namespace Vav2Player {
class GlobalFrameBudget
{
public:
static GlobalFrameBudget& GetInstance();
// Acquire permission to process frame
// Returns: true if slot acquired, false if budget limit reached
bool TryAcquireFrameSlot(int playerId, uint64_t frameNumber);
// Release slot after processing complete
void ReleaseFrameSlot(int playerId);
// Query current state
int GetActiveFrameCount() const { return m_activeFrames.load(); }
// Statistics
struct BudgetStatistics {
uint64_t totalAcquireAttempts;
uint64_t successfulAcquires;
uint64_t rejectedAcquires;
double rejectionRate;
};
BudgetStatistics GetStatistics() const;
void ResetStatistics();
private:
GlobalFrameBudget() = default;
~GlobalFrameBudget() = default;
// Disable copy/move
GlobalFrameBudget(const GlobalFrameBudget&) = delete;
GlobalFrameBudget& operator=(const GlobalFrameBudget&) = delete;
// Configuration
static constexpr int MAX_CONCURRENT_FRAMES_BOTTLENECK = 3;
// State
std::atomic<int> m_activeFrames{0};
// Statistics
std::atomic<uint64_t> m_totalAcquireAttempts{0};
std::atomic<uint64_t> m_successfulAcquires{0};
std::atomic<uint64_t> m_rejectedAcquires{0};
};
} // namespace Vav2Player
5. Integration with FrameProcessor
FrameProcessor Changes
FrameProcessor.h additions:
class FrameProcessor
{
public:
// Processing phase query
enum class Phase {
INITIAL_BUFFERING,
TRIPLE_FILLING,
NORMAL_PLAYBACK
};
Phase GetCurrentPhase() const;
private:
// Track if budget slot was acquired (for proper release)
std::atomic<bool> m_budgetSlotAcquired{false};
};
FrameProcessor.cpp integration:
bool FrameProcessor::ProcessFrame(VavCorePlayer* player,
std::function<void(bool)> onComplete)
{
// Existing: Skip if previous frame still processing
if (m_frameProcessing.load()) {
m_framesDropped++;
return false;
}
Phase currentPhase = GetCurrentPhase();
// NEW: Apply GlobalFrameBudget during bottleneck phase
if (currentPhase == Phase::TRIPLE_FILLING) {
if (!GlobalFrameBudget::GetInstance().TryAcquireFrameSlot(
m_playerInstanceId, m_framesDecoded)) {
LOGF_DEBUG("[Player#%d] Frame %llu SKIPPED (global budget limit)",
m_playerInstanceId, m_framesDecoded.load());
m_framesDropped++;
return false;
}
m_budgetSlotAcquired = true;
}
m_frameProcessing = true;
// ... existing decode logic ...
// UI thread callback with budget release
m_dispatcherQueue.TryEnqueue([this, renderIndex, onComplete]() {
bool renderSuccess = m_renderer->RenderFrame(renderIndex);
// NEW: Release budget slot after render complete
if (m_budgetSlotAcquired.load()) {
GlobalFrameBudget::GetInstance().ReleaseFrameSlot(m_playerInstanceId);
m_budgetSlotAcquired = false;
}
m_frameProcessing = false;
onComplete(renderSuccess);
});
m_framesDecoded++;
return true;
}
FrameProcessor::Phase FrameProcessor::GetCurrentPhase() const
{
uint64_t decoded = m_framesDecoded.load();
if (decoded < VAVCORE_NVDEC_INITIAL_BUFFERING) {
return Phase::INITIAL_BUFFERING;
}
else if (decoded < VAVCORE_NVDEC_INITIAL_BUFFERING + VAV2PLAYER_TRIPLE_BUFFER_SIZE) {
return Phase::TRIPLE_FILLING;
}
else {
return Phase::NORMAL_PLAYBACK;
}
}
6. Simulation Scenario
Timeline with 4 Players
Initial State: m_activeFrames = 0, MAX = 3
t=0ms: All 4 players call Play(), start frame 0
Phase 1 (frames 0-15): INITIAL_BUFFERING
- All 4 players process normally
- No GlobalFrameBudget involvement
- QUEUE_DELAY: 6-15ms (stable)
t=533ms: All 4 players reach frame 16
Phase 2 (frames 16-18): TRIPLE_FILLING (BOTTLENECK)
Frame 16:
t=533ms: Player#0 TryAcquire → m_activeFrames: 0→1 ✅
t=533ms: Player#1 TryAcquire → m_activeFrames: 1→2 ✅
t=533ms: Player#2 TryAcquire → m_activeFrames: 2→3 ✅
t=533ms: Player#3 TryAcquire → REJECTED (3 >= 3) ❌ [FRAME SKIPPED]
t=543ms: Player#0 render complete → Release → m_activeFrames: 3→2
t=543ms: Player#3 ProcessFrame (retry frame 16) → TryAcquire → 2→3 ✅
Frame 17:
Similar pattern: One player skips, retries after slot release
Frame 18:
Similar pattern: One player skips, retries after slot release
t=633ms: All 4 players reach frame 19
Phase 3 (frames 19+): NORMAL_PLAYBACK
- All 4 players process normally
- No GlobalFrameBudget involvement
- QUEUE_DELAY: 6-22ms (stable)
Result:
- Bottleneck phase: 4 players → max 3 concurrent
- NVDEC queue load: 25% reduction
- QUEUE_DELAY: 35-42ms → ~28-33ms (within 33.33ms budget)
- Player synchronization: Maintained (skipped frames retry immediately)
7. Thread Safety
Lock-Free Design
All operations use atomic primitives for thread safety without mutexes:
bool GlobalFrameBudget::TryAcquireFrameSlot(int playerId, uint64_t frameNumber)
{
// Atomic read
int current = m_activeFrames.load(std::memory_order_acquire);
// Check limit
if (current >= MAX_CONCURRENT_FRAMES_BOTTLENECK) {
return false; // Fast path rejection
}
// Lock-free CAS loop
while (current < MAX_CONCURRENT_FRAMES_BOTTLENECK) {
if (m_activeFrames.compare_exchange_weak(current, current + 1,
std::memory_order_acq_rel,
std::memory_order_acquire)) {
return true; // Successfully acquired
}
// compare_exchange_weak failed - current was updated, retry
}
return false; // Budget exhausted during retry
}
void GlobalFrameBudget::ReleaseFrameSlot(int playerId)
{
// Atomic decrement
m_activeFrames.fetch_sub(1, std::memory_order_acq_rel);
}
Memory Ordering Rationale
- acquire/release: Ensures proper synchronization between acquire and release operations
- relaxed (statistics): Non-critical counters, accuracy not critical for correctness
8. Performance Impact
Expected Improvements
Before (No GlobalFrameBudget):
- Frames 16-18: All 4 players decode simultaneously
- NVDEC queue: 4 concurrent submissions
- QUEUE_DELAY: 35-42ms (exceeds 33.33ms budget)
- Result: Stutter/frame drops
After (With GlobalFrameBudget):
- Frames 16-18: Max 3 players decode concurrently
- NVDEC queue: 3 concurrent submissions (25% reduction)
- QUEUE_DELAY: ~28-33ms (within 33.33ms budget)
- Result: Smooth playback
Measured Metrics (from time.log)
| Metric | Without Budget | With Budget (Expected) |
|---|---|---|
| QUEUE_DELAY (frame 16) | 35-42ms | 28-33ms |
| Frames dropped | 0-2 | 1-3 (brief skip) |
| Total playback time | ~600ms | ~650ms (+8% initially) |
| Sync after frame 19 | Perfect | Perfect |
9. Statistics and Monitoring
BudgetStatistics Structure
struct BudgetStatistics {
uint64_t totalAcquireAttempts; // Total TryAcquireFrameSlot() calls
uint64_t successfulAcquires; // Slots acquired successfully
uint64_t rejectedAcquires; // Rejections due to budget limit
double rejectionRate; // rejectedAcquires / totalAcquireAttempts
};
Usage Example
// After playback test
auto stats = GlobalFrameBudget::GetInstance().GetStatistics();
LOGF_INFO("GlobalFrameBudget Statistics:");
LOGF_INFO(" Total attempts: %llu", stats.totalAcquireAttempts);
LOGF_INFO(" Successful: %llu", stats.successfulAcquires);
LOGF_INFO(" Rejected: %llu", stats.rejectedAcquires);
LOGF_INFO(" Rejection rate: %.2f%%", stats.rejectionRate * 100.0);
// Expected results with 4 players:
// Total attempts: ~12 (4 players × 3 frames)
// Successful: ~12 (all eventually succeed)
// Rejected: ~3-6 (transient rejections, immediate retry)
// Rejection rate: ~25-50% (acceptable due to immediate retry)
10. Configuration Tuning
MAX_CONCURRENT_FRAMES_BOTTLENECK
Current value: 3
Rationale:
- 4 concurrent → 35-42ms QUEUE_DELAY (exceeds budget)
- 3 concurrent → ~28-33ms QUEUE_DELAY (within budget)
- 2 concurrent → Would be too conservative, longer total time
Tuning guide:
- Increase if QUEUE_DELAY still exceeds budget
- Decrease if want more aggressive load reduction
- Monitor via BudgetStatistics.rejectionRate
Phase Detection Thresholds
Current values:
- INITIAL_BUFFERING: frames 0-15 (VAVCORE_NVDEC_INITIAL_BUFFERING)
- TRIPLE_FILLING: frames 16-18 (+VAV2PLAYER_TRIPLE_BUFFER_SIZE)
- NORMAL_PLAYBACK: frames 19+
Tuning guide:
- Extend TRIPLE_FILLING range if stuttering persists after frame 18
- Reduce if budget overhead is unnecessary
11. Error Handling
Slot Leak Prevention
Problem: If ReleaseFrameSlot() is not called, m_activeFrames never decrements, causing permanent budget exhaustion.
Solution: Sanity check in ReleaseFrameSlot():
void GlobalFrameBudget::ReleaseFrameSlot(int playerId)
{
int previous = m_activeFrames.fetch_sub(1, std::memory_order_acq_rel);
// Sanity check
if (previous <= 0) {
LOGF_ERROR("[GlobalFrameBudget] Player#%d attempted to release but m_activeFrames was %d!",
playerId, previous);
}
}
Proper Cleanup Pattern
// In FrameProcessor::ProcessFrame()
bool slotAcquired = false;
if (currentPhase == Phase::TRIPLE_FILLING) {
if (GlobalFrameBudget::GetInstance().TryAcquireFrameSlot(...)) {
slotAcquired = true;
} else {
return false; // Skip frame
}
}
// Ensure release happens in all code paths
auto cleanup = [&]() {
if (slotAcquired) {
GlobalFrameBudget::GetInstance().ReleaseFrameSlot(...);
}
};
// Normal path: UI thread callback
m_dispatcherQueue.TryEnqueue([cleanup, ...]() {
// ... render ...
cleanup();
});
// Error path: immediate cleanup
if (decodeError) {
cleanup();
return false;
}
12. Future Enhancements
Adaptive Budget
Dynamically adjust MAX_CONCURRENT_FRAMES based on measured QUEUE_DELAY:
class AdaptiveFrameBudget : public GlobalFrameBudget
{
private:
std::atomic<int> m_maxConcurrent{3}; // Dynamic limit
public:
void UpdateBudget(double measuredQueueDelay) {
if (measuredQueueDelay > 35.0) {
// Too high, reduce concurrency
m_maxConcurrent.store(std::max(1, m_maxConcurrent.load() - 1));
} else if (measuredQueueDelay < 25.0) {
// Safe margin, can increase
m_maxConcurrent.store(std::min(4, m_maxConcurrent.load() + 1));
}
}
};
Per-Decoder Budget
Different decoders may have different queue capacities:
struct DecoderBudget {
int maxConcurrentNVDEC = 3;
int maxConcurrentVPL = 4;
int maxConcurrentAMF = 3;
};
Priority-Based Slot Allocation
Assign priority to players for fair scheduling:
bool TryAcquireFrameSlot(int playerId, int priority, uint64_t frameNumber);
13. Testing Plan
Unit Tests
-
Basic slot acquisition:
- Acquire 3 slots → all succeed
- Acquire 4th slot → fail
-
Slot release:
- Acquire 3 → release 1 → acquire 1 more → succeed
-
Statistics tracking:
- Verify counters increment correctly
Integration Tests
-
Single player:
- GlobalFrameBudget should not interfere
- Verify normal playback
-
4 simultaneous players:
- Monitor QUEUE_DELAY during frames 16-18
- Verify stays within 33.33ms budget
- Check synchronization after frame 19
-
Stress test:
- 8 simultaneous players
- Verify budget prevents complete stall
Performance Benchmarks
Compare time.log with/without GlobalFrameBudget:
- QUEUE_DELAY distribution
- Total frame drop count
- Playback smoothness (subjective)
14. Implementation Checklist
- Create GlobalFrameBudget.h
- Implement GlobalFrameBudget.cpp
- Add Phase enum to FrameProcessor.h
- Implement GetCurrentPhase() in FrameProcessor.cpp
- Integrate TryAcquireFrameSlot() in ProcessFrame()
- Integrate ReleaseFrameSlot() in UI callback
- Add m_budgetSlotAcquired tracking
- Add GlobalFrameBudget.cpp to Vav2Player.vcxproj
- Build and verify compilation
- Test with 4 simultaneous players
- Analyze time.log for improvements
- Document final results
15. References
- Original Issue: time.log analysis showing 35-42ms QUEUE_DELAY spikes
- Root Cause: 4 players simultaneously hitting frames 16-18 (triple buffer filling)
- DPB Size Investigation: DPB_SIZE=4/8 crashes, must remain 16
- NVDEC Spec: min_num_decode_surfaces=9 from AV1 sequence header (codec spec, not H/W)
Document Version: 1.0 Last Updated: 2025-10-11 Author: Claude Code