# GlobalFrameBudget Design Document ## 1. Overview ### Purpose **Problem**: When 4 VideoPlayerControl2 instances play simultaneously, they all hit the initial buffering bottleneck (frames 16-18) at the same time, causing NVDEC queue overflow with QUEUE_DELAY of 35-42ms (exceeding the 33.33ms budget for 30fps). **Solution**: Implement a global frame processing budget manager that limits concurrent frame processing during the bottleneck phase, reducing the load from 4 players to 3 maximum, bringing QUEUE_DELAY down to ~28-33ms (within budget). ### Key Constraints - **NVDEC DPB_SIZE = 16**: Required by AV1 sequence header (min_num_decode_surfaces=9 for test video, up to 12+ for complex GOPs) - **INITIAL_BUFFERING = 16**: NVDEC requires full DPB filling for B-frame reordering - **Cannot reduce buffer sizes**: Tested DPB_SIZE=4/8 both crash with "Invalid CurrPicIdx" - **Must maintain sync**: All 4 players should remain synchronized after initial buffering --- ## 2. Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────┐ │ GlobalFrameBudget │ │ (Singleton) │ ├─────────────────────────────────────────────────────────────┤ │ State: │ │ - m_activeFrames: atomic (current active frames) │ │ - MAX_CONCURRENT_FRAMES = 3 (bottleneck phase limit) │ ├─────────────────────────────────────────────────────────────┤ │ Public API: │ │ + TryAcquireFrameSlot(playerId, frameNumber) → bool │ │ + ReleaseFrameSlot(playerId) → void │ │ + GetActiveFrameCount() → int │ │ + GetStatistics() → BudgetStatistics │ │ + ResetStatistics() → void │ └─────────────────────────────────────────────────────────────┘ ▲ │ uses ┌────────────────┼────────────────┐ │ │ │ ┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐ │ Player#0 │ │ Player#1 │ │ Player#2 │ ... │FrameProc │ │FrameProc │ │FrameProc │ └──────────┘ └──────────┘ └──────────┘ Call Flow: 1. FrameProcessor::ProcessFrame() → Check Phase == TRIPLE_FILLING? 2. YES → TryAcquireFrameSlot() → m_activeFrames < 3? 3. YES → Proceed with decode → ReleaseFrameSlot() after render completes 4. NO → Skip frame (m_framesDropped++) ``` --- ## 3. Processing Phases FrameProcessor operates in 3 distinct phases: ```cpp enum class Phase { INITIAL_BUFFERING, // frames 0-15: NULL surface submission to NVDEC DPB TRIPLE_FILLING, // frames 16-18: triple buffer filling (BOTTLENECK) NORMAL_PLAYBACK // frames 19+: stable rendering }; ``` ### Phase Details | Phase | Frame Range | Behavior | QUEUE_DELAY | |--------------------|-------------|-----------------------------------|------------------| | INITIAL_BUFFERING | 0-15 | NULL surface, no render | 6-15ms (stable) | | TRIPLE_FILLING | 16-18 | Fill triple buffer, first renders | 35-42ms (SPIKE) | | NORMAL_PLAYBACK | 19+ | Steady state rendering | 6-22ms (stable) | **GlobalFrameBudget is only active during TRIPLE_FILLING phase.** --- ## 4. Class Interface ### Header: GlobalFrameBudget.h ```cpp namespace Vav2Player { class GlobalFrameBudget { public: static GlobalFrameBudget& GetInstance(); // Acquire permission to process frame // Returns: true if slot acquired, false if budget limit reached bool TryAcquireFrameSlot(int playerId, uint64_t frameNumber); // Release slot after processing complete void ReleaseFrameSlot(int playerId); // Query current state int GetActiveFrameCount() const { return m_activeFrames.load(); } // Statistics struct BudgetStatistics { uint64_t totalAcquireAttempts; uint64_t successfulAcquires; uint64_t rejectedAcquires; double rejectionRate; }; BudgetStatistics GetStatistics() const; void ResetStatistics(); private: GlobalFrameBudget() = default; ~GlobalFrameBudget() = default; // Disable copy/move GlobalFrameBudget(const GlobalFrameBudget&) = delete; GlobalFrameBudget& operator=(const GlobalFrameBudget&) = delete; // Configuration static constexpr int MAX_CONCURRENT_FRAMES_BOTTLENECK = 3; // State std::atomic m_activeFrames{0}; // Statistics std::atomic m_totalAcquireAttempts{0}; std::atomic m_successfulAcquires{0}; std::atomic m_rejectedAcquires{0}; }; } // namespace Vav2Player ``` --- ## 5. Integration with FrameProcessor ### FrameProcessor Changes **FrameProcessor.h additions:** ```cpp class FrameProcessor { public: // Processing phase query enum class Phase { INITIAL_BUFFERING, TRIPLE_FILLING, NORMAL_PLAYBACK }; Phase GetCurrentPhase() const; private: // Track if budget slot was acquired (for proper release) std::atomic m_budgetSlotAcquired{false}; }; ``` **FrameProcessor.cpp integration:** ```cpp bool FrameProcessor::ProcessFrame(VavCorePlayer* player, std::function onComplete) { // Existing: Skip if previous frame still processing if (m_frameProcessing.load()) { m_framesDropped++; return false; } Phase currentPhase = GetCurrentPhase(); // NEW: Apply GlobalFrameBudget during bottleneck phase if (currentPhase == Phase::TRIPLE_FILLING) { if (!GlobalFrameBudget::GetInstance().TryAcquireFrameSlot( m_playerInstanceId, m_framesDecoded)) { LOGF_DEBUG("[Player#%d] Frame %llu SKIPPED (global budget limit)", m_playerInstanceId, m_framesDecoded.load()); m_framesDropped++; return false; } m_budgetSlotAcquired = true; } m_frameProcessing = true; // ... existing decode logic ... // UI thread callback with budget release m_dispatcherQueue.TryEnqueue([this, renderIndex, onComplete]() { bool renderSuccess = m_renderer->RenderFrame(renderIndex); // NEW: Release budget slot after render complete if (m_budgetSlotAcquired.load()) { GlobalFrameBudget::GetInstance().ReleaseFrameSlot(m_playerInstanceId); m_budgetSlotAcquired = false; } m_frameProcessing = false; onComplete(renderSuccess); }); m_framesDecoded++; return true; } FrameProcessor::Phase FrameProcessor::GetCurrentPhase() const { uint64_t decoded = m_framesDecoded.load(); if (decoded < VAVCORE_NVDEC_INITIAL_BUFFERING) { return Phase::INITIAL_BUFFERING; } else if (decoded < VAVCORE_NVDEC_INITIAL_BUFFERING + VAV2PLAYER_TRIPLE_BUFFER_SIZE) { return Phase::TRIPLE_FILLING; } else { return Phase::NORMAL_PLAYBACK; } } ``` --- ## 6. Simulation Scenario ### Timeline with 4 Players ``` Initial State: m_activeFrames = 0, MAX = 3 t=0ms: All 4 players call Play(), start frame 0 Phase 1 (frames 0-15): INITIAL_BUFFERING - All 4 players process normally - No GlobalFrameBudget involvement - QUEUE_DELAY: 6-15ms (stable) t=533ms: All 4 players reach frame 16 Phase 2 (frames 16-18): TRIPLE_FILLING (BOTTLENECK) Frame 16: t=533ms: Player#0 TryAcquire → m_activeFrames: 0→1 ✅ t=533ms: Player#1 TryAcquire → m_activeFrames: 1→2 ✅ t=533ms: Player#2 TryAcquire → m_activeFrames: 2→3 ✅ t=533ms: Player#3 TryAcquire → REJECTED (3 >= 3) ❌ [FRAME SKIPPED] t=543ms: Player#0 render complete → Release → m_activeFrames: 3→2 t=543ms: Player#3 ProcessFrame (retry frame 16) → TryAcquire → 2→3 ✅ Frame 17: Similar pattern: One player skips, retries after slot release Frame 18: Similar pattern: One player skips, retries after slot release t=633ms: All 4 players reach frame 19 Phase 3 (frames 19+): NORMAL_PLAYBACK - All 4 players process normally - No GlobalFrameBudget involvement - QUEUE_DELAY: 6-22ms (stable) Result: - Bottleneck phase: 4 players → max 3 concurrent - NVDEC queue load: 25% reduction - QUEUE_DELAY: 35-42ms → ~28-33ms (within 33.33ms budget) - Player synchronization: Maintained (skipped frames retry immediately) ``` --- ## 7. Thread Safety ### Lock-Free Design All operations use atomic primitives for thread safety without mutexes: ```cpp bool GlobalFrameBudget::TryAcquireFrameSlot(int playerId, uint64_t frameNumber) { // Atomic read int current = m_activeFrames.load(std::memory_order_acquire); // Check limit if (current >= MAX_CONCURRENT_FRAMES_BOTTLENECK) { return false; // Fast path rejection } // Lock-free CAS loop while (current < MAX_CONCURRENT_FRAMES_BOTTLENECK) { if (m_activeFrames.compare_exchange_weak(current, current + 1, std::memory_order_acq_rel, std::memory_order_acquire)) { return true; // Successfully acquired } // compare_exchange_weak failed - current was updated, retry } return false; // Budget exhausted during retry } void GlobalFrameBudget::ReleaseFrameSlot(int playerId) { // Atomic decrement m_activeFrames.fetch_sub(1, std::memory_order_acq_rel); } ``` ### Memory Ordering Rationale - **acquire/release**: Ensures proper synchronization between acquire and release operations - **relaxed** (statistics): Non-critical counters, accuracy not critical for correctness --- ## 8. Performance Impact ### Expected Improvements **Before (No GlobalFrameBudget):** - Frames 16-18: All 4 players decode simultaneously - NVDEC queue: 4 concurrent submissions - QUEUE_DELAY: 35-42ms (exceeds 33.33ms budget) - Result: Stutter/frame drops **After (With GlobalFrameBudget):** - Frames 16-18: Max 3 players decode concurrently - NVDEC queue: 3 concurrent submissions (25% reduction) - QUEUE_DELAY: ~28-33ms (within 33.33ms budget) - Result: Smooth playback ### Measured Metrics (from time.log) | Metric | Without Budget | With Budget (Expected) | |-----------------------|----------------|------------------------| | QUEUE_DELAY (frame 16)| 35-42ms | 28-33ms | | Frames dropped | 0-2 | 1-3 (brief skip) | | Total playback time | ~600ms | ~650ms (+8% initially) | | Sync after frame 19 | Perfect | Perfect | --- ## 9. Statistics and Monitoring ### BudgetStatistics Structure ```cpp struct BudgetStatistics { uint64_t totalAcquireAttempts; // Total TryAcquireFrameSlot() calls uint64_t successfulAcquires; // Slots acquired successfully uint64_t rejectedAcquires; // Rejections due to budget limit double rejectionRate; // rejectedAcquires / totalAcquireAttempts }; ``` ### Usage Example ```cpp // After playback test auto stats = GlobalFrameBudget::GetInstance().GetStatistics(); LOGF_INFO("GlobalFrameBudget Statistics:"); LOGF_INFO(" Total attempts: %llu", stats.totalAcquireAttempts); LOGF_INFO(" Successful: %llu", stats.successfulAcquires); LOGF_INFO(" Rejected: %llu", stats.rejectedAcquires); LOGF_INFO(" Rejection rate: %.2f%%", stats.rejectionRate * 100.0); // Expected results with 4 players: // Total attempts: ~12 (4 players × 3 frames) // Successful: ~12 (all eventually succeed) // Rejected: ~3-6 (transient rejections, immediate retry) // Rejection rate: ~25-50% (acceptable due to immediate retry) ``` --- ## 10. Configuration Tuning ### MAX_CONCURRENT_FRAMES_BOTTLENECK **Current value: 3** Rationale: - 4 concurrent → 35-42ms QUEUE_DELAY (exceeds budget) - 3 concurrent → ~28-33ms QUEUE_DELAY (within budget) - 2 concurrent → Would be too conservative, longer total time **Tuning guide:** - Increase if QUEUE_DELAY still exceeds budget - Decrease if want more aggressive load reduction - Monitor via BudgetStatistics.rejectionRate ### Phase Detection Thresholds **Current values:** - INITIAL_BUFFERING: frames 0-15 (VAVCORE_NVDEC_INITIAL_BUFFERING) - TRIPLE_FILLING: frames 16-18 (+VAV2PLAYER_TRIPLE_BUFFER_SIZE) - NORMAL_PLAYBACK: frames 19+ **Tuning guide:** - Extend TRIPLE_FILLING range if stuttering persists after frame 18 - Reduce if budget overhead is unnecessary --- ## 11. Error Handling ### Slot Leak Prevention **Problem**: If ReleaseFrameSlot() is not called, m_activeFrames never decrements, causing permanent budget exhaustion. **Solution**: Sanity check in ReleaseFrameSlot(): ```cpp void GlobalFrameBudget::ReleaseFrameSlot(int playerId) { int previous = m_activeFrames.fetch_sub(1, std::memory_order_acq_rel); // Sanity check if (previous <= 0) { LOGF_ERROR("[GlobalFrameBudget] Player#%d attempted to release but m_activeFrames was %d!", playerId, previous); } } ``` ### Proper Cleanup Pattern ```cpp // In FrameProcessor::ProcessFrame() bool slotAcquired = false; if (currentPhase == Phase::TRIPLE_FILLING) { if (GlobalFrameBudget::GetInstance().TryAcquireFrameSlot(...)) { slotAcquired = true; } else { return false; // Skip frame } } // Ensure release happens in all code paths auto cleanup = [&]() { if (slotAcquired) { GlobalFrameBudget::GetInstance().ReleaseFrameSlot(...); } }; // Normal path: UI thread callback m_dispatcherQueue.TryEnqueue([cleanup, ...]() { // ... render ... cleanup(); }); // Error path: immediate cleanup if (decodeError) { cleanup(); return false; } ``` --- ## 12. Future Enhancements ### Adaptive Budget Dynamically adjust MAX_CONCURRENT_FRAMES based on measured QUEUE_DELAY: ```cpp class AdaptiveFrameBudget : public GlobalFrameBudget { private: std::atomic m_maxConcurrent{3}; // Dynamic limit public: void UpdateBudget(double measuredQueueDelay) { if (measuredQueueDelay > 35.0) { // Too high, reduce concurrency m_maxConcurrent.store(std::max(1, m_maxConcurrent.load() - 1)); } else if (measuredQueueDelay < 25.0) { // Safe margin, can increase m_maxConcurrent.store(std::min(4, m_maxConcurrent.load() + 1)); } } }; ``` ### Per-Decoder Budget Different decoders may have different queue capacities: ```cpp struct DecoderBudget { int maxConcurrentNVDEC = 3; int maxConcurrentVPL = 4; int maxConcurrentAMF = 3; }; ``` ### Priority-Based Slot Allocation Assign priority to players for fair scheduling: ```cpp bool TryAcquireFrameSlot(int playerId, int priority, uint64_t frameNumber); ``` --- ## 13. Testing Plan ### Unit Tests 1. **Basic slot acquisition:** - Acquire 3 slots → all succeed - Acquire 4th slot → fail 2. **Slot release:** - Acquire 3 → release 1 → acquire 1 more → succeed 3. **Statistics tracking:** - Verify counters increment correctly ### Integration Tests 1. **Single player:** - GlobalFrameBudget should not interfere - Verify normal playback 2. **4 simultaneous players:** - Monitor QUEUE_DELAY during frames 16-18 - Verify stays within 33.33ms budget - Check synchronization after frame 19 3. **Stress test:** - 8 simultaneous players - Verify budget prevents complete stall ### Performance Benchmarks Compare time.log with/without GlobalFrameBudget: - QUEUE_DELAY distribution - Total frame drop count - Playback smoothness (subjective) --- ## 14. Implementation Checklist - [x] Create GlobalFrameBudget.h - [x] Implement GlobalFrameBudget.cpp - [ ] Add Phase enum to FrameProcessor.h - [ ] Implement GetCurrentPhase() in FrameProcessor.cpp - [ ] Integrate TryAcquireFrameSlot() in ProcessFrame() - [ ] Integrate ReleaseFrameSlot() in UI callback - [ ] Add m_budgetSlotAcquired tracking - [ ] Add GlobalFrameBudget.cpp to Vav2Player.vcxproj - [ ] Build and verify compilation - [ ] Test with 4 simultaneous players - [ ] Analyze time.log for improvements - [ ] Document final results --- ## 15. References - **Original Issue**: time.log analysis showing 35-42ms QUEUE_DELAY spikes - **Root Cause**: 4 players simultaneously hitting frames 16-18 (triple buffer filling) - **DPB Size Investigation**: DPB_SIZE=4/8 crashes, must remain 16 - **NVDEC Spec**: min_num_decode_surfaces=9 from AV1 sequence header (codec spec, not H/W) --- *Document Version: 1.0* *Last Updated: 2025-10-11* *Author: Claude Code*