Files
video-v1/vav2/docs/working/GlobalFrameBudget_Design.md
2025-10-11 04:27:57 +09:00

17 KiB
Raw Permalink Blame History

GlobalFrameBudget Design Document

1. Overview

Purpose

Problem: When 4 VideoPlayerControl2 instances play simultaneously, they all hit the initial buffering bottleneck (frames 16-18) at the same time, causing NVDEC queue overflow with QUEUE_DELAY of 35-42ms (exceeding the 33.33ms budget for 30fps).

Solution: Implement a global frame processing budget manager that limits concurrent frame processing during the bottleneck phase, reducing the load from 4 players to 3 maximum, bringing QUEUE_DELAY down to ~28-33ms (within budget).

Key Constraints

  • NVDEC DPB_SIZE = 16: Required by AV1 sequence header (min_num_decode_surfaces=9 for test video, up to 12+ for complex GOPs)
  • INITIAL_BUFFERING = 16: NVDEC requires full DPB filling for B-frame reordering
  • Cannot reduce buffer sizes: Tested DPB_SIZE=4/8 both crash with "Invalid CurrPicIdx"
  • Must maintain sync: All 4 players should remain synchronized after initial buffering

2. Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    GlobalFrameBudget                        │
│                     (Singleton)                             │
├─────────────────────────────────────────────────────────────┤
│  State:                                                     │
│    - m_activeFrames: atomic<int>  (current active frames)   │
│    - MAX_CONCURRENT_FRAMES = 3    (bottleneck phase limit)  │
├─────────────────────────────────────────────────────────────┤
│  Public API:                                                │
│    + TryAcquireFrameSlot(playerId, frameNumber) → bool      │
│    + ReleaseFrameSlot(playerId) → void                      │
│    + GetActiveFrameCount() → int                            │
│    + GetStatistics() → BudgetStatistics                     │
│    + ResetStatistics() → void                               │
└─────────────────────────────────────────────────────────────┘
                          ▲
                          │ uses
         ┌────────────────┼────────────────┐
         │                │                │
    ┌────▼─────┐    ┌────▼─────┐    ┌────▼─────┐
    │ Player#0 │    │ Player#1 │    │ Player#2 │ ...
    │FrameProc │    │FrameProc │    │FrameProc │
    └──────────┘    └──────────┘    └──────────┘

Call Flow:
  1. FrameProcessor::ProcessFrame()
     → Check Phase == TRIPLE_FILLING?
  2. YES → TryAcquireFrameSlot()
     → m_activeFrames < 3?
  3. YES → Proceed with decode
     → ReleaseFrameSlot() after render completes
  4. NO → Skip frame (m_framesDropped++)

3. Processing Phases

FrameProcessor operates in 3 distinct phases:

enum class Phase {
    INITIAL_BUFFERING,    // frames 0-15: NULL surface submission to NVDEC DPB
    TRIPLE_FILLING,       // frames 16-18: triple buffer filling (BOTTLENECK)
    NORMAL_PLAYBACK       // frames 19+: stable rendering
};

Phase Details

Phase Frame Range Behavior QUEUE_DELAY
INITIAL_BUFFERING 0-15 NULL surface, no render 6-15ms (stable)
TRIPLE_FILLING 16-18 Fill triple buffer, first renders 35-42ms (SPIKE)
NORMAL_PLAYBACK 19+ Steady state rendering 6-22ms (stable)

GlobalFrameBudget is only active during TRIPLE_FILLING phase.


4. Class Interface

Header: GlobalFrameBudget.h

namespace Vav2Player {

class GlobalFrameBudget
{
public:
    static GlobalFrameBudget& GetInstance();

    // Acquire permission to process frame
    // Returns: true if slot acquired, false if budget limit reached
    bool TryAcquireFrameSlot(int playerId, uint64_t frameNumber);

    // Release slot after processing complete
    void ReleaseFrameSlot(int playerId);

    // Query current state
    int GetActiveFrameCount() const { return m_activeFrames.load(); }

    // Statistics
    struct BudgetStatistics {
        uint64_t totalAcquireAttempts;
        uint64_t successfulAcquires;
        uint64_t rejectedAcquires;
        double rejectionRate;
    };

    BudgetStatistics GetStatistics() const;
    void ResetStatistics();

private:
    GlobalFrameBudget() = default;
    ~GlobalFrameBudget() = default;

    // Disable copy/move
    GlobalFrameBudget(const GlobalFrameBudget&) = delete;
    GlobalFrameBudget& operator=(const GlobalFrameBudget&) = delete;

    // Configuration
    static constexpr int MAX_CONCURRENT_FRAMES_BOTTLENECK = 3;

    // State
    std::atomic<int> m_activeFrames{0};

    // Statistics
    std::atomic<uint64_t> m_totalAcquireAttempts{0};
    std::atomic<uint64_t> m_successfulAcquires{0};
    std::atomic<uint64_t> m_rejectedAcquires{0};
};

} // namespace Vav2Player

5. Integration with FrameProcessor

FrameProcessor Changes

FrameProcessor.h additions:

class FrameProcessor
{
public:
    // Processing phase query
    enum class Phase {
        INITIAL_BUFFERING,
        TRIPLE_FILLING,
        NORMAL_PLAYBACK
    };

    Phase GetCurrentPhase() const;

private:
    // Track if budget slot was acquired (for proper release)
    std::atomic<bool> m_budgetSlotAcquired{false};
};

FrameProcessor.cpp integration:

bool FrameProcessor::ProcessFrame(VavCorePlayer* player,
                                  std::function<void(bool)> onComplete)
{
    // Existing: Skip if previous frame still processing
    if (m_frameProcessing.load()) {
        m_framesDropped++;
        return false;
    }

    Phase currentPhase = GetCurrentPhase();

    // NEW: Apply GlobalFrameBudget during bottleneck phase
    if (currentPhase == Phase::TRIPLE_FILLING) {
        if (!GlobalFrameBudget::GetInstance().TryAcquireFrameSlot(
                m_playerInstanceId, m_framesDecoded)) {

            LOGF_DEBUG("[Player#%d] Frame %llu SKIPPED (global budget limit)",
                      m_playerInstanceId, m_framesDecoded.load());
            m_framesDropped++;
            return false;
        }
        m_budgetSlotAcquired = true;
    }

    m_frameProcessing = true;

    // ... existing decode logic ...

    // UI thread callback with budget release
    m_dispatcherQueue.TryEnqueue([this, renderIndex, onComplete]() {
        bool renderSuccess = m_renderer->RenderFrame(renderIndex);

        // NEW: Release budget slot after render complete
        if (m_budgetSlotAcquired.load()) {
            GlobalFrameBudget::GetInstance().ReleaseFrameSlot(m_playerInstanceId);
            m_budgetSlotAcquired = false;
        }

        m_frameProcessing = false;
        onComplete(renderSuccess);
    });

    m_framesDecoded++;
    return true;
}

FrameProcessor::Phase FrameProcessor::GetCurrentPhase() const
{
    uint64_t decoded = m_framesDecoded.load();

    if (decoded < VAVCORE_NVDEC_INITIAL_BUFFERING) {
        return Phase::INITIAL_BUFFERING;
    }
    else if (decoded < VAVCORE_NVDEC_INITIAL_BUFFERING + VAV2PLAYER_TRIPLE_BUFFER_SIZE) {
        return Phase::TRIPLE_FILLING;
    }
    else {
        return Phase::NORMAL_PLAYBACK;
    }
}

6. Simulation Scenario

Timeline with 4 Players

Initial State: m_activeFrames = 0, MAX = 3

t=0ms: All 4 players call Play(), start frame 0

Phase 1 (frames 0-15): INITIAL_BUFFERING
  - All 4 players process normally
  - No GlobalFrameBudget involvement
  - QUEUE_DELAY: 6-15ms (stable)

t=533ms: All 4 players reach frame 16

Phase 2 (frames 16-18): TRIPLE_FILLING (BOTTLENECK)

  Frame 16:
    t=533ms: Player#0 TryAcquire → m_activeFrames: 0→1 ✅
    t=533ms: Player#1 TryAcquire → m_activeFrames: 1→2 ✅
    t=533ms: Player#2 TryAcquire → m_activeFrames: 2→3 ✅
    t=533ms: Player#3 TryAcquire → REJECTED (3 >= 3) ❌ [FRAME SKIPPED]

  t=543ms: Player#0 render complete → Release → m_activeFrames: 3→2
  t=543ms: Player#3 ProcessFrame (retry frame 16) → TryAcquire → 2→3 ✅

  Frame 17:
    Similar pattern: One player skips, retries after slot release

  Frame 18:
    Similar pattern: One player skips, retries after slot release

t=633ms: All 4 players reach frame 19

Phase 3 (frames 19+): NORMAL_PLAYBACK
  - All 4 players process normally
  - No GlobalFrameBudget involvement
  - QUEUE_DELAY: 6-22ms (stable)

Result:
  - Bottleneck phase: 4 players → max 3 concurrent
  - NVDEC queue load: 25% reduction
  - QUEUE_DELAY: 35-42ms → ~28-33ms (within 33.33ms budget)
  - Player synchronization: Maintained (skipped frames retry immediately)

7. Thread Safety

Lock-Free Design

All operations use atomic primitives for thread safety without mutexes:

bool GlobalFrameBudget::TryAcquireFrameSlot(int playerId, uint64_t frameNumber)
{
    // Atomic read
    int current = m_activeFrames.load(std::memory_order_acquire);

    // Check limit
    if (current >= MAX_CONCURRENT_FRAMES_BOTTLENECK) {
        return false;  // Fast path rejection
    }

    // Lock-free CAS loop
    while (current < MAX_CONCURRENT_FRAMES_BOTTLENECK) {
        if (m_activeFrames.compare_exchange_weak(current, current + 1,
                                                  std::memory_order_acq_rel,
                                                  std::memory_order_acquire)) {
            return true;  // Successfully acquired
        }
        // compare_exchange_weak failed - current was updated, retry
    }

    return false;  // Budget exhausted during retry
}

void GlobalFrameBudget::ReleaseFrameSlot(int playerId)
{
    // Atomic decrement
    m_activeFrames.fetch_sub(1, std::memory_order_acq_rel);
}

Memory Ordering Rationale

  • acquire/release: Ensures proper synchronization between acquire and release operations
  • relaxed (statistics): Non-critical counters, accuracy not critical for correctness

8. Performance Impact

Expected Improvements

Before (No GlobalFrameBudget):

  • Frames 16-18: All 4 players decode simultaneously
  • NVDEC queue: 4 concurrent submissions
  • QUEUE_DELAY: 35-42ms (exceeds 33.33ms budget)
  • Result: Stutter/frame drops

After (With GlobalFrameBudget):

  • Frames 16-18: Max 3 players decode concurrently
  • NVDEC queue: 3 concurrent submissions (25% reduction)
  • QUEUE_DELAY: ~28-33ms (within 33.33ms budget)
  • Result: Smooth playback

Measured Metrics (from time.log)

Metric Without Budget With Budget (Expected)
QUEUE_DELAY (frame 16) 35-42ms 28-33ms
Frames dropped 0-2 1-3 (brief skip)
Total playback time ~600ms ~650ms (+8% initially)
Sync after frame 19 Perfect Perfect

9. Statistics and Monitoring

BudgetStatistics Structure

struct BudgetStatistics {
    uint64_t totalAcquireAttempts;   // Total TryAcquireFrameSlot() calls
    uint64_t successfulAcquires;     // Slots acquired successfully
    uint64_t rejectedAcquires;       // Rejections due to budget limit
    double rejectionRate;             // rejectedAcquires / totalAcquireAttempts
};

Usage Example

// After playback test
auto stats = GlobalFrameBudget::GetInstance().GetStatistics();

LOGF_INFO("GlobalFrameBudget Statistics:");
LOGF_INFO("  Total attempts: %llu", stats.totalAcquireAttempts);
LOGF_INFO("  Successful: %llu", stats.successfulAcquires);
LOGF_INFO("  Rejected: %llu", stats.rejectedAcquires);
LOGF_INFO("  Rejection rate: %.2f%%", stats.rejectionRate * 100.0);

// Expected results with 4 players:
//   Total attempts: ~12 (4 players × 3 frames)
//   Successful: ~12 (all eventually succeed)
//   Rejected: ~3-6 (transient rejections, immediate retry)
//   Rejection rate: ~25-50% (acceptable due to immediate retry)

10. Configuration Tuning

MAX_CONCURRENT_FRAMES_BOTTLENECK

Current value: 3

Rationale:

  • 4 concurrent → 35-42ms QUEUE_DELAY (exceeds budget)
  • 3 concurrent → ~28-33ms QUEUE_DELAY (within budget)
  • 2 concurrent → Would be too conservative, longer total time

Tuning guide:

  • Increase if QUEUE_DELAY still exceeds budget
  • Decrease if want more aggressive load reduction
  • Monitor via BudgetStatistics.rejectionRate

Phase Detection Thresholds

Current values:

  • INITIAL_BUFFERING: frames 0-15 (VAVCORE_NVDEC_INITIAL_BUFFERING)
  • TRIPLE_FILLING: frames 16-18 (+VAV2PLAYER_TRIPLE_BUFFER_SIZE)
  • NORMAL_PLAYBACK: frames 19+

Tuning guide:

  • Extend TRIPLE_FILLING range if stuttering persists after frame 18
  • Reduce if budget overhead is unnecessary

11. Error Handling

Slot Leak Prevention

Problem: If ReleaseFrameSlot() is not called, m_activeFrames never decrements, causing permanent budget exhaustion.

Solution: Sanity check in ReleaseFrameSlot():

void GlobalFrameBudget::ReleaseFrameSlot(int playerId)
{
    int previous = m_activeFrames.fetch_sub(1, std::memory_order_acq_rel);

    // Sanity check
    if (previous <= 0) {
        LOGF_ERROR("[GlobalFrameBudget] Player#%d attempted to release but m_activeFrames was %d!",
                  playerId, previous);
    }
}

Proper Cleanup Pattern

// In FrameProcessor::ProcessFrame()
bool slotAcquired = false;

if (currentPhase == Phase::TRIPLE_FILLING) {
    if (GlobalFrameBudget::GetInstance().TryAcquireFrameSlot(...)) {
        slotAcquired = true;
    } else {
        return false;  // Skip frame
    }
}

// Ensure release happens in all code paths
auto cleanup = [&]() {
    if (slotAcquired) {
        GlobalFrameBudget::GetInstance().ReleaseFrameSlot(...);
    }
};

// Normal path: UI thread callback
m_dispatcherQueue.TryEnqueue([cleanup, ...]() {
    // ... render ...
    cleanup();
});

// Error path: immediate cleanup
if (decodeError) {
    cleanup();
    return false;
}

12. Future Enhancements

Adaptive Budget

Dynamically adjust MAX_CONCURRENT_FRAMES based on measured QUEUE_DELAY:

class AdaptiveFrameBudget : public GlobalFrameBudget
{
private:
    std::atomic<int> m_maxConcurrent{3};  // Dynamic limit

public:
    void UpdateBudget(double measuredQueueDelay) {
        if (measuredQueueDelay > 35.0) {
            // Too high, reduce concurrency
            m_maxConcurrent.store(std::max(1, m_maxConcurrent.load() - 1));
        } else if (measuredQueueDelay < 25.0) {
            // Safe margin, can increase
            m_maxConcurrent.store(std::min(4, m_maxConcurrent.load() + 1));
        }
    }
};

Per-Decoder Budget

Different decoders may have different queue capacities:

struct DecoderBudget {
    int maxConcurrentNVDEC = 3;
    int maxConcurrentVPL = 4;
    int maxConcurrentAMF = 3;
};

Priority-Based Slot Allocation

Assign priority to players for fair scheduling:

bool TryAcquireFrameSlot(int playerId, int priority, uint64_t frameNumber);

13. Testing Plan

Unit Tests

  1. Basic slot acquisition:

    • Acquire 3 slots → all succeed
    • Acquire 4th slot → fail
  2. Slot release:

    • Acquire 3 → release 1 → acquire 1 more → succeed
  3. Statistics tracking:

    • Verify counters increment correctly

Integration Tests

  1. Single player:

    • GlobalFrameBudget should not interfere
    • Verify normal playback
  2. 4 simultaneous players:

    • Monitor QUEUE_DELAY during frames 16-18
    • Verify stays within 33.33ms budget
    • Check synchronization after frame 19
  3. Stress test:

    • 8 simultaneous players
    • Verify budget prevents complete stall

Performance Benchmarks

Compare time.log with/without GlobalFrameBudget:

  • QUEUE_DELAY distribution
  • Total frame drop count
  • Playback smoothness (subjective)

14. Implementation Checklist

  • Create GlobalFrameBudget.h
  • Implement GlobalFrameBudget.cpp
  • Add Phase enum to FrameProcessor.h
  • Implement GetCurrentPhase() in FrameProcessor.cpp
  • Integrate TryAcquireFrameSlot() in ProcessFrame()
  • Integrate ReleaseFrameSlot() in UI callback
  • Add m_budgetSlotAcquired tracking
  • Add GlobalFrameBudget.cpp to Vav2Player.vcxproj
  • Build and verify compilation
  • Test with 4 simultaneous players
  • Analyze time.log for improvements
  • Document final results

15. References

  • Original Issue: time.log analysis showing 35-42ms QUEUE_DELAY spikes
  • Root Cause: 4 players simultaneously hitting frames 16-18 (triple buffer filling)
  • DPB Size Investigation: DPB_SIZE=4/8 crashes, must remain 16
  • NVDEC Spec: min_num_decode_surfaces=9 from AV1 sequence header (codec spec, not H/W)

Document Version: 1.0 Last Updated: 2025-10-11 Author: Claude Code