Files
video-v1/vav2/docs/working/Playback_Timing_Fix_Design.md
2025-10-07 14:53:33 +09:00

11 KiB

Playback Timing Fix Design

Date: 2025-10-07 Author: Claude Code Status: Design Phase

Problem Statement

Symptom

Video playback is jerky ("통통 튄다") at 30fps despite NVDEC hardware acceleration working correctly.

Root Cause Analysis

NOT the problem:

  • CUDA synchronization (cuStreamSynchronize is fine)
  • Fence implementation (works correctly but unnecessary)
  • GPU decode performance (10-15ms, well within budget)

ACTUAL problem:

  • PlaybackController::TimingThreadLoop() does NOT wait for frame processing completion
  • Timing thread sends "next frame" signal every 33.33ms regardless of previous frame status
  • FrameProcessor drops frames when previous frame still processing (m_frameProcessing flag)

Evidence

PlaybackController.cpp:249-275 - Current broken implementation:

void PlaybackController::TimingThreadLoop()
{
    while (!m_shouldStopTiming && m_isPlaying) {
        // 1. Signal frame processing
        if (m_frameReadyCallback) {
            m_frameReadyCallback();  // → ProcessFrame()
        }

        // 2. ❌ IMMEDIATE advance without waiting!
        m_currentFrame++;
        m_currentTime = m_currentFrame / m_frameRate;

        // 3. Sleep until next frame
        std::this_thread::sleep_until(nextFrame);
    }
}

FrameProcessor.cpp:47-53 - Frame drop mechanism:

bool FrameProcessor::ProcessFrame(...)
{
    // Check if previous frame still processing
    bool expected = false;
    if (!m_frameProcessing.compare_exchange_strong(expected, true)) {
        m_framesDropped++;  // ← This causes jerky playback!
        return false;
    }
    // ... decode and render ...
}

Solution Design

Minimal change, keeps existing architecture

Modified PlaybackController.cpp

void PlaybackController::TimingThreadLoop()
{
    auto start = std::chrono::high_resolution_clock::now();

    while (!m_shouldStopTiming && m_isPlaying) {
        auto frameStart = std::chrono::high_resolution_clock::now();

        // 1. Signal frame processing
        if (m_frameReadyCallback) {
            m_frameReadyCallback();
        }

        // 2. ✅ WAIT for frame processing completion (max 100ms timeout)
        int waitCount = 0;
        while (IsFrameProcessing() && waitCount < 100) {
            std::this_thread::sleep_for(std::chrono::milliseconds(1));
            waitCount++;
        }

        if (waitCount >= 100) {
            LOGF_WARNING("[PlaybackController] Frame processing timeout");
        }

        // 3. Advance frame counter
        m_currentFrame++;
        m_currentTime = m_currentFrame / m_frameRate;

        // 4. Sleep for remaining time to maintain 30fps
        auto frameEnd = std::chrono::high_resolution_clock::now();
        double elapsed = std::chrono::duration<double, std::milli>(frameEnd - frameStart).count();

        auto nextFrameTime = start + std::chrono::microseconds(
            static_cast<long long>(33333.0 * m_currentFrame));
        std::this_thread::sleep_until(nextFrameTime);
    }
}

Required Changes

PlaybackController.h:

class PlaybackController
{
    // Add frame processor reference
    void SetFrameProcessor(FrameProcessor* processor) { m_frameProcessor = processor; }
    bool IsFrameProcessing() const {
        return m_frameProcessor && m_frameProcessor->IsProcessing();
    }

private:
    FrameProcessor* m_frameProcessor = nullptr;  // Non-owning pointer
};

VideoPlayerControl2.xaml.cpp:

void VideoPlayerControl2::LoadVideo(...)
{
    // ... existing code ...

    // Link PlaybackController to FrameProcessor
    m_playbackController->SetFrameProcessor(m_frameProcessor.get());
}

Benefits

  • Minimal code change (10-15 lines)
  • Keeps existing architecture intact
  • Easy to test and rollback
  • No risk to other components
  • Frame drops eliminated

Approach 2: Remove Fence + Fix Timing (COMPREHENSIVE)

Reduces complexity while fixing the root cause

Changes

1. Remove CUDA-D3D12 Fence (complexity reduction):

Delete:

  • D3D12SurfaceHandler::SetD3D12Fence()
  • D3D12SurfaceHandler::SignalD3D12Fence()
  • vavcore_get_sync_fence() API
  • VavCoreVideoFrame::sync_fence_value field

Replace with:

// NVDECAV1Decoder.cpp
VavCoreResult NVDECAV1Decoder::DecodeToSurface(...)
{
    // ... decode logic ...

    // Copy RGBA to D3D12 texture
    m_d3d12Handler->CopyRGBAFrame(...);

    // ✅ Simple synchronous wait (15ms max)
    cuStreamSynchronize(m_stream);

    // ❌ Remove fence signaling
    // m_d3d12Handler->SignalD3D12Fence(...);

    return VAVCORE_SUCCESS;
}

2. Fix TimingThreadLoop (same as Approach 1)

3. Keep D3D12 Internal Fence (required for SwapChain):

  • Keep D3D12VideoRenderer::m_fence
  • Keep WaitForFrameCompletion() for triple buffering

Benefits

  • Reduced complexity (remove External Fence)
  • Fixes root cause (timing)
  • Same performance (15ms CPU block < 33.33ms budget)
  • Easier debugging (synchronous flow)

Risks

  • ⚠️ More code changes (3 components)
  • ⚠️ Requires more thorough testing

Decision Matrix

Aspect Approach 1 (Fix Timing Only) Approach 2 (Remove Fence + Fix)
Code Changes 10-15 lines ~50 lines
Complexity Reduction None Medium
Performance Impact None None (15ms < 33.33ms)
Risk Level Very Low Low
Testing Effort Minimal Moderate
Recommendation Start here Follow-up refactor

Implementation Plan

Phase 1: Fix Timing (Immediate)

Goal: Stop jerky playback

Tasks:

  1. Modify PlaybackController::TimingThreadLoop() to wait for frame completion
  2. Add SetFrameProcessor() method to PlaybackController
  3. Link FrameProcessor in VideoPlayerControl2
  4. Test with sample video

Files Modified:

  • PlaybackController.h (add 3 lines)
  • PlaybackController.cpp (modify TimingThreadLoop, ~15 lines)
  • VideoPlayerControl2.xaml.cpp (add 1 line)

Expected Outcome: Smooth 30fps playback


Phase 2: Remove Fence (Follow-up)

Goal: Reduce technical debt

Tasks:

  1. Remove External Fence from D3D12SurfaceHandler
  2. Replace with cuStreamSynchronize() in NVDECAV1Decoder
  3. Remove vavcore_get_sync_fence() from VavCore API
  4. Remove fence wait from D3D12VideoRenderer
  5. Update VavCoreVideoFrame struct

Files Modified:

  • D3D12SurfaceHandler.h/.cpp
  • NVDECAV1Decoder.cpp
  • D3D12VideoRenderer.cpp
  • VavCore.h (C API)
  • VideoTypes.h (internal)

Expected Outcome: Same performance, simpler code


Performance Analysis

Frame Processing Timeline (30fps = 33.33ms budget)

Current (Broken):

0ms    : Timing thread signals frame N
0ms    : ProcessFrame(N) starts
10ms   : CUDA decode completes (async)
15ms   : UI render starts
25ms   : Frame N complete
33.33ms: Timing thread signals frame N+1 (regardless of N status!)
         → If frame N not done, FRAME DROP occurs

After Fix:

0ms    : Timing thread signals frame N
0ms    : ProcessFrame(N) starts
10ms   : CUDA decode completes
15ms   : UI render starts
25ms   : Frame N complete
25ms   : Timing thread detects completion, advances counter
33.33ms: Sleep completes, signal frame N+1
         → NO FRAME DROPS

Budget Analysis

  • NVDEC decode: 10-15ms
  • UI render: 5-10ms
  • Total: 15-25ms
  • Budget: 33.33ms (30fps)
  • Margin: 8-18ms spare time

Conclusion: We have plenty of time budget, synchronous processing is perfectly fine.


Alternative Considered and Rejected

Fence-based Async Processing

Why rejected:

  1. Current architecture doesn't benefit: Timing thread doesn't overlap work
  2. 33.33ms budget sufficient: No need for async optimization
  3. Adds complexity: External fence requires CUDA/D3D12 interop code
  4. Doesn't fix root cause: Timing thread still needs to wait

When Fence would be beneficial:

  • Variable framerate (60fps+)
  • Tight timing budget (< 16ms)
  • Truly parallel decode/render pipeline

Our case: Fixed 30fps, 33.33ms budget → synchronous is simpler and sufficient


Testing Plan

Phase 1 Testing (Timing Fix)

Unit Tests: None required (integration testing sufficient)

Integration Tests:

  1. Smooth Playback Test:

    • Load 30fps test video
    • Play for 30 seconds
    • Verify no frame drops
    • Verify smooth visual playback
  2. Frame Timing Test:

    • Log actual frame intervals
    • Verify 33.33ms ± 2ms intervals
    • Check m_framesDropped counter (should be 0)
  3. Long Duration Test:

    • Play 5-minute video
    • Monitor frame drops
    • Check for memory leaks

Success Criteria:

  • Zero frame drops over 5 minutes
  • Consistent 33.33ms frame intervals
  • Smooth visual playback (no jerking)

Phase 2 Testing (Fence Removal)

Additional Tests:

  1. Performance Regression Test:

    • Measure decode time before/after
    • Verify < 20ms average decode time
    • Check CPU usage (should be same)
  2. Multi-format Test:

    • Test with different resolutions (720p, 1080p, 4K)
    • Verify all formats work
  3. Decoder Compatibility Test:

    • Test NVDEC (affected by change)
    • Test VPL (should be unaffected)
    • Test dav1d fallback (should be unaffected)

Success Criteria:

  • Same performance as before
  • No visual regression
  • All decoders work correctly

Rollback Plan

Phase 1 Rollback

If timing fix causes issues:

  1. Revert PlaybackController.cpp changes
  2. Revert PlaybackController.h changes
  3. Revert VideoPlayerControl2.xaml.cpp link

Time: < 5 minutes

Phase 2 Rollback

If Fence removal causes issues:

  1. Git revert to Phase 1 completion
  2. Keep timing fix, restore Fence code

Time: < 10 minutes


Implementation Notes

Thread Safety Considerations

FrameProcessor::IsProcessing():

  • Called from Timing Thread
  • Reads atomic bool m_frameProcessing
  • Thread-safe (atomic operation)

PlaybackController::SetFrameProcessor():

  • Called from UI Thread during LoadVideo
  • Non-owning pointer (FrameProcessor lifetime > PlaybackController)
  • Safe (no concurrent access during init)

Performance Considerations

CPU Blocking in cuStreamSynchronize():

  • Blocks for ~15ms
  • Timing thread sleeps for 33.33ms anyway
  • No performance impact (within budget)

Wait Loop in TimingThreadLoop():

  • 1ms sleep per iteration
  • Max 100 iterations (100ms timeout)
  • Typical: 25ms wait = 25 iterations
  • Low CPU overhead

Success Metrics

Immediate (Phase 1)

  • Zero frame drops during playback
  • Smooth visual playback (user confirmed)
  • Consistent 30fps timing (logs verified)

Follow-up (Phase 2)

  • Code complexity reduced (fence code removed)
  • Same or better performance
  • All decoders working (NVDEC, VPL, dav1d)

References

Related Files:

  • vav2/platforms/windows/applications/vav2player/Vav2Player/src/Playback/PlaybackController.cpp
  • vav2/platforms/windows/applications/vav2player/Vav2Player/src/Playback/FrameProcessor.cpp
  • vav2/platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp
  • vav2/platforms/windows/applications/vav2player/Vav2Player/src/Rendering/D3D12VideoRenderer.cpp

Previous Attempts:

  • Fence implementation (2025-10-07): Did not fix jerky playback
  • Timing analysis: Identified root cause in PlaybackController

Approval

Approved by: [User] Date: [Pending] Implementation Start: [Pending]