# Playback Timing Fix Design **Date**: 2025-10-07 **Author**: Claude Code **Status**: Design Phase ## Problem Statement ### Symptom Video playback is jerky ("통통 튄다") at 30fps despite NVDEC hardware acceleration working correctly. ### Root Cause Analysis **NOT the problem**: - ❌ CUDA synchronization (cuStreamSynchronize is fine) - ❌ Fence implementation (works correctly but unnecessary) - ❌ GPU decode performance (10-15ms, well within budget) **ACTUAL problem**: - ✅ **PlaybackController::TimingThreadLoop()** does NOT wait for frame processing completion - ✅ Timing thread sends "next frame" signal every 33.33ms regardless of previous frame status - ✅ FrameProcessor drops frames when previous frame still processing (m_frameProcessing flag) ### Evidence **PlaybackController.cpp:249-275** - Current broken implementation: ```cpp void PlaybackController::TimingThreadLoop() { while (!m_shouldStopTiming && m_isPlaying) { // 1. Signal frame processing if (m_frameReadyCallback) { m_frameReadyCallback(); // → ProcessFrame() } // 2. ❌ IMMEDIATE advance without waiting! m_currentFrame++; m_currentTime = m_currentFrame / m_frameRate; // 3. Sleep until next frame std::this_thread::sleep_until(nextFrame); } } ``` **FrameProcessor.cpp:47-53** - Frame drop mechanism: ```cpp bool FrameProcessor::ProcessFrame(...) { // Check if previous frame still processing bool expected = false; if (!m_frameProcessing.compare_exchange_strong(expected, true)) { m_framesDropped++; // ← This causes jerky playback! return false; } // ... decode and render ... } ``` --- ## Solution Design ### Approach 1: Fix TimingThreadLoop (RECOMMENDED) **Minimal change, keeps existing architecture** #### Modified PlaybackController.cpp ```cpp void PlaybackController::TimingThreadLoop() { auto start = std::chrono::high_resolution_clock::now(); while (!m_shouldStopTiming && m_isPlaying) { auto frameStart = std::chrono::high_resolution_clock::now(); // 1. Signal frame processing if (m_frameReadyCallback) { m_frameReadyCallback(); } // 2. ✅ WAIT for frame processing completion (max 100ms timeout) int waitCount = 0; while (IsFrameProcessing() && waitCount < 100) { std::this_thread::sleep_for(std::chrono::milliseconds(1)); waitCount++; } if (waitCount >= 100) { LOGF_WARNING("[PlaybackController] Frame processing timeout"); } // 3. Advance frame counter m_currentFrame++; m_currentTime = m_currentFrame / m_frameRate; // 4. Sleep for remaining time to maintain 30fps auto frameEnd = std::chrono::high_resolution_clock::now(); double elapsed = std::chrono::duration(frameEnd - frameStart).count(); auto nextFrameTime = start + std::chrono::microseconds( static_cast(33333.0 * m_currentFrame)); std::this_thread::sleep_until(nextFrameTime); } } ``` #### Required Changes **PlaybackController.h**: ```cpp class PlaybackController { // Add frame processor reference void SetFrameProcessor(FrameProcessor* processor) { m_frameProcessor = processor; } bool IsFrameProcessing() const { return m_frameProcessor && m_frameProcessor->IsProcessing(); } private: FrameProcessor* m_frameProcessor = nullptr; // Non-owning pointer }; ``` **VideoPlayerControl2.xaml.cpp**: ```cpp void VideoPlayerControl2::LoadVideo(...) { // ... existing code ... // Link PlaybackController to FrameProcessor m_playbackController->SetFrameProcessor(m_frameProcessor.get()); } ``` #### Benefits - ✅ Minimal code change (10-15 lines) - ✅ Keeps existing architecture intact - ✅ Easy to test and rollback - ✅ No risk to other components - ✅ Frame drops eliminated --- ### Approach 2: Remove Fence + Fix Timing (COMPREHENSIVE) **Reduces complexity while fixing the root cause** #### Changes **1. Remove CUDA-D3D12 Fence** (complexity reduction): Delete: - `D3D12SurfaceHandler::SetD3D12Fence()` - `D3D12SurfaceHandler::SignalD3D12Fence()` - `vavcore_get_sync_fence()` API - `VavCoreVideoFrame::sync_fence_value` field Replace with: ```cpp // NVDECAV1Decoder.cpp VavCoreResult NVDECAV1Decoder::DecodeToSurface(...) { // ... decode logic ... // Copy RGBA to D3D12 texture m_d3d12Handler->CopyRGBAFrame(...); // ✅ Simple synchronous wait (15ms max) cuStreamSynchronize(m_stream); // ❌ Remove fence signaling // m_d3d12Handler->SignalD3D12Fence(...); return VAVCORE_SUCCESS; } ``` **2. Fix TimingThreadLoop** (same as Approach 1) **3. Keep D3D12 Internal Fence** (required for SwapChain): - ✅ Keep `D3D12VideoRenderer::m_fence` - ✅ Keep `WaitForFrameCompletion()` for triple buffering #### Benefits - ✅ Reduced complexity (remove External Fence) - ✅ Fixes root cause (timing) - ✅ Same performance (15ms CPU block < 33.33ms budget) - ✅ Easier debugging (synchronous flow) #### Risks - ⚠️ More code changes (3 components) - ⚠️ Requires more thorough testing --- ## Decision Matrix | Aspect | Approach 1 (Fix Timing Only) | Approach 2 (Remove Fence + Fix) | |--------|------------------------------|----------------------------------| | Code Changes | 10-15 lines | ~50 lines | | Complexity Reduction | None | Medium | | Performance Impact | None | None (15ms < 33.33ms) | | Risk Level | Very Low | Low | | Testing Effort | Minimal | Moderate | | **Recommendation** | ✅ Start here | Follow-up refactor | --- ## Implementation Plan ### Phase 1: Fix Timing (Immediate) **Goal**: Stop jerky playback **Tasks**: 1. Modify `PlaybackController::TimingThreadLoop()` to wait for frame completion 2. Add `SetFrameProcessor()` method to PlaybackController 3. Link FrameProcessor in VideoPlayerControl2 4. Test with sample video **Files Modified**: - `PlaybackController.h` (add 3 lines) - `PlaybackController.cpp` (modify TimingThreadLoop, ~15 lines) - `VideoPlayerControl2.xaml.cpp` (add 1 line) **Expected Outcome**: Smooth 30fps playback --- ### Phase 2: Remove Fence (Follow-up) **Goal**: Reduce technical debt **Tasks**: 1. Remove External Fence from D3D12SurfaceHandler 2. Replace with `cuStreamSynchronize()` in NVDECAV1Decoder 3. Remove `vavcore_get_sync_fence()` from VavCore API 4. Remove fence wait from D3D12VideoRenderer 5. Update VavCoreVideoFrame struct **Files Modified**: - `D3D12SurfaceHandler.h/.cpp` - `NVDECAV1Decoder.cpp` - `D3D12VideoRenderer.cpp` - `VavCore.h` (C API) - `VideoTypes.h` (internal) **Expected Outcome**: Same performance, simpler code --- ## Performance Analysis ### Frame Processing Timeline (30fps = 33.33ms budget) **Current (Broken)**: ``` 0ms : Timing thread signals frame N 0ms : ProcessFrame(N) starts 10ms : CUDA decode completes (async) 15ms : UI render starts 25ms : Frame N complete 33.33ms: Timing thread signals frame N+1 (regardless of N status!) → If frame N not done, FRAME DROP occurs ``` **After Fix**: ``` 0ms : Timing thread signals frame N 0ms : ProcessFrame(N) starts 10ms : CUDA decode completes 15ms : UI render starts 25ms : Frame N complete 25ms : Timing thread detects completion, advances counter 33.33ms: Sleep completes, signal frame N+1 → NO FRAME DROPS ``` ### Budget Analysis - NVDEC decode: 10-15ms - UI render: 5-10ms - **Total**: 15-25ms - **Budget**: 33.33ms (30fps) - **Margin**: 8-18ms spare time **Conclusion**: We have plenty of time budget, synchronous processing is perfectly fine. --- ## Alternative Considered and Rejected ### Fence-based Async Processing **Why rejected**: 1. **Current architecture doesn't benefit**: Timing thread doesn't overlap work 2. **33.33ms budget sufficient**: No need for async optimization 3. **Adds complexity**: External fence requires CUDA/D3D12 interop code 4. **Doesn't fix root cause**: Timing thread still needs to wait **When Fence would be beneficial**: - Variable framerate (60fps+) - Tight timing budget (< 16ms) - Truly parallel decode/render pipeline **Our case**: Fixed 30fps, 33.33ms budget → synchronous is simpler and sufficient --- ## Testing Plan ### Phase 1 Testing (Timing Fix) **Unit Tests**: None required (integration testing sufficient) **Integration Tests**: 1. **Smooth Playback Test**: - Load 30fps test video - Play for 30 seconds - Verify no frame drops - Verify smooth visual playback 2. **Frame Timing Test**: - Log actual frame intervals - Verify 33.33ms ± 2ms intervals - Check m_framesDropped counter (should be 0) 3. **Long Duration Test**: - Play 5-minute video - Monitor frame drops - Check for memory leaks **Success Criteria**: - ✅ Zero frame drops over 5 minutes - ✅ Consistent 33.33ms frame intervals - ✅ Smooth visual playback (no jerking) --- ### Phase 2 Testing (Fence Removal) **Additional Tests**: 1. **Performance Regression Test**: - Measure decode time before/after - Verify < 20ms average decode time - Check CPU usage (should be same) 2. **Multi-format Test**: - Test with different resolutions (720p, 1080p, 4K) - Verify all formats work 3. **Decoder Compatibility Test**: - Test NVDEC (affected by change) - Test VPL (should be unaffected) - Test dav1d fallback (should be unaffected) **Success Criteria**: - ✅ Same performance as before - ✅ No visual regression - ✅ All decoders work correctly --- ## Rollback Plan ### Phase 1 Rollback **If timing fix causes issues**: 1. Revert PlaybackController.cpp changes 2. Revert PlaybackController.h changes 3. Revert VideoPlayerControl2.xaml.cpp link **Time**: < 5 minutes ### Phase 2 Rollback **If Fence removal causes issues**: 1. Git revert to Phase 1 completion 2. Keep timing fix, restore Fence code **Time**: < 10 minutes --- ## Implementation Notes ### Thread Safety Considerations **FrameProcessor::IsProcessing()**: - Called from Timing Thread - Reads atomic bool `m_frameProcessing` - ✅ Thread-safe (atomic operation) **PlaybackController::SetFrameProcessor()**: - Called from UI Thread during LoadVideo - Non-owning pointer (FrameProcessor lifetime > PlaybackController) - ✅ Safe (no concurrent access during init) ### Performance Considerations **CPU Blocking in cuStreamSynchronize()**: - Blocks for ~15ms - Timing thread sleeps for 33.33ms anyway - ✅ No performance impact (within budget) **Wait Loop in TimingThreadLoop()**: - 1ms sleep per iteration - Max 100 iterations (100ms timeout) - Typical: 25ms wait = 25 iterations - ✅ Low CPU overhead --- ## Success Metrics ### Immediate (Phase 1) - [ ] Zero frame drops during playback - [ ] Smooth visual playback (user confirmed) - [ ] Consistent 30fps timing (logs verified) ### Follow-up (Phase 2) - [ ] Code complexity reduced (fence code removed) - [ ] Same or better performance - [ ] All decoders working (NVDEC, VPL, dav1d) --- ## References **Related Files**: - `vav2/platforms/windows/applications/vav2player/Vav2Player/src/Playback/PlaybackController.cpp` - `vav2/platforms/windows/applications/vav2player/Vav2Player/src/Playback/FrameProcessor.cpp` - `vav2/platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp` - `vav2/platforms/windows/applications/vav2player/Vav2Player/src/Rendering/D3D12VideoRenderer.cpp` **Previous Attempts**: - Fence implementation (2025-10-07): Did not fix jerky playback - Timing analysis: Identified root cause in PlaybackController --- ## Approval **Approved by**: [User] **Date**: [Pending] **Implementation Start**: [Pending]