# Vav2Player Stutter Fix Design **Date**: 2025-10-08 **Status**: In Progress **Priority**: Critical ## Problem Statement Vav2Player exhibits persistent stuttering ("통통 튀는 현상") during 30fps AV1 video playback despite previous B-frame reordering fixes. Analysis reveals three critical synchronization issues in the NVDEC → Triple Buffering → Staging Texture → Renderer pipeline. ## Root Cause Analysis ### Problem 1: GPU Copy Race Condition **Issue**: Asynchronous GPU copy to staging texture without completion wait **Current Flow**: ``` NVDEC Decode → Backend Texture (m_rgbaTextures[0,1,2]) ↓ GPU CopyResource (async, no wait) ↓ Staging Texture (m_stagingTexture) ↓ Renderer reads (60Hz Present) ``` **Race Condition**: - `FrameProcessor.cpp:109` calls `CopyToStagingTexture()` and returns immediately - GPU copy executes asynchronously on command queue - Next frame may overwrite staging texture before GPU copy completes - Renderer (`RGBASurfaceBackend.cpp:200-206`) reads staging texture while copy in-flight **Evidence**: ```cpp // FrameProcessor.cpp:106-114 if (result == VAVCORE_SUCCESS) { auto backend = m_renderer->GetRGBASurfaceBackend(); if (backend) { HRESULT hr = backend->CopyToStagingTexture(rgbaTexture); // ❌ NO WAIT HERE - returns immediately // ❌ GPU copy may not be complete } } ``` **Symptoms**: - Flickering/tearing during playback - Inconsistent frame presentation - Visual artifacts (partial frame updates) --- ### Problem 2: NVDEC Decode Completion Not Verified **Issue**: Polling thread exists but decode completion not enforced in DecodeToSurface **Current Implementation**: - `NVDECAV1Decoder.cpp` has `PollingThreadFunc()` for `cuvidGetDecodeStatus()` - `DecodeToSurface()` only waits for FIFO ordering (submission queue) - **No wait for `slot.is_ready` flag** (set by polling thread) **Evidence**: ```cpp // NVDECAV1Decoder.cpp - DecodeToSurface (approximate line ~1400) // FIFO wait only while (m_returnCounter.load() != submission_id) { std::unique_lock fifo_lock(m_fifoWaitMutex); m_fifoWaitCV.wait_for(fifo_lock, std::chrono::milliseconds(100), ...); } // ❌ MISSING: Wait for slot.is_ready // ❌ MISSING: cuvidGetDecodeStatus() completion check // Function returns immediately after FIFO order satisfied ``` **Impact**: - Backend texture may contain incomplete decoded data - GPU copy operates on partial decode results - Frame quality inconsistency --- ### Problem 3: Playback Timing Irregularity **Issue**: Sleep-first strategy causes cumulative timing jitter **Current Implementation** (`PlaybackController.cpp:354-371`): ```cpp // Sleep FIRST (fixed duration) std::this_thread::sleep_for(targetIntervalMs); // Then invoke callback (variable duration) m_frameReadyCallback(); // Blocking: 6-20ms depending on frame type ``` **Timing Analysis**: | Frame Type | Sleep | Callback | Total | Target | Error | |------------|-------|----------|-------|--------|-------| | Display-only | 33ms | 6ms | 39ms | 33.33ms | +5.67ms | | Normal decode | 33ms | 20ms | 53ms | 33.33ms | +19.67ms | **B-frame Pattern** (every 3rd frame is Display-only): ``` Frame 0: Normal (53ms total) → 19.67ms late Frame 1: Normal (53ms total) → 19.67ms late Frame 2: Display (39ms total) → 5.67ms late Frame 3: Normal (53ms total) → 19.67ms late ... ``` **Compounding Effects**: - Combined with VSync Present(1,0): frames shown for 1 or 2 VSync cycles (16.66ms or 33.33ms) - Irregular display duration causes perceived "jumping" motion - User perception: "통통 튀는 현상" (bouncy/stuttering playback) --- ## Solution Design ### Solution 1: GPU Copy Completion Synchronization **Objective**: Ensure GPU copy completes before proceeding to rendering **Implementation**: #### 1.1 Add GPU Fence to RGBASurfaceBackend **File**: `RGBASurfaceBackend.h` ```cpp class RGBASurfaceBackend : public IVideoBackend { public: // ... existing methods ... // New method: Wait for GPU copy to complete HRESULT WaitForCopyCompletion(); private: // ... existing members ... // GPU synchronization for copy operations ComPtr m_copyFence; UINT64 m_copyFenceValue = 0; HANDLE m_copyFenceEvent = nullptr; }; ``` #### 1.2 Create Fence in Initialize **File**: `RGBASurfaceBackend.cpp` (in `Initialize()` method) ```cpp // Create fence for GPU copy synchronization HRESULT hr = m_device->CreateFence( 0, D3D12_FENCE_FLAG_NONE, IID_PPV_ARGS(&m_copyFence) ); if (FAILED(hr)) { return hr; } // Create fence event m_copyFenceEvent = CreateEvent(nullptr, FALSE, FALSE, nullptr); if (m_copyFenceEvent == nullptr) { return HRESULT_FROM_WIN32(GetLastError()); } ``` #### 1.3 Signal Fence After Copy Submission **File**: `RGBASurfaceBackend.cpp` (in `CopyToStagingTexture()`) ```cpp HRESULT RGBASurfaceBackend::CopyToStagingTexture(ID3D12Resource* sourceTexture) { // ... existing copy commands ... m_copyCommandList->Close(); ID3D12CommandList* commandLists[] = { m_copyCommandList.Get() }; m_commandQueue->ExecuteCommandLists(1, commandLists); // Signal fence after copy submission m_copyFenceValue++; HRESULT hr = m_commandQueue->Signal(m_copyFence.Get(), m_copyFenceValue); if (FAILED(hr)) { LOGF_ERROR("[CopyToStagingTexture] Failed to signal fence: 0x%08X", hr); return hr; } LOGF_DEBUG("[CopyToStagingTexture] GPU copy submitted (fence value: %llu)", m_copyFenceValue); return S_OK; } ``` #### 1.4 Implement Wait Method **File**: `RGBASurfaceBackend.cpp` (new method) ```cpp HRESULT RGBASurfaceBackend::WaitForCopyCompletion() { // Check if copy already completed if (m_copyFence->GetCompletedValue() >= m_copyFenceValue) { return S_OK; // Already complete } // Wait for GPU copy to complete HRESULT hr = m_copyFence->SetEventOnCompletion( m_copyFenceValue, m_copyFenceEvent ); if (FAILED(hr)) { LOGF_ERROR("[WaitForCopyCompletion] SetEventOnCompletion failed: 0x%08X", hr); return hr; } DWORD waitResult = WaitForSingleObject(m_copyFenceEvent, 5000); // 5 second timeout if (waitResult != WAIT_OBJECT_0) { LOGF_ERROR("[WaitForCopyCompletion] Wait failed or timed out: %lu", waitResult); return E_FAIL; } LOGF_DEBUG("[WaitForCopyCompletion] GPU copy completed (fence value: %llu)", m_copyFenceValue); return S_OK; } ``` #### 1.5 Call Wait in FrameProcessor **File**: `FrameProcessor.cpp` (in `ProcessFrame()`) ```cpp // After successful decode, copy to staging texture for safe rendering if (result == VAVCORE_SUCCESS) { auto backend = m_renderer->GetRGBASurfaceBackend(); if (backend) { HRESULT hr = backend->CopyToStagingTexture(rgbaTexture); if (FAILED(hr)) { LOGF_ERROR("[FrameProcessor] Failed to copy to staging texture: 0x%08X", hr); } else { // Wait for GPU copy to complete before proceeding hr = backend->WaitForCopyCompletion(); if (FAILED(hr)) { LOGF_ERROR("[FrameProcessor] Failed to wait for copy completion: 0x%08X", hr); } else { LOGF_DEBUG("[FrameProcessor] GPU copy completed, staging texture ready"); } } } } ``` #### 1.6 Cleanup in Shutdown **File**: `RGBASurfaceBackend.cpp` (in `Shutdown()`) ```cpp // Close fence event handle if (m_copyFenceEvent != nullptr) { CloseHandle(m_copyFenceEvent); m_copyFenceEvent = nullptr; } // Release fence m_copyFence.Reset(); ``` --- ### Solution 2: NVDEC Decode Completion Wait **Objective**: Ensure NVDEC hardware decoding completes before accessing decoded surface **Implementation**: #### 2.1 Find Decode Slot in DecodeToSurface **File**: `NVDECAV1Decoder.cpp` (in `DecodeToSurface()`, after FIFO wait) ```cpp // After FIFO ordering wait, find the actual slot used by this submission int slot_idx = -1; { std::lock_guard lock(m_submissionMutex); // Search for slot matching this submission_id for (size_t i = 0; i < RING_BUFFER_SIZE; i++) { if (m_ringBuffer[i].submission_id == submission_id && m_ringBuffer[i].in_use.load()) { slot_idx = static_cast(i); break; } } } if (slot_idx < 0) { LOGF_ERROR("[DecodeToSurface] Failed to find decode slot for submission_id=%llu", submission_id); return VAVCORE_ERROR_DECODE_FAILED; } ``` #### 2.2 Wait for Decode Completion **File**: `NVDECAV1Decoder.cpp` (in `DecodeToSurface()`, after finding slot) ```cpp DecodeSlot& slot = m_ringBuffer[slot_idx]; // Wait for NVDEC decode to complete (signaled by polling thread) { std::unique_lock slot_lock(slot.slot_mutex); bool decode_ready = slot.frame_ready.wait_for( slot_lock, std::chrono::milliseconds(500), // 500ms timeout [&slot]() { return slot.is_ready.load(); } ); if (!decode_ready) { LOGF_ERROR("[DecodeToSurface] Decode timeout for slot %d (submission_id=%llu)", slot_idx, submission_id); // Mark slot as failed slot.decoding_failed.store(true); slot.in_use.store(false); return VAVCORE_ERROR_DECODE_TIMEOUT; } // Check if decoding failed if (slot.decoding_failed.load()) { LOGF_ERROR("[DecodeToSurface] Decode failed for slot %d (submission_id=%llu)", slot_idx, submission_id); slot.in_use.store(false); return VAVCORE_ERROR_DECODE_FAILED; } } LOGF_DEBUG("[DecodeToSurface] Decode completed for slot %d (submission_id=%llu)", slot_idx, submission_id); ``` #### 2.3 Update Polling Thread to Signal Readiness **File**: `NVDECAV1Decoder.cpp` (in `PollingThreadFunc()`) **Ensure polling thread properly signals `slot.is_ready` when `cuvidGetDecodeStatus()` returns success** ```cpp void NVDECAV1Decoder::PollingThreadFunc() { while (m_pollingRunning.load()) { // Poll all active slots for (size_t i = 0; i < RING_BUFFER_SIZE; i++) { DecodeSlot& slot = m_ringBuffer[i]; if (!slot.in_use.load() || slot.is_ready.load()) { continue; // Skip inactive or already ready slots } int pic_idx = slot.picture_index; if (pic_idx < 0) { continue; // No picture assigned yet } // Check decode status CUresult result = cuvidGetDecodeStatus(m_decoder, pic_idx); if (result == CUDA_SUCCESS) { // Decode complete { std::lock_guard lock(slot.slot_mutex); slot.is_ready.store(true); } slot.frame_ready.notify_all(); LOGF_DEBUG("[PollingThread] Slot %zu ready (pic_idx=%d)", i, pic_idx); } else if (result == CUDA_ERROR_NOT_READY) { // Still decoding, continue polling } else { // Decode error LOGF_ERROR("[PollingThread] Decode error for slot %zu (pic_idx=%d): %d", i, pic_idx, result); { std::lock_guard lock(slot.slot_mutex); slot.decoding_failed.store(true); slot.is_ready.store(true); // Signal to wake up waiter } slot.frame_ready.notify_all(); } } // Poll interval: 1ms for responsiveness std::this_thread::sleep_for(std::chrono::milliseconds(1)); } LOGF_DEBUG("[PollingThread] Polling thread exiting"); } ``` --- ### Solution 3: Playback Timing Strategy Redesign **Objective**: Maintain fixed 33.33ms frame intervals regardless of decode time variation **Implementation**: #### 3.1 Callback-First with Absolute Target Tracking **File**: `PlaybackController.cpp` (in `TimingThreadLoop()`) ```cpp void PlaybackController::TimingThreadLoop() { // Set Windows timer resolution to 1ms for accurate sleep timeBeginPeriod(1); LOGF_INFO("[PlaybackController] Set Windows timer resolution to 1ms"); double baseIntervalMs = 1000.0 / m_frameRate; auto startTime = std::chrono::high_resolution_clock::now(); auto nextFrameTarget = startTime; LOGF_INFO("[PlaybackController] Timing thread loop started (target: %.2f fps, %.2f ms per frame)", m_frameRate, baseIntervalMs); while (!m_shouldStopTiming && m_isPlaying) { auto frameStart = std::chrono::high_resolution_clock::now(); // Apply playback speed double speed = m_playbackSpeed.load(); double targetIntervalMs = baseIntervalMs / speed; // Invoke callback FIRST (blocking decode + render) // This allows decode time to vary (6-20ms) without affecting frame interval auto callbackStart = std::chrono::high_resolution_clock::now(); if (m_frameReadyCallback) { m_frameReadyCallback(); } auto callbackEnd = std::chrono::high_resolution_clock::now(); double callbackTime = std::chrono::duration(callbackEnd - callbackStart).count(); // Calculate next frame target (fixed interval from start time) nextFrameTarget += std::chrono::microseconds(static_cast(targetIntervalMs * 1000)); // Sleep until next frame target auto now = std::chrono::high_resolution_clock::now(); auto sleepDuration = nextFrameTarget - now; if (sleepDuration.count() > 0) { // Sleep for remaining time std::this_thread::sleep_until(nextFrameTarget); double sleepTime = std::chrono::duration(sleepDuration).count(); LOGF_DEBUG("[PlaybackController] Frame %llu timing: callback=%.2fms, sleep=%.2fms", m_currentFrame, callbackTime, sleepTime); } else { // Missed target - log warning double missedBy = std::chrono::duration(-sleepDuration).count(); LOGF_WARNING("[PlaybackController] Frame %llu MISSED target by %.2fms (callback took %.2fms)", m_currentFrame, missedBy, callbackTime); // Reset target to current time to avoid cumulative drift nextFrameTarget = now; } // Update current time m_currentFrame++; m_currentTime = m_currentFrame / m_frameRate; } // Restore Windows timer resolution timeEndPeriod(1); LOGF_INFO("[PlaybackController] Timing thread loop exited, timer resolution restored"); } ``` #### 3.2 Timing Characteristics **New Timing Behavior**: ``` Frame 0 (Normal, 20ms decode): [Callback: 20ms] [Sleep: 13.33ms] = 33.33ms total ✓ Frame 1 (Normal, 20ms decode): [Callback: 20ms] [Sleep: 13.33ms] = 33.33ms total ✓ Frame 2 (Display-only, 6ms decode): [Callback: 6ms] [Sleep: 27.33ms] = 33.33ms total ✓ Frame 3 (Normal, 20ms decode): [Callback: 20ms] [Sleep: 13.33ms] = 33.33ms total ✓ ``` **Benefits**: - ✅ Fixed 33.33ms frame interval maintained - ✅ Decode time variation absorbed by sleep duration - ✅ No cumulative timing drift - ✅ Consistent VSync alignment (every 2 frames = 33.33ms) --- ## Implementation Plan ### Phase 1: Solution 1 (GPU Copy Sync) 1. Add fence/event members to `RGBASurfaceBackend.h` 2. Create fence in `Initialize()`, cleanup in `Shutdown()` 3. Implement `WaitForCopyCompletion()` 4. Signal fence in `CopyToStagingTexture()` 5. Call wait in `FrameProcessor::ProcessFrame()` 6. **Build and test**: Verify staging texture stability ### Phase 2: Solution 2 (NVDEC Decode Sync) 1. Update `DecodeToSurface()` to find slot after FIFO wait 2. Add decode completion wait with timeout 3. Update `PollingThreadFunc()` to properly signal readiness 4. Add error handling for decode failures/timeouts 5. **Build and test**: Verify decode completion before surface access ### Phase 3: Solution 3 (Playback Timing) 1. Redesign `TimingThreadLoop()` with callback-first strategy 2. Implement absolute frame target tracking 3. Add missed frame detection and recovery 4. **Build and test**: Verify consistent 33.33ms frame intervals ### Phase 4: Integration Testing 1. Run full playback test with all 3 fixes 2. Measure frame timing consistency 3. Verify stutter elimination 4. Performance profiling (GPU/CPU utilization) --- ## Expected Results ### Before Fix - GPU copy race: Flickering/tearing artifacts - NVDEC not ready: Partial decoded frames - Timing irregular: 33-53ms frame intervals (39% variation) - User perception: **Severe stuttering** ("통통 튀는 현상") ### After Fix - GPU copy complete: Stable staging texture - NVDEC verified: Complete decoded frames - Timing fixed: Consistent 33.33ms intervals (±1ms tolerance) - User perception: **Smooth 30fps playback** --- ## Performance Impact ### CPU Impact - GPU fence wait: ~0.1ms per frame (minimal, GPU-bound) - NVDEC status poll: ~1ms per frame (already running in background) - Timing redesign: No additional CPU overhead ### GPU Impact - Fence overhead: Negligible (native GPU operation) - No additional GPU work introduced ### Latency Impact - Added synchronization: +1-2ms per frame - **Trade-off**: Slightly higher latency for stability and smoothness - Still well within 30fps budget (33.33ms) --- ## Risks and Mitigations ### Risk 1: Fence Wait Timeout **Mitigation**: 5-second timeout with error logging, graceful fallback ### Risk 2: NVDEC Decode Timeout **Mitigation**: 500ms timeout, mark slot as failed, continue with next frame ### Risk 3: Callback Takes >33ms **Mitigation**: Detect missed frames, log warning, reset timing target to prevent drift --- ## Testing Strategy ### Unit Tests - GPU fence creation/signaling/waiting - NVDEC decode status polling accuracy - Frame timing calculation correctness ### Integration Tests - Full 30fps playback for 60 seconds - Frame interval histogram (should cluster at 33.33ms ±1ms) - Visual inspection for stutter/artifacts ### Performance Tests - CPU utilization during playback - GPU utilization during playback - Memory usage over extended playback --- ## Success Criteria 1. **No GPU copy race**: Staging texture content stable across frames 2. **NVDEC decode verified**: All frames fully decoded before access 3. **Frame timing consistent**: 95% of frames within 33.33ms ±2ms 4. **Stutter eliminated**: No visible "jumping" or irregular motion 5. **Performance acceptable**: <5% CPU overhead, <2ms added latency --- ## References - Previous fix: `NVDEC_Frame_Reordering_Fix_Design.md` - CUDA Documentation: `cuvidGetDecodeStatus()` - D3D12 Documentation: `ID3D12Fence`, `SetEventOnCompletion()` - Windows Multimedia Timer: `timeBeginPeriod()` --- **Document Status**: Design Complete, Ready for Implementation **Next Step**: Implement Solution 1 (GPU Copy Sync)