19 KiB
Vav2Player Stutter Fix Design
Date: 2025-10-08 Status: In Progress Priority: Critical
Problem Statement
Vav2Player exhibits persistent stuttering ("통통 튀는 현상") during 30fps AV1 video playback despite previous B-frame reordering fixes. Analysis reveals three critical synchronization issues in the NVDEC → Triple Buffering → Staging Texture → Renderer pipeline.
Root Cause Analysis
Problem 1: GPU Copy Race Condition
Issue: Asynchronous GPU copy to staging texture without completion wait
Current Flow:
NVDEC Decode → Backend Texture (m_rgbaTextures[0,1,2])
↓
GPU CopyResource (async, no wait)
↓
Staging Texture (m_stagingTexture)
↓
Renderer reads (60Hz Present)
Race Condition:
FrameProcessor.cpp:109callsCopyToStagingTexture()and returns immediately- GPU copy executes asynchronously on command queue
- Next frame may overwrite staging texture before GPU copy completes
- Renderer (
RGBASurfaceBackend.cpp:200-206) reads staging texture while copy in-flight
Evidence:
// FrameProcessor.cpp:106-114
if (result == VAVCORE_SUCCESS) {
auto backend = m_renderer->GetRGBASurfaceBackend();
if (backend) {
HRESULT hr = backend->CopyToStagingTexture(rgbaTexture);
// ❌ NO WAIT HERE - returns immediately
// ❌ GPU copy may not be complete
}
}
Symptoms:
- Flickering/tearing during playback
- Inconsistent frame presentation
- Visual artifacts (partial frame updates)
Problem 2: NVDEC Decode Completion Not Verified
Issue: Polling thread exists but decode completion not enforced in DecodeToSurface
Current Implementation:
NVDECAV1Decoder.cpphasPollingThreadFunc()forcuvidGetDecodeStatus()DecodeToSurface()only waits for FIFO ordering (submission queue)- No wait for
slot.is_readyflag (set by polling thread)
Evidence:
// NVDECAV1Decoder.cpp - DecodeToSurface (approximate line ~1400)
// FIFO wait only
while (m_returnCounter.load() != submission_id) {
std::unique_lock<std::mutex> fifo_lock(m_fifoWaitMutex);
m_fifoWaitCV.wait_for(fifo_lock, std::chrono::milliseconds(100), ...);
}
// ❌ MISSING: Wait for slot.is_ready
// ❌ MISSING: cuvidGetDecodeStatus() completion check
// Function returns immediately after FIFO order satisfied
Impact:
- Backend texture may contain incomplete decoded data
- GPU copy operates on partial decode results
- Frame quality inconsistency
Problem 3: Playback Timing Irregularity
Issue: Sleep-first strategy causes cumulative timing jitter
Current Implementation (PlaybackController.cpp:354-371):
// Sleep FIRST (fixed duration)
std::this_thread::sleep_for(targetIntervalMs);
// Then invoke callback (variable duration)
m_frameReadyCallback(); // Blocking: 6-20ms depending on frame type
Timing Analysis:
| Frame Type | Sleep | Callback | Total | Target | Error |
|---|---|---|---|---|---|
| Display-only | 33ms | 6ms | 39ms | 33.33ms | +5.67ms |
| Normal decode | 33ms | 20ms | 53ms | 33.33ms | +19.67ms |
B-frame Pattern (every 3rd frame is Display-only):
Frame 0: Normal (53ms total) → 19.67ms late
Frame 1: Normal (53ms total) → 19.67ms late
Frame 2: Display (39ms total) → 5.67ms late
Frame 3: Normal (53ms total) → 19.67ms late
...
Compounding Effects:
- Combined with VSync Present(1,0): frames shown for 1 or 2 VSync cycles (16.66ms or 33.33ms)
- Irregular display duration causes perceived "jumping" motion
- User perception: "통통 튀는 현상" (bouncy/stuttering playback)
Solution Design
Solution 1: GPU Copy Completion Synchronization
Objective: Ensure GPU copy completes before proceeding to rendering
Implementation:
1.1 Add GPU Fence to RGBASurfaceBackend
File: RGBASurfaceBackend.h
class RGBASurfaceBackend : public IVideoBackend {
public:
// ... existing methods ...
// New method: Wait for GPU copy to complete
HRESULT WaitForCopyCompletion();
private:
// ... existing members ...
// GPU synchronization for copy operations
ComPtr<ID3D12Fence> m_copyFence;
UINT64 m_copyFenceValue = 0;
HANDLE m_copyFenceEvent = nullptr;
};
1.2 Create Fence in Initialize
File: RGBASurfaceBackend.cpp (in Initialize() method)
// Create fence for GPU copy synchronization
HRESULT hr = m_device->CreateFence(
0,
D3D12_FENCE_FLAG_NONE,
IID_PPV_ARGS(&m_copyFence)
);
if (FAILED(hr)) {
return hr;
}
// Create fence event
m_copyFenceEvent = CreateEvent(nullptr, FALSE, FALSE, nullptr);
if (m_copyFenceEvent == nullptr) {
return HRESULT_FROM_WIN32(GetLastError());
}
1.3 Signal Fence After Copy Submission
File: RGBASurfaceBackend.cpp (in CopyToStagingTexture())
HRESULT RGBASurfaceBackend::CopyToStagingTexture(ID3D12Resource* sourceTexture) {
// ... existing copy commands ...
m_copyCommandList->Close();
ID3D12CommandList* commandLists[] = { m_copyCommandList.Get() };
m_commandQueue->ExecuteCommandLists(1, commandLists);
// Signal fence after copy submission
m_copyFenceValue++;
HRESULT hr = m_commandQueue->Signal(m_copyFence.Get(), m_copyFenceValue);
if (FAILED(hr)) {
LOGF_ERROR("[CopyToStagingTexture] Failed to signal fence: 0x%08X", hr);
return hr;
}
LOGF_DEBUG("[CopyToStagingTexture] GPU copy submitted (fence value: %llu)",
m_copyFenceValue);
return S_OK;
}
1.4 Implement Wait Method
File: RGBASurfaceBackend.cpp (new method)
HRESULT RGBASurfaceBackend::WaitForCopyCompletion() {
// Check if copy already completed
if (m_copyFence->GetCompletedValue() >= m_copyFenceValue) {
return S_OK; // Already complete
}
// Wait for GPU copy to complete
HRESULT hr = m_copyFence->SetEventOnCompletion(
m_copyFenceValue,
m_copyFenceEvent
);
if (FAILED(hr)) {
LOGF_ERROR("[WaitForCopyCompletion] SetEventOnCompletion failed: 0x%08X", hr);
return hr;
}
DWORD waitResult = WaitForSingleObject(m_copyFenceEvent, 5000); // 5 second timeout
if (waitResult != WAIT_OBJECT_0) {
LOGF_ERROR("[WaitForCopyCompletion] Wait failed or timed out: %lu", waitResult);
return E_FAIL;
}
LOGF_DEBUG("[WaitForCopyCompletion] GPU copy completed (fence value: %llu)",
m_copyFenceValue);
return S_OK;
}
1.5 Call Wait in FrameProcessor
File: FrameProcessor.cpp (in ProcessFrame())
// After successful decode, copy to staging texture for safe rendering
if (result == VAVCORE_SUCCESS) {
auto backend = m_renderer->GetRGBASurfaceBackend();
if (backend) {
HRESULT hr = backend->CopyToStagingTexture(rgbaTexture);
if (FAILED(hr)) {
LOGF_ERROR("[FrameProcessor] Failed to copy to staging texture: 0x%08X", hr);
} else {
// Wait for GPU copy to complete before proceeding
hr = backend->WaitForCopyCompletion();
if (FAILED(hr)) {
LOGF_ERROR("[FrameProcessor] Failed to wait for copy completion: 0x%08X", hr);
} else {
LOGF_DEBUG("[FrameProcessor] GPU copy completed, staging texture ready");
}
}
}
}
1.6 Cleanup in Shutdown
File: RGBASurfaceBackend.cpp (in Shutdown())
// Close fence event handle
if (m_copyFenceEvent != nullptr) {
CloseHandle(m_copyFenceEvent);
m_copyFenceEvent = nullptr;
}
// Release fence
m_copyFence.Reset();
Solution 2: NVDEC Decode Completion Wait
Objective: Ensure NVDEC hardware decoding completes before accessing decoded surface
Implementation:
2.1 Find Decode Slot in DecodeToSurface
File: NVDECAV1Decoder.cpp (in DecodeToSurface(), after FIFO wait)
// After FIFO ordering wait, find the actual slot used by this submission
int slot_idx = -1;
{
std::lock_guard<std::mutex> lock(m_submissionMutex);
// Search for slot matching this submission_id
for (size_t i = 0; i < RING_BUFFER_SIZE; i++) {
if (m_ringBuffer[i].submission_id == submission_id &&
m_ringBuffer[i].in_use.load()) {
slot_idx = static_cast<int>(i);
break;
}
}
}
if (slot_idx < 0) {
LOGF_ERROR("[DecodeToSurface] Failed to find decode slot for submission_id=%llu",
submission_id);
return VAVCORE_ERROR_DECODE_FAILED;
}
2.2 Wait for Decode Completion
File: NVDECAV1Decoder.cpp (in DecodeToSurface(), after finding slot)
DecodeSlot& slot = m_ringBuffer[slot_idx];
// Wait for NVDEC decode to complete (signaled by polling thread)
{
std::unique_lock<std::mutex> slot_lock(slot.slot_mutex);
bool decode_ready = slot.frame_ready.wait_for(
slot_lock,
std::chrono::milliseconds(500), // 500ms timeout
[&slot]() {
return slot.is_ready.load();
}
);
if (!decode_ready) {
LOGF_ERROR("[DecodeToSurface] Decode timeout for slot %d (submission_id=%llu)",
slot_idx, submission_id);
// Mark slot as failed
slot.decoding_failed.store(true);
slot.in_use.store(false);
return VAVCORE_ERROR_DECODE_TIMEOUT;
}
// Check if decoding failed
if (slot.decoding_failed.load()) {
LOGF_ERROR("[DecodeToSurface] Decode failed for slot %d (submission_id=%llu)",
slot_idx, submission_id);
slot.in_use.store(false);
return VAVCORE_ERROR_DECODE_FAILED;
}
}
LOGF_DEBUG("[DecodeToSurface] Decode completed for slot %d (submission_id=%llu)",
slot_idx, submission_id);
2.3 Update Polling Thread to Signal Readiness
File: NVDECAV1Decoder.cpp (in PollingThreadFunc())
Ensure polling thread properly signals slot.is_ready when cuvidGetDecodeStatus() returns success
void NVDECAV1Decoder::PollingThreadFunc() {
while (m_pollingRunning.load()) {
// Poll all active slots
for (size_t i = 0; i < RING_BUFFER_SIZE; i++) {
DecodeSlot& slot = m_ringBuffer[i];
if (!slot.in_use.load() || slot.is_ready.load()) {
continue; // Skip inactive or already ready slots
}
int pic_idx = slot.picture_index;
if (pic_idx < 0) {
continue; // No picture assigned yet
}
// Check decode status
CUresult result = cuvidGetDecodeStatus(m_decoder, pic_idx);
if (result == CUDA_SUCCESS) {
// Decode complete
{
std::lock_guard<std::mutex> lock(slot.slot_mutex);
slot.is_ready.store(true);
}
slot.frame_ready.notify_all();
LOGF_DEBUG("[PollingThread] Slot %zu ready (pic_idx=%d)", i, pic_idx);
} else if (result == CUDA_ERROR_NOT_READY) {
// Still decoding, continue polling
} else {
// Decode error
LOGF_ERROR("[PollingThread] Decode error for slot %zu (pic_idx=%d): %d",
i, pic_idx, result);
{
std::lock_guard<std::mutex> lock(slot.slot_mutex);
slot.decoding_failed.store(true);
slot.is_ready.store(true); // Signal to wake up waiter
}
slot.frame_ready.notify_all();
}
}
// Poll interval: 1ms for responsiveness
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
LOGF_DEBUG("[PollingThread] Polling thread exiting");
}
Solution 3: Playback Timing Strategy Redesign
Objective: Maintain fixed 33.33ms frame intervals regardless of decode time variation
Implementation:
3.1 Callback-First with Absolute Target Tracking
File: PlaybackController.cpp (in TimingThreadLoop())
void PlaybackController::TimingThreadLoop()
{
// Set Windows timer resolution to 1ms for accurate sleep
timeBeginPeriod(1);
LOGF_INFO("[PlaybackController] Set Windows timer resolution to 1ms");
double baseIntervalMs = 1000.0 / m_frameRate;
auto startTime = std::chrono::high_resolution_clock::now();
auto nextFrameTarget = startTime;
LOGF_INFO("[PlaybackController] Timing thread loop started (target: %.2f fps, %.2f ms per frame)",
m_frameRate, baseIntervalMs);
while (!m_shouldStopTiming && m_isPlaying) {
auto frameStart = std::chrono::high_resolution_clock::now();
// Apply playback speed
double speed = m_playbackSpeed.load();
double targetIntervalMs = baseIntervalMs / speed;
// Invoke callback FIRST (blocking decode + render)
// This allows decode time to vary (6-20ms) without affecting frame interval
auto callbackStart = std::chrono::high_resolution_clock::now();
if (m_frameReadyCallback) {
m_frameReadyCallback();
}
auto callbackEnd = std::chrono::high_resolution_clock::now();
double callbackTime = std::chrono::duration<double, std::milli>(callbackEnd - callbackStart).count();
// Calculate next frame target (fixed interval from start time)
nextFrameTarget += std::chrono::microseconds(static_cast<long long>(targetIntervalMs * 1000));
// Sleep until next frame target
auto now = std::chrono::high_resolution_clock::now();
auto sleepDuration = nextFrameTarget - now;
if (sleepDuration.count() > 0) {
// Sleep for remaining time
std::this_thread::sleep_until(nextFrameTarget);
double sleepTime = std::chrono::duration<double, std::milli>(sleepDuration).count();
LOGF_DEBUG("[PlaybackController] Frame %llu timing: callback=%.2fms, sleep=%.2fms",
m_currentFrame, callbackTime, sleepTime);
} else {
// Missed target - log warning
double missedBy = std::chrono::duration<double, std::milli>(-sleepDuration).count();
LOGF_WARNING("[PlaybackController] Frame %llu MISSED target by %.2fms (callback took %.2fms)",
m_currentFrame, missedBy, callbackTime);
// Reset target to current time to avoid cumulative drift
nextFrameTarget = now;
}
// Update current time
m_currentFrame++;
m_currentTime = m_currentFrame / m_frameRate;
}
// Restore Windows timer resolution
timeEndPeriod(1);
LOGF_INFO("[PlaybackController] Timing thread loop exited, timer resolution restored");
}
3.2 Timing Characteristics
New Timing Behavior:
Frame 0 (Normal, 20ms decode):
[Callback: 20ms] [Sleep: 13.33ms] = 33.33ms total ✓
Frame 1 (Normal, 20ms decode):
[Callback: 20ms] [Sleep: 13.33ms] = 33.33ms total ✓
Frame 2 (Display-only, 6ms decode):
[Callback: 6ms] [Sleep: 27.33ms] = 33.33ms total ✓
Frame 3 (Normal, 20ms decode):
[Callback: 20ms] [Sleep: 13.33ms] = 33.33ms total ✓
Benefits:
- ✅ Fixed 33.33ms frame interval maintained
- ✅ Decode time variation absorbed by sleep duration
- ✅ No cumulative timing drift
- ✅ Consistent VSync alignment (every 2 frames = 33.33ms)
Implementation Plan
Phase 1: Solution 1 (GPU Copy Sync)
- Add fence/event members to
RGBASurfaceBackend.h - Create fence in
Initialize(), cleanup inShutdown() - Implement
WaitForCopyCompletion() - Signal fence in
CopyToStagingTexture() - Call wait in
FrameProcessor::ProcessFrame() - Build and test: Verify staging texture stability
Phase 2: Solution 2 (NVDEC Decode Sync)
- Update
DecodeToSurface()to find slot after FIFO wait - Add decode completion wait with timeout
- Update
PollingThreadFunc()to properly signal readiness - Add error handling for decode failures/timeouts
- Build and test: Verify decode completion before surface access
Phase 3: Solution 3 (Playback Timing)
- Redesign
TimingThreadLoop()with callback-first strategy - Implement absolute frame target tracking
- Add missed frame detection and recovery
- Build and test: Verify consistent 33.33ms frame intervals
Phase 4: Integration Testing
- Run full playback test with all 3 fixes
- Measure frame timing consistency
- Verify stutter elimination
- Performance profiling (GPU/CPU utilization)
Expected Results
Before Fix
- GPU copy race: Flickering/tearing artifacts
- NVDEC not ready: Partial decoded frames
- Timing irregular: 33-53ms frame intervals (39% variation)
- User perception: Severe stuttering ("통통 튀는 현상")
After Fix
- GPU copy complete: Stable staging texture
- NVDEC verified: Complete decoded frames
- Timing fixed: Consistent 33.33ms intervals (±1ms tolerance)
- User perception: Smooth 30fps playback
Performance Impact
CPU Impact
- GPU fence wait: ~0.1ms per frame (minimal, GPU-bound)
- NVDEC status poll: ~1ms per frame (already running in background)
- Timing redesign: No additional CPU overhead
GPU Impact
- Fence overhead: Negligible (native GPU operation)
- No additional GPU work introduced
Latency Impact
- Added synchronization: +1-2ms per frame
- Trade-off: Slightly higher latency for stability and smoothness
- Still well within 30fps budget (33.33ms)
Risks and Mitigations
Risk 1: Fence Wait Timeout
Mitigation: 5-second timeout with error logging, graceful fallback
Risk 2: NVDEC Decode Timeout
Mitigation: 500ms timeout, mark slot as failed, continue with next frame
Risk 3: Callback Takes >33ms
Mitigation: Detect missed frames, log warning, reset timing target to prevent drift
Testing Strategy
Unit Tests
- GPU fence creation/signaling/waiting
- NVDEC decode status polling accuracy
- Frame timing calculation correctness
Integration Tests
- Full 30fps playback for 60 seconds
- Frame interval histogram (should cluster at 33.33ms ±1ms)
- Visual inspection for stutter/artifacts
Performance Tests
- CPU utilization during playback
- GPU utilization during playback
- Memory usage over extended playback
Success Criteria
- No GPU copy race: Staging texture content stable across frames
- NVDEC decode verified: All frames fully decoded before access
- Frame timing consistent: 95% of frames within 33.33ms ±2ms
- Stutter eliminated: No visible "jumping" or irregular motion
- Performance acceptable: <5% CPU overhead, <2ms added latency
References
- Previous fix:
NVDEC_Frame_Reordering_Fix_Design.md - CUDA Documentation:
cuvidGetDecodeStatus() - D3D12 Documentation:
ID3D12Fence,SetEventOnCompletion() - Windows Multimedia Timer:
timeBeginPeriod()
Document Status: Design Complete, Ready for Implementation Next Step: Implement Solution 1 (GPU Copy Sync)