Files
video-v1/vav2/docs/working/Vav2Player_Stutter_Fix_Design.md
2025-10-09 19:21:14 +09:00

19 KiB

Vav2Player Stutter Fix Design

Date: 2025-10-08 Status: In Progress Priority: Critical

Problem Statement

Vav2Player exhibits persistent stuttering ("통통 튀는 현상") during 30fps AV1 video playback despite previous B-frame reordering fixes. Analysis reveals three critical synchronization issues in the NVDEC → Triple Buffering → Staging Texture → Renderer pipeline.

Root Cause Analysis

Problem 1: GPU Copy Race Condition

Issue: Asynchronous GPU copy to staging texture without completion wait

Current Flow:

NVDEC Decode → Backend Texture (m_rgbaTextures[0,1,2])
              ↓
        GPU CopyResource (async, no wait)
              ↓
        Staging Texture (m_stagingTexture)
              ↓
        Renderer reads (60Hz Present)

Race Condition:

  • FrameProcessor.cpp:109 calls CopyToStagingTexture() and returns immediately
  • GPU copy executes asynchronously on command queue
  • Next frame may overwrite staging texture before GPU copy completes
  • Renderer (RGBASurfaceBackend.cpp:200-206) reads staging texture while copy in-flight

Evidence:

// FrameProcessor.cpp:106-114
if (result == VAVCORE_SUCCESS) {
    auto backend = m_renderer->GetRGBASurfaceBackend();
    if (backend) {
        HRESULT hr = backend->CopyToStagingTexture(rgbaTexture);
        // ❌ NO WAIT HERE - returns immediately
        // ❌ GPU copy may not be complete
    }
}

Symptoms:

  • Flickering/tearing during playback
  • Inconsistent frame presentation
  • Visual artifacts (partial frame updates)

Problem 2: NVDEC Decode Completion Not Verified

Issue: Polling thread exists but decode completion not enforced in DecodeToSurface

Current Implementation:

  • NVDECAV1Decoder.cpp has PollingThreadFunc() for cuvidGetDecodeStatus()
  • DecodeToSurface() only waits for FIFO ordering (submission queue)
  • No wait for slot.is_ready flag (set by polling thread)

Evidence:

// NVDECAV1Decoder.cpp - DecodeToSurface (approximate line ~1400)
// FIFO wait only
while (m_returnCounter.load() != submission_id) {
    std::unique_lock<std::mutex> fifo_lock(m_fifoWaitMutex);
    m_fifoWaitCV.wait_for(fifo_lock, std::chrono::milliseconds(100), ...);
}

// ❌ MISSING: Wait for slot.is_ready
// ❌ MISSING: cuvidGetDecodeStatus() completion check
// Function returns immediately after FIFO order satisfied

Impact:

  • Backend texture may contain incomplete decoded data
  • GPU copy operates on partial decode results
  • Frame quality inconsistency

Problem 3: Playback Timing Irregularity

Issue: Sleep-first strategy causes cumulative timing jitter

Current Implementation (PlaybackController.cpp:354-371):

// Sleep FIRST (fixed duration)
std::this_thread::sleep_for(targetIntervalMs);

// Then invoke callback (variable duration)
m_frameReadyCallback();  // Blocking: 6-20ms depending on frame type

Timing Analysis:

Frame Type Sleep Callback Total Target Error
Display-only 33ms 6ms 39ms 33.33ms +5.67ms
Normal decode 33ms 20ms 53ms 33.33ms +19.67ms

B-frame Pattern (every 3rd frame is Display-only):

Frame 0: Normal  (53ms total) → 19.67ms late
Frame 1: Normal  (53ms total) → 19.67ms late
Frame 2: Display (39ms total) → 5.67ms late
Frame 3: Normal  (53ms total) → 19.67ms late
...

Compounding Effects:

  • Combined with VSync Present(1,0): frames shown for 1 or 2 VSync cycles (16.66ms or 33.33ms)
  • Irregular display duration causes perceived "jumping" motion
  • User perception: "통통 튀는 현상" (bouncy/stuttering playback)

Solution Design

Solution 1: GPU Copy Completion Synchronization

Objective: Ensure GPU copy completes before proceeding to rendering

Implementation:

1.1 Add GPU Fence to RGBASurfaceBackend

File: RGBASurfaceBackend.h

class RGBASurfaceBackend : public IVideoBackend {
public:
    // ... existing methods ...

    // New method: Wait for GPU copy to complete
    HRESULT WaitForCopyCompletion();

private:
    // ... existing members ...

    // GPU synchronization for copy operations
    ComPtr<ID3D12Fence> m_copyFence;
    UINT64 m_copyFenceValue = 0;
    HANDLE m_copyFenceEvent = nullptr;
};

1.2 Create Fence in Initialize

File: RGBASurfaceBackend.cpp (in Initialize() method)

// Create fence for GPU copy synchronization
HRESULT hr = m_device->CreateFence(
    0,
    D3D12_FENCE_FLAG_NONE,
    IID_PPV_ARGS(&m_copyFence)
);
if (FAILED(hr)) {
    return hr;
}

// Create fence event
m_copyFenceEvent = CreateEvent(nullptr, FALSE, FALSE, nullptr);
if (m_copyFenceEvent == nullptr) {
    return HRESULT_FROM_WIN32(GetLastError());
}

1.3 Signal Fence After Copy Submission

File: RGBASurfaceBackend.cpp (in CopyToStagingTexture())

HRESULT RGBASurfaceBackend::CopyToStagingTexture(ID3D12Resource* sourceTexture) {
    // ... existing copy commands ...

    m_copyCommandList->Close();
    ID3D12CommandList* commandLists[] = { m_copyCommandList.Get() };
    m_commandQueue->ExecuteCommandLists(1, commandLists);

    // Signal fence after copy submission
    m_copyFenceValue++;
    HRESULT hr = m_commandQueue->Signal(m_copyFence.Get(), m_copyFenceValue);
    if (FAILED(hr)) {
        LOGF_ERROR("[CopyToStagingTexture] Failed to signal fence: 0x%08X", hr);
        return hr;
    }

    LOGF_DEBUG("[CopyToStagingTexture] GPU copy submitted (fence value: %llu)",
               m_copyFenceValue);

    return S_OK;
}

1.4 Implement Wait Method

File: RGBASurfaceBackend.cpp (new method)

HRESULT RGBASurfaceBackend::WaitForCopyCompletion() {
    // Check if copy already completed
    if (m_copyFence->GetCompletedValue() >= m_copyFenceValue) {
        return S_OK;  // Already complete
    }

    // Wait for GPU copy to complete
    HRESULT hr = m_copyFence->SetEventOnCompletion(
        m_copyFenceValue,
        m_copyFenceEvent
    );
    if (FAILED(hr)) {
        LOGF_ERROR("[WaitForCopyCompletion] SetEventOnCompletion failed: 0x%08X", hr);
        return hr;
    }

    DWORD waitResult = WaitForSingleObject(m_copyFenceEvent, 5000);  // 5 second timeout
    if (waitResult != WAIT_OBJECT_0) {
        LOGF_ERROR("[WaitForCopyCompletion] Wait failed or timed out: %lu", waitResult);
        return E_FAIL;
    }

    LOGF_DEBUG("[WaitForCopyCompletion] GPU copy completed (fence value: %llu)",
               m_copyFenceValue);

    return S_OK;
}

1.5 Call Wait in FrameProcessor

File: FrameProcessor.cpp (in ProcessFrame())

// After successful decode, copy to staging texture for safe rendering
if (result == VAVCORE_SUCCESS) {
    auto backend = m_renderer->GetRGBASurfaceBackend();
    if (backend) {
        HRESULT hr = backend->CopyToStagingTexture(rgbaTexture);
        if (FAILED(hr)) {
            LOGF_ERROR("[FrameProcessor] Failed to copy to staging texture: 0x%08X", hr);
        } else {
            // Wait for GPU copy to complete before proceeding
            hr = backend->WaitForCopyCompletion();
            if (FAILED(hr)) {
                LOGF_ERROR("[FrameProcessor] Failed to wait for copy completion: 0x%08X", hr);
            } else {
                LOGF_DEBUG("[FrameProcessor] GPU copy completed, staging texture ready");
            }
        }
    }
}

1.6 Cleanup in Shutdown

File: RGBASurfaceBackend.cpp (in Shutdown())

// Close fence event handle
if (m_copyFenceEvent != nullptr) {
    CloseHandle(m_copyFenceEvent);
    m_copyFenceEvent = nullptr;
}

// Release fence
m_copyFence.Reset();

Solution 2: NVDEC Decode Completion Wait

Objective: Ensure NVDEC hardware decoding completes before accessing decoded surface

Implementation:

2.1 Find Decode Slot in DecodeToSurface

File: NVDECAV1Decoder.cpp (in DecodeToSurface(), after FIFO wait)

// After FIFO ordering wait, find the actual slot used by this submission
int slot_idx = -1;
{
    std::lock_guard<std::mutex> lock(m_submissionMutex);

    // Search for slot matching this submission_id
    for (size_t i = 0; i < RING_BUFFER_SIZE; i++) {
        if (m_ringBuffer[i].submission_id == submission_id &&
            m_ringBuffer[i].in_use.load()) {
            slot_idx = static_cast<int>(i);
            break;
        }
    }
}

if (slot_idx < 0) {
    LOGF_ERROR("[DecodeToSurface] Failed to find decode slot for submission_id=%llu",
               submission_id);
    return VAVCORE_ERROR_DECODE_FAILED;
}

2.2 Wait for Decode Completion

File: NVDECAV1Decoder.cpp (in DecodeToSurface(), after finding slot)

DecodeSlot& slot = m_ringBuffer[slot_idx];

// Wait for NVDEC decode to complete (signaled by polling thread)
{
    std::unique_lock<std::mutex> slot_lock(slot.slot_mutex);

    bool decode_ready = slot.frame_ready.wait_for(
        slot_lock,
        std::chrono::milliseconds(500),  // 500ms timeout
        [&slot]() {
            return slot.is_ready.load();
        }
    );

    if (!decode_ready) {
        LOGF_ERROR("[DecodeToSurface] Decode timeout for slot %d (submission_id=%llu)",
                   slot_idx, submission_id);

        // Mark slot as failed
        slot.decoding_failed.store(true);
        slot.in_use.store(false);

        return VAVCORE_ERROR_DECODE_TIMEOUT;
    }

    // Check if decoding failed
    if (slot.decoding_failed.load()) {
        LOGF_ERROR("[DecodeToSurface] Decode failed for slot %d (submission_id=%llu)",
                   slot_idx, submission_id);

        slot.in_use.store(false);
        return VAVCORE_ERROR_DECODE_FAILED;
    }
}

LOGF_DEBUG("[DecodeToSurface] Decode completed for slot %d (submission_id=%llu)",
           slot_idx, submission_id);

2.3 Update Polling Thread to Signal Readiness

File: NVDECAV1Decoder.cpp (in PollingThreadFunc())

Ensure polling thread properly signals slot.is_ready when cuvidGetDecodeStatus() returns success

void NVDECAV1Decoder::PollingThreadFunc() {
    while (m_pollingRunning.load()) {
        // Poll all active slots
        for (size_t i = 0; i < RING_BUFFER_SIZE; i++) {
            DecodeSlot& slot = m_ringBuffer[i];

            if (!slot.in_use.load() || slot.is_ready.load()) {
                continue;  // Skip inactive or already ready slots
            }

            int pic_idx = slot.picture_index;
            if (pic_idx < 0) {
                continue;  // No picture assigned yet
            }

            // Check decode status
            CUresult result = cuvidGetDecodeStatus(m_decoder, pic_idx);

            if (result == CUDA_SUCCESS) {
                // Decode complete
                {
                    std::lock_guard<std::mutex> lock(slot.slot_mutex);
                    slot.is_ready.store(true);
                }
                slot.frame_ready.notify_all();

                LOGF_DEBUG("[PollingThread] Slot %zu ready (pic_idx=%d)", i, pic_idx);
            } else if (result == CUDA_ERROR_NOT_READY) {
                // Still decoding, continue polling
            } else {
                // Decode error
                LOGF_ERROR("[PollingThread] Decode error for slot %zu (pic_idx=%d): %d",
                           i, pic_idx, result);
                {
                    std::lock_guard<std::mutex> lock(slot.slot_mutex);
                    slot.decoding_failed.store(true);
                    slot.is_ready.store(true);  // Signal to wake up waiter
                }
                slot.frame_ready.notify_all();
            }
        }

        // Poll interval: 1ms for responsiveness
        std::this_thread::sleep_for(std::chrono::milliseconds(1));
    }

    LOGF_DEBUG("[PollingThread] Polling thread exiting");
}

Solution 3: Playback Timing Strategy Redesign

Objective: Maintain fixed 33.33ms frame intervals regardless of decode time variation

Implementation:

3.1 Callback-First with Absolute Target Tracking

File: PlaybackController.cpp (in TimingThreadLoop())

void PlaybackController::TimingThreadLoop()
{
    // Set Windows timer resolution to 1ms for accurate sleep
    timeBeginPeriod(1);
    LOGF_INFO("[PlaybackController] Set Windows timer resolution to 1ms");

    double baseIntervalMs = 1000.0 / m_frameRate;
    auto startTime = std::chrono::high_resolution_clock::now();
    auto nextFrameTarget = startTime;

    LOGF_INFO("[PlaybackController] Timing thread loop started (target: %.2f fps, %.2f ms per frame)",
              m_frameRate, baseIntervalMs);

    while (!m_shouldStopTiming && m_isPlaying) {
        auto frameStart = std::chrono::high_resolution_clock::now();

        // Apply playback speed
        double speed = m_playbackSpeed.load();
        double targetIntervalMs = baseIntervalMs / speed;

        // Invoke callback FIRST (blocking decode + render)
        // This allows decode time to vary (6-20ms) without affecting frame interval
        auto callbackStart = std::chrono::high_resolution_clock::now();
        if (m_frameReadyCallback) {
            m_frameReadyCallback();
        }
        auto callbackEnd = std::chrono::high_resolution_clock::now();
        double callbackTime = std::chrono::duration<double, std::milli>(callbackEnd - callbackStart).count();

        // Calculate next frame target (fixed interval from start time)
        nextFrameTarget += std::chrono::microseconds(static_cast<long long>(targetIntervalMs * 1000));

        // Sleep until next frame target
        auto now = std::chrono::high_resolution_clock::now();
        auto sleepDuration = nextFrameTarget - now;

        if (sleepDuration.count() > 0) {
            // Sleep for remaining time
            std::this_thread::sleep_until(nextFrameTarget);

            double sleepTime = std::chrono::duration<double, std::milli>(sleepDuration).count();
            LOGF_DEBUG("[PlaybackController] Frame %llu timing: callback=%.2fms, sleep=%.2fms",
                      m_currentFrame, callbackTime, sleepTime);
        } else {
            // Missed target - log warning
            double missedBy = std::chrono::duration<double, std::milli>(-sleepDuration).count();
            LOGF_WARNING("[PlaybackController] Frame %llu MISSED target by %.2fms (callback took %.2fms)",
                        m_currentFrame, missedBy, callbackTime);

            // Reset target to current time to avoid cumulative drift
            nextFrameTarget = now;
        }

        // Update current time
        m_currentFrame++;
        m_currentTime = m_currentFrame / m_frameRate;
    }

    // Restore Windows timer resolution
    timeEndPeriod(1);
    LOGF_INFO("[PlaybackController] Timing thread loop exited, timer resolution restored");
}

3.2 Timing Characteristics

New Timing Behavior:

Frame 0 (Normal, 20ms decode):
  [Callback: 20ms] [Sleep: 13.33ms] = 33.33ms total ✓

Frame 1 (Normal, 20ms decode):
  [Callback: 20ms] [Sleep: 13.33ms] = 33.33ms total ✓

Frame 2 (Display-only, 6ms decode):
  [Callback: 6ms] [Sleep: 27.33ms] = 33.33ms total ✓

Frame 3 (Normal, 20ms decode):
  [Callback: 20ms] [Sleep: 13.33ms] = 33.33ms total ✓

Benefits:

  • Fixed 33.33ms frame interval maintained
  • Decode time variation absorbed by sleep duration
  • No cumulative timing drift
  • Consistent VSync alignment (every 2 frames = 33.33ms)

Implementation Plan

Phase 1: Solution 1 (GPU Copy Sync)

  1. Add fence/event members to RGBASurfaceBackend.h
  2. Create fence in Initialize(), cleanup in Shutdown()
  3. Implement WaitForCopyCompletion()
  4. Signal fence in CopyToStagingTexture()
  5. Call wait in FrameProcessor::ProcessFrame()
  6. Build and test: Verify staging texture stability

Phase 2: Solution 2 (NVDEC Decode Sync)

  1. Update DecodeToSurface() to find slot after FIFO wait
  2. Add decode completion wait with timeout
  3. Update PollingThreadFunc() to properly signal readiness
  4. Add error handling for decode failures/timeouts
  5. Build and test: Verify decode completion before surface access

Phase 3: Solution 3 (Playback Timing)

  1. Redesign TimingThreadLoop() with callback-first strategy
  2. Implement absolute frame target tracking
  3. Add missed frame detection and recovery
  4. Build and test: Verify consistent 33.33ms frame intervals

Phase 4: Integration Testing

  1. Run full playback test with all 3 fixes
  2. Measure frame timing consistency
  3. Verify stutter elimination
  4. Performance profiling (GPU/CPU utilization)

Expected Results

Before Fix

  • GPU copy race: Flickering/tearing artifacts
  • NVDEC not ready: Partial decoded frames
  • Timing irregular: 33-53ms frame intervals (39% variation)
  • User perception: Severe stuttering ("통통 튀는 현상")

After Fix

  • GPU copy complete: Stable staging texture
  • NVDEC verified: Complete decoded frames
  • Timing fixed: Consistent 33.33ms intervals (±1ms tolerance)
  • User perception: Smooth 30fps playback

Performance Impact

CPU Impact

  • GPU fence wait: ~0.1ms per frame (minimal, GPU-bound)
  • NVDEC status poll: ~1ms per frame (already running in background)
  • Timing redesign: No additional CPU overhead

GPU Impact

  • Fence overhead: Negligible (native GPU operation)
  • No additional GPU work introduced

Latency Impact

  • Added synchronization: +1-2ms per frame
  • Trade-off: Slightly higher latency for stability and smoothness
  • Still well within 30fps budget (33.33ms)

Risks and Mitigations

Risk 1: Fence Wait Timeout

Mitigation: 5-second timeout with error logging, graceful fallback

Risk 2: NVDEC Decode Timeout

Mitigation: 500ms timeout, mark slot as failed, continue with next frame

Risk 3: Callback Takes >33ms

Mitigation: Detect missed frames, log warning, reset timing target to prevent drift


Testing Strategy

Unit Tests

  • GPU fence creation/signaling/waiting
  • NVDEC decode status polling accuracy
  • Frame timing calculation correctness

Integration Tests

  • Full 30fps playback for 60 seconds
  • Frame interval histogram (should cluster at 33.33ms ±1ms)
  • Visual inspection for stutter/artifacts

Performance Tests

  • CPU utilization during playback
  • GPU utilization during playback
  • Memory usage over extended playback

Success Criteria

  1. No GPU copy race: Staging texture content stable across frames
  2. NVDEC decode verified: All frames fully decoded before access
  3. Frame timing consistent: 95% of frames within 33.33ms ±2ms
  4. Stutter eliminated: No visible "jumping" or irregular motion
  5. Performance acceptable: <5% CPU overhead, <2ms added latency

References

  • Previous fix: NVDEC_Frame_Reordering_Fix_Design.md
  • CUDA Documentation: cuvidGetDecodeStatus()
  • D3D12 Documentation: ID3D12Fence, SetEventOnCompletion()
  • Windows Multimedia Timer: timeBeginPeriod()

Document Status: Design Complete, Ready for Implementation Next Step: Implement Solution 1 (GPU Copy Sync)