23 KiB
NVDEC RingBuffer-based Asynchronous Decoding Design
Date: 2025-10-04 Status: 🔄 In Progress Target: NVDECAV1Decoder DecodeToSurface() optimization
📋 Table of Contents
- Background & Problem Statement
- Design Goals
- Architecture Overview
- Detailed Design
- Implementation Plan
- Performance Analysis
- Risk Assessment
- References
🎯 Background & Problem Statement
Current Implementation Issues
Issue 1: Sequential Decoding Bottleneck
Current Code (NVDECAV1Decoder::DecodeToSurface):
bool DecodeToSurface(packet_data, target_surface, output_frame) {
cuvidParseVideoData(packet); // Submit packet
wait_for_frame(); // Block until decode complete
cuvidMapVideoFrame(); // Map decoded frame
CopyNV12Frame(srcPtr, target_surface); // Copy to surface
cuvidUnmapVideoFrame(); // Unmap frame
return true;
}
Timeline:
Call 1: [Submit][Wait____][Map][Copy][Unmap]
Call 2: [Submit][Wait____][Map][Copy][Unmap]
Call 3: [Submit][Wait____][Map][Copy][Unmap]
Total: ~45ms (15ms * 3)
Problem: GPU sits idle during packet submission and surface copying.
Issue 2: Packet-Surface Mapping Mismatch
Scenario: Multi-threaded DecodeToSurface calls
// Thread A
DecodeToSurface(packet1, surface1, frame1); // I-frame, 15ms
// Thread B (simultaneous)
DecodeToSurface(packet2, surface2, frame2); // P-frame, 3ms
// GPU completes out-of-order
packet2 completes @ t=3ms → picture_index=5
packet1 completes @ t=15ms → picture_index=4
// Current queue-based retrieval
Thread A: pop() → picture_index=5 ❌ (gets packet2 result!)
Thread B: pop() → picture_index=4 ❌ (gets packet1 result!)
// Result: surface1 contains packet2, surface2 contains packet1
Root Cause:
- FIFO queue doesn't track which packet belongs to which surface
- Decoding completion order ≠ submission order
Issue 3: ulNumOutputSurfaces Underutilization
NVDEC Configuration:
m_createInfo.ulNumOutputSurfaces = 8; // 8 output buffers allocated
Purpose of 8 Surfaces:
- Reference frame storage (AV1 needs up to 8 reference frames)
- Display reordering buffer (B-frames, decode order ≠ display order)
- Pipeline parallelism (overlap decode/map/copy operations)
Current Usage: Only 1-2 surfaces actually utilized → Wasted GPU memory (~24MB for 1920x1080)
🎯 Design Goals
Primary Goals
-
✅ Correct Packet-Surface Mapping
- Guarantee:
DecodeToSurface(packetN, surfaceN)→ surfaceN contains packetN result - No out-of-order completion issues
- Guarantee:
-
✅ Sequential Return Order
- Even if packet2 completes before packet1, return surface1 first
- Maintain FIFO semantics for API users
-
⚡ Parallel Decoding Throughput
- Allow 2-3 packets to decode simultaneously on GPU
- Reduce total decode latency by ~30-40%
-
🛡️ Simplicity & Maintainability
- Avoid complex ID mapping systems
- Use fixed-size RingBuffer (matches NVDEC hardware design)
- Single polling thread (not per-decode thread)
Non-Goals
- ❌ Full 8-frame parallel decoding (limited by reference frame dependencies)
- ❌ Out-of-order surface return (would break video playback semantics)
- ❌ Dynamic buffer sizing (fixed 8 slots matches NVDEC hardware)
🏗️ Architecture Overview
High-Level Design
┌─────────────────────────────────────────────────────────────────┐
│ DecodeToSurface() Call │
│ (Thread 1, 2, 3, ... can call concurrently) │
└────────────┬────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ RingBuffer (8 slots) │
│ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │
│ │ [0] │ [1] │ [2] │ [3] │ [4] │ [5] │ [6] │ [7] │ │
│ └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ │
│ ▲ ▲ │
│ │ │ │
│ m_returnIndex m_submitIndex │
│ (consumer) (producer) │
└────────────┬────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Polling Thread │
│ while (polling_running) { │
│ slot = ringBuffer[m_returnIndex] │
│ if (cuvidGetDecodeStatus(slot.picture_index) == Success) { │
│ slot.is_ready = true │
│ slot.frame_ready.notify_one() │
│ } │
│ sleep(100us) │
│ } │
└─────────────────────────────────────────────────────────────────┘
Data Structures
DecodeSlot Structure
struct DecodeSlot {
// Slot state
bool in_use; // Is this slot allocated?
// Submitted information (from DecodeToSurface args)
void* target_surface; // Destination D3D12 resource
VavCoreSurfaceType surface_type; // Surface type
// NVDEC information (from HandlePictureDisplay callback)
int picture_index; // NVDEC frame index for cuvidMapVideoFrame
// Synchronization primitives
std::condition_variable frame_ready; // Signaled when decode complete
std::mutex slot_mutex; // Protects this slot's state
bool is_ready; // Decode completed flag
};
RingBuffer Members
class NVDECAV1Decoder {
private:
static constexpr size_t RING_BUFFER_SIZE = 8; // Match ulNumOutputSurfaces
DecodeSlot m_ringBuffer[RING_BUFFER_SIZE];
// Producer-consumer indices
std::atomic<size_t> m_submitIndex{0}; // Next slot to allocate (producer)
std::atomic<size_t> m_returnIndex{0}; // Next slot to return (consumer)
// Polling thread
std::thread m_pollingThread;
std::atomic<bool> m_pollingRunning{false};
};
🔧 Detailed Design
Component 1: Slot Allocation (Producer)
Purpose: Assign RingBuffer slot to each DecodeToSurface call
// In DecodeToSurface()
// 1. Allocate next available slot
size_t my_slot_idx = m_submitIndex.fetch_add(1) % RING_BUFFER_SIZE;
DecodeSlot& my_slot = m_ringBuffer[my_slot_idx];
// 2. Check for overflow
{
std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
if (my_slot.in_use) {
// RingBuffer full! Too many concurrent decodes
LogError("RingBuffer overflow - max 8 concurrent decodes");
return false;
}
// 3. Initialize slot
my_slot.in_use = true;
my_slot.target_surface = target_surface;
my_slot.surface_type = target_type;
my_slot.picture_index = -1; // Set by HandlePictureDisplay
my_slot.is_ready = false;
}
Atomic Counter Behavior:
Thread 1: m_submitIndex.fetch_add(1) → 0 % 8 = slot[0]
Thread 2: m_submitIndex.fetch_add(1) → 1 % 8 = slot[1]
Thread 3: m_submitIndex.fetch_add(1) → 2 % 8 = slot[2]
...
Thread 9: m_submitIndex.fetch_add(1) → 8 % 8 = slot[0] (wrap around)
Overflow Protection: If slot[0].in_use == true when Thread 9 arrives → error
Component 2: Packet Submission
Purpose: Submit packet to NVDEC with slot index tracking
// 4. Submit packet to NVDEC parser
CUVIDSOURCEDATAPACKET packet = {};
packet.payload = packet_data;
packet.payload_size = packet_size;
packet.flags = CUVID_PKT_ENDOFPICTURE;
packet.timestamp = my_slot_idx; // ✅ Embed slot index in timestamp
CUresult result = cuvidParseVideoData(m_parser, &packet);
if (result != CUDA_SUCCESS) {
my_slot.in_use = false; // Release slot on error
return false;
}
Timestamp Flow:
cuvidParseVideoData(packet, timestamp=2)
↓
HandleVideoSequence() (first time only)
↓
HandlePictureDecode(timestamp=2)
↓
GPU decodes packet...
↓
HandlePictureDisplay(timestamp=2, picture_index=5)
↓
m_ringBuffer[2].picture_index = 5 // ✅ Slot 2 now linked to picture_index 5
Component 3: Polling Thread (Background Status Checker)
Purpose: Continuously poll m_returnIndex slot for decode completion
void NVDECAV1Decoder::PollingThreadFunc() {
while (m_pollingRunning) {
// 1. Get current return slot (oldest pending decode)
size_t current_return_idx = m_returnIndex.load();
DecodeSlot& slot = m_ringBuffer[current_return_idx];
// 2. Check if slot is in use and not yet ready
if (slot.in_use && !slot.is_ready && slot.picture_index >= 0) {
// 3. Query NVDEC for decode status
CUVIDGETDECODESTATUS decodeStatus = {};
CUresult result = cuvidGetDecodeStatus(m_decoder, slot.picture_index, &decodeStatus);
if (result == CUDA_SUCCESS) {
if (decodeStatus.decodeStatus == cuvidDecodeStatus_Success) {
// ✅ Decode complete!
{
std::lock_guard<std::mutex> lock(slot.slot_mutex);
slot.is_ready = true;
}
// Wake up waiting DecodeToSurface thread
slot.frame_ready.notify_one();
OutputDebugStringA("[Polling] Slot ready\n");
}
else if (decodeStatus.decodeStatus == cuvidDecodeStatus_Error) {
// Decode error - mark as ready to unblock
{
std::lock_guard<std::mutex> lock(slot.slot_mutex);
slot.is_ready = true; // Error also counts as "ready"
}
slot.frame_ready.notify_one();
OutputDebugStringA("[Polling] Decode error\n");
}
// cuvidDecodeStatus_InProgress → keep polling
}
}
// 4. Sleep to avoid busy-wait
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
}
Key Points:
- ✅ Only polls
m_returnIndexslot (not all 8 slots) → efficient - ✅ Uses
cuvidGetDecodeStatus()non-blocking query - ✅ 100us sleep → ~10,000 checks/second (low CPU usage)
- ✅ Handles decode errors gracefully
Component 4: Sequential Return Wait
Purpose: Enforce FIFO order even when decodes complete out-of-order
// In DecodeToSurface() - PHASE 2
// 5. Wait for my turn (sequential return order)
while (m_returnIndex.load() != my_slot_idx) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
// Now I'm first in line!
// 6. Wait for decode to complete
{
std::unique_lock<std::mutex> lock(my_slot.slot_mutex);
if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500),
[&my_slot]() { return my_slot.is_ready; })) {
// Timeout - decode took too long
LogError("Decode timeout");
my_slot.in_use = false;
m_returnIndex.fetch_add(1); // Skip this slot to avoid deadlock
return false;
}
}
Timeline Example:
Thread 1 (slot 0): Wait for returnIndex==0 ✅ (immediate)
Wait for is_ready...
Thread 2 (slot 1): Wait for returnIndex==1 ⏸️ (blocked)
Thread 3 (slot 2): Wait for returnIndex==2 ⏸️ (blocked)
GPU: packet2 completes first @ t=3ms
→ slot[1].is_ready = true
→ Thread 2 still blocked (returnIndex=0)
GPU: packet1 completes @ t=15ms
→ slot[0].is_ready = true
→ Thread 1 wakes up ✅
→ Thread 1 processes → returnIndex = 1
→ Thread 2 now unblocked ✅
Component 5: Frame Retrieval & Cleanup
Purpose: Map decoded frame, copy to surface, release slot
// In DecodeToSurface() - PHASE 3
int frameIdx = my_slot.picture_index;
// 7. Map decoded frame from NVDEC
CUVIDPROCPARAMS procParams = {};
procParams.progressive_frame = 1;
CUdeviceptr srcDevicePtr = 0;
unsigned int srcPitch = 0;
CUresult result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
if (result != CUDA_SUCCESS) {
my_slot.in_use = false;
m_returnIndex.fetch_add(1);
return false;
}
// 8. Copy to D3D12 surface
ID3D12Resource* d3d12Resource = static_cast<ID3D12Resource*>(target_surface);
bool copySuccess = m_d3d12Handler->CopyNV12Frame(
srcDevicePtr, srcPitch, d3d12Resource, m_width, m_height
);
// 9. Unmap frame
cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);
// 10. Release slot
{
std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
my_slot.in_use = false;
}
// 11. Advance return index
m_returnIndex.fetch_add(1);
return copySuccess;
Component 6: NVDEC Callback Integration
Purpose: Link NVDEC picture_index to RingBuffer slot
int CUDAAPI NVDECAV1Decoder::HandlePictureDisplay(void* user_data, CUVIDPARSERDISPINFO* disp_info) {
auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);
// Extract slot index from timestamp
size_t slot_idx = static_cast<size_t>(disp_info->timestamp) % RING_BUFFER_SIZE;
DecodeSlot& slot = decoder->m_ringBuffer[slot_idx];
// Store NVDEC picture index
{
std::lock_guard<std::mutex> lock(slot.slot_mutex);
slot.picture_index = disp_info->picture_index;
}
// Polling thread will check cuvidGetDecodeStatus() for this picture_index
return 1;
}
📐 Implementation Plan
Phase 1: Data Structure Setup ✅
Files to Modify:
NVDECAV1Decoder.h- Add RingBuffer membersNVDECAV1Decoder.cpp- Initialize RingBuffer in constructor
Tasks:
- Define
DecodeSlotstructure - Add
m_ringBuffer[8]array - Add
m_submitIndex,m_returnIndexatomic counters - Add
m_pollingThread,m_pollingRunningmembers
Estimated Time: 30 minutes
Phase 2: Polling Thread Implementation
Files to Modify:
NVDECAV1Decoder.cpp- ImplementPollingThreadFunc()
Tasks:
- Implement polling loop with
cuvidGetDecodeStatus() - Add thread start in
Initialize() - Add thread stop in
Cleanup() - Add debug logging for polling events
Testing:
- Verify thread starts/stops correctly
- Verify
cuvidGetDecodeStatus()calls work
Estimated Time: 1 hour
Phase 3: DecodeToSurface Refactoring
Files to Modify:
NVDECAV1Decoder.cpp- RewriteDecodeToSurface()
Tasks:
- Phase 1: Slot allocation logic
- Phase 2: Sequential return wait logic
- Phase 3: Frame retrieval & cleanup logic
- Error handling for all failure paths
Testing:
- Single-threaded decode test
- Multi-threaded decode test (2-3 threads)
- Verify packet-surface mapping correctness
Estimated Time: 2 hours
Phase 4: HandlePictureDisplay Update
Files to Modify:
NVDECAV1Decoder.cpp- ModifyHandlePictureDisplay()
Tasks:
- Extract slot_idx from timestamp
- Store picture_index in correct slot
- Add debug logging
Testing:
- Verify timestamp → slot_idx mapping
- Verify picture_index stored correctly
Estimated Time: 30 minutes
Phase 5: Integration Testing
Test Scenarios:
- Single packet decode - Verify basic functionality
- Sequential 3 packets - Verify FIFO order
- Out-of-order completion - Verify correct mapping (I-frame after P-frame)
- RingBuffer overflow - Verify error handling (9+ concurrent calls)
- Decode errors - Verify graceful failure
- Performance benchmark - Measure latency reduction
Test Files:
- Simple test video (simple_test.webm)
- Complex GOP structure video (test_720p_stripe.webm)
Estimated Time: 2 hours
Phase 6: Documentation & Cleanup
Tasks:
- Update NVDEC design documentation
- Add inline code comments
- Remove old queue-based code
- Move design doc to
docs/completed/
Estimated Time: 1 hour
📊 Performance Analysis
Expected Improvements
Scenario 1: Sequential 3-Frame Decode
Current (Sequential):
Frame 1: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
Frame 2: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
Frame 3: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
Total: 45ms
With RingBuffer (Parallel):
Submit all 3: [S1][S2][S3] = 3ms
GPU decode: [─── 10ms ───] (parallel)
Sequential return:
Frame 1: [Wait][Map][Copy][Unmap] = 5ms
Frame 2: [Map][Copy][Unmap] = 4ms (already decoded)
Frame 3: [Map][Copy][Unmap] = 4ms (already decoded)
Total: 3ms + 10ms + 13ms = 26ms
Improvement: 45ms → 26ms (42% faster)
Scenario 2: Complex I/P Frame Mix
Video: I-frame (15ms), P-frame (3ms), P-frame (3ms)
Current:
I: 15ms
P: 3ms
P: 3ms
Total: 21ms
With RingBuffer:
Submit: 3ms
GPU: [I:15ms, P:3ms, P:3ms overlap] = 15ms (limited by I-frame)
Return: 9ms (sequential)
Total: 27ms (slower due to overhead)
But if I-frame completes first: 21ms (same)
Note: Benefit depends on GOP structure. Simple videos may see less improvement.
CPU & GPU Utilization
Current:
CPU: [Active──────][Idle────────][Active──────][Idle────────]
GPU: [Decode──────] [Decode──────]
With RingBuffer:
CPU: [Active][Active][Active][─────Wait─────][Active][Active]
GPU: [Decode1][Decode2][Decode3]
GPU Utilization: ⬆️ Increased by ~30% CPU Utilization: ➡️ Similar (waiting in different places)
🛡️ Risk Assessment
Risk 1: RingBuffer Overflow
Scenario: Application calls DecodeToSurface() 9+ times before first frame completes
Mitigation:
if (my_slot.in_use) {
LogError("RingBuffer overflow");
return false; // Fail fast
}
Alternative: Block until slot available
while (my_slot.in_use) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
Recommendation: Fail fast + document max concurrent limit (8 frames)
Risk 2: Decode Timeout Deadlock
Scenario: GPU hangs, polling thread never sets is_ready
Mitigation:
if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500), ...)) {
// Timeout - skip this slot
my_slot.in_use = false;
m_returnIndex.fetch_add(1); // Advance to prevent deadlock
return false;
}
Risk: Skipping slot may cause next slot to fail (reference frame missing)
Recommendation: Return error, let application decide (retry/abort)
Risk 3: Polling Thread CPU Usage
100us sleep → ~10,000 checks/second
CPU Impact:
Per check: ~1us (cuvidGetDecodeStatus overhead)
Total: 10,000 * 1us = 10ms/sec = 1% CPU
Mitigation: Acceptable overhead for real-time video decoding
Alternative: Adaptive polling (faster when frames pending, slower when idle)
Risk 4: Thread Safety - CUDA Context
Issue: cuvidGetDecodeStatus() requires correct CUDA context
Current Protection:
std::lock_guard<std::mutex> contextLock(m_cudaContextMutex);
cuCtxSetCurrent(m_cuContext);
In Polling Thread:
void PollingThreadFunc() {
// Set CUDA context once at thread start
cuCtxSetCurrent(m_cuContext);
while (m_pollingRunning) {
// cuvidGetDecodeStatus() now safe
cuvidGetDecodeStatus(...);
}
}
Recommendation: Set context once per thread (not per call)
📚 References
NVIDIA Documentation
-
NVDEC Video Decoder API Programming Guide
- URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/nvdec-video-decoder-api-prog-guide/
- Section: "cuvidGetDecodeStatus" API reference
-
Video Codec SDK 12.2
- URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/
- Section: "NVDECODE API" - asynchronous decoding patterns
-
CUDA Programming Guide
- URL: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
- Section: "Multi-threading" - CUDA context management
Previous Design Documents
-
NVDEC Deferred Decoder Recreation Design (Abandoned)
- Path:
docs/completed/windows/nvdec/NVDEC_Deferred_Decoder_Recreation_Design.md - Reason for abandonment: Incompatible with NVDEC synchronous callback architecture
- Lessons learned: NVDEC parser requires immediate decoder availability
- Path:
-
VavCore Library Design
- Path:
VavCore_Library_Design.md - Relevant sections: C API design, cross-platform surface types
- Path:
Code References
-
Current DecodeToSurface Implementation
- File:
platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp - Lines: 976-1200
- Current approach: Synchronous wait with condition variable
- File:
-
HandlePictureDisplay Callback
- File:
platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp - Lines: 791-813
- Current approach: Push picture_index to queue, notify condition variable
- File:
-
NVDEC Initialization
- File:
platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp - Lines: 600-650
- Key setting:
m_createInfo.ulNumOutputSurfaces = 8
- File:
🔄 Change Log
| Date | Author | Change Description |
|---|---|---|
| 2025-10-04 | Claude | Initial design document created |
| 2025-10-04 | Claude | Added RingBuffer architecture details |
| 2025-10-04 | Claude | Added polling thread design |
| 2025-10-04 | Claude | Added performance analysis |
✅ Approval & Sign-off
Design Review:
- Architecture approved
- Performance estimates validated
- Risk mitigation strategies accepted
Implementation Ready:
- All design questions resolved
- Implementation plan reviewed
- Test plan defined
Next Steps: Proceed to Phase 1 - Data Structure Setup