Files

ened 7c2973f17f WIP

2025-10-05 11:12:57 +09:00

29 KiB

Raw Blame History

NVDEC RingBuffer-based Asynchronous Decoding Design

Date: 2025-10-05 (Updated) Status: ✅ Design Finalized - Ready for Implementation Target: NVDECAV1Decoder DecodeToSurface() optimization with multi-frame packet support

🎯 Background & Problem Statement

Current Implementation Issues

Issue 1: Sequential Decoding Bottleneck

Current Code (NVDECAV1Decoder::DecodeToSurface):

bool DecodeToSurface(packet_data, target_surface, output_frame) {
    cuvidParseVideoData(packet);           // Submit packet
    wait_for_frame();                      // Block until decode complete
    cuvidMapVideoFrame();                  // Map decoded frame
    CopyNV12Frame(srcPtr, target_surface); // Copy to surface
    cuvidUnmapVideoFrame();                // Unmap frame
    return true;
}

Timeline:

Call 1: [Submit][Wait____][Map][Copy][Unmap]
Call 2:                                      [Submit][Wait____][Map][Copy][Unmap]
Call 3:                                                                          [Submit][Wait____][Map][Copy][Unmap]

Total: ~45ms (15ms * 3)

Problem: GPU sits idle during packet submission and surface copying.

Issue 2: Packet-Surface Mapping Mismatch

Scenario: Multi-threaded DecodeToSurface calls

// Thread A
DecodeToSurface(packet1, surface1, frame1);  // I-frame, 15ms

// Thread B (simultaneous)
DecodeToSurface(packet2, surface2, frame2);  // P-frame, 3ms

// GPU completes out-of-order
packet2 completes @ t=3ms  → picture_index=5
packet1 completes @ t=15ms → picture_index=4

// Current queue-based retrieval
Thread A: pop() → picture_index=5 ❌ (gets packet2 result!)
Thread B: pop() → picture_index=4 ❌ (gets packet1 result!)

// Result: surface1 contains packet2, surface2 contains packet1

Root Cause:

FIFO queue doesn't track which packet belongs to which surface
Decoding completion order ≠ submission order

Issue 3: Multi-Frame Packet Handling ⚠️ Critical Discovery

Scenario: Single WebM packet contains multiple AV1 frames

// DecodeToSurface called ONCE
DecodeToSurface(packet_175bytes, surface1, frame1);

// NVDEC parser extracts MULTIPLE frames from single packet:
→ HandlePictureDecode(CurrPicIdx=0, IntraPicFlag=1)  // I-frame
→ HandlePictureDecode(CurrPicIdx=1, IntraPicFlag=0)  // P-frame
→ HandlePictureDecode(CurrPicIdx=2, IntraPicFlag=0)  // P-frame
→ ... (up to 8 frames in one packet)

// Problem: Which picture_index should be returned?
// Current design assumes: 1 packet = 1 frame ❌

Impact:

Slot allocation assumes 1 packet → 1 slot → 1 picture_index
Reality: 1 packet → 1 slot → N picture_indices
Must track multiple picture_indices per slot
Must decide which frame to return (first? last? all?)

Issue 3: ulNumOutputSurfaces Underutilization

NVDEC Configuration:

m_createInfo.ulNumOutputSurfaces = 8;  // 8 output buffers allocated

Purpose of 8 Surfaces:

Reference frame storage (AV1 needs up to 8 reference frames)
Display reordering buffer (B-frames, decode order ≠ display order)
Pipeline parallelism (overlap decode/map/copy operations)

Current Usage: Only 1-2 surfaces actually utilized → Wasted GPU memory (~24MB for 1920x1080)

🎯 Design Goals

Primary Goals

✅ Correct Packet-Surface Mapping
- Guarantee: DecodeToSurface(packetN, surfaceN) → surfaceN contains packetN result
- No out-of-order completion issues
✅ Sequential Return Order
- Even if packet2 completes before packet1, return surface1 first
- Maintain FIFO semantics for API users
⚡ Parallel Decoding Throughput
- Allow 2-3 packets to decode simultaneously on GPU
- Reduce total decode latency by ~30-40%
🛡️ Simplicity & Maintainability
- Avoid complex ID mapping systems
- Use fixed-size RingBuffer (matches NVDEC hardware design)
- Single polling thread (not per-decode thread)

Non-Goals

❌ Full 8-frame parallel decoding (limited by reference frame dependencies)
❌ Out-of-order surface return (would break video playback semantics)
❌ Dynamic buffer sizing (fixed 8 slots matches NVDEC hardware)

🏗️ Architecture Overview

High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                    DecodeToSurface() Call                        │
│  (Thread 1, 2, 3, ... can call concurrently)                    │
└────────────┬────────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────────┐
│                        RingBuffer (8 slots)                      │
│  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐            │
│  │ [0] │ [1] │ [2] │ [3] │ [4] │ [5] │ [6] │ [7] │            │
│  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘            │
│     ▲                                             ▲              │
│     │                                             │              │
│  m_returnIndex                              m_submitIndex        │
│  (consumer)                                 (producer)           │
└────────────┬────────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Polling Thread                              │
│  while (polling_running) {                                       │
│    slot = ringBuffer[m_returnIndex]                             │
│    if (cuvidGetDecodeStatus(slot.picture_index) == Success) {   │
│      slot.is_ready = true                                       │
│      slot.frame_ready.notify_one()                              │
│    }                                                             │
│    sleep(100us)                                                 │
│  }                                                               │
└─────────────────────────────────────────────────────────────────┘

Data Structures

DecodeSlot Structure

struct DecodeSlot {
    // Slot state
    bool in_use;                          // Is this slot allocated?

    // Submitted information (from DecodeToSurface args)
    void* target_surface;                 // Destination D3D12 resource
    VavCoreSurfaceType surface_type;      // Surface type

    // NVDEC information (from HandlePictureDecode callback)
    // ⚠️ Multi-frame support: One packet can decode to multiple frames
    std::vector<int> picture_indices;     // All NVDEC frame indices from this packet

    // Synchronization primitives
    std::condition_variable frame_ready;  // Signaled when ALL frames are decoded
    std::mutex slot_mutex;                // Protects this slot's state
    bool is_ready;                        // All frames decoded flag
};

RingBuffer Members

class NVDECAV1Decoder {
private:
    static constexpr size_t RING_BUFFER_SIZE = 8;  // Match ulNumOutputSurfaces

    DecodeSlot m_ringBuffer[RING_BUFFER_SIZE];

    // 🎯 Option C: Unified slot allocation counter (no mapping needed!)
    std::atomic<uint64_t> m_slotIdCounter{0};   // Monotonically increasing slot ID
    std::atomic<uint64_t> m_returnIdCounter{0}; // Return order enforcement (FIFO)

    // Polling thread
    std::thread m_pollingThread;
    std::atomic<bool> m_pollingRunning{false};
};

🔧 Detailed Design

Component 1: Slot Allocation (Producer)

Purpose: Assign RingBuffer slot to each DecodeToSurface call using Option C design

// In DecodeToSurface()

// 🎯 Option C: Allocate unique ID (serves as both slot ID and submission order)
uint64_t my_id = m_slotIdCounter.fetch_add(1);
size_t slot_idx = my_id % RING_BUFFER_SIZE;
DecodeSlot& my_slot = m_ringBuffer[slot_idx];

// 2. Check for overflow
{
    std::lock_guard<std::mutex> lock(my_slot.slot_mutex);

    if (my_slot.in_use) {
        // RingBuffer full! Too many concurrent decodes
        LogError("RingBuffer overflow - max 8 concurrent decodes");
        return false;
    }

    // 3. Initialize slot
    my_slot.in_use = true;
    my_slot.target_surface = target_surface;
    my_slot.surface_type = target_type;
    my_slot.picture_indices.clear();  // Multi-frame support: clear previous frames
    my_slot.is_ready = false;
}

🎯 Option C: No Mapping Needed - Direct Calculation:

Thread 1: m_slotIdCounter.fetch_add(1) → ID=0, slot_idx = 0 % 8 = slot[0]
Thread 2: m_slotIdCounter.fetch_add(1) → ID=1, slot_idx = 1 % 8 = slot[1]
Thread 3: m_slotIdCounter.fetch_add(1) → ID=2, slot_idx = 2 % 8 = slot[2]
...
Thread 9: m_slotIdCounter.fetch_add(1) → ID=8, slot_idx = 8 % 8 = slot[0] (wrap around)

Key Advantage: ID → slot_idx calculation is deterministic, no map storage needed!

Overflow Protection: If slot[0].in_use == true when Thread 9 arrives → error

Component 2: Packet Submission

Purpose: Submit packet to NVDEC with slot ID tracking via timestamp

// 4. Submit packet to NVDEC parser
CUVIDSOURCEDATAPACKET packet = {};
packet.payload = packet_data;
packet.payload_size = packet_size;
packet.flags = CUVID_PKT_ENDOFPICTURE;
packet.timestamp = static_cast<int64_t>(my_id);  // 🎯 Pass full slot_id (NOT modulo!)

CUresult result = cuvidParseVideoData(m_parser, &packet);
if (result != CUDA_SUCCESS) {
    my_slot.in_use = false;  // Release slot on error
    return false;
}

🎯 Option C Timestamp Flow (multi-frame packet support):

cuvidParseVideoData(packet, timestamp=my_id=17)
    ↓
HandleVideoSequence() (first time only)
    ↓
HandlePictureDecode(timestamp=17, CurrPicIdx=0) → slot_idx = 17 % 8 = 1
  → m_ringBuffer[1].picture_indices.push_back(0)
    ↓
HandlePictureDecode(timestamp=17, CurrPicIdx=1) → slot_idx = 17 % 8 = 1
  → m_ringBuffer[1].picture_indices.push_back(1)  // Same packet, multiple frames!
    ↓
HandlePictureDecode(timestamp=17, CurrPicIdx=2) → slot_idx = 17 % 8 = 1
  → m_ringBuffer[1].picture_indices.push_back(2)
    ↓
PollingThread checks ALL picture_indices for slot[1]
    ↓
When all complete: slot[1].is_ready = true, notify thread

Key Point: Timestamp carries full slot_id, HandlePictureDecode calculates slot_idx directly

Component 3: Polling Thread (Background Status Checker)

Purpose: Continuously poll m_returnIdCounter slot for decode completion (multi-frame support)

void NVDECAV1Decoder::PollingThreadFunc() {
    while (m_pollingRunning) {
        // 1. Get current return ID and calculate slot index
        uint64_t current_return_id = m_returnIdCounter.load();
        size_t slot_idx = current_return_id % RING_BUFFER_SIZE;
        DecodeSlot& slot = m_ringBuffer[slot_idx];

        // 2. Check if slot is in use and not yet ready
        if (slot.in_use && !slot.is_ready) {

            // 3. Get copy of picture indices (multi-frame support)
            std::vector<int> picture_indices_copy;
            {
                std::lock_guard<std::mutex> lock(slot.slot_mutex);
                picture_indices_copy = slot.picture_indices;
            }

            // 4. Check if ALL frames are decoded
            bool all_complete = true;
            for (int pic_idx : picture_indices_copy) {
                CUVIDGETDECODESTATUS decodeStatus = {};
                CUresult result = cuvidGetDecodeStatus(m_decoder, pic_idx, &decodeStatus);

                if (result != CUDA_SUCCESS ||
                    decodeStatus.decodeStatus != cuvidDecodeStatus_Success) {
                    all_complete = false;
                    break;
                }
            }

            // 5. If all frames complete, signal ready
            if (all_complete && !picture_indices_copy.empty()) {
                {
                    std::lock_guard<std::mutex> lock(slot.slot_mutex);
                    slot.is_ready = true;
                }
                slot.frame_ready.notify_one();
            }
        }

        // 6. Sleep to avoid busy-wait
        std::this_thread::sleep_for(std::chrono::microseconds(100));
    }
}

Key Points:

✅ Only polls m_returnIdCounter slot (not all 8 slots) → efficient
✅ Multi-frame support: Checks ALL picture_indices for completion
✅ Uses cuvidGetDecodeStatus() non-blocking query
✅ 100us sleep → ~10,000 checks/second (low CPU usage)
✅ Thread-safe picture_indices copy to avoid lock contention

Component 4: Sequential Return Wait (FIFO Guarantee)

Purpose: Enforce FIFO order even when decodes complete out-of-order using Option C

// In DecodeToSurface() - PHASE 2

// 5. Wait for my turn (FIFO order enforcement)
while (m_returnIdCounter.load() != my_id) {
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

// Now I'm first in line!

// 6. Wait for decode to complete
{
    std::unique_lock<std::mutex> lock(my_slot.slot_mutex);

    if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500),
        [&my_slot]() { return my_slot.is_ready; })) {
        // Timeout - decode took too long
        LogError("Decode timeout");
        my_slot.in_use = false;
        m_returnIdCounter.fetch_add(1);  // Skip this slot to avoid deadlock
        return false;
    }
}

🎯 Option C Timeline Example (using slot ID, not slot index):

Thread 1 (ID=17, slot 1): Wait for returnIdCounter==17 ✅ (immediate)
                          Wait for is_ready...

Thread 2 (ID=18, slot 2): Wait for returnIdCounter==18 ⏸️ (blocked)

Thread 3 (ID=19, slot 3): Wait for returnIdCounter==19 ⏸️ (blocked)

GPU: ID=18 completes first @ t=3ms
     → slot[2].is_ready = true
     → Thread 2 still blocked (returnIdCounter=17)

GPU: ID=17 completes @ t=15ms
     → slot[1].is_ready = true
     → Thread 1 wakes up ✅
     → Thread 1 processes → returnIdCounter = 18
     → Thread 2 now unblocked ✅

Key Point: Wait on slot_id (my_id), not slot_idx, for correct FIFO ordering

Component 5: Frame Retrieval & Cleanup (Multi-Frame Support)

Purpose: Map decoded frame, copy to surface, release slot

// In DecodeToSurface() - PHASE 3

// 7. Get first frame from multi-frame packet
int frameIdx = my_slot.picture_indices[0];  // Return first frame only

// 8. Map decoded frame from NVDEC
CUVIDPROCPARAMS procParams = {};
procParams.progressive_frame = 1;

CUdeviceptr srcDevicePtr = 0;
unsigned int srcPitch = 0;

CUresult result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
if (result != CUDA_SUCCESS) {
    my_slot.in_use = false;
    m_returnIdCounter.fetch_add(1);
    return false;
}

// 9. Copy to D3D12 surface
ID3D12Resource* d3d12Resource = static_cast<ID3D12Resource*>(target_surface);
bool copySuccess = m_d3d12Handler->CopyNV12Frame(
    srcDevicePtr, srcPitch, d3d12Resource, m_width, m_height
);

// 10. Unmap frame
cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);

// 11. Release slot
{
    std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
    my_slot.in_use = false;
}

// 12. Advance return ID counter (FIFO order)
m_returnIdCounter.fetch_add(1);

return copySuccess;

Multi-Frame Decision: Return first frame only from multi-frame packet (indices[0])

Component 6: NVDEC Callback Integration (Option C)

Purpose: Link NVDEC picture_index to RingBuffer slot using direct slot_id calculation

int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
    auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

    // 🎯 Option C: Direct slot_id → slot_idx calculation (no map lookup!)
    uint64_t slot_id = static_cast<uint64_t>(pic_params->nTimeStamp);
    size_t slot_idx = slot_id % RING_BUFFER_SIZE;

    DecodeSlot& slot = decoder->m_ringBuffer[slot_idx];

    // Submit frame to NVDEC decoder
    CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
    if (result != CUDA_SUCCESS) {
        decoder->LogCUDAError(result, "cuvidDecodePicture failed");
        return 0;
    }

    // Store picture_index for polling (multi-frame support)
    {
        std::lock_guard<std::mutex> lock(slot.slot_mutex);
        slot.picture_indices.push_back(pic_params->CurrPicIdx);
    }

    // Polling thread will check cuvidGetDecodeStatus() for ALL picture_indices

    return 1;
}

🎯 Option C Key Advantages:

✅ No mapping overhead: Direct modulo calculation, no unordered_map lookup
✅ No mutex contention: No global map mutex needed
✅ Multi-frame support: Automatically handles multiple frames per packet
✅ Deterministic: Same slot_id always maps to same slot_idx

📐 Implementation Plan (Updated for Option C)

Phase 1: Data Structure Setup ✅

Files to Modify:

NVDECAV1Decoder.h - Add RingBuffer members with Option C design
NVDECAV1Decoder.cpp - Initialize RingBuffer in constructor

Tasks:

Define DecodeSlot structure with std::vector<int> picture_indices
Add m_ringBuffer[8] array
🎯 Option C: Add m_slotIdCounter, m_returnIdCounter atomic counters (NOT submitIndex/returnIndex)
Add m_pollingThread, m_pollingRunning members

Estimated Time: 30 minutes

Phase 2: Polling Thread Implementation (Multi-Frame Support)

Files to Modify:

NVDECAV1Decoder.cpp - Implement PollingThreadFunc()

Tasks:

Implement polling loop with cuvidGetDecodeStatus() checking ALL picture_indices
Poll slot at m_returnIdCounter % RING_BUFFER_SIZE (NOT all slots)
Add thread start in Initialize()
Add thread stop in Cleanup()
Add debug logging for multi-frame packet events

Testing:

Verify thread starts/stops correctly
Verify multi-frame packet handling (all frames checked)
Verify cuvidGetDecodeStatus() calls work

Estimated Time: 1 hour

Phase 3: DecodeToSurface Refactoring (Option C Implementation)

Files to Modify:

NVDECAV1Decoder.cpp - Rewrite DecodeToSurface() with Option C design

Tasks:

🎯 Phase 1: Slot allocation using my_id = m_slotIdCounter.fetch_add(1)
🎯 Phase 2: Packet submission with packet.timestamp = my_id (full slot_id, NOT modulo!)
🎯 Phase 3: FIFO wait using while(m_returnIdCounter != my_id) (NOT slot_idx!)
🎯 Phase 4: Frame retrieval from picture_indices[0] (first frame only)
🎯 Phase 5: Cleanup and m_returnIdCounter.fetch_add(1)
Error handling for all failure paths

Testing:

Single-threaded decode test
Multi-threaded decode test (2-3 threads)
Verify packet-surface mapping correctness with multi-frame packets
Verify FIFO ordering with out-of-order completion

Estimated Time: 2 hours

Phase 4: HandlePictureDecode Update (Option C)

Files to Modify:

NVDECAV1Decoder.cpp - Modify HandlePictureDecode() callback

Tasks:

🎯 Option C: Extract slot_id from pic_params->nTimeStamp
🎯 Option C: Calculate slot_idx = slot_id % RING_BUFFER_SIZE (NO map lookup!)
🎯 Multi-frame: Use slot.picture_indices.push_back(CurrPicIdx) (NOT single index!)
Add debug logging for multi-frame packets

Testing:

Verify timestamp → slot_idx direct calculation works
Verify picture_indices vector correctly stores multiple frames
Test with video that has multi-frame packets (test_720p_stripe.webm)

Estimated Time: 30 minutes

Phase 5: Integration Testing (Option C Validation)

Test Scenarios:

Single packet decode - Verify Option C basic functionality
Multi-frame packet - Verify vector-based picture_indices handling (test_720p_stripe.webm)
Sequential 3 packets - Verify FIFO order using m_returnIdCounter
Out-of-order completion - Verify slot_id → slot_idx mapping (I-frame after P-frame)
RingBuffer overflow - Verify error handling (9+ concurrent calls)
Decode errors - Verify graceful failure with multi-frame packets
Performance benchmark - Measure latency reduction vs old queue-based approach

Test Files:

Simple test video (simple_test.webm) - basic validation
Multi-frame packet video (test_720p_stripe.webm) - critical multi-frame test ⚠️

Validation Criteria:

✅ No slot ID → slot_idx mapping errors
✅ All frames from multi-frame packets detected and polled
✅ FIFO order maintained even with out-of-order GPU completion
✅ No memory corruption or race conditions

Estimated Time: 2 hours

Phase 6: Documentation & Cleanup

Tasks:

Update NVDEC design documentation with Option C and multi-frame support
Add inline code comments explaining Option C design choices
Remove old queue-based code and any ID→Index mapping attempts
Move design doc to docs/completed/ after successful implementation
Document multi-frame packet behavior and first-frame-only decision

Estimated Time: 1 hour

🎯 Option C Design Summary

Core Principle: Eliminate mapping overhead through deterministic calculation

Key Components:

Single Counter for Dual Purpose: m_slotIdCounter serves as both unique ID and submission order
Direct Slot Calculation: slot_idx = slot_id % RING_BUFFER_SIZE (no map needed)
FIFO via ID Comparison: while(m_returnIdCounter != my_id) ensures ordering
Multi-Frame Vector: std::vector<int> picture_indices handles packets with multiple frames

Data Flow:

DecodeToSurface:
  my_id = m_slotIdCounter++ (e.g., 17)
  slot_idx = 17 % 8 = 1
  packet.timestamp = 17

HandlePictureDecode:
  slot_id = pic_params->nTimeStamp (17)
  slot_idx = 17 % 8 = 1
  m_ringBuffer[1].picture_indices.push_back(CurrPicIdx)

PollingThread:
  return_id = m_returnIdCounter (17)
  slot_idx = 17 % 8 = 1
  Check all m_ringBuffer[1].picture_indices[]

DecodeToSurface (wait):
  while(m_returnIdCounter != 17) { wait }
  Process frame from picture_indices[0]
  m_returnIdCounter++ (18)

Eliminated Complexity:

❌ No std::unordered_map<uint64_t, size_t> mapping
❌ No global map mutex
❌ No map insert/erase operations
❌ No lookup failures or stale entries

Multi-Frame Packet Handling:

One packet → multiple HandlePictureDecode calls
All frames stored in picture_indices vector
PollingThread checks ALL frames complete
Return first frame only (picture_indices[0])

📊 Performance Analysis

Expected Improvements

Scenario 1: Sequential 3-Frame Decode

Current (Sequential):

Frame 1: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
Frame 2: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
Frame 3: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
Total: 45ms

With RingBuffer (Parallel):

Submit all 3:     [S1][S2][S3] = 3ms
GPU decode:       [─── 10ms ───] (parallel)
Sequential return:
  Frame 1: [Wait][Map][Copy][Unmap] = 5ms
  Frame 2: [Map][Copy][Unmap] = 4ms (already decoded)
  Frame 3: [Map][Copy][Unmap] = 4ms (already decoded)
Total: 3ms + 10ms + 13ms = 26ms

Improvement: 45ms → 26ms (42% faster)

Scenario 2: Complex I/P Frame Mix

Video: I-frame (15ms), P-frame (3ms), P-frame (3ms)

Current:

I: 15ms
P: 3ms
P: 3ms
Total: 21ms

With RingBuffer:

Submit: 3ms
GPU:    [I:15ms, P:3ms, P:3ms overlap] = 15ms (limited by I-frame)
Return: 9ms (sequential)
Total: 27ms (slower due to overhead)

But if I-frame completes first: 21ms (same)

Note: Benefit depends on GOP structure. Simple videos may see less improvement.

CPU & GPU Utilization

Current:

CPU: [Active──────][Idle────────][Active──────][Idle────────]
GPU:        [Decode──────]               [Decode──────]

With RingBuffer:

CPU: [Active][Active][Active][─────Wait─────][Active][Active]
GPU:        [Decode1][Decode2][Decode3]

GPU Utilization: ⬆️ Increased by ~30% CPU Utilization: ➡️ Similar (waiting in different places)

🛡️ Risk Assessment

Risk 1: RingBuffer Overflow

Scenario: Application calls DecodeToSurface() 9+ times before first frame completes

Mitigation:

if (my_slot.in_use) {
    LogError("RingBuffer overflow");
    return false;  // Fail fast
}

Alternative: Block until slot available

while (my_slot.in_use) {
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

Recommendation: Fail fast + document max concurrent limit (8 frames)

Risk 2: Decode Timeout Deadlock

Scenario: GPU hangs, polling thread never sets is_ready

Mitigation:

if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500), ...)) {
    // Timeout - skip this slot
    my_slot.in_use = false;
    m_returnIndex.fetch_add(1);  // Advance to prevent deadlock
    return false;
}

Risk: Skipping slot may cause next slot to fail (reference frame missing)

Recommendation: Return error, let application decide (retry/abort)

Risk 3: Polling Thread CPU Usage

100us sleep → ~10,000 checks/second

CPU Impact:

Per check: ~1us (cuvidGetDecodeStatus overhead)
Total: 10,000 * 1us = 10ms/sec = 1% CPU

Mitigation: Acceptable overhead for real-time video decoding

Alternative: Adaptive polling (faster when frames pending, slower when idle)

Risk 4: Thread Safety - CUDA Context

Issue: cuvidGetDecodeStatus() requires correct CUDA context

Current Protection:

std::lock_guard<std::mutex> contextLock(m_cudaContextMutex);
cuCtxSetCurrent(m_cuContext);

In Polling Thread:

void PollingThreadFunc() {
    // Set CUDA context once at thread start
    cuCtxSetCurrent(m_cuContext);

    while (m_pollingRunning) {
        // cuvidGetDecodeStatus() now safe
        cuvidGetDecodeStatus(...);
    }
}

Recommendation: Set context once per thread (not per call)

📚 References

NVIDIA Documentation

NVDEC Video Decoder API Programming Guide
- URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/nvdec-video-decoder-api-prog-guide/
- Section: "cuvidGetDecodeStatus" API reference
Video Codec SDK 12.2
- URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/
- Section: "NVDECODE API" - asynchronous decoding patterns
CUDA Programming Guide
- URL: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
- Section: "Multi-threading" - CUDA context management

Previous Design Documents

NVDEC Deferred Decoder Recreation Design (Abandoned)
- Path: docs/completed/windows/nvdec/NVDEC_Deferred_Decoder_Recreation_Design.md
- Reason for abandonment: Incompatible with NVDEC synchronous callback architecture
- Lessons learned: NVDEC parser requires immediate decoder availability
VavCore Library Design
- Path: VavCore_Library_Design.md
- Relevant sections: C API design, cross-platform surface types

Code References

Current DecodeToSurface Implementation
- File: platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp
- Lines: 976-1200
- Current approach: Synchronous wait with condition variable
HandlePictureDisplay Callback
- File: platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp
- Lines: 791-813
- Current approach: Push picture_index to queue, notify condition variable
NVDEC Initialization
- File: platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp
- Lines: 600-650
- Key setting: m_createInfo.ulNumOutputSurfaces = 8

🔄 Change Log

Date	Author	Change Description
2025-10-04	Claude	Initial design document created
2025-10-04	Claude	Added RingBuffer architecture details
2025-10-04	Claude	Added polling thread design
2025-10-04	Claude	Added performance analysis

✅ Approval & Sign-off

Design Review:

Architecture approved
Performance estimates validated
Risk mitigation strategies accepted

Implementation Ready:

All design questions resolved
Implementation plan reviewed
Test plan defined

Next Steps: Proceed to Phase 1 - Data Structure Setup

29 KiB Raw Blame History

NVDEC RingBuffer-based Asynchronous Decoding Design

📋 Table of Contents

🎯 Background & Problem Statement

Current Implementation Issues

Issue 1: Sequential Decoding Bottleneck

Issue 2: Packet-Surface Mapping Mismatch

Issue 3: Multi-Frame Packet Handling ⚠️ Critical Discovery

Issue 3: ulNumOutputSurfaces Underutilization

🎯 Design Goals

Primary Goals

Non-Goals

🏗️ Architecture Overview

High-Level Design

Data Structures

DecodeSlot Structure

RingBuffer Members

🔧 Detailed Design

Component 1: Slot Allocation (Producer)

Component 2: Packet Submission

Component 3: Polling Thread (Background Status Checker)

Component 4: Sequential Return Wait (FIFO Guarantee)

Component 5: Frame Retrieval & Cleanup (Multi-Frame Support)

Component 6: NVDEC Callback Integration (Option C)

📐 Implementation Plan (Updated for Option C)

Phase 1: Data Structure Setup ✅

Phase 2: Polling Thread Implementation (Multi-Frame Support)

Phase 3: DecodeToSurface Refactoring (Option C Implementation)

Phase 4: HandlePictureDecode Update (Option C)

Phase 5: Integration Testing (Option C Validation)

Phase 6: Documentation & Cleanup

🎯 Option C Design Summary

Core Principle: Eliminate mapping overhead through deterministic calculation

📊 Performance Analysis

Expected Improvements

Scenario 1: Sequential 3-Frame Decode

Scenario 2: Complex I/P Frame Mix

CPU & GPU Utilization

🛡️ Risk Assessment

Risk 1: RingBuffer Overflow

Risk 2: Decode Timeout Deadlock

Risk 3: Polling Thread CPU Usage

Risk 4: Thread Safety - CUDA Context

📚 References

NVIDIA Documentation

Previous Design Documents

Code References

🔄 Change Log

✅ Approval & Sign-off

29 KiB

Raw Blame History