# NVDEC RingBuffer-based Asynchronous Decoding Design

**Date**: 2025-10-04
**Status**: 🔄 **In Progress**
**Target**: NVDECAV1Decoder DecodeToSurface() optimization

---

## 📋 Table of Contents

1. [Background & Problem Statement](#background--problem-statement)
2. [Design Goals](#design-goals)
3. [Architecture Overview](#architecture-overview)
4. [Detailed Design](#detailed-design)
5. [Implementation Plan](#implementation-plan)
6. [Performance Analysis](#performance-analysis)
7. [Risk Assessment](#risk-assessment)
8. [References](#references)

---

## 🎯 Background & Problem Statement

### Current Implementation Issues

#### Issue 1: Sequential Decoding Bottleneck

**Current Code** (`NVDECAV1Decoder::DecodeToSurface`):
```cpp
bool DecodeToSurface(packet_data, target_surface, output_frame) {
    cuvidParseVideoData(packet);           // Submit packet
    wait_for_frame();                      // Block until decode complete
    cuvidMapVideoFrame();                  // Map decoded frame
    CopyNV12Frame(srcPtr, target_surface); // Copy to surface
    cuvidUnmapVideoFrame();                // Unmap frame
    return true;
}
```

**Timeline**:
```
Call 1: [Submit][Wait____][Map][Copy][Unmap]
Call 2:                                      [Submit][Wait____][Map][Copy][Unmap]
Call 3:                                                                          [Submit][Wait____][Map][Copy][Unmap]

Total: ~45ms (15ms * 3)
```

**Problem**: GPU sits idle during packet submission and surface copying.

---

#### Issue 2: Packet-Surface Mapping Mismatch

**Scenario**: Multi-threaded DecodeToSurface calls

```cpp
// Thread A
DecodeToSurface(packet1, surface1, frame1);  // I-frame, 15ms

// Thread B (simultaneous)
DecodeToSurface(packet2, surface2, frame2);  // P-frame, 3ms

// GPU completes out-of-order
packet2 completes @ t=3ms  → picture_index=5
packet1 completes @ t=15ms → picture_index=4

// Current queue-based retrieval
Thread A: pop() → picture_index=5 ❌ (gets packet2 result!)
Thread B: pop() → picture_index=4 ❌ (gets packet1 result!)

// Result: surface1 contains packet2, surface2 contains packet1
```

**Root Cause**:
- FIFO queue doesn't track which packet belongs to which surface
- Decoding completion order ≠ submission order

---

#### Issue 3: ulNumOutputSurfaces Underutilization

**NVDEC Configuration**:
```cpp
m_createInfo.ulNumOutputSurfaces = 8;  // 8 output buffers allocated
```

**Purpose of 8 Surfaces**:
1. Reference frame storage (AV1 needs up to 8 reference frames)
2. Display reordering buffer (B-frames, decode order ≠ display order)
3. Pipeline parallelism (overlap decode/map/copy operations)

**Current Usage**: Only 1-2 surfaces actually utilized → Wasted GPU memory (~24MB for 1920x1080)

---

## 🎯 Design Goals

### Primary Goals

1. **✅ Correct Packet-Surface Mapping**
   - Guarantee: `DecodeToSurface(packetN, surfaceN)` → surfaceN contains packetN result
   - No out-of-order completion issues

2. **✅ Sequential Return Order**
   - Even if packet2 completes before packet1, return surface1 first
   - Maintain FIFO semantics for API users

3. **⚡ Parallel Decoding Throughput**
   - Allow 2-3 packets to decode simultaneously on GPU
   - Reduce total decode latency by ~30-40%

4. **🛡️ Simplicity & Maintainability**
   - Avoid complex ID mapping systems
   - Use fixed-size RingBuffer (matches NVDEC hardware design)
   - Single polling thread (not per-decode thread)

### Non-Goals

- ❌ Full 8-frame parallel decoding (limited by reference frame dependencies)
- ❌ Out-of-order surface return (would break video playback semantics)
- ❌ Dynamic buffer sizing (fixed 8 slots matches NVDEC hardware)

---

## 🏗️ Architecture Overview

### High-Level Design

```
┌─────────────────────────────────────────────────────────────────┐
│                    DecodeToSurface() Call                        │
│  (Thread 1, 2, 3, ... can call concurrently)                    │
└────────────┬────────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────────┐
│                        RingBuffer (8 slots)                      │
│  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐            │
│  │ [0] │ [1] │ [2] │ [3] │ [4] │ [5] │ [6] │ [7] │            │
│  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘            │
│     ▲                                             ▲              │
│     │                                             │              │
│  m_returnIndex                              m_submitIndex        │
│  (consumer)                                 (producer)           │
└────────────┬────────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Polling Thread                              │
│  while (polling_running) {                                       │
│    slot = ringBuffer[m_returnIndex]                             │
│    if (cuvidGetDecodeStatus(slot.picture_index) == Success) {   │
│      slot.is_ready = true                                       │
│      slot.frame_ready.notify_one()                              │
│    }                                                             │
│    sleep(100us)                                                 │
│  }                                                               │
└─────────────────────────────────────────────────────────────────┘
```

### Data Structures

#### DecodeSlot Structure

```cpp
struct DecodeSlot {
    // Slot state
    bool in_use;                          // Is this slot allocated?

    // Submitted information (from DecodeToSurface args)
    void* target_surface;                 // Destination D3D12 resource
    VavCoreSurfaceType surface_type;      // Surface type

    // NVDEC information (from HandlePictureDisplay callback)
    int picture_index;                    // NVDEC frame index for cuvidMapVideoFrame

    // Synchronization primitives
    std::condition_variable frame_ready;  // Signaled when decode complete
    std::mutex slot_mutex;                // Protects this slot's state
    bool is_ready;                        // Decode completed flag
};
```

#### RingBuffer Members

```cpp
class NVDECAV1Decoder {
private:
    static constexpr size_t RING_BUFFER_SIZE = 8;  // Match ulNumOutputSurfaces

    DecodeSlot m_ringBuffer[RING_BUFFER_SIZE];

    // Producer-consumer indices
    std::atomic<size_t> m_submitIndex{0};   // Next slot to allocate (producer)
    std::atomic<size_t> m_returnIndex{0};   // Next slot to return (consumer)

    // Polling thread
    std::thread m_pollingThread;
    std::atomic<bool> m_pollingRunning{false};
};
```

---

## 🔧 Detailed Design

### Component 1: Slot Allocation (Producer)

**Purpose**: Assign RingBuffer slot to each DecodeToSurface call

```cpp
// In DecodeToSurface()

// 1. Allocate next available slot
size_t my_slot_idx = m_submitIndex.fetch_add(1) % RING_BUFFER_SIZE;
DecodeSlot& my_slot = m_ringBuffer[my_slot_idx];

// 2. Check for overflow
{
    std::lock_guard<std::mutex> lock(my_slot.slot_mutex);

    if (my_slot.in_use) {
        // RingBuffer full! Too many concurrent decodes
        LogError("RingBuffer overflow - max 8 concurrent decodes");
        return false;
    }

    // 3. Initialize slot
    my_slot.in_use = true;
    my_slot.target_surface = target_surface;
    my_slot.surface_type = target_type;
    my_slot.picture_index = -1;  // Set by HandlePictureDisplay
    my_slot.is_ready = false;
}
```

**Atomic Counter Behavior**:
```
Thread 1: m_submitIndex.fetch_add(1) → 0 % 8 = slot[0]
Thread 2: m_submitIndex.fetch_add(1) → 1 % 8 = slot[1]
Thread 3: m_submitIndex.fetch_add(1) → 2 % 8 = slot[2]
...
Thread 9: m_submitIndex.fetch_add(1) → 8 % 8 = slot[0] (wrap around)
```

**Overflow Protection**: If `slot[0].in_use == true` when Thread 9 arrives → error

---

### Component 2: Packet Submission

**Purpose**: Submit packet to NVDEC with slot index tracking

```cpp
// 4. Submit packet to NVDEC parser
CUVIDSOURCEDATAPACKET packet = {};
packet.payload = packet_data;
packet.payload_size = packet_size;
packet.flags = CUVID_PKT_ENDOFPICTURE;
packet.timestamp = my_slot_idx;  // ✅ Embed slot index in timestamp

CUresult result = cuvidParseVideoData(m_parser, &packet);
if (result != CUDA_SUCCESS) {
    my_slot.in_use = false;  // Release slot on error
    return false;
}
```

**Timestamp Flow**:
```
cuvidParseVideoData(packet, timestamp=2)
    ↓
HandleVideoSequence() (first time only)
    ↓
HandlePictureDecode(timestamp=2)
    ↓
GPU decodes packet...
    ↓
HandlePictureDisplay(timestamp=2, picture_index=5)
    ↓
m_ringBuffer[2].picture_index = 5  // ✅ Slot 2 now linked to picture_index 5
```

---

### Component 3: Polling Thread (Background Status Checker)

**Purpose**: Continuously poll `m_returnIndex` slot for decode completion

```cpp
void NVDECAV1Decoder::PollingThreadFunc() {
    while (m_pollingRunning) {
        // 1. Get current return slot (oldest pending decode)
        size_t current_return_idx = m_returnIndex.load();
        DecodeSlot& slot = m_ringBuffer[current_return_idx];

        // 2. Check if slot is in use and not yet ready
        if (slot.in_use && !slot.is_ready && slot.picture_index >= 0) {

            // 3. Query NVDEC for decode status
            CUVIDGETDECODESTATUS decodeStatus = {};
            CUresult result = cuvidGetDecodeStatus(m_decoder, slot.picture_index, &decodeStatus);

            if (result == CUDA_SUCCESS) {
                if (decodeStatus.decodeStatus == cuvidDecodeStatus_Success) {
                    // ✅ Decode complete!
                    {
                        std::lock_guard<std::mutex> lock(slot.slot_mutex);
                        slot.is_ready = true;
                    }

                    // Wake up waiting DecodeToSurface thread
                    slot.frame_ready.notify_one();

                    OutputDebugStringA("[Polling] Slot ready\n");
                }
                else if (decodeStatus.decodeStatus == cuvidDecodeStatus_Error) {
                    // Decode error - mark as ready to unblock
                    {
                        std::lock_guard<std::mutex> lock(slot.slot_mutex);
                        slot.is_ready = true;  // Error also counts as "ready"
                    }
                    slot.frame_ready.notify_one();

                    OutputDebugStringA("[Polling] Decode error\n");
                }
                // cuvidDecodeStatus_InProgress → keep polling
            }
        }

        // 4. Sleep to avoid busy-wait
        std::this_thread::sleep_for(std::chrono::microseconds(100));
    }
}
```

**Key Points**:
- ✅ Only polls `m_returnIndex` slot (not all 8 slots) → efficient
- ✅ Uses `cuvidGetDecodeStatus()` non-blocking query
- ✅ 100us sleep → ~10,000 checks/second (low CPU usage)
- ✅ Handles decode errors gracefully

---

### Component 4: Sequential Return Wait

**Purpose**: Enforce FIFO order even when decodes complete out-of-order

```cpp
// In DecodeToSurface() - PHASE 2

// 5. Wait for my turn (sequential return order)
while (m_returnIndex.load() != my_slot_idx) {
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

// Now I'm first in line!

// 6. Wait for decode to complete
{
    std::unique_lock<std::mutex> lock(my_slot.slot_mutex);

    if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500),
        [&my_slot]() { return my_slot.is_ready; })) {
        // Timeout - decode took too long
        LogError("Decode timeout");
        my_slot.in_use = false;
        m_returnIndex.fetch_add(1);  // Skip this slot to avoid deadlock
        return false;
    }
}
```

**Timeline Example**:
```
Thread 1 (slot 0): Wait for returnIndex==0 ✅ (immediate)
                   Wait for is_ready...

Thread 2 (slot 1): Wait for returnIndex==1 ⏸️ (blocked)

Thread 3 (slot 2): Wait for returnIndex==2 ⏸️ (blocked)

GPU: packet2 completes first @ t=3ms
     → slot[1].is_ready = true
     → Thread 2 still blocked (returnIndex=0)

GPU: packet1 completes @ t=15ms
     → slot[0].is_ready = true
     → Thread 1 wakes up ✅
     → Thread 1 processes → returnIndex = 1
     → Thread 2 now unblocked ✅
```

---

### Component 5: Frame Retrieval & Cleanup

**Purpose**: Map decoded frame, copy to surface, release slot

```cpp
// In DecodeToSurface() - PHASE 3

int frameIdx = my_slot.picture_index;

// 7. Map decoded frame from NVDEC
CUVIDPROCPARAMS procParams = {};
procParams.progressive_frame = 1;

CUdeviceptr srcDevicePtr = 0;
unsigned int srcPitch = 0;

CUresult result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
if (result != CUDA_SUCCESS) {
    my_slot.in_use = false;
    m_returnIndex.fetch_add(1);
    return false;
}

// 8. Copy to D3D12 surface
ID3D12Resource* d3d12Resource = static_cast<ID3D12Resource*>(target_surface);
bool copySuccess = m_d3d12Handler->CopyNV12Frame(
    srcDevicePtr, srcPitch, d3d12Resource, m_width, m_height
);

// 9. Unmap frame
cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);

// 10. Release slot
{
    std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
    my_slot.in_use = false;
}

// 11. Advance return index
m_returnIndex.fetch_add(1);

return copySuccess;
```

---

### Component 6: NVDEC Callback Integration

**Purpose**: Link NVDEC picture_index to RingBuffer slot

```cpp
int CUDAAPI NVDECAV1Decoder::HandlePictureDisplay(void* user_data, CUVIDPARSERDISPINFO* disp_info) {
    auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

    // Extract slot index from timestamp
    size_t slot_idx = static_cast<size_t>(disp_info->timestamp) % RING_BUFFER_SIZE;

    DecodeSlot& slot = decoder->m_ringBuffer[slot_idx];

    // Store NVDEC picture index
    {
        std::lock_guard<std::mutex> lock(slot.slot_mutex);
        slot.picture_index = disp_info->picture_index;
    }

    // Polling thread will check cuvidGetDecodeStatus() for this picture_index

    return 1;
}
```

---

## 📐 Implementation Plan

### Phase 1: Data Structure Setup ✅

**Files to Modify**:
- `NVDECAV1Decoder.h` - Add RingBuffer members
- `NVDECAV1Decoder.cpp` - Initialize RingBuffer in constructor

**Tasks**:
- [x] Define `DecodeSlot` structure
- [x] Add `m_ringBuffer[8]` array
- [x] Add `m_submitIndex`, `m_returnIndex` atomic counters
- [x] Add `m_pollingThread`, `m_pollingRunning` members

**Estimated Time**: 30 minutes

---

### Phase 2: Polling Thread Implementation

**Files to Modify**:
- `NVDECAV1Decoder.cpp` - Implement `PollingThreadFunc()`

**Tasks**:
- [ ] Implement polling loop with `cuvidGetDecodeStatus()`
- [ ] Add thread start in `Initialize()`
- [ ] Add thread stop in `Cleanup()`
- [ ] Add debug logging for polling events

**Testing**:
- Verify thread starts/stops correctly
- Verify `cuvidGetDecodeStatus()` calls work

**Estimated Time**: 1 hour

---

### Phase 3: DecodeToSurface Refactoring

**Files to Modify**:
- `NVDECAV1Decoder.cpp` - Rewrite `DecodeToSurface()`

**Tasks**:
- [ ] Phase 1: Slot allocation logic
- [ ] Phase 2: Sequential return wait logic
- [ ] Phase 3: Frame retrieval & cleanup logic
- [ ] Error handling for all failure paths

**Testing**:
- Single-threaded decode test
- Multi-threaded decode test (2-3 threads)
- Verify packet-surface mapping correctness

**Estimated Time**: 2 hours

---

### Phase 4: HandlePictureDisplay Update

**Files to Modify**:
- `NVDECAV1Decoder.cpp` - Modify `HandlePictureDisplay()`

**Tasks**:
- [ ] Extract slot_idx from timestamp
- [ ] Store picture_index in correct slot
- [ ] Add debug logging

**Testing**:
- Verify timestamp → slot_idx mapping
- Verify picture_index stored correctly

**Estimated Time**: 30 minutes

---

### Phase 5: Integration Testing

**Test Scenarios**:
1. **Single packet decode** - Verify basic functionality
2. **Sequential 3 packets** - Verify FIFO order
3. **Out-of-order completion** - Verify correct mapping (I-frame after P-frame)
4. **RingBuffer overflow** - Verify error handling (9+ concurrent calls)
5. **Decode errors** - Verify graceful failure
6. **Performance benchmark** - Measure latency reduction

**Test Files**:
- Simple test video (simple_test.webm)
- Complex GOP structure video (test_720p_stripe.webm)

**Estimated Time**: 2 hours

---

### Phase 6: Documentation & Cleanup

**Tasks**:
- [ ] Update NVDEC design documentation
- [ ] Add inline code comments
- [ ] Remove old queue-based code
- [ ] Move design doc to `docs/completed/`

**Estimated Time**: 1 hour

---

## 📊 Performance Analysis

### Expected Improvements

#### Scenario 1: Sequential 3-Frame Decode

**Current (Sequential)**:
```
Frame 1: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
Frame 2: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
Frame 3: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
Total: 45ms
```

**With RingBuffer (Parallel)**:
```
Submit all 3:     [S1][S2][S3] = 3ms
GPU decode:       [─── 10ms ───] (parallel)
Sequential return:
  Frame 1: [Wait][Map][Copy][Unmap] = 5ms
  Frame 2: [Map][Copy][Unmap] = 4ms (already decoded)
  Frame 3: [Map][Copy][Unmap] = 4ms (already decoded)
Total: 3ms + 10ms + 13ms = 26ms

Improvement: 45ms → 26ms (42% faster)
```

---

#### Scenario 2: Complex I/P Frame Mix

**Video**: I-frame (15ms), P-frame (3ms), P-frame (3ms)

**Current**:
```
I: 15ms
P: 3ms
P: 3ms
Total: 21ms
```

**With RingBuffer**:
```
Submit: 3ms
GPU:    [I:15ms, P:3ms, P:3ms overlap] = 15ms (limited by I-frame)
Return: 9ms (sequential)
Total: 27ms (slower due to overhead)

But if I-frame completes first: 21ms (same)
```

**Note**: Benefit depends on GOP structure. Simple videos may see less improvement.

---

### CPU & GPU Utilization

**Current**:
```
CPU: [Active──────][Idle────────][Active──────][Idle────────]
GPU:        [Decode──────]               [Decode──────]
```

**With RingBuffer**:
```
CPU: [Active][Active][Active][─────Wait─────][Active][Active]
GPU:        [Decode1][Decode2][Decode3]
```

**GPU Utilization**: ⬆️ Increased by ~30%
**CPU Utilization**: ➡️ Similar (waiting in different places)

---

## 🛡️ Risk Assessment

### Risk 1: RingBuffer Overflow

**Scenario**: Application calls `DecodeToSurface()` 9+ times before first frame completes

**Mitigation**:
```cpp
if (my_slot.in_use) {
    LogError("RingBuffer overflow");
    return false;  // Fail fast
}
```

**Alternative**: Block until slot available
```cpp
while (my_slot.in_use) {
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
```

**Recommendation**: Fail fast + document max concurrent limit (8 frames)

---

### Risk 2: Decode Timeout Deadlock

**Scenario**: GPU hangs, polling thread never sets `is_ready`

**Mitigation**:
```cpp
if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500), ...)) {
    // Timeout - skip this slot
    my_slot.in_use = false;
    m_returnIndex.fetch_add(1);  // Advance to prevent deadlock
    return false;
}
```

**Risk**: Skipping slot may cause next slot to fail (reference frame missing)

**Recommendation**: Return error, let application decide (retry/abort)

---

### Risk 3: Polling Thread CPU Usage

**100us sleep → ~10,000 checks/second**

**CPU Impact**:
```
Per check: ~1us (cuvidGetDecodeStatus overhead)
Total: 10,000 * 1us = 10ms/sec = 1% CPU
```

**Mitigation**: Acceptable overhead for real-time video decoding

**Alternative**: Adaptive polling (faster when frames pending, slower when idle)

---

### Risk 4: Thread Safety - CUDA Context

**Issue**: `cuvidGetDecodeStatus()` requires correct CUDA context

**Current Protection**:
```cpp
std::lock_guard<std::mutex> contextLock(m_cudaContextMutex);
cuCtxSetCurrent(m_cuContext);
```

**In Polling Thread**:
```cpp
void PollingThreadFunc() {
    // Set CUDA context once at thread start
    cuCtxSetCurrent(m_cuContext);

    while (m_pollingRunning) {
        // cuvidGetDecodeStatus() now safe
        cuvidGetDecodeStatus(...);
    }
}
```

**Recommendation**: Set context once per thread (not per call)

---

## 📚 References

### NVIDIA Documentation

1. **NVDEC Video Decoder API Programming Guide**
   - URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/nvdec-video-decoder-api-prog-guide/
   - Section: "cuvidGetDecodeStatus" API reference

2. **Video Codec SDK 12.2**
   - URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/
   - Section: "NVDECODE API" - asynchronous decoding patterns

3. **CUDA Programming Guide**
   - URL: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
   - Section: "Multi-threading" - CUDA context management

### Previous Design Documents

1. **NVDEC Deferred Decoder Recreation Design** (Abandoned)
   - Path: `docs/completed/windows/nvdec/NVDEC_Deferred_Decoder_Recreation_Design.md`
   - Reason for abandonment: Incompatible with NVDEC synchronous callback architecture
   - Lessons learned: NVDEC parser requires immediate decoder availability

2. **VavCore Library Design**
   - Path: `VavCore_Library_Design.md`
   - Relevant sections: C API design, cross-platform surface types

### Code References

1. **Current DecodeToSurface Implementation**
   - File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp`
   - Lines: 976-1200
   - Current approach: Synchronous wait with condition variable

2. **HandlePictureDisplay Callback**
   - File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp`
   - Lines: 791-813
   - Current approach: Push picture_index to queue, notify condition variable

3. **NVDEC Initialization**
   - File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp`
   - Lines: 600-650
   - Key setting: `m_createInfo.ulNumOutputSurfaces = 8`

---

## 🔄 Change Log

| Date       | Author | Change Description |
|------------|--------|-------------------|
| 2025-10-04 | Claude | Initial design document created |
| 2025-10-04 | Claude | Added RingBuffer architecture details |
| 2025-10-04 | Claude | Added polling thread design |
| 2025-10-04 | Claude | Added performance analysis |

---

## ✅ Approval & Sign-off

**Design Review**:
- [ ] Architecture approved
- [ ] Performance estimates validated
- [ ] Risk mitigation strategies accepted

**Implementation Ready**:
- [ ] All design questions resolved
- [ ] Implementation plan reviewed
- [ ] Test plan defined

---

**Next Steps**: Proceed to Phase 1 - Data Structure Setup