2025-10-04 14:57:14 +09:00
|
|
|
# NVDEC RingBuffer-based Asynchronous Decoding Design
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**Date**: 2025-10-05 (Updated)
|
|
|
|
|
**Status**: ✅ **Design Finalized - Ready for Implementation**
|
|
|
|
|
**Target**: NVDECAV1Decoder DecodeToSurface() optimization with multi-frame packet support
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 📋 Table of Contents
|
|
|
|
|
|
|
|
|
|
1. [Background & Problem Statement](#background--problem-statement)
|
|
|
|
|
2. [Design Goals](#design-goals)
|
|
|
|
|
3. [Architecture Overview](#architecture-overview)
|
|
|
|
|
4. [Detailed Design](#detailed-design)
|
|
|
|
|
5. [Implementation Plan](#implementation-plan)
|
|
|
|
|
6. [Performance Analysis](#performance-analysis)
|
|
|
|
|
7. [Risk Assessment](#risk-assessment)
|
|
|
|
|
8. [References](#references)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 🎯 Background & Problem Statement
|
|
|
|
|
|
|
|
|
|
### Current Implementation Issues
|
|
|
|
|
|
|
|
|
|
#### Issue 1: Sequential Decoding Bottleneck
|
|
|
|
|
|
|
|
|
|
**Current Code** (`NVDECAV1Decoder::DecodeToSurface`):
|
|
|
|
|
```cpp
|
|
|
|
|
bool DecodeToSurface(packet_data, target_surface, output_frame) {
|
|
|
|
|
cuvidParseVideoData(packet); // Submit packet
|
|
|
|
|
wait_for_frame(); // Block until decode complete
|
|
|
|
|
cuvidMapVideoFrame(); // Map decoded frame
|
|
|
|
|
CopyNV12Frame(srcPtr, target_surface); // Copy to surface
|
|
|
|
|
cuvidUnmapVideoFrame(); // Unmap frame
|
|
|
|
|
return true;
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Timeline**:
|
|
|
|
|
```
|
|
|
|
|
Call 1: [Submit][Wait____][Map][Copy][Unmap]
|
|
|
|
|
Call 2: [Submit][Wait____][Map][Copy][Unmap]
|
|
|
|
|
Call 3: [Submit][Wait____][Map][Copy][Unmap]
|
|
|
|
|
|
|
|
|
|
Total: ~45ms (15ms * 3)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Problem**: GPU sits idle during packet submission and surface copying.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
#### Issue 2: Packet-Surface Mapping Mismatch
|
|
|
|
|
|
|
|
|
|
**Scenario**: Multi-threaded DecodeToSurface calls
|
|
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
// Thread A
|
|
|
|
|
DecodeToSurface(packet1, surface1, frame1); // I-frame, 15ms
|
|
|
|
|
|
|
|
|
|
// Thread B (simultaneous)
|
|
|
|
|
DecodeToSurface(packet2, surface2, frame2); // P-frame, 3ms
|
|
|
|
|
|
|
|
|
|
// GPU completes out-of-order
|
|
|
|
|
packet2 completes @ t=3ms → picture_index=5
|
|
|
|
|
packet1 completes @ t=15ms → picture_index=4
|
|
|
|
|
|
|
|
|
|
// Current queue-based retrieval
|
|
|
|
|
Thread A: pop() → picture_index=5 ❌ (gets packet2 result!)
|
|
|
|
|
Thread B: pop() → picture_index=4 ❌ (gets packet1 result!)
|
|
|
|
|
|
|
|
|
|
// Result: surface1 contains packet2, surface2 contains packet1
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Root Cause**:
|
|
|
|
|
- FIFO queue doesn't track which packet belongs to which surface
|
|
|
|
|
- Decoding completion order ≠ submission order
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
#### Issue 3: Multi-Frame Packet Handling ⚠️ **Critical Discovery**
|
|
|
|
|
|
|
|
|
|
**Scenario**: Single WebM packet contains multiple AV1 frames
|
|
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
// DecodeToSurface called ONCE
|
|
|
|
|
DecodeToSurface(packet_175bytes, surface1, frame1);
|
|
|
|
|
|
|
|
|
|
// NVDEC parser extracts MULTIPLE frames from single packet:
|
|
|
|
|
→ HandlePictureDecode(CurrPicIdx=0, IntraPicFlag=1) // I-frame
|
|
|
|
|
→ HandlePictureDecode(CurrPicIdx=1, IntraPicFlag=0) // P-frame
|
|
|
|
|
→ HandlePictureDecode(CurrPicIdx=2, IntraPicFlag=0) // P-frame
|
|
|
|
|
→ ... (up to 8 frames in one packet)
|
|
|
|
|
|
|
|
|
|
// Problem: Which picture_index should be returned?
|
|
|
|
|
// Current design assumes: 1 packet = 1 frame ❌
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Impact**:
|
|
|
|
|
- Slot allocation assumes 1 packet → 1 slot → 1 picture_index
|
|
|
|
|
- Reality: 1 packet → 1 slot → **N picture_indices**
|
|
|
|
|
- Must track multiple picture_indices per slot
|
|
|
|
|
- Must decide which frame to return (first? last? all?)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-10-04 14:57:14 +09:00
|
|
|
#### Issue 3: ulNumOutputSurfaces Underutilization
|
|
|
|
|
|
|
|
|
|
**NVDEC Configuration**:
|
|
|
|
|
```cpp
|
|
|
|
|
m_createInfo.ulNumOutputSurfaces = 8; // 8 output buffers allocated
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Purpose of 8 Surfaces**:
|
|
|
|
|
1. Reference frame storage (AV1 needs up to 8 reference frames)
|
|
|
|
|
2. Display reordering buffer (B-frames, decode order ≠ display order)
|
|
|
|
|
3. Pipeline parallelism (overlap decode/map/copy operations)
|
|
|
|
|
|
|
|
|
|
**Current Usage**: Only 1-2 surfaces actually utilized → Wasted GPU memory (~24MB for 1920x1080)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 🎯 Design Goals
|
|
|
|
|
|
|
|
|
|
### Primary Goals
|
|
|
|
|
|
|
|
|
|
1. **✅ Correct Packet-Surface Mapping**
|
|
|
|
|
- Guarantee: `DecodeToSurface(packetN, surfaceN)` → surfaceN contains packetN result
|
|
|
|
|
- No out-of-order completion issues
|
|
|
|
|
|
|
|
|
|
2. **✅ Sequential Return Order**
|
|
|
|
|
- Even if packet2 completes before packet1, return surface1 first
|
|
|
|
|
- Maintain FIFO semantics for API users
|
|
|
|
|
|
|
|
|
|
3. **⚡ Parallel Decoding Throughput**
|
|
|
|
|
- Allow 2-3 packets to decode simultaneously on GPU
|
|
|
|
|
- Reduce total decode latency by ~30-40%
|
|
|
|
|
|
|
|
|
|
4. **🛡️ Simplicity & Maintainability**
|
|
|
|
|
- Avoid complex ID mapping systems
|
|
|
|
|
- Use fixed-size RingBuffer (matches NVDEC hardware design)
|
|
|
|
|
- Single polling thread (not per-decode thread)
|
|
|
|
|
|
|
|
|
|
### Non-Goals
|
|
|
|
|
|
|
|
|
|
- ❌ Full 8-frame parallel decoding (limited by reference frame dependencies)
|
|
|
|
|
- ❌ Out-of-order surface return (would break video playback semantics)
|
|
|
|
|
- ❌ Dynamic buffer sizing (fixed 8 slots matches NVDEC hardware)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 🏗️ Architecture Overview
|
|
|
|
|
|
|
|
|
|
### High-Level Design
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ DecodeToSurface() Call │
|
|
|
|
|
│ (Thread 1, 2, 3, ... can call concurrently) │
|
|
|
|
|
└────────────┬────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ RingBuffer (8 slots) │
|
|
|
|
|
│ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │
|
|
|
|
|
│ │ [0] │ [1] │ [2] │ [3] │ [4] │ [5] │ [6] │ [7] │ │
|
|
|
|
|
│ └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ │
|
|
|
|
|
│ ▲ ▲ │
|
|
|
|
|
│ │ │ │
|
|
|
|
|
│ m_returnIndex m_submitIndex │
|
|
|
|
|
│ (consumer) (producer) │
|
|
|
|
|
└────────────┬────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ Polling Thread │
|
|
|
|
|
│ while (polling_running) { │
|
|
|
|
|
│ slot = ringBuffer[m_returnIndex] │
|
|
|
|
|
│ if (cuvidGetDecodeStatus(slot.picture_index) == Success) { │
|
|
|
|
|
│ slot.is_ready = true │
|
|
|
|
|
│ slot.frame_ready.notify_one() │
|
|
|
|
|
│ } │
|
|
|
|
|
│ sleep(100us) │
|
|
|
|
|
│ } │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Data Structures
|
|
|
|
|
|
|
|
|
|
#### DecodeSlot Structure
|
|
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
struct DecodeSlot {
|
|
|
|
|
// Slot state
|
|
|
|
|
bool in_use; // Is this slot allocated?
|
|
|
|
|
|
|
|
|
|
// Submitted information (from DecodeToSurface args)
|
|
|
|
|
void* target_surface; // Destination D3D12 resource
|
|
|
|
|
VavCoreSurfaceType surface_type; // Surface type
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// NVDEC information (from HandlePictureDecode callback)
|
|
|
|
|
// ⚠️ Multi-frame support: One packet can decode to multiple frames
|
|
|
|
|
std::vector<int> picture_indices; // All NVDEC frame indices from this packet
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
// Synchronization primitives
|
2025-10-05 11:12:57 +09:00
|
|
|
std::condition_variable frame_ready; // Signaled when ALL frames are decoded
|
2025-10-04 14:57:14 +09:00
|
|
|
std::mutex slot_mutex; // Protects this slot's state
|
2025-10-05 11:12:57 +09:00
|
|
|
bool is_ready; // All frames decoded flag
|
2025-10-04 14:57:14 +09:00
|
|
|
};
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### RingBuffer Members
|
|
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
class NVDECAV1Decoder {
|
|
|
|
|
private:
|
|
|
|
|
static constexpr size_t RING_BUFFER_SIZE = 8; // Match ulNumOutputSurfaces
|
|
|
|
|
|
|
|
|
|
DecodeSlot m_ringBuffer[RING_BUFFER_SIZE];
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 🎯 Option C: Unified slot allocation counter (no mapping needed!)
|
|
|
|
|
std::atomic<uint64_t> m_slotIdCounter{0}; // Monotonically increasing slot ID
|
|
|
|
|
std::atomic<uint64_t> m_returnIdCounter{0}; // Return order enforcement (FIFO)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
// Polling thread
|
|
|
|
|
std::thread m_pollingThread;
|
|
|
|
|
std::atomic<bool> m_pollingRunning{false};
|
|
|
|
|
};
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 🔧 Detailed Design
|
|
|
|
|
|
|
|
|
|
### Component 1: Slot Allocation (Producer)
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**Purpose**: Assign RingBuffer slot to each DecodeToSurface call using Option C design
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
// In DecodeToSurface()
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 🎯 Option C: Allocate unique ID (serves as both slot ID and submission order)
|
|
|
|
|
uint64_t my_id = m_slotIdCounter.fetch_add(1);
|
|
|
|
|
size_t slot_idx = my_id % RING_BUFFER_SIZE;
|
|
|
|
|
DecodeSlot& my_slot = m_ringBuffer[slot_idx];
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
// 2. Check for overflow
|
|
|
|
|
{
|
|
|
|
|
std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
|
|
|
|
|
|
|
|
|
|
if (my_slot.in_use) {
|
|
|
|
|
// RingBuffer full! Too many concurrent decodes
|
|
|
|
|
LogError("RingBuffer overflow - max 8 concurrent decodes");
|
|
|
|
|
return false;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// 3. Initialize slot
|
|
|
|
|
my_slot.in_use = true;
|
|
|
|
|
my_slot.target_surface = target_surface;
|
|
|
|
|
my_slot.surface_type = target_type;
|
2025-10-05 11:12:57 +09:00
|
|
|
my_slot.picture_indices.clear(); // Multi-frame support: clear previous frames
|
2025-10-04 14:57:14 +09:00
|
|
|
my_slot.is_ready = false;
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**🎯 Option C: No Mapping Needed - Direct Calculation**:
|
2025-10-04 14:57:14 +09:00
|
|
|
```
|
2025-10-05 11:12:57 +09:00
|
|
|
Thread 1: m_slotIdCounter.fetch_add(1) → ID=0, slot_idx = 0 % 8 = slot[0]
|
|
|
|
|
Thread 2: m_slotIdCounter.fetch_add(1) → ID=1, slot_idx = 1 % 8 = slot[1]
|
|
|
|
|
Thread 3: m_slotIdCounter.fetch_add(1) → ID=2, slot_idx = 2 % 8 = slot[2]
|
2025-10-04 14:57:14 +09:00
|
|
|
...
|
2025-10-05 11:12:57 +09:00
|
|
|
Thread 9: m_slotIdCounter.fetch_add(1) → ID=8, slot_idx = 8 % 8 = slot[0] (wrap around)
|
2025-10-04 14:57:14 +09:00
|
|
|
```
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**Key Advantage**: ID → slot_idx calculation is **deterministic**, no map storage needed!
|
|
|
|
|
|
2025-10-04 14:57:14 +09:00
|
|
|
**Overflow Protection**: If `slot[0].in_use == true` when Thread 9 arrives → error
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### Component 2: Packet Submission
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**Purpose**: Submit packet to NVDEC with slot ID tracking via timestamp
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
// 4. Submit packet to NVDEC parser
|
|
|
|
|
CUVIDSOURCEDATAPACKET packet = {};
|
|
|
|
|
packet.payload = packet_data;
|
|
|
|
|
packet.payload_size = packet_size;
|
|
|
|
|
packet.flags = CUVID_PKT_ENDOFPICTURE;
|
2025-10-05 11:12:57 +09:00
|
|
|
packet.timestamp = static_cast<int64_t>(my_id); // 🎯 Pass full slot_id (NOT modulo!)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
CUresult result = cuvidParseVideoData(m_parser, &packet);
|
|
|
|
|
if (result != CUDA_SUCCESS) {
|
|
|
|
|
my_slot.in_use = false; // Release slot on error
|
|
|
|
|
return false;
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**🎯 Option C Timestamp Flow** (multi-frame packet support):
|
2025-10-04 14:57:14 +09:00
|
|
|
```
|
2025-10-05 11:12:57 +09:00
|
|
|
cuvidParseVideoData(packet, timestamp=my_id=17)
|
2025-10-04 14:57:14 +09:00
|
|
|
↓
|
|
|
|
|
HandleVideoSequence() (first time only)
|
|
|
|
|
↓
|
2025-10-05 11:12:57 +09:00
|
|
|
HandlePictureDecode(timestamp=17, CurrPicIdx=0) → slot_idx = 17 % 8 = 1
|
|
|
|
|
→ m_ringBuffer[1].picture_indices.push_back(0)
|
2025-10-04 14:57:14 +09:00
|
|
|
↓
|
2025-10-05 11:12:57 +09:00
|
|
|
HandlePictureDecode(timestamp=17, CurrPicIdx=1) → slot_idx = 17 % 8 = 1
|
|
|
|
|
→ m_ringBuffer[1].picture_indices.push_back(1) // Same packet, multiple frames!
|
2025-10-04 14:57:14 +09:00
|
|
|
↓
|
2025-10-05 11:12:57 +09:00
|
|
|
HandlePictureDecode(timestamp=17, CurrPicIdx=2) → slot_idx = 17 % 8 = 1
|
|
|
|
|
→ m_ringBuffer[1].picture_indices.push_back(2)
|
2025-10-04 14:57:14 +09:00
|
|
|
↓
|
2025-10-05 11:12:57 +09:00
|
|
|
PollingThread checks ALL picture_indices for slot[1]
|
|
|
|
|
↓
|
|
|
|
|
When all complete: slot[1].is_ready = true, notify thread
|
2025-10-04 14:57:14 +09:00
|
|
|
```
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**Key Point**: Timestamp carries **full slot_id**, HandlePictureDecode calculates slot_idx directly
|
|
|
|
|
|
2025-10-04 14:57:14 +09:00
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### Component 3: Polling Thread (Background Status Checker)
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**Purpose**: Continuously poll `m_returnIdCounter` slot for decode completion (multi-frame support)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
void NVDECAV1Decoder::PollingThreadFunc() {
|
|
|
|
|
while (m_pollingRunning) {
|
2025-10-05 11:12:57 +09:00
|
|
|
// 1. Get current return ID and calculate slot index
|
|
|
|
|
uint64_t current_return_id = m_returnIdCounter.load();
|
|
|
|
|
size_t slot_idx = current_return_id % RING_BUFFER_SIZE;
|
|
|
|
|
DecodeSlot& slot = m_ringBuffer[slot_idx];
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
// 2. Check if slot is in use and not yet ready
|
2025-10-05 11:12:57 +09:00
|
|
|
if (slot.in_use && !slot.is_ready) {
|
2025-10-04 14:57:14 +09:00
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 3. Get copy of picture indices (multi-frame support)
|
|
|
|
|
std::vector<int> picture_indices_copy;
|
|
|
|
|
{
|
|
|
|
|
std::lock_guard<std::mutex> lock(slot.slot_mutex);
|
|
|
|
|
picture_indices_copy = slot.picture_indices;
|
|
|
|
|
}
|
2025-10-04 14:57:14 +09:00
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 4. Check if ALL frames are decoded
|
|
|
|
|
bool all_complete = true;
|
|
|
|
|
for (int pic_idx : picture_indices_copy) {
|
|
|
|
|
CUVIDGETDECODESTATUS decodeStatus = {};
|
|
|
|
|
CUresult result = cuvidGetDecodeStatus(m_decoder, pic_idx, &decodeStatus);
|
2025-10-04 14:57:14 +09:00
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
if (result != CUDA_SUCCESS ||
|
|
|
|
|
decodeStatus.decodeStatus != cuvidDecodeStatus_Success) {
|
|
|
|
|
all_complete = false;
|
|
|
|
|
break;
|
2025-10-04 14:57:14 +09:00
|
|
|
}
|
2025-10-05 11:12:57 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// 5. If all frames complete, signal ready
|
|
|
|
|
if (all_complete && !picture_indices_copy.empty()) {
|
|
|
|
|
{
|
|
|
|
|
std::lock_guard<std::mutex> lock(slot.slot_mutex);
|
|
|
|
|
slot.is_ready = true;
|
2025-10-04 14:57:14 +09:00
|
|
|
}
|
2025-10-05 11:12:57 +09:00
|
|
|
slot.frame_ready.notify_one();
|
2025-10-04 14:57:14 +09:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 6. Sleep to avoid busy-wait
|
2025-10-04 14:57:14 +09:00
|
|
|
std::this_thread::sleep_for(std::chrono::microseconds(100));
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Key Points**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- ✅ Only polls `m_returnIdCounter` slot (not all 8 slots) → efficient
|
|
|
|
|
- ✅ **Multi-frame support**: Checks ALL picture_indices for completion
|
2025-10-04 14:57:14 +09:00
|
|
|
- ✅ Uses `cuvidGetDecodeStatus()` non-blocking query
|
|
|
|
|
- ✅ 100us sleep → ~10,000 checks/second (low CPU usage)
|
2025-10-05 11:12:57 +09:00
|
|
|
- ✅ Thread-safe picture_indices copy to avoid lock contention
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
### Component 4: Sequential Return Wait (FIFO Guarantee)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**Purpose**: Enforce FIFO order even when decodes complete out-of-order using Option C
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
// In DecodeToSurface() - PHASE 2
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 5. Wait for my turn (FIFO order enforcement)
|
|
|
|
|
while (m_returnIdCounter.load() != my_id) {
|
2025-10-04 14:57:14 +09:00
|
|
|
std::this_thread::sleep_for(std::chrono::milliseconds(1));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Now I'm first in line!
|
|
|
|
|
|
|
|
|
|
// 6. Wait for decode to complete
|
|
|
|
|
{
|
|
|
|
|
std::unique_lock<std::mutex> lock(my_slot.slot_mutex);
|
|
|
|
|
|
|
|
|
|
if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500),
|
|
|
|
|
[&my_slot]() { return my_slot.is_ready; })) {
|
|
|
|
|
// Timeout - decode took too long
|
|
|
|
|
LogError("Decode timeout");
|
|
|
|
|
my_slot.in_use = false;
|
2025-10-05 11:12:57 +09:00
|
|
|
m_returnIdCounter.fetch_add(1); // Skip this slot to avoid deadlock
|
2025-10-04 14:57:14 +09:00
|
|
|
return false;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**🎯 Option C Timeline Example** (using slot ID, not slot index):
|
2025-10-04 14:57:14 +09:00
|
|
|
```
|
2025-10-05 11:12:57 +09:00
|
|
|
Thread 1 (ID=17, slot 1): Wait for returnIdCounter==17 ✅ (immediate)
|
|
|
|
|
Wait for is_ready...
|
2025-10-04 14:57:14 +09:00
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
Thread 2 (ID=18, slot 2): Wait for returnIdCounter==18 ⏸️ (blocked)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
Thread 3 (ID=19, slot 3): Wait for returnIdCounter==19 ⏸️ (blocked)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
GPU: ID=18 completes first @ t=3ms
|
|
|
|
|
→ slot[2].is_ready = true
|
|
|
|
|
→ Thread 2 still blocked (returnIdCounter=17)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
GPU: ID=17 completes @ t=15ms
|
|
|
|
|
→ slot[1].is_ready = true
|
2025-10-04 14:57:14 +09:00
|
|
|
→ Thread 1 wakes up ✅
|
2025-10-05 11:12:57 +09:00
|
|
|
→ Thread 1 processes → returnIdCounter = 18
|
2025-10-04 14:57:14 +09:00
|
|
|
→ Thread 2 now unblocked ✅
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**Key Point**: Wait on **slot_id** (my_id), not slot_idx, for correct FIFO ordering
|
|
|
|
|
|
2025-10-04 14:57:14 +09:00
|
|
|
---
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
### Component 5: Frame Retrieval & Cleanup (Multi-Frame Support)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Purpose**: Map decoded frame, copy to surface, release slot
|
|
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
// In DecodeToSurface() - PHASE 3
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 7. Get first frame from multi-frame packet
|
|
|
|
|
int frameIdx = my_slot.picture_indices[0]; // Return first frame only
|
2025-10-04 14:57:14 +09:00
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 8. Map decoded frame from NVDEC
|
2025-10-04 14:57:14 +09:00
|
|
|
CUVIDPROCPARAMS procParams = {};
|
|
|
|
|
procParams.progressive_frame = 1;
|
|
|
|
|
|
|
|
|
|
CUdeviceptr srcDevicePtr = 0;
|
|
|
|
|
unsigned int srcPitch = 0;
|
|
|
|
|
|
|
|
|
|
CUresult result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
|
|
|
|
|
if (result != CUDA_SUCCESS) {
|
|
|
|
|
my_slot.in_use = false;
|
2025-10-05 11:12:57 +09:00
|
|
|
m_returnIdCounter.fetch_add(1);
|
2025-10-04 14:57:14 +09:00
|
|
|
return false;
|
|
|
|
|
}
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 9. Copy to D3D12 surface
|
2025-10-04 14:57:14 +09:00
|
|
|
ID3D12Resource* d3d12Resource = static_cast<ID3D12Resource*>(target_surface);
|
|
|
|
|
bool copySuccess = m_d3d12Handler->CopyNV12Frame(
|
|
|
|
|
srcDevicePtr, srcPitch, d3d12Resource, m_width, m_height
|
|
|
|
|
);
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 10. Unmap frame
|
2025-10-04 14:57:14 +09:00
|
|
|
cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 11. Release slot
|
2025-10-04 14:57:14 +09:00
|
|
|
{
|
|
|
|
|
std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
|
|
|
|
|
my_slot.in_use = false;
|
|
|
|
|
}
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 12. Advance return ID counter (FIFO order)
|
|
|
|
|
m_returnIdCounter.fetch_add(1);
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
return copySuccess;
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**Multi-Frame Decision**: Return **first frame only** from multi-frame packet (indices[0])
|
|
|
|
|
|
2025-10-04 14:57:14 +09:00
|
|
|
---
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
### Component 6: NVDEC Callback Integration (Option C)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**Purpose**: Link NVDEC picture_index to RingBuffer slot using direct slot_id calculation
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
```cpp
|
2025-10-05 11:12:57 +09:00
|
|
|
int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
|
2025-10-04 14:57:14 +09:00
|
|
|
auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// 🎯 Option C: Direct slot_id → slot_idx calculation (no map lookup!)
|
|
|
|
|
uint64_t slot_id = static_cast<uint64_t>(pic_params->nTimeStamp);
|
|
|
|
|
size_t slot_idx = slot_id % RING_BUFFER_SIZE;
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
DecodeSlot& slot = decoder->m_ringBuffer[slot_idx];
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// Submit frame to NVDEC decoder
|
|
|
|
|
CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
|
|
|
|
|
if (result != CUDA_SUCCESS) {
|
|
|
|
|
decoder->LogCUDAError(result, "cuvidDecodePicture failed");
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Store picture_index for polling (multi-frame support)
|
2025-10-04 14:57:14 +09:00
|
|
|
{
|
|
|
|
|
std::lock_guard<std::mutex> lock(slot.slot_mutex);
|
2025-10-05 11:12:57 +09:00
|
|
|
slot.picture_indices.push_back(pic_params->CurrPicIdx);
|
2025-10-04 14:57:14 +09:00
|
|
|
}
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
// Polling thread will check cuvidGetDecodeStatus() for ALL picture_indices
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
**🎯 Option C Key Advantages**:
|
|
|
|
|
- ✅ **No mapping overhead**: Direct modulo calculation, no unordered_map lookup
|
|
|
|
|
- ✅ **No mutex contention**: No global map mutex needed
|
|
|
|
|
- ✅ **Multi-frame support**: Automatically handles multiple frames per packet
|
|
|
|
|
- ✅ **Deterministic**: Same slot_id always maps to same slot_idx
|
|
|
|
|
|
2025-10-04 14:57:14 +09:00
|
|
|
---
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
## 📐 Implementation Plan (Updated for Option C)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
### Phase 1: Data Structure Setup ✅
|
|
|
|
|
|
|
|
|
|
**Files to Modify**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- `NVDECAV1Decoder.h` - Add RingBuffer members with Option C design
|
2025-10-04 14:57:14 +09:00
|
|
|
- `NVDECAV1Decoder.cpp` - Initialize RingBuffer in constructor
|
|
|
|
|
|
|
|
|
|
**Tasks**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- [x] Define `DecodeSlot` structure with `std::vector<int> picture_indices`
|
2025-10-04 14:57:14 +09:00
|
|
|
- [x] Add `m_ringBuffer[8]` array
|
2025-10-05 11:12:57 +09:00
|
|
|
- [x] **🎯 Option C**: Add `m_slotIdCounter`, `m_returnIdCounter` atomic counters (NOT submitIndex/returnIndex)
|
2025-10-04 14:57:14 +09:00
|
|
|
- [x] Add `m_pollingThread`, `m_pollingRunning` members
|
|
|
|
|
|
|
|
|
|
**Estimated Time**: 30 minutes
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
### Phase 2: Polling Thread Implementation (Multi-Frame Support)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Files to Modify**:
|
|
|
|
|
- `NVDECAV1Decoder.cpp` - Implement `PollingThreadFunc()`
|
|
|
|
|
|
|
|
|
|
**Tasks**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- [ ] Implement polling loop with `cuvidGetDecodeStatus()` checking ALL picture_indices
|
|
|
|
|
- [ ] Poll slot at `m_returnIdCounter % RING_BUFFER_SIZE` (NOT all slots)
|
2025-10-04 14:57:14 +09:00
|
|
|
- [ ] Add thread start in `Initialize()`
|
|
|
|
|
- [ ] Add thread stop in `Cleanup()`
|
2025-10-05 11:12:57 +09:00
|
|
|
- [ ] Add debug logging for multi-frame packet events
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Testing**:
|
|
|
|
|
- Verify thread starts/stops correctly
|
2025-10-05 11:12:57 +09:00
|
|
|
- Verify multi-frame packet handling (all frames checked)
|
2025-10-04 14:57:14 +09:00
|
|
|
- Verify `cuvidGetDecodeStatus()` calls work
|
|
|
|
|
|
|
|
|
|
**Estimated Time**: 1 hour
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
### Phase 3: DecodeToSurface Refactoring (Option C Implementation)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Files to Modify**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- `NVDECAV1Decoder.cpp` - Rewrite `DecodeToSurface()` with Option C design
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Tasks**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- [ ] **🎯 Phase 1**: Slot allocation using `my_id = m_slotIdCounter.fetch_add(1)`
|
|
|
|
|
- [ ] **🎯 Phase 2**: Packet submission with `packet.timestamp = my_id` (full slot_id, NOT modulo!)
|
|
|
|
|
- [ ] **🎯 Phase 3**: FIFO wait using `while(m_returnIdCounter != my_id)` (NOT slot_idx!)
|
|
|
|
|
- [ ] **🎯 Phase 4**: Frame retrieval from `picture_indices[0]` (first frame only)
|
|
|
|
|
- [ ] **🎯 Phase 5**: Cleanup and `m_returnIdCounter.fetch_add(1)`
|
2025-10-04 14:57:14 +09:00
|
|
|
- [ ] Error handling for all failure paths
|
|
|
|
|
|
|
|
|
|
**Testing**:
|
|
|
|
|
- Single-threaded decode test
|
|
|
|
|
- Multi-threaded decode test (2-3 threads)
|
2025-10-05 11:12:57 +09:00
|
|
|
- Verify packet-surface mapping correctness with multi-frame packets
|
|
|
|
|
- Verify FIFO ordering with out-of-order completion
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Estimated Time**: 2 hours
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
### Phase 4: HandlePictureDecode Update (Option C)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Files to Modify**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- `NVDECAV1Decoder.cpp` - Modify `HandlePictureDecode()` callback
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Tasks**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- [ ] **🎯 Option C**: Extract `slot_id` from `pic_params->nTimeStamp`
|
|
|
|
|
- [ ] **🎯 Option C**: Calculate `slot_idx = slot_id % RING_BUFFER_SIZE` (NO map lookup!)
|
|
|
|
|
- [ ] **🎯 Multi-frame**: Use `slot.picture_indices.push_back(CurrPicIdx)` (NOT single index!)
|
|
|
|
|
- [ ] Add debug logging for multi-frame packets
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Testing**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- Verify timestamp → slot_idx direct calculation works
|
|
|
|
|
- Verify picture_indices vector correctly stores multiple frames
|
|
|
|
|
- Test with video that has multi-frame packets (test_720p_stripe.webm)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Estimated Time**: 30 minutes
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
### Phase 5: Integration Testing (Option C Validation)
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Test Scenarios**:
|
2025-10-05 11:12:57 +09:00
|
|
|
1. **Single packet decode** - Verify Option C basic functionality
|
|
|
|
|
2. **Multi-frame packet** - Verify vector-based picture_indices handling (test_720p_stripe.webm)
|
|
|
|
|
3. **Sequential 3 packets** - Verify FIFO order using m_returnIdCounter
|
|
|
|
|
4. **Out-of-order completion** - Verify slot_id → slot_idx mapping (I-frame after P-frame)
|
|
|
|
|
5. **RingBuffer overflow** - Verify error handling (9+ concurrent calls)
|
|
|
|
|
6. **Decode errors** - Verify graceful failure with multi-frame packets
|
|
|
|
|
7. **Performance benchmark** - Measure latency reduction vs old queue-based approach
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Test Files**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- Simple test video (simple_test.webm) - basic validation
|
|
|
|
|
- **Multi-frame packet video (test_720p_stripe.webm)** - critical multi-frame test ⚠️
|
|
|
|
|
|
|
|
|
|
**Validation Criteria**:
|
|
|
|
|
- ✅ No slot ID → slot_idx mapping errors
|
|
|
|
|
- ✅ All frames from multi-frame packets detected and polled
|
|
|
|
|
- ✅ FIFO order maintained even with out-of-order GPU completion
|
|
|
|
|
- ✅ No memory corruption or race conditions
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Estimated Time**: 2 hours
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### Phase 6: Documentation & Cleanup
|
|
|
|
|
|
|
|
|
|
**Tasks**:
|
2025-10-05 11:12:57 +09:00
|
|
|
- [x] Update NVDEC design documentation with Option C and multi-frame support
|
|
|
|
|
- [ ] Add inline code comments explaining Option C design choices
|
|
|
|
|
- [ ] Remove old queue-based code and any ID→Index mapping attempts
|
|
|
|
|
- [ ] Move design doc to `docs/completed/` after successful implementation
|
|
|
|
|
- [ ] Document multi-frame packet behavior and first-frame-only decision
|
2025-10-04 14:57:14 +09:00
|
|
|
|
|
|
|
|
**Estimated Time**: 1 hour
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-10-05 11:12:57 +09:00
|
|
|
## 🎯 Option C Design Summary
|
|
|
|
|
|
|
|
|
|
### **Core Principle**: Eliminate mapping overhead through deterministic calculation
|
|
|
|
|
|
|
|
|
|
**Key Components**:
|
|
|
|
|
1. **Single Counter for Dual Purpose**: `m_slotIdCounter` serves as both unique ID and submission order
|
|
|
|
|
2. **Direct Slot Calculation**: `slot_idx = slot_id % RING_BUFFER_SIZE` (no map needed)
|
|
|
|
|
3. **FIFO via ID Comparison**: `while(m_returnIdCounter != my_id)` ensures ordering
|
|
|
|
|
4. **Multi-Frame Vector**: `std::vector<int> picture_indices` handles packets with multiple frames
|
|
|
|
|
|
|
|
|
|
**Data Flow**:
|
|
|
|
|
```
|
|
|
|
|
DecodeToSurface:
|
|
|
|
|
my_id = m_slotIdCounter++ (e.g., 17)
|
|
|
|
|
slot_idx = 17 % 8 = 1
|
|
|
|
|
packet.timestamp = 17
|
|
|
|
|
|
|
|
|
|
HandlePictureDecode:
|
|
|
|
|
slot_id = pic_params->nTimeStamp (17)
|
|
|
|
|
slot_idx = 17 % 8 = 1
|
|
|
|
|
m_ringBuffer[1].picture_indices.push_back(CurrPicIdx)
|
|
|
|
|
|
|
|
|
|
PollingThread:
|
|
|
|
|
return_id = m_returnIdCounter (17)
|
|
|
|
|
slot_idx = 17 % 8 = 1
|
|
|
|
|
Check all m_ringBuffer[1].picture_indices[]
|
|
|
|
|
|
|
|
|
|
DecodeToSurface (wait):
|
|
|
|
|
while(m_returnIdCounter != 17) { wait }
|
|
|
|
|
Process frame from picture_indices[0]
|
|
|
|
|
m_returnIdCounter++ (18)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Eliminated Complexity**:
|
|
|
|
|
- ❌ No `std::unordered_map<uint64_t, size_t>` mapping
|
|
|
|
|
- ❌ No global map mutex
|
|
|
|
|
- ❌ No map insert/erase operations
|
|
|
|
|
- ❌ No lookup failures or stale entries
|
|
|
|
|
|
|
|
|
|
**Multi-Frame Packet Handling**:
|
|
|
|
|
- One packet → multiple HandlePictureDecode calls
|
|
|
|
|
- All frames stored in `picture_indices` vector
|
|
|
|
|
- PollingThread checks ALL frames complete
|
|
|
|
|
- Return first frame only (`picture_indices[0]`)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-10-04 14:57:14 +09:00
|
|
|
## 📊 Performance Analysis
|
|
|
|
|
|
|
|
|
|
### Expected Improvements
|
|
|
|
|
|
|
|
|
|
#### Scenario 1: Sequential 3-Frame Decode
|
|
|
|
|
|
|
|
|
|
**Current (Sequential)**:
|
|
|
|
|
```
|
|
|
|
|
Frame 1: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
|
|
|
|
|
Frame 2: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
|
|
|
|
|
Frame 3: [Submit 1ms][Wait 10ms][Map 1ms][Copy 2ms][Unmap 1ms] = 15ms
|
|
|
|
|
Total: 45ms
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**With RingBuffer (Parallel)**:
|
|
|
|
|
```
|
|
|
|
|
Submit all 3: [S1][S2][S3] = 3ms
|
|
|
|
|
GPU decode: [─── 10ms ───] (parallel)
|
|
|
|
|
Sequential return:
|
|
|
|
|
Frame 1: [Wait][Map][Copy][Unmap] = 5ms
|
|
|
|
|
Frame 2: [Map][Copy][Unmap] = 4ms (already decoded)
|
|
|
|
|
Frame 3: [Map][Copy][Unmap] = 4ms (already decoded)
|
|
|
|
|
Total: 3ms + 10ms + 13ms = 26ms
|
|
|
|
|
|
|
|
|
|
Improvement: 45ms → 26ms (42% faster)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
#### Scenario 2: Complex I/P Frame Mix
|
|
|
|
|
|
|
|
|
|
**Video**: I-frame (15ms), P-frame (3ms), P-frame (3ms)
|
|
|
|
|
|
|
|
|
|
**Current**:
|
|
|
|
|
```
|
|
|
|
|
I: 15ms
|
|
|
|
|
P: 3ms
|
|
|
|
|
P: 3ms
|
|
|
|
|
Total: 21ms
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**With RingBuffer**:
|
|
|
|
|
```
|
|
|
|
|
Submit: 3ms
|
|
|
|
|
GPU: [I:15ms, P:3ms, P:3ms overlap] = 15ms (limited by I-frame)
|
|
|
|
|
Return: 9ms (sequential)
|
|
|
|
|
Total: 27ms (slower due to overhead)
|
|
|
|
|
|
|
|
|
|
But if I-frame completes first: 21ms (same)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Note**: Benefit depends on GOP structure. Simple videos may see less improvement.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### CPU & GPU Utilization
|
|
|
|
|
|
|
|
|
|
**Current**:
|
|
|
|
|
```
|
|
|
|
|
CPU: [Active──────][Idle────────][Active──────][Idle────────]
|
|
|
|
|
GPU: [Decode──────] [Decode──────]
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**With RingBuffer**:
|
|
|
|
|
```
|
|
|
|
|
CPU: [Active][Active][Active][─────Wait─────][Active][Active]
|
|
|
|
|
GPU: [Decode1][Decode2][Decode3]
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**GPU Utilization**: ⬆️ Increased by ~30%
|
|
|
|
|
**CPU Utilization**: ➡️ Similar (waiting in different places)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 🛡️ Risk Assessment
|
|
|
|
|
|
|
|
|
|
### Risk 1: RingBuffer Overflow
|
|
|
|
|
|
|
|
|
|
**Scenario**: Application calls `DecodeToSurface()` 9+ times before first frame completes
|
|
|
|
|
|
|
|
|
|
**Mitigation**:
|
|
|
|
|
```cpp
|
|
|
|
|
if (my_slot.in_use) {
|
|
|
|
|
LogError("RingBuffer overflow");
|
|
|
|
|
return false; // Fail fast
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Alternative**: Block until slot available
|
|
|
|
|
```cpp
|
|
|
|
|
while (my_slot.in_use) {
|
|
|
|
|
std::this_thread::sleep_for(std::chrono::milliseconds(1));
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Recommendation**: Fail fast + document max concurrent limit (8 frames)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### Risk 2: Decode Timeout Deadlock
|
|
|
|
|
|
|
|
|
|
**Scenario**: GPU hangs, polling thread never sets `is_ready`
|
|
|
|
|
|
|
|
|
|
**Mitigation**:
|
|
|
|
|
```cpp
|
|
|
|
|
if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500), ...)) {
|
|
|
|
|
// Timeout - skip this slot
|
|
|
|
|
my_slot.in_use = false;
|
|
|
|
|
m_returnIndex.fetch_add(1); // Advance to prevent deadlock
|
|
|
|
|
return false;
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Risk**: Skipping slot may cause next slot to fail (reference frame missing)
|
|
|
|
|
|
|
|
|
|
**Recommendation**: Return error, let application decide (retry/abort)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### Risk 3: Polling Thread CPU Usage
|
|
|
|
|
|
|
|
|
|
**100us sleep → ~10,000 checks/second**
|
|
|
|
|
|
|
|
|
|
**CPU Impact**:
|
|
|
|
|
```
|
|
|
|
|
Per check: ~1us (cuvidGetDecodeStatus overhead)
|
|
|
|
|
Total: 10,000 * 1us = 10ms/sec = 1% CPU
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Mitigation**: Acceptable overhead for real-time video decoding
|
|
|
|
|
|
|
|
|
|
**Alternative**: Adaptive polling (faster when frames pending, slower when idle)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### Risk 4: Thread Safety - CUDA Context
|
|
|
|
|
|
|
|
|
|
**Issue**: `cuvidGetDecodeStatus()` requires correct CUDA context
|
|
|
|
|
|
|
|
|
|
**Current Protection**:
|
|
|
|
|
```cpp
|
|
|
|
|
std::lock_guard<std::mutex> contextLock(m_cudaContextMutex);
|
|
|
|
|
cuCtxSetCurrent(m_cuContext);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**In Polling Thread**:
|
|
|
|
|
```cpp
|
|
|
|
|
void PollingThreadFunc() {
|
|
|
|
|
// Set CUDA context once at thread start
|
|
|
|
|
cuCtxSetCurrent(m_cuContext);
|
|
|
|
|
|
|
|
|
|
while (m_pollingRunning) {
|
|
|
|
|
// cuvidGetDecodeStatus() now safe
|
|
|
|
|
cuvidGetDecodeStatus(...);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Recommendation**: Set context once per thread (not per call)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 📚 References
|
|
|
|
|
|
|
|
|
|
### NVIDIA Documentation
|
|
|
|
|
|
|
|
|
|
1. **NVDEC Video Decoder API Programming Guide**
|
|
|
|
|
- URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/nvdec-video-decoder-api-prog-guide/
|
|
|
|
|
- Section: "cuvidGetDecodeStatus" API reference
|
|
|
|
|
|
|
|
|
|
2. **Video Codec SDK 12.2**
|
|
|
|
|
- URL: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/
|
|
|
|
|
- Section: "NVDECODE API" - asynchronous decoding patterns
|
|
|
|
|
|
|
|
|
|
3. **CUDA Programming Guide**
|
|
|
|
|
- URL: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
|
|
|
|
|
- Section: "Multi-threading" - CUDA context management
|
|
|
|
|
|
|
|
|
|
### Previous Design Documents
|
|
|
|
|
|
|
|
|
|
1. **NVDEC Deferred Decoder Recreation Design** (Abandoned)
|
|
|
|
|
- Path: `docs/completed/windows/nvdec/NVDEC_Deferred_Decoder_Recreation_Design.md`
|
|
|
|
|
- Reason for abandonment: Incompatible with NVDEC synchronous callback architecture
|
|
|
|
|
- Lessons learned: NVDEC parser requires immediate decoder availability
|
|
|
|
|
|
|
|
|
|
2. **VavCore Library Design**
|
|
|
|
|
- Path: `VavCore_Library_Design.md`
|
|
|
|
|
- Relevant sections: C API design, cross-platform surface types
|
|
|
|
|
|
|
|
|
|
### Code References
|
|
|
|
|
|
|
|
|
|
1. **Current DecodeToSurface Implementation**
|
|
|
|
|
- File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp`
|
|
|
|
|
- Lines: 976-1200
|
|
|
|
|
- Current approach: Synchronous wait with condition variable
|
|
|
|
|
|
|
|
|
|
2. **HandlePictureDisplay Callback**
|
|
|
|
|
- File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp`
|
|
|
|
|
- Lines: 791-813
|
|
|
|
|
- Current approach: Push picture_index to queue, notify condition variable
|
|
|
|
|
|
|
|
|
|
3. **NVDEC Initialization**
|
|
|
|
|
- File: `platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp`
|
|
|
|
|
- Lines: 600-650
|
|
|
|
|
- Key setting: `m_createInfo.ulNumOutputSurfaces = 8`
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 🔄 Change Log
|
|
|
|
|
|
|
|
|
|
| Date | Author | Change Description |
|
|
|
|
|
|------------|--------|-------------------|
|
|
|
|
|
| 2025-10-04 | Claude | Initial design document created |
|
|
|
|
|
| 2025-10-04 | Claude | Added RingBuffer architecture details |
|
|
|
|
|
| 2025-10-04 | Claude | Added polling thread design |
|
|
|
|
|
| 2025-10-04 | Claude | Added performance analysis |
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## ✅ Approval & Sign-off
|
|
|
|
|
|
|
|
|
|
**Design Review**:
|
|
|
|
|
- [ ] Architecture approved
|
|
|
|
|
- [ ] Performance estimates validated
|
|
|
|
|
- [ ] Risk mitigation strategies accepted
|
|
|
|
|
|
|
|
|
|
**Implementation Ready**:
|
|
|
|
|
- [ ] All design questions resolved
|
|
|
|
|
- [ ] Implementation plan reviewed
|
|
|
|
|
- [ ] Test plan defined
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
**Next Steps**: Proceed to Phase 1 - Data Structure Setup
|