2025-10-07 03:49:32 +09:00
|
|
|
# CUDA Driver API Complete Unification Design
|
|
|
|
|
|
|
|
|
|
## Problem Analysis
|
|
|
|
|
|
|
|
|
|
### Root Cause: Runtime API + Driver API Mixing
|
|
|
|
|
|
|
|
|
|
**Current Issue:**
|
|
|
|
|
- `CUDA_ERROR_INVALID_HANDLE` occurs when kernel launches
|
|
|
|
|
- Surface object created successfully but cannot be used in kernel
|
|
|
|
|
- Context setting doesn't resolve the fundamental problem
|
|
|
|
|
|
|
|
|
|
**Why Mixing APIs Causes Issues:**
|
|
|
|
|
|
|
|
|
|
1. **Different Context Management:**
|
|
|
|
|
- Runtime API: Automatic context creation/management (implicit primary context)
|
|
|
|
|
- Driver API: Explicit context creation/management (requires `cuCtxSetCurrent()`)
|
|
|
|
|
- **Mixed usage can result in operations happening in different contexts**
|
|
|
|
|
|
|
|
|
|
2. **Handle Incompatibility:**
|
|
|
|
|
- Runtime API handles: Bound to Runtime API context
|
|
|
|
|
- Driver API handles: Bound to Driver API context
|
|
|
|
|
- **A surface created in Runtime context cannot be used in Driver context**
|
|
|
|
|
|
|
|
|
|
3. **NVDEC Constraint:**
|
|
|
|
|
- NVDEC uses **pure Driver API** (`cuvidCreateDecoder`, `cuvidMapVideoFrame`, etc.)
|
|
|
|
|
- All subsequent CUDA operations must use **the same Driver API context**
|
|
|
|
|
- Mixing Runtime API breaks this constraint
|
|
|
|
|
|
|
|
|
|
### Historical Context: How We Got Here
|
|
|
|
|
|
|
|
|
|
**Original Implementation (commit 73d9d8d):**
|
|
|
|
|
- Fully Runtime API based
|
|
|
|
|
- `cudaSurfaceObject_t`, `cudaCreateSurfaceObject()`
|
|
|
|
|
- Worked in isolation but failed with NVDEC integration
|
|
|
|
|
|
|
|
|
|
**Partial Fix Attempt:**
|
|
|
|
|
- Changed surface creation to Driver API (`cuSurfObjectCreate()`)
|
|
|
|
|
- Left external memory import as Runtime API
|
|
|
|
|
- **Result: Still mixing APIs → Still failing**
|
|
|
|
|
|
|
|
|
|
**Current State:**
|
|
|
|
|
```cpp
|
|
|
|
|
// Runtime API (wrong context)
|
|
|
|
|
cudaImportExternalMemory(&external_memory, &mem_desc);
|
|
|
|
|
cudaExternalMemoryGetMappedMipmappedArray(&mipmapped_array, ...);
|
|
|
|
|
|
|
|
|
|
// Driver API (NVDEC context)
|
|
|
|
|
cuSurfObjectCreate(&surface, &res_desc);
|
|
|
|
|
cuLaunchKernel(kernel, ...);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Solution: Complete Driver API Unification
|
|
|
|
|
|
|
|
|
|
### Phase 1: API Conversion (Current Task)
|
|
|
|
|
|
|
|
|
|
**Files to Modify:**
|
|
|
|
|
1. `ExternalMemoryCache.h` - Update type signatures
|
|
|
|
|
2. `ExternalMemoryCache.cpp` - Convert all Runtime API calls
|
|
|
|
|
3. `D3D12SurfaceHandler.cpp` - Convert remaining Runtime API calls
|
|
|
|
|
|
|
|
|
|
**Conversion Map:**
|
|
|
|
|
|
|
|
|
|
| Runtime API | Driver API | Notes |
|
|
|
|
|
|-------------|-----------|-------|
|
|
|
|
|
| `cudaExternalMemory_t` | `CUDA_EXTERNAL_MEMORY_HANDLE_DESC` + `CUexternalMemory` | Type change required |
|
|
|
|
|
| `cudaImportExternalMemory()` | `cuImportExternalMemory()` | Direct replacement |
|
|
|
|
|
| `cudaExternalMemoryGetMappedMipmappedArray()` | `cuExternalMemoryGetMappedMipmappedArray()` | Direct replacement |
|
|
|
|
|
| `cudaGetMipmappedArrayLevel()` | `cuMipmappedArrayGetLevel()` | Direct replacement |
|
|
|
|
|
| `cudaMipmappedArray_t` | `CUmipmappedArray` | Type change |
|
|
|
|
|
| `cudaArray_t` | `CUarray` | Type change |
|
|
|
|
|
| `cudaMemcpy2D()` | `cuMemcpy2D()` | Async version: `cuMemcpy2DAsync()` |
|
|
|
|
|
| `cudaDestroyExternalMemory()` | `cuDestroyExternalMemory()` | Direct replacement |
|
|
|
|
|
| `cudaFreeMipmappedArray()` | `cuMipmappedArrayDestroy()` | Direct replacement |
|
|
|
|
|
|
|
|
|
|
### Phase 2: Context Verification (After Unification)
|
|
|
|
|
|
|
|
|
|
After complete Driver API unification:
|
|
|
|
|
1. Verify NVDEC context is properly passed to all components
|
|
|
|
|
2. Ensure `cuCtxSetCurrent()` is called before Driver API operations
|
|
|
|
|
3. Confirm all operations use the same context
|
|
|
|
|
|
|
|
|
|
### Expected Benefits
|
|
|
|
|
|
|
|
|
|
✅ **Single Context:** All operations in NVDEC's Driver API context
|
|
|
|
|
✅ **Handle Compatibility:** All handles created/used in same context
|
|
|
|
|
✅ **Clear Debugging:** Unified API makes issues easier to diagnose
|
|
|
|
|
✅ **NVDEC Alignment:** Matches NVDEC's native API paradigm
|
|
|
|
|
✅ **Stability:** Eliminates context switching issues
|
|
|
|
|
|
|
|
|
|
## Implementation Plan
|
|
|
|
|
|
|
|
|
|
### Step 1: Update Type Definitions
|
|
|
|
|
|
|
|
|
|
**ExternalMemoryCache.h:**
|
|
|
|
|
```cpp
|
|
|
|
|
// Before (Runtime API types)
|
|
|
|
|
struct CachedEntry {
|
|
|
|
|
cudaExternalMemory_t external_memory;
|
|
|
|
|
cudaMipmappedArray_t mipmapped_array;
|
|
|
|
|
// ...
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
// After (Driver API types)
|
|
|
|
|
struct CachedEntry {
|
|
|
|
|
CUexternalMemory external_memory;
|
|
|
|
|
CUmipmappedArray mipmapped_array;
|
|
|
|
|
// ...
|
|
|
|
|
};
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 2: Convert External Memory Import
|
|
|
|
|
|
|
|
|
|
**ExternalMemoryCache.cpp - ImportD3D12TextureAsSurface():**
|
|
|
|
|
```cpp
|
|
|
|
|
// Before (Runtime API)
|
|
|
|
|
cudaExternalMemoryHandleDesc mem_desc = {};
|
|
|
|
|
mem_desc.type = cudaExternalMemoryHandleTypeD3D12Resource;
|
|
|
|
|
// ...
|
|
|
|
|
cudaError_t err = cudaImportExternalMemory(&external_memory, &mem_desc);
|
|
|
|
|
|
|
|
|
|
// After (Driver API)
|
|
|
|
|
CUDA_EXTERNAL_MEMORY_HANDLE_DESC mem_desc = {};
|
|
|
|
|
mem_desc.type = CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE;
|
|
|
|
|
// ...
|
|
|
|
|
CUresult result = cuImportExternalMemory(&external_memory, &mem_desc);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 3: Convert Mipmapped Array Operations
|
|
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
// Before (Runtime API)
|
|
|
|
|
cudaExternalMemoryGetMappedMipmappedArray(&mipmapped_array, external_memory, &mipmap_desc);
|
|
|
|
|
cudaGetMipmappedArrayLevel(&array, mipmapped_array, 0);
|
|
|
|
|
|
|
|
|
|
// After (Driver API)
|
|
|
|
|
cuExternalMemoryGetMappedMipmappedArray(&mipmapped_array, external_memory, &mipmap_desc);
|
|
|
|
|
cuMipmappedArrayGetLevel(&array, mipmapped_array, 0);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 4: Convert Memory Copy Operations
|
|
|
|
|
|
|
|
|
|
**D3D12SurfaceHandler.cpp - CopyYPlane() / CopyUVPlane():**
|
|
|
|
|
```cpp
|
|
|
|
|
// Before (Runtime API)
|
|
|
|
|
cudaError_t err = cudaMemcpy2D(
|
|
|
|
|
(void*)dst, dst_pitch,
|
|
|
|
|
(void*)src, src_pitch,
|
|
|
|
|
width, height,
|
|
|
|
|
cudaMemcpyDeviceToDevice
|
|
|
|
|
);
|
|
|
|
|
|
|
|
|
|
// After (Driver API)
|
|
|
|
|
CUresult result = cuMemcpy2D(dst, dst_pitch, src, src_pitch, width, height);
|
|
|
|
|
// Or async version:
|
|
|
|
|
// CUresult result = cuMemcpy2DAsync(dst, dst_pitch, src, src_pitch, width, height, stream);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 5: Convert Cleanup Operations
|
|
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
|
// Before (Runtime API)
|
|
|
|
|
cudaFreeMipmappedArray(mipmapped_array);
|
|
|
|
|
cudaDestroyExternalMemory(external_memory);
|
|
|
|
|
|
|
|
|
|
// After (Driver API)
|
|
|
|
|
cuMipmappedArrayDestroy(mipmapped_array);
|
|
|
|
|
cuDestroyExternalMemory(external_memory);
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-07 04:03:15 +09:00
|
|
|
## Implementation Status
|
|
|
|
|
|
|
|
|
|
### ✅ Phase 1: Driver API Unification - COMPLETED
|
|
|
|
|
|
|
|
|
|
**Completed Tasks:**
|
|
|
|
|
1. ✅ ExternalMemoryCache.cpp - All Runtime API converted to Driver API
|
|
|
|
|
2. ✅ D3D12SurfaceHandler.cpp - CopyYPlane/CopyUVPlane converted to Driver API
|
|
|
|
|
3. ✅ Type definitions updated (CUexternalMemory, CUmipmappedArray, etc.)
|
|
|
|
|
4. ✅ Build successful with Driver API unification
|
|
|
|
|
|
|
|
|
|
**Key Changes:**
|
|
|
|
|
- `cudaImportExternalMemory()` → `cuImportExternalMemory()`
|
|
|
|
|
- `cudaExternalMemoryGetMappedMipmappedArray()` → `cuExternalMemoryGetMappedMipmappedArray()`
|
|
|
|
|
- `cudaMemcpy2D()` → `cuMemcpy2D()` with CUDA_MEMCPY2D struct
|
|
|
|
|
- `cudaGetMipmappedArrayLevel()` → `cuMipmappedArrayGetLevel()`
|
|
|
|
|
|
|
|
|
|
### ✅ Phase 2: Error Recovery Mechanism - COMPLETED
|
|
|
|
|
|
|
|
|
|
**Implemented Features:**
|
|
|
|
|
1. ✅ **D3D12 Resource Cleanup:**
|
|
|
|
|
- `D3D12SurfaceHandler::ReleaseD3D12Resource()` method
|
|
|
|
|
- External memory cache release on error
|
|
|
|
|
|
|
|
|
|
2. ✅ **Slot-based Resource Tracking:**
|
|
|
|
|
- `DecodeSlot` structure enhanced with `d3d12_texture` and `surface_object` fields
|
|
|
|
|
- Resources tracked per-slot for proper cleanup
|
|
|
|
|
|
|
|
|
|
3. ✅ **NV12ToRGBAConverter Reset:**
|
|
|
|
|
- Automatic reset on error to clean state
|
|
|
|
|
- Prevents resource accumulation across failed frames
|
|
|
|
|
|
|
|
|
|
**Error Recovery Flow:**
|
|
|
|
|
```cpp
|
|
|
|
|
if (!copySuccess) {
|
|
|
|
|
// 1. Release D3D12 texture from cache
|
|
|
|
|
m_d3d12Handler->ReleaseD3D12Resource(my_slot.d3d12_texture);
|
|
|
|
|
|
|
|
|
|
// 2. Reset converter to clean state
|
|
|
|
|
m_rgbaConverter.reset();
|
|
|
|
|
m_rgbaConverter = std::make_unique<NV12ToRGBAConverter>();
|
|
|
|
|
|
|
|
|
|
// 3. Clean slot state
|
|
|
|
|
my_slot.d3d12_texture = nullptr;
|
|
|
|
|
my_slot.surface_object = 0;
|
|
|
|
|
my_slot.in_use.store(false);
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### ❌ Phase 3: CUDA_ERROR_INVALID_HANDLE Investigation - IN PROGRESS
|
|
|
|
|
|
|
|
|
|
**Current Issue:**
|
|
|
|
|
- Driver API unification completed successfully
|
|
|
|
|
- Error recovery mechanism working correctly
|
|
|
|
|
- **BUT: `CUDA_ERROR_INVALID_HANDLE` persists during kernel launch**
|
|
|
|
|
|
|
|
|
|
**Verified:**
|
|
|
|
|
- ✅ Surface object created successfully (`surface=0x13`)
|
|
|
|
|
- ✅ Kernel handle valid (`kernel=000001DF7D337BC0`)
|
|
|
|
|
- ✅ Stream handle valid (`stream=000001DF073C08C0`)
|
|
|
|
|
- ✅ Source RGBA pointer valid (`src_rgba=0x130B200000`)
|
|
|
|
|
- ❌ Kernel launch fails with `CUDA_ERROR_INVALID_HANDLE`
|
|
|
|
|
|
|
|
|
|
**Hypothesis:**
|
|
|
|
|
The issue is NOT related to API mixing (fully unified to Driver API).
|
|
|
|
|
Possible causes:
|
|
|
|
|
1. Surface object type mismatch in kernel argument passing
|
|
|
|
|
2. CUDA context issue despite Driver API unification
|
|
|
|
|
3. Kernel parameter pointer alignment or type issues
|
|
|
|
|
|
|
|
|
|
### Context Management Analysis
|
|
|
|
|
|
|
|
|
|
**Test Results:**
|
|
|
|
|
- Removed all `cuCtxSetCurrent()` calls - Error persists
|
|
|
|
|
- This confirms context was already set correctly by `cuCtxCreate()`
|
|
|
|
|
- NVDEC context is active throughout the pipeline
|
|
|
|
|
|
|
|
|
|
**Conclusion:**
|
|
|
|
|
`cuCtxSetCurrent()` was unnecessary defensive programming. The real issue lies elsewhere.
|
|
|
|
|
|
2025-10-07 03:49:32 +09:00
|
|
|
## Testing Strategy
|
|
|
|
|
|
|
|
|
|
### Verification Points
|
|
|
|
|
|
2025-10-07 04:03:15 +09:00
|
|
|
1. ✅ **Build Success:** All type changes compile without errors
|
|
|
|
|
2. ✅ **Error Recovery:** Resources properly cleaned up on failure
|
|
|
|
|
3. ✅ **Context Consistency:** Single NVDEC context used throughout
|
|
|
|
|
4. ❌ **Kernel Execution:** Still fails with CUDA_ERROR_INVALID_HANDLE
|
|
|
|
|
5. ❌ **Frame Output:** No frames decoded successfully
|
2025-10-07 03:49:32 +09:00
|
|
|
|
|
|
|
|
### Debug Logging
|
|
|
|
|
|
|
|
|
|
Add diagnostic logs to verify context usage:
|
|
|
|
|
```cpp
|
|
|
|
|
CUcontext current_ctx;
|
|
|
|
|
cuCtxGetCurrent(¤t_ctx);
|
|
|
|
|
LOGF_DEBUG("[Component] Using CUDA context: 0x%llX", (unsigned long long)current_ctx);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Compare context handles across:
|
|
|
|
|
- NVDEC initialization
|
|
|
|
|
- External memory import
|
|
|
|
|
- Surface creation
|
|
|
|
|
- Kernel launch
|
|
|
|
|
|
|
|
|
|
All should show **the same context value**.
|
|
|
|
|
|
|
|
|
|
## Risk Mitigation
|
|
|
|
|
|
|
|
|
|
### Potential Issues
|
|
|
|
|
|
|
|
|
|
1. **API Signature Differences:**
|
|
|
|
|
- Some Driver API functions have different parameter orders
|
|
|
|
|
- Careful review of CUDA documentation required
|
|
|
|
|
|
|
|
|
|
2. **Error Handling:**
|
|
|
|
|
- Runtime API returns `cudaError_t`
|
|
|
|
|
- Driver API returns `CUresult`
|
|
|
|
|
- Update error checking code accordingly
|
|
|
|
|
|
|
|
|
|
3. **Enum Value Changes:**
|
|
|
|
|
- Runtime: `cudaExternalMemoryHandleTypeD3D12Resource`
|
|
|
|
|
- Driver: `CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE`
|
|
|
|
|
- Ensure correct enum mapping
|
|
|
|
|
|
|
|
|
|
### Rollback Plan
|
|
|
|
|
|
|
|
|
|
If Driver API unification fails:
|
|
|
|
|
- Git revert to current commit
|
|
|
|
|
- Consider alternative: Create separate Runtime API context and synchronize
|
|
|
|
|
- Document why unification was not viable
|
|
|
|
|
|
|
|
|
|
## Success Criteria
|
|
|
|
|
|
2025-10-07 04:03:15 +09:00
|
|
|
### Completed ✅
|
|
|
|
|
- ✅ All Runtime API calls converted to Driver API
|
|
|
|
|
- ✅ Build completes without errors
|
|
|
|
|
- ✅ Single CUDA context used throughout pipeline
|
|
|
|
|
- ✅ Error recovery mechanism implemented
|
|
|
|
|
- ✅ Resource cleanup on failure
|
|
|
|
|
|
|
|
|
|
### In Progress ❌
|
|
|
|
|
- ❌ Surface objects successfully used in kernels
|
|
|
|
|
- ❌ No `CUDA_ERROR_INVALID_HANDLE` errors
|
|
|
|
|
- ❌ Frames decode and render correctly
|
|
|
|
|
|
|
|
|
|
### Next Steps
|
|
|
|
|
1. Investigate kernel parameter passing for surface objects
|
|
|
|
|
2. Verify CUDA Driver API surface object compatibility with D3D12 external memory
|
|
|
|
|
3. Consider alternative approaches if Driver API surface objects incompatible
|
2025-10-07 03:49:32 +09:00
|
|
|
|
|
|
|
|
## References
|
|
|
|
|
|
|
|
|
|
- NVIDIA CUDA Driver API Documentation
|
|
|
|
|
- CUDA Runtime API vs Driver API Comparison
|
|
|
|
|
- NVDEC Programming Guide
|
|
|
|
|
- Previous commit: 73d9d8d (original Runtime API implementation)
|