10 KiB
CUDA Driver API Complete Unification Design
Problem Analysis
Root Cause: Runtime API + Driver API Mixing
Current Issue:
CUDA_ERROR_INVALID_HANDLEoccurs when kernel launches- Surface object created successfully but cannot be used in kernel
- Context setting doesn't resolve the fundamental problem
Why Mixing APIs Causes Issues:
-
Different Context Management:
- Runtime API: Automatic context creation/management (implicit primary context)
- Driver API: Explicit context creation/management (requires
cuCtxSetCurrent()) - Mixed usage can result in operations happening in different contexts
-
Handle Incompatibility:
- Runtime API handles: Bound to Runtime API context
- Driver API handles: Bound to Driver API context
- A surface created in Runtime context cannot be used in Driver context
-
NVDEC Constraint:
- NVDEC uses pure Driver API (
cuvidCreateDecoder,cuvidMapVideoFrame, etc.) - All subsequent CUDA operations must use the same Driver API context
- Mixing Runtime API breaks this constraint
- NVDEC uses pure Driver API (
Historical Context: How We Got Here
Original Implementation (commit 73d9d8d):
- Fully Runtime API based
cudaSurfaceObject_t,cudaCreateSurfaceObject()- Worked in isolation but failed with NVDEC integration
Partial Fix Attempt:
- Changed surface creation to Driver API (
cuSurfObjectCreate()) - Left external memory import as Runtime API
- Result: Still mixing APIs → Still failing
Current State:
// Runtime API (wrong context)
cudaImportExternalMemory(&external_memory, &mem_desc);
cudaExternalMemoryGetMappedMipmappedArray(&mipmapped_array, ...);
// Driver API (NVDEC context)
cuSurfObjectCreate(&surface, &res_desc);
cuLaunchKernel(kernel, ...);
Solution: Complete Driver API Unification
Phase 1: API Conversion (Current Task)
Files to Modify:
ExternalMemoryCache.h- Update type signaturesExternalMemoryCache.cpp- Convert all Runtime API callsD3D12SurfaceHandler.cpp- Convert remaining Runtime API calls
Conversion Map:
| Runtime API | Driver API | Notes |
|---|---|---|
cudaExternalMemory_t |
CUDA_EXTERNAL_MEMORY_HANDLE_DESC + CUexternalMemory |
Type change required |
cudaImportExternalMemory() |
cuImportExternalMemory() |
Direct replacement |
cudaExternalMemoryGetMappedMipmappedArray() |
cuExternalMemoryGetMappedMipmappedArray() |
Direct replacement |
cudaGetMipmappedArrayLevel() |
cuMipmappedArrayGetLevel() |
Direct replacement |
cudaMipmappedArray_t |
CUmipmappedArray |
Type change |
cudaArray_t |
CUarray |
Type change |
cudaMemcpy2D() |
cuMemcpy2D() |
Async version: cuMemcpy2DAsync() |
cudaDestroyExternalMemory() |
cuDestroyExternalMemory() |
Direct replacement |
cudaFreeMipmappedArray() |
cuMipmappedArrayDestroy() |
Direct replacement |
Phase 2: Context Verification (After Unification)
After complete Driver API unification:
- Verify NVDEC context is properly passed to all components
- Ensure
cuCtxSetCurrent()is called before Driver API operations - Confirm all operations use the same context
Expected Benefits
✅ Single Context: All operations in NVDEC's Driver API context ✅ Handle Compatibility: All handles created/used in same context ✅ Clear Debugging: Unified API makes issues easier to diagnose ✅ NVDEC Alignment: Matches NVDEC's native API paradigm ✅ Stability: Eliminates context switching issues
Implementation Plan
Step 1: Update Type Definitions
ExternalMemoryCache.h:
// Before (Runtime API types)
struct CachedEntry {
cudaExternalMemory_t external_memory;
cudaMipmappedArray_t mipmapped_array;
// ...
};
// After (Driver API types)
struct CachedEntry {
CUexternalMemory external_memory;
CUmipmappedArray mipmapped_array;
// ...
};
Step 2: Convert External Memory Import
ExternalMemoryCache.cpp - ImportD3D12TextureAsSurface():
// Before (Runtime API)
cudaExternalMemoryHandleDesc mem_desc = {};
mem_desc.type = cudaExternalMemoryHandleTypeD3D12Resource;
// ...
cudaError_t err = cudaImportExternalMemory(&external_memory, &mem_desc);
// After (Driver API)
CUDA_EXTERNAL_MEMORY_HANDLE_DESC mem_desc = {};
mem_desc.type = CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE;
// ...
CUresult result = cuImportExternalMemory(&external_memory, &mem_desc);
Step 3: Convert Mipmapped Array Operations
// Before (Runtime API)
cudaExternalMemoryGetMappedMipmappedArray(&mipmapped_array, external_memory, &mipmap_desc);
cudaGetMipmappedArrayLevel(&array, mipmapped_array, 0);
// After (Driver API)
cuExternalMemoryGetMappedMipmappedArray(&mipmapped_array, external_memory, &mipmap_desc);
cuMipmappedArrayGetLevel(&array, mipmapped_array, 0);
Step 4: Convert Memory Copy Operations
D3D12SurfaceHandler.cpp - CopyYPlane() / CopyUVPlane():
// Before (Runtime API)
cudaError_t err = cudaMemcpy2D(
(void*)dst, dst_pitch,
(void*)src, src_pitch,
width, height,
cudaMemcpyDeviceToDevice
);
// After (Driver API)
CUresult result = cuMemcpy2D(dst, dst_pitch, src, src_pitch, width, height);
// Or async version:
// CUresult result = cuMemcpy2DAsync(dst, dst_pitch, src, src_pitch, width, height, stream);
Step 5: Convert Cleanup Operations
// Before (Runtime API)
cudaFreeMipmappedArray(mipmapped_array);
cudaDestroyExternalMemory(external_memory);
// After (Driver API)
cuMipmappedArrayDestroy(mipmapped_array);
cuDestroyExternalMemory(external_memory);
Implementation Status
✅ Phase 1: Driver API Unification - COMPLETED
Completed Tasks:
- ✅ ExternalMemoryCache.cpp - All Runtime API converted to Driver API
- ✅ D3D12SurfaceHandler.cpp - CopyYPlane/CopyUVPlane converted to Driver API
- ✅ Type definitions updated (CUexternalMemory, CUmipmappedArray, etc.)
- ✅ Build successful with Driver API unification
Key Changes:
cudaImportExternalMemory()→cuImportExternalMemory()cudaExternalMemoryGetMappedMipmappedArray()→cuExternalMemoryGetMappedMipmappedArray()cudaMemcpy2D()→cuMemcpy2D()with CUDA_MEMCPY2D structcudaGetMipmappedArrayLevel()→cuMipmappedArrayGetLevel()
✅ Phase 2: Error Recovery Mechanism - COMPLETED
Implemented Features:
-
✅ D3D12 Resource Cleanup:
D3D12SurfaceHandler::ReleaseD3D12Resource()method- External memory cache release on error
-
✅ Slot-based Resource Tracking:
DecodeSlotstructure enhanced withd3d12_textureandsurface_objectfields- Resources tracked per-slot for proper cleanup
-
✅ NV12ToRGBAConverter Reset:
- Automatic reset on error to clean state
- Prevents resource accumulation across failed frames
Error Recovery Flow:
if (!copySuccess) {
// 1. Release D3D12 texture from cache
m_d3d12Handler->ReleaseD3D12Resource(my_slot.d3d12_texture);
// 2. Reset converter to clean state
m_rgbaConverter.reset();
m_rgbaConverter = std::make_unique<NV12ToRGBAConverter>();
// 3. Clean slot state
my_slot.d3d12_texture = nullptr;
my_slot.surface_object = 0;
my_slot.in_use.store(false);
}
❌ Phase 3: CUDA_ERROR_INVALID_HANDLE Investigation - IN PROGRESS
Current Issue:
- Driver API unification completed successfully
- Error recovery mechanism working correctly
- BUT:
CUDA_ERROR_INVALID_HANDLEpersists during kernel launch
Verified:
- ✅ Surface object created successfully (
surface=0x13) - ✅ Kernel handle valid (
kernel=000001DF7D337BC0) - ✅ Stream handle valid (
stream=000001DF073C08C0) - ✅ Source RGBA pointer valid (
src_rgba=0x130B200000) - ❌ Kernel launch fails with
CUDA_ERROR_INVALID_HANDLE
Hypothesis: The issue is NOT related to API mixing (fully unified to Driver API). Possible causes:
- Surface object type mismatch in kernel argument passing
- CUDA context issue despite Driver API unification
- Kernel parameter pointer alignment or type issues
Context Management Analysis
Test Results:
- Removed all
cuCtxSetCurrent()calls - Error persists - This confirms context was already set correctly by
cuCtxCreate() - NVDEC context is active throughout the pipeline
Conclusion:
cuCtxSetCurrent() was unnecessary defensive programming. The real issue lies elsewhere.
Testing Strategy
Verification Points
- ✅ Build Success: All type changes compile without errors
- ✅ Error Recovery: Resources properly cleaned up on failure
- ✅ Context Consistency: Single NVDEC context used throughout
- ❌ Kernel Execution: Still fails with CUDA_ERROR_INVALID_HANDLE
- ❌ Frame Output: No frames decoded successfully
Debug Logging
Add diagnostic logs to verify context usage:
CUcontext current_ctx;
cuCtxGetCurrent(¤t_ctx);
LOGF_DEBUG("[Component] Using CUDA context: 0x%llX", (unsigned long long)current_ctx);
Compare context handles across:
- NVDEC initialization
- External memory import
- Surface creation
- Kernel launch
All should show the same context value.
Risk Mitigation
Potential Issues
-
API Signature Differences:
- Some Driver API functions have different parameter orders
- Careful review of CUDA documentation required
-
Error Handling:
- Runtime API returns
cudaError_t - Driver API returns
CUresult - Update error checking code accordingly
- Runtime API returns
-
Enum Value Changes:
- Runtime:
cudaExternalMemoryHandleTypeD3D12Resource - Driver:
CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE - Ensure correct enum mapping
- Runtime:
Rollback Plan
If Driver API unification fails:
- Git revert to current commit
- Consider alternative: Create separate Runtime API context and synchronize
- Document why unification was not viable
Success Criteria
Completed ✅
- ✅ All Runtime API calls converted to Driver API
- ✅ Build completes without errors
- ✅ Single CUDA context used throughout pipeline
- ✅ Error recovery mechanism implemented
- ✅ Resource cleanup on failure
In Progress ❌
- ❌ Surface objects successfully used in kernels
- ❌ No
CUDA_ERROR_INVALID_HANDLEerrors - ❌ Frames decode and render correctly
Next Steps
- Investigate kernel parameter passing for surface objects
- Verify CUDA Driver API surface object compatibility with D3D12 external memory
- Consider alternative approaches if Driver API surface objects incompatible
References
- NVIDIA CUDA Driver API Documentation
- CUDA Runtime API vs Driver API Comparison
- NVDEC Programming Guide
- Previous commit:
73d9d8d(original Runtime API implementation)