vav2/docs/working/Driver_API_Unification_Design.md

# CUDA Driver API Complete Unification Design

## Problem Analysis

### Root Cause: Runtime API + Driver API Mixing

**Current Issue:**
- `CUDA_ERROR_INVALID_HANDLE` occurs when kernel launches
- Surface object created successfully but cannot be used in kernel
- Context setting doesn't resolve the fundamental problem

**Why Mixing APIs Causes Issues:**

1. **Different Context Management:**
   - Runtime API: Automatic context creation/management (implicit primary context)
   - Driver API: Explicit context creation/management (requires `cuCtxSetCurrent()`)
   - **Mixed usage can result in operations happening in different contexts**

2. **Handle Incompatibility:**
   - Runtime API handles: Bound to Runtime API context
   - Driver API handles: Bound to Driver API context
   - **A surface created in Runtime context cannot be used in Driver context**

3. **NVDEC Constraint:**
   - NVDEC uses **pure Driver API** (`cuvidCreateDecoder`, `cuvidMapVideoFrame`, etc.)
   - All subsequent CUDA operations must use **the same Driver API context**
   - Mixing Runtime API breaks this constraint

### Historical Context: How We Got Here

**Original Implementation (commit 73d9d8d):**
- Fully Runtime API based
- `cudaSurfaceObject_t`, `cudaCreateSurfaceObject()`
- Worked in isolation but failed with NVDEC integration

**Partial Fix Attempt:**
- Changed surface creation to Driver API (`cuSurfObjectCreate()`)
- Left external memory import as Runtime API
- **Result: Still mixing APIs → Still failing**

**Current State:**
```cpp
// Runtime API (wrong context)
cudaImportExternalMemory(&external_memory, &mem_desc);
cudaExternalMemoryGetMappedMipmappedArray(&mipmapped_array, ...);

// Driver API (NVDEC context)
cuSurfObjectCreate(&surface, &res_desc);
cuLaunchKernel(kernel, ...);
```

## Solution: Complete Driver API Unification

### Phase 1: API Conversion (Current Task)

**Files to Modify:**
1. `ExternalMemoryCache.h` - Update type signatures
2. `ExternalMemoryCache.cpp` - Convert all Runtime API calls
3. `D3D12SurfaceHandler.cpp` - Convert remaining Runtime API calls

**Conversion Map:**

| Runtime API | Driver API | Notes |
|-------------|-----------|-------|
| `cudaExternalMemory_t` | `CUDA_EXTERNAL_MEMORY_HANDLE_DESC` + `CUexternalMemory` | Type change required |
| `cudaImportExternalMemory()` | `cuImportExternalMemory()` | Direct replacement |
| `cudaExternalMemoryGetMappedMipmappedArray()` | `cuExternalMemoryGetMappedMipmappedArray()` | Direct replacement |
| `cudaGetMipmappedArrayLevel()` | `cuMipmappedArrayGetLevel()` | Direct replacement |
| `cudaMipmappedArray_t` | `CUmipmappedArray` | Type change |
| `cudaArray_t` | `CUarray` | Type change |
| `cudaMemcpy2D()` | `cuMemcpy2D()` | Async version: `cuMemcpy2DAsync()` |
| `cudaDestroyExternalMemory()` | `cuDestroyExternalMemory()` | Direct replacement |
| `cudaFreeMipmappedArray()` | `cuMipmappedArrayDestroy()` | Direct replacement |

### Phase 2: Context Verification (After Unification)

After complete Driver API unification:
1. Verify NVDEC context is properly passed to all components
2. Ensure `cuCtxSetCurrent()` is called before Driver API operations
3. Confirm all operations use the same context

### Expected Benefits

✅ **Single Context:** All operations in NVDEC's Driver API context
✅ **Handle Compatibility:** All handles created/used in same context
✅ **Clear Debugging:** Unified API makes issues easier to diagnose
✅ **NVDEC Alignment:** Matches NVDEC's native API paradigm
✅ **Stability:** Eliminates context switching issues

## Implementation Plan

### Step 1: Update Type Definitions

**ExternalMemoryCache.h:**
```cpp
// Before (Runtime API types)
struct CachedEntry {
    cudaExternalMemory_t external_memory;
    cudaMipmappedArray_t mipmapped_array;
    // ...
};

// After (Driver API types)
struct CachedEntry {
    CUexternalMemory external_memory;
    CUmipmappedArray mipmapped_array;
    // ...
};
```

### Step 2: Convert External Memory Import

**ExternalMemoryCache.cpp - ImportD3D12TextureAsSurface():**
```cpp
// Before (Runtime API)
cudaExternalMemoryHandleDesc mem_desc = {};
mem_desc.type = cudaExternalMemoryHandleTypeD3D12Resource;
// ...
cudaError_t err = cudaImportExternalMemory(&external_memory, &mem_desc);

// After (Driver API)
CUDA_EXTERNAL_MEMORY_HANDLE_DESC mem_desc = {};
mem_desc.type = CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE;
// ...
CUresult result = cuImportExternalMemory(&external_memory, &mem_desc);
```

### Step 3: Convert Mipmapped Array Operations

```cpp
// Before (Runtime API)
cudaExternalMemoryGetMappedMipmappedArray(&mipmapped_array, external_memory, &mipmap_desc);
cudaGetMipmappedArrayLevel(&array, mipmapped_array, 0);

// After (Driver API)
cuExternalMemoryGetMappedMipmappedArray(&mipmapped_array, external_memory, &mipmap_desc);
cuMipmappedArrayGetLevel(&array, mipmapped_array, 0);
```

### Step 4: Convert Memory Copy Operations

**D3D12SurfaceHandler.cpp - CopyYPlane() / CopyUVPlane():**
```cpp
// Before (Runtime API)
cudaError_t err = cudaMemcpy2D(
    (void*)dst, dst_pitch,
    (void*)src, src_pitch,
    width, height,
    cudaMemcpyDeviceToDevice
);

// After (Driver API)
CUresult result = cuMemcpy2D(dst, dst_pitch, src, src_pitch, width, height);
// Or async version:
// CUresult result = cuMemcpy2DAsync(dst, dst_pitch, src, src_pitch, width, height, stream);
```

### Step 5: Convert Cleanup Operations

```cpp
// Before (Runtime API)
cudaFreeMipmappedArray(mipmapped_array);
cudaDestroyExternalMemory(external_memory);

// After (Driver API)
cuMipmappedArrayDestroy(mipmapped_array);
cuDestroyExternalMemory(external_memory);
```

## Testing Strategy

### Verification Points

1. **Build Success:** All type changes compile without errors
2. **Context Consistency:** Log context handles to verify single context usage
3. **Surface Creation:** Verify surface objects are created successfully
4. **Kernel Execution:** Confirm kernels can access surface objects
5. **Frame Output:** Validate decoded frames are rendered correctly

### Debug Logging

Add diagnostic logs to verify context usage:
```cpp
CUcontext current_ctx;
cuCtxGetCurrent(&current_ctx);
LOGF_DEBUG("[Component] Using CUDA context: 0x%llX", (unsigned long long)current_ctx);
```

Compare context handles across:
- NVDEC initialization
- External memory import
- Surface creation
- Kernel launch

All should show **the same context value**.

## Risk Mitigation

### Potential Issues

1. **API Signature Differences:**
   - Some Driver API functions have different parameter orders
   - Careful review of CUDA documentation required

2. **Error Handling:**
   - Runtime API returns `cudaError_t`
   - Driver API returns `CUresult`
   - Update error checking code accordingly

3. **Enum Value Changes:**
   - Runtime: `cudaExternalMemoryHandleTypeD3D12Resource`
   - Driver: `CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE`
   - Ensure correct enum mapping

### Rollback Plan

If Driver API unification fails:
- Git revert to current commit
- Consider alternative: Create separate Runtime API context and synchronize
- Document why unification was not viable

## Success Criteria

✅ All Runtime API calls converted to Driver API
✅ Build completes without errors
✅ Single CUDA context used throughout pipeline
✅ Surface objects successfully used in kernels
✅ No `CUDA_ERROR_INVALID_HANDLE` errors
✅ Frames decode and render correctly

## References

- NVIDIA CUDA Driver API Documentation
- CUDA Runtime API vs Driver API Comparison
- NVDEC Programming Guide
- Previous commit: 73d9d8d (original Runtime API implementation)
CUDA Driver API called 2025-10-07 03:49:32 +09:00			`# CUDA Driver API Complete Unification Design`

			`## Problem Analysis`

			`### Root Cause: Runtime API + Driver API Mixing`

			`Current Issue:`
			- `CUDA_ERROR_INVALID_HANDLE` occurs when kernel launches
			`- Surface object created successfully but cannot be used in kernel`
			`- Context setting doesn't resolve the fundamental problem`

			`Why Mixing APIs Causes Issues:`

			`1. Different Context Management:`
			`- Runtime API: Automatic context creation/management (implicit primary context)`
			- Driver API: Explicit context creation/management (requires `cuCtxSetCurrent()`)
			`- Mixed usage can result in operations happening in different contexts`

			`2. Handle Incompatibility:`
			`- Runtime API handles: Bound to Runtime API context`
			`- Driver API handles: Bound to Driver API context`
			`- A surface created in Runtime context cannot be used in Driver context`

			`3. NVDEC Constraint:`
			- NVDEC uses pure Driver API (`cuvidCreateDecoder`, `cuvidMapVideoFrame`, etc.)
			`- All subsequent CUDA operations must use the same Driver API context`
			`- Mixing Runtime API breaks this constraint`

			`### Historical Context: How We Got Here`

			`Original Implementation (commit 73d9d8d):`
			`- Fully Runtime API based`
			- `cudaSurfaceObject_t`, `cudaCreateSurfaceObject()`
			`- Worked in isolation but failed with NVDEC integration`

			`Partial Fix Attempt:`
			- Changed surface creation to Driver API (`cuSurfObjectCreate()`)
			`- Left external memory import as Runtime API`
			`- Result: Still mixing APIs → Still failing`

			`Current State:`
			```cpp
			`// Runtime API (wrong context)`
			`cudaImportExternalMemory(&external_memory, &mem_desc);`
			`cudaExternalMemoryGetMappedMipmappedArray(&mipmapped_array, ...);`

			`// Driver API (NVDEC context)`
			`cuSurfObjectCreate(&surface, &res_desc);`
			`cuLaunchKernel(kernel, ...);`
			```

			`## Solution: Complete Driver API Unification`

			`### Phase 1: API Conversion (Current Task)`

			`Files to Modify:`
			1. `ExternalMemoryCache.h` - Update type signatures
			2. `ExternalMemoryCache.cpp` - Convert all Runtime API calls
			3. `D3D12SurfaceHandler.cpp` - Convert remaining Runtime API calls

			`Conversion Map:`

			`\| Runtime API \| Driver API \| Notes \|`
			`\|-------------\|-----------\|-------\|`
			\| `cudaExternalMemory_t` \| `CUDA_EXTERNAL_MEMORY_HANDLE_DESC` + `CUexternalMemory` \| Type change required \|
			\| `cudaImportExternalMemory()` \| `cuImportExternalMemory()` \| Direct replacement \|
			\| `cudaExternalMemoryGetMappedMipmappedArray()` \| `cuExternalMemoryGetMappedMipmappedArray()` \| Direct replacement \|
			\| `cudaGetMipmappedArrayLevel()` \| `cuMipmappedArrayGetLevel()` \| Direct replacement \|
			\| `cudaMipmappedArray_t` \| `CUmipmappedArray` \| Type change \|
			\| `cudaArray_t` \| `CUarray` \| Type change \|
			\| `cudaMemcpy2D()` \| `cuMemcpy2D()` \| Async version: `cuMemcpy2DAsync()` \|
			\| `cudaDestroyExternalMemory()` \| `cuDestroyExternalMemory()` \| Direct replacement \|
			\| `cudaFreeMipmappedArray()` \| `cuMipmappedArrayDestroy()` \| Direct replacement \|

			`### Phase 2: Context Verification (After Unification)`

			`After complete Driver API unification:`
			`1. Verify NVDEC context is properly passed to all components`
			2. Ensure `cuCtxSetCurrent()` is called before Driver API operations
			`3. Confirm all operations use the same context`

			`### Expected Benefits`

			`✅ Single Context: All operations in NVDEC's Driver API context`
			`✅ Handle Compatibility: All handles created/used in same context`
			`✅ Clear Debugging: Unified API makes issues easier to diagnose`
			`✅ NVDEC Alignment: Matches NVDEC's native API paradigm`
			`✅ Stability: Eliminates context switching issues`

			`## Implementation Plan`

			`### Step 1: Update Type Definitions`

			`ExternalMemoryCache.h:`
			```cpp
			`// Before (Runtime API types)`
			`struct CachedEntry {`
			`cudaExternalMemory_t external_memory;`
			`cudaMipmappedArray_t mipmapped_array;`
			`// ...`
			`};`

			`// After (Driver API types)`
			`struct CachedEntry {`
			`CUexternalMemory external_memory;`
			`CUmipmappedArray mipmapped_array;`
			`// ...`
			`};`
			```

			`### Step 2: Convert External Memory Import`

			`ExternalMemoryCache.cpp - ImportD3D12TextureAsSurface():`
			```cpp
			`// Before (Runtime API)`
			`cudaExternalMemoryHandleDesc mem_desc = {};`
			`mem_desc.type = cudaExternalMemoryHandleTypeD3D12Resource;`
			`// ...`
			`cudaError_t err = cudaImportExternalMemory(&external_memory, &mem_desc);`

			`// After (Driver API)`
			`CUDA_EXTERNAL_MEMORY_HANDLE_DESC mem_desc = {};`
			`mem_desc.type = CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE;`
			`// ...`
			`CUresult result = cuImportExternalMemory(&external_memory, &mem_desc);`
			```

			`### Step 3: Convert Mipmapped Array Operations`

			```cpp
			`// Before (Runtime API)`
			`cudaExternalMemoryGetMappedMipmappedArray(&mipmapped_array, external_memory, &mipmap_desc);`
			`cudaGetMipmappedArrayLevel(&array, mipmapped_array, 0);`

			`// After (Driver API)`
			`cuExternalMemoryGetMappedMipmappedArray(&mipmapped_array, external_memory, &mipmap_desc);`
			`cuMipmappedArrayGetLevel(&array, mipmapped_array, 0);`
			```

			`### Step 4: Convert Memory Copy Operations`

			`D3D12SurfaceHandler.cpp - CopyYPlane() / CopyUVPlane():`
			```cpp
			`// Before (Runtime API)`
			`cudaError_t err = cudaMemcpy2D(`
			`(void*)dst, dst_pitch,`
			`(void*)src, src_pitch,`
			`width, height,`
			`cudaMemcpyDeviceToDevice`
			`);`

			`// After (Driver API)`
			`CUresult result = cuMemcpy2D(dst, dst_pitch, src, src_pitch, width, height);`
			`// Or async version:`
			`// CUresult result = cuMemcpy2DAsync(dst, dst_pitch, src, src_pitch, width, height, stream);`
			```

			`### Step 5: Convert Cleanup Operations`

			```cpp
			`// Before (Runtime API)`
			`cudaFreeMipmappedArray(mipmapped_array);`
			`cudaDestroyExternalMemory(external_memory);`

			`// After (Driver API)`
			`cuMipmappedArrayDestroy(mipmapped_array);`
			`cuDestroyExternalMemory(external_memory);`
			```

			`## Testing Strategy`

			`### Verification Points`

			`1. Build Success: All type changes compile without errors`
			`2. Context Consistency: Log context handles to verify single context usage`
			`3. Surface Creation: Verify surface objects are created successfully`
			`4. Kernel Execution: Confirm kernels can access surface objects`
			`5. Frame Output: Validate decoded frames are rendered correctly`

			`### Debug Logging`

			`Add diagnostic logs to verify context usage:`
			```cpp
			`CUcontext current_ctx;`
			`cuCtxGetCurrent(&current_ctx);`
			`LOGF_DEBUG("[Component] Using CUDA context: 0x%llX", (unsigned long long)current_ctx);`
			```

			`Compare context handles across:`
			`- NVDEC initialization`
			`- External memory import`
			`- Surface creation`
			`- Kernel launch`

			`All should show the same context value.`

			`## Risk Mitigation`

			`### Potential Issues`

			`1. API Signature Differences:`
			`- Some Driver API functions have different parameter orders`
			`- Careful review of CUDA documentation required`

			`2. Error Handling:`
			- Runtime API returns `cudaError_t`
			- Driver API returns `CUresult`
			`- Update error checking code accordingly`

			`3. Enum Value Changes:`
			- Runtime: `cudaExternalMemoryHandleTypeD3D12Resource`
			- Driver: `CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE`
			`- Ensure correct enum mapping`

			`### Rollback Plan`

			`If Driver API unification fails:`
			`- Git revert to current commit`
			`- Consider alternative: Create separate Runtime API context and synchronize`
			`- Document why unification was not viable`

			`## Success Criteria`

			`✅ All Runtime API calls converted to Driver API`
			`✅ Build completes without errors`
			`✅ Single CUDA context used throughout pipeline`
			`✅ Surface objects successfully used in kernels`
			✅ No `CUDA_ERROR_INVALID_HANDLE` errors
			`✅ Frames decode and render correctly`

			`## References`

			`- NVIDIA CUDA Driver API Documentation`
			`- CUDA Runtime API vs Driver API Comparison`
			`- NVDEC Programming Guide`
			`- Previous commit: 73d9d8d (original Runtime API implementation)`