에러 복구 메커니즘 강화 (슬롯 정리 로직 추가)

2025-10-07 04:03:15 +09:00
parent 23e7956375
commit f3fc17c796
5 changed files with 154 additions and 14 deletions
--- a/vav2/docs/working/Driver_API_Unification_Design.md
+++ b/vav2/docs/working/Driver_API_Unification_Design.md
@@ -167,15 +167,94 @@ cuMipmappedArrayDestroy(mipmapped_array);
 cuDestroyExternalMemory(external_memory);
 ```

+## Implementation Status
+
+### ✅ Phase 1: Driver API Unification - COMPLETED
+
+**Completed Tasks:**
+1. ✅ ExternalMemoryCache.cpp - All Runtime API converted to Driver API
+2. ✅ D3D12SurfaceHandler.cpp - CopyYPlane/CopyUVPlane converted to Driver API
+3. ✅ Type definitions updated (CUexternalMemory, CUmipmappedArray, etc.)
+4. ✅ Build successful with Driver API unification
+
+**Key Changes:**
+- `cudaImportExternalMemory()` → `cuImportExternalMemory()`
+- `cudaExternalMemoryGetMappedMipmappedArray()` → `cuExternalMemoryGetMappedMipmappedArray()`
+- `cudaMemcpy2D()` → `cuMemcpy2D()` with CUDA_MEMCPY2D struct
+- `cudaGetMipmappedArrayLevel()` → `cuMipmappedArrayGetLevel()`
+
+### ✅ Phase 2: Error Recovery Mechanism - COMPLETED
+
+**Implemented Features:**
+1. ✅ **D3D12 Resource Cleanup:**
+   - `D3D12SurfaceHandler::ReleaseD3D12Resource()` method
+   - External memory cache release on error
+
+2. ✅ **Slot-based Resource Tracking:**
+   - `DecodeSlot` structure enhanced with `d3d12_texture` and `surface_object` fields
+   - Resources tracked per-slot for proper cleanup
+
+3. ✅ **NV12ToRGBAConverter Reset:**
+   - Automatic reset on error to clean state
+   - Prevents resource accumulation across failed frames
+
+**Error Recovery Flow:**
+```cpp
+if (!copySuccess) {
+    // 1. Release D3D12 texture from cache
+    m_d3d12Handler->ReleaseD3D12Resource(my_slot.d3d12_texture);
+
+    // 2. Reset converter to clean state
+    m_rgbaConverter.reset();
+    m_rgbaConverter = std::make_unique<NV12ToRGBAConverter>();
+
+    // 3. Clean slot state
+    my_slot.d3d12_texture = nullptr;
+    my_slot.surface_object = 0;
+    my_slot.in_use.store(false);
+}
+```
+
+### ❌ Phase 3: CUDA_ERROR_INVALID_HANDLE Investigation - IN PROGRESS
+
+**Current Issue:**
+- Driver API unification completed successfully
+- Error recovery mechanism working correctly
+- **BUT: `CUDA_ERROR_INVALID_HANDLE` persists during kernel launch**
+
+**Verified:**
+- ✅ Surface object created successfully (`surface=0x13`)
+- ✅ Kernel handle valid (`kernel=000001DF7D337BC0`)
+- ✅ Stream handle valid (`stream=000001DF073C08C0`)
+- ✅ Source RGBA pointer valid (`src_rgba=0x130B200000`)
+- ❌ Kernel launch fails with `CUDA_ERROR_INVALID_HANDLE`
+
+**Hypothesis:**
+The issue is NOT related to API mixing (fully unified to Driver API).
+Possible causes:
+1. Surface object type mismatch in kernel argument passing
+2. CUDA context issue despite Driver API unification
+3. Kernel parameter pointer alignment or type issues
+
+### Context Management Analysis
+
+**Test Results:**
+- Removed all `cuCtxSetCurrent()` calls - Error persists
+- This confirms context was already set correctly by `cuCtxCreate()`
+- NVDEC context is active throughout the pipeline
+
+**Conclusion:**
+`cuCtxSetCurrent()` was unnecessary defensive programming. The real issue lies elsewhere.
+
 ## Testing Strategy

 ### Verification Points

-1. **Build Success:** All type changes compile without errors
-2. **Context Consistency:** Log context handles to verify single context usage
-3. **Surface Creation:** Verify surface objects are created successfully
-4. **Kernel Execution:** Confirm kernels can access surface objects
-5. **Frame Output:** Validate decoded frames are rendered correctly
+1. ✅ **Build Success:** All type changes compile without errors
+2. ✅ **Error Recovery:** Resources properly cleaned up on failure
+3. ✅ **Context Consistency:** Single NVDEC context used throughout
+4. ❌ **Kernel Execution:** Still fails with CUDA_ERROR_INVALID_HANDLE
+5. ❌ **Frame Output:** No frames decoded successfully

 ### Debug Logging

@@ -221,12 +300,22 @@ If Driver API unification fails:

 ## Success Criteria

-✅ All Runtime API calls converted to Driver API
-✅ Build completes without errors
-✅ Single CUDA context used throughout pipeline
-✅ Surface objects successfully used in kernels
-✅ No `CUDA_ERROR_INVALID_HANDLE` errors
-✅ Frames decode and render correctly
+### Completed ✅
+- ✅ All Runtime API calls converted to Driver API
+- ✅ Build completes without errors
+- ✅ Single CUDA context used throughout pipeline
+- ✅ Error recovery mechanism implemented
+- ✅ Resource cleanup on failure
+
+### In Progress ❌
+- ❌ Surface objects successfully used in kernels
+- ❌ No `CUDA_ERROR_INVALID_HANDLE` errors
+- ❌ Frames decode and render correctly
+
+### Next Steps
+1. Investigate kernel parameter passing for surface objects
+2. Verify CUDA Driver API surface object compatibility with D3D12 external memory
+3. Consider alternative approaches if Driver API surface objects incompatible

 ## References