vav2/docs/working/CUDA_API_Unification_Design.md

# CUDA API 통합 및 에러 복구 강화 설계

- **문서 작성일:** 2025년 10월 7일
- **작성자:** Claude
- **관련 문서:**
  - Color_Space_Correction_Design.md
  - NVDECAV1Decoder_Resource_Mgmt_Refactor_Design.md
  - NVDECAV1Decoder_Sync_Stabilization_Design.md

## 1. 문제 정의

### 1.1. CUDA API 불일치 문제

현재 `D3D12SurfaceHandler`에서 CUDA Runtime API와 Driver API가 혼용되어 사용되고 있어 `CUDA_ERROR_INVALID_HANDLE (400)` 에러가 발생한다.

**문제 발생 지점:**
```cpp
// ExternalMemoryCache.cpp:273 - Runtime API
cudaSurfaceObject_t surface;
cudaCreateSurfaceObject(&surface, &res_desc);  // Runtime API surface 생성

// D3D12SurfaceHandler.cpp:339 - Driver API
CUresult result = cuLaunchKernel(
    m_surfaceWriteKernel,  // Driver API kernel
    grid_size.x, grid_size.y, 1,
    block_size.x, block_size.y, 1,
    0,
    (CUstream)stream,
    kernel_args,  // Runtime API surface를 전달 → ERROR!
    nullptr
);
```

**에러 메시지:**
```
[D3D12SurfaceHandler] Kernel launch failed: 400 (CUDA_ERROR_INVALID_HANDLE: invalid resource handle)
```

**근본 원인:**
- `cudaCreateSurfaceObject()` (Runtime API)로 생성된 surface handle
- `cuLaunchKernel()` (Driver API)에 전달 시 호환되지 않음
- CUDA Runtime API와 Driver API는 별도의 핸들 공간을 사용

### 1.2. 에러 복구 메커니즘 부재

CUDA 커널 실패 시 `DecodeSlot` 정리 로직이 누락되어 다음 문제가 발생:

1. **슬롯 누수:** `DecodeSlot.in_use` 플래그가 `true`로 남아 슬롯 재사용 불가
2. **순서 보장 파괴:** `m_returnCounter` 증가 안됨 → FIFO 순서 보장 로직 데드락
3. **콜백 중복 호출:** NVDEC이 동일 `picture_index`를 여러 번 디스플레이로 발행
4. **픽셀 데이터 손상:** 동일 프레임이 중복 처리되어 Frame 6에서 441,600개 픽셀 에러

**테스트 결과:**
```
picture_index=0: 1회 (정상)
picture_index=1~7: 각 3회 (비정상 - HandlePictureDisplay 중복 호출)
picture_index=8: 2회

Frame 6: FAIL (441,600 pixel errors)
```

## 2. 목표

### 2.1. CUDA API 통합
- Runtime API와 Driver API 혼용 제거
- 일관된 API 사용으로 안정성 확보

### 2.2. 에러 복구 메커니즘 강화
- CUDA 커널 실패 시 슬롯 정리 보장
- 순서 보장 로직 데드락 방지
- HandlePictureDisplay 중복 호출 차단

## 3. 설계 옵션

### 옵션 A: Runtime API 통일 (권장)

**장점:**
- ✅ Surface 생성 코드 변경 불필요 (ExternalMemoryCache.cpp)
- ✅ 커널 로딩 단순화 (PTX 모듈 관리 불필요)
- ✅ 더 나은 에러 메시지
- ✅ 코드 변경 최소화

**단점:**
- ⚠️ PTX 기반 커널을 Runtime API로 전환 필요
- ⚠️ `cuModuleLoadData` → CUDA Runtime 컴파일로 변경

**구현 방안:**

#### 3.1.1. D3D12SurfaceHandler.h 수정
```cpp
class D3D12SurfaceHandler {
private:
    // Driver API → Runtime API 전환
    // CUmodule m_surfaceWriteModule;  // 제거
    // CUfunction m_surfaceWriteKernel;  // 제거

    // Runtime API 커널 함수 포인터 추가
    void (*m_kernelFunction)(const unsigned char*, unsigned long long,
                              unsigned int, unsigned int, unsigned int);
};
```

#### 3.1.2. D3D12SurfaceHandler.cpp 수정

**LoadSurfaceWriteKernel() 함수 변경:**
```cpp
bool D3D12SurfaceHandler::LoadSurfaceWriteKernel()
{
    LOGF_DEBUG("[D3D12SurfaceHandler] Loading kernel via Runtime API");

    // Runtime API: PTX를 모듈로 컴파일
    // 1. PTX를 파일로 저장 (임시)
    // 2. nvrtcCompileProgram으로 컴파일
    // 3. 커널 함수 포인터 획득

    // 또는: CUDA C++ 소스를 직접 컴파일하여 링크
    // (구현 세부사항은 4단계에서 결정)

    LOGF_DEBUG("[D3D12SurfaceHandler] Kernel loaded successfully");
    return true;
}
```

**CopyRGBAToSurfaceViaKernel() 함수 변경:**
```cpp
bool D3D12SurfaceHandler::CopyRGBAToSurfaceViaKernel(
    cudaSurfaceObject_t dst_surface,
    CUdeviceptr src_rgba,
    uint32_t width, uint32_t height, uint32_t src_pitch,
    cudaStream_t stream)
{
    if (!m_kernelFunction) {
        LOGF_ERROR("[D3D12SurfaceHandler] Kernel not loaded");
        return false;
    }

    // Launch parameters: 16x16 thread blocks
    dim3 block_size(16, 16);
    dim3 grid_size((width + 15) / 16, (height + 15) / 16);

    LOGF_DEBUG("[D3D12SurfaceHandler] Launching kernel via Runtime API: grid=(%u,%u), block=(%u,%u)",
               grid_size.x, grid_size.y, block_size.x, block_size.y);

    // Runtime API kernel launch (<<<>>> syntax)
    // Note: This requires the kernel to be compiled as a device function
    // For now, we'll use a workaround with a global function pointer

    // Alternative: Use CUDA Runtime kernel launch
    void* kernel_args[] = {
        (void*)&src_rgba,
        (void*)&dst_surface,
        (void*)&width,
        (void*)&height,
        (void*)&src_pitch
    };

    cudaError_t err = cudaLaunchKernel(
        (void*)m_kernelFunction,
        grid_size,
        block_size,
        kernel_args,
        0,  // Shared memory
        stream
    );

    if (err != cudaSuccess) {
        LOGF_ERROR("[D3D12SurfaceHandler] Kernel launch failed: %s", cudaGetErrorString(err));
        return false;
    }

    // Synchronize to ensure kernel completes
    err = cudaStreamSynchronize(stream);
    if (err != cudaSuccess) {
        LOGF_ERROR("[D3D12SurfaceHandler] Kernel synchronization failed: %s", cudaGetErrorString(err));
        return false;
    }

    LOGF_DEBUG("[D3D12SurfaceHandler] Kernel completed successfully");
    return true;
}
```

### 옵션 B: Driver API 통일

**장점:**
- ✅ PTX 모듈 관리 유지 (기존 코드 활용)
- ✅ 더 세밀한 제어 가능

**단점:**
- ❌ Surface 생성 코드 대규모 수정 필요 (ExternalMemoryCache.cpp)
- ❌ `cudaCreateSurfaceObject` → `cuSurfObjectCreate` 전환
- ❌ 모든 CUDA Runtime API 호출을 Driver API로 변경
- ❌ 코드 복잡도 증가

**구현 방안:**

#### 3.2.1. ExternalMemoryCache.cpp 수정
```cpp
bool ExternalMemoryCache::ImportD3D12TextureAsSurface(ID3D12Resource* resource,
                                                       CUexternalMemory* out_ext_mem,
                                                       CUmipmappedArray* out_mipmap,
                                                       CUsurfObject* out_surface)
{
    // Step 6: Create surface object via Driver API
    CUDA_RESOURCE_DESC res_desc = {};
    res_desc.resType = CU_RESOURCE_TYPE_ARRAY;
    res_desc.res.array.hArray = array;

    CUsurfObject surface;
    CUresult result = cuSurfObjectCreate(&surface, &res_desc);

    if (result != CUDA_SUCCESS) {
        const char* errorName = nullptr;
        cuGetErrorName(result, &errorName);
        LOGF_ERROR("[ExternalMemoryCache] cuSurfObjectCreate failed: %s", errorName);
        // ... cleanup ...
        return false;
    }

    *out_surface = surface;
    return true;
}
```

#### 3.2.2. D3D12SurfaceHandler.h 수정
```cpp
// cudaSurfaceObject_t → CUsurfObject
bool CopyRGBAFrame(CUdeviceptr src_rgba,
                   ID3D12Resource* dst_texture,
                   uint32_t width,
                   uint32_t height,
                   CUstream stream);  // cudaStream_t → CUstream

bool CopyRGBAToSurfaceViaKernel(CUsurfObject dst_surface,  // 타입 변경
                                 CUdeviceptr src_rgba,
                                 uint32_t width, uint32_t height, uint32_t src_pitch,
                                 CUstream stream);
```

### 권장 사항: **옵션 A (Runtime API 통일)**

**이유:**
1. Surface 생성 코드가 이미 안정적으로 작동 중
2. 커널 로딩만 변경하면 되어 영향 범위 최소화
3. Runtime API가 더 간결하고 유지보수 용이
4. todo20.txt에서도 옵션 A를 권장

## 4. 에러 복구 메커니즘 강화

### 4.1. DecodeToSurface 슬롯 정리 로직 추가

**NVDECAV1Decoder.cpp 수정 위치:**
```cpp
bool NVDECAV1Decoder::DecodeToSurface(/* ... */)
{
    // ... 기존 코드 ...

    // D3D12 frame processing
    bool copy_success = false;
    if (target_type == VAVCORE_SURFACE_D3D12_RESOURCE) {
        if (isRGBAFormat) {
            // RGBA conversion and copy
            if (!m_rgbaConverter->Convert(/* ... */)) {
                LOGF_ERROR("[DecodeToSurface] RGBA conversion failed");
                goto cleanup_on_error;  // 에러 처리로 점프
            }

            if (!m_d3d12Handler->CopyRGBAFrame(/* ... */)) {
                LOGF_ERROR("[DecodeToSurface] Frame copy failed");
                goto cleanup_on_error;  // 에러 처리로 점프
            }
        } else {
            // NV12 copy
            if (!m_d3d12Handler->CopyNV12Frame(/* ... */)) {
                LOGF_ERROR("[DecodeToSurface] NV12 copy failed");
                goto cleanup_on_error;  // 에러 처리로 점프
            }
        }
        copy_success = true;
    }

    // ... 기존 성공 경로 ...

    // 10. Release slot
    my_slot.in_use.store(false);
    LOGF_DEBUG("[DecodeToSurface] Released slot %d", my_slot_idx);

    // 11. Increment return counter to allow next frame
    m_returnCounter.fetch_add(1);

    return true;

cleanup_on_error:
    // **NEW: 에러 복구 로직**
    LOGF_ERROR("[DecodeToSurface] Entering cleanup on error for slot %d, submission_id=%llu",
               my_slot_idx, my_submission_id);

    // 1. Release decode slot
    my_slot.in_use.store(false);
    my_slot.is_ready.store(false);

    // 2. Increment return counter to prevent FIFO deadlock
    m_returnCounter.fetch_add(1);

    // 3. Signal next waiting thread (if any)
    my_slot.slot_cv.notify_one();

    LOGF_ERROR("[DecodeToSurface] Cleanup complete, returning false");
    return false;
}
```

### 4.2. HandlePictureDisplay 중복 호출 방지

**추가 안전장치:**
```cpp
int NVDECAV1Decoder::HandlePictureDisplay(CUVIDPARSERDISPINFO* disp_info)
{
    // ... 기존 코드 ...

    // **NEW: 중복 호출 감지**
    if (slot.is_ready.load()) {
        LOGF_WARNING("[HandlePictureDisplay] Duplicate display callback for picture_index=%d (already ready)",
                     disp_info->picture_index);
        return 1;  // Skip duplicate
    }

    // ... 기존 코드 계속 ...
}
```

## 5. 구현 단계

### Phase 1: CUDA API 통합 (우선순위 높음)
1. **D3D12SurfaceHandler.h 수정**
   - Driver API 멤버 변수 제거
   - Runtime API 커널 함수 포인터 추가

2. **D3D12SurfaceHandler.cpp 수정**
   - `LoadSurfaceWriteKernel()` Runtime API로 전환
   - `CopyRGBAToSurfaceViaKernel()` Runtime API로 전환

3. **빌드 및 테스트**
   - VavCore 라이브러리 빌드
   - RedSurfaceNVDECTest 실행
   - CUDA 커널 에러 해결 확인

### Phase 2: 에러 복구 강화 (안정성 개선)
1. **NVDECAV1Decoder.cpp 수정**
   - `DecodeToSurface`에 `cleanup_on_error` 라벨 추가
   - 슬롯 정리 로직 구현

2. **HandlePictureDisplay 중복 방지**
   - 중복 호출 감지 로직 추가

3. **테스트**
   - 25프레임 모두 정상 디코딩 확인
   - HandlePictureDisplay 호출 횟수 검증
   - 픽셀 에러 0개 확인

## 6. 검증 기준

### 6.1. CUDA API 통합 성공 기준
- ✅ `CUDA_ERROR_INVALID_HANDLE` 에러 0건
- ✅ 모든 프레임 RGBA 변환 성공
- ✅ Kernel launch 성공 로그 출력

### 6.2. 에러 복구 강화 성공 기준
- ✅ HandlePictureDisplay 각 picture_index 1회만 호출
- ✅ 슬롯 누수 0건
- ✅ 25프레임 모두 픽셀 에러 0개
- ✅ FIFO 순서 보장 정상 작동

### 6.3. 성능 검증
- ✅ 디코딩 성능 유지 (기존 대비 ±5% 이내)
- ✅ 메모리 사용량 증가 없음

## 7. 위험 요소 및 완화 방안

### 7.1. Runtime API 커널 로딩 복잡도
**위험:** PTX → Runtime API 전환 시 커널 로딩 로직 복잡화

**완화:**
- NVRTC (NVIDIA Runtime Compilation) 사용
- 또는 사전 컴파일된 CUBIN 파일 사용
- 최악의 경우 옵션 B (Driver API 통일)로 전환

### 7.2. 기존 동기화 로직과의 충돌
**위험:** 에러 복구 로직이 기존 동기화 메커니즘과 충돌

**완화:**
- `cleanup_on_error`에서 슬롯 뮤텍스 획득 후 정리
- 철저한 단위 테스트

### 7.3. 성능 저하
**위험:** Runtime API 전환으로 인한 성능 저하

**완화:**
- 벤치마크 테스트 수행
- 성능 저하 시 Driver API 전환 고려

## 8. 참고 자료

- **CUDA Programming Guide**: Runtime API vs Driver API
- **todo20.txt**: CUDA API 불일치 분석 및 해결 제안
- **NVDECAV1Decoder_Sync_Stabilization_Design.md**: 동기화 개선 설계
CUDA Driver API called 2025-10-07 03:49:32 +09:00			`# CUDA API 통합 및 에러 복구 강화 설계`

			`- 문서 작성일: 2025년 10월 7일`
			`- 작성자: Claude`
			`- 관련 문서:`
			`- Color_Space_Correction_Design.md`
			`- NVDECAV1Decoder_Resource_Mgmt_Refactor_Design.md`
			`- NVDECAV1Decoder_Sync_Stabilization_Design.md`

			`## 1. 문제 정의`

			`### 1.1. CUDA API 불일치 문제`

			현재 `D3D12SurfaceHandler`에서 CUDA Runtime API와 Driver API가 혼용되어 사용되고 있어 `CUDA_ERROR_INVALID_HANDLE (400)` 에러가 발생한다.

			`문제 발생 지점:`
			```cpp
			`// ExternalMemoryCache.cpp:273 - Runtime API`
			`cudaSurfaceObject_t surface;`
			`cudaCreateSurfaceObject(&surface, &res_desc); // Runtime API surface 생성`

			`// D3D12SurfaceHandler.cpp:339 - Driver API`
			`CUresult result = cuLaunchKernel(`
			`m_surfaceWriteKernel, // Driver API kernel`
			`grid_size.x, grid_size.y, 1,`
			`block_size.x, block_size.y, 1,`
			`0,`
			`(CUstream)stream,`
			`kernel_args, // Runtime API surface를 전달 → ERROR!`
			`nullptr`
			`);`
			```

			`에러 메시지:`
			```
			`[D3D12SurfaceHandler] Kernel launch failed: 400 (CUDA_ERROR_INVALID_HANDLE: invalid resource handle)`
			```

			`근본 원인:`
			- `cudaCreateSurfaceObject()` (Runtime API)로 생성된 surface handle
			- `cuLaunchKernel()` (Driver API)에 전달 시 호환되지 않음
			`- CUDA Runtime API와 Driver API는 별도의 핸들 공간을 사용`

			`### 1.2. 에러 복구 메커니즘 부재`

			CUDA 커널 실패 시 `DecodeSlot` 정리 로직이 누락되어 다음 문제가 발생:

			1. 슬롯 누수: `DecodeSlot.in_use` 플래그가 `true`로 남아 슬롯 재사용 불가
			2. 순서 보장 파괴: `m_returnCounter` 증가 안됨 → FIFO 순서 보장 로직 데드락
			3. 콜백 중복 호출: NVDEC이 동일 `picture_index`를 여러 번 디스플레이로 발행
			`4. 픽셀 데이터 손상: 동일 프레임이 중복 처리되어 Frame 6에서 441,600개 픽셀 에러`

			`테스트 결과:`
			```
			`picture_index=0: 1회 (정상)`
			`picture_index=1~7: 각 3회 (비정상 - HandlePictureDisplay 중복 호출)`
			`picture_index=8: 2회`

			`Frame 6: FAIL (441,600 pixel errors)`
			```

			`## 2. 목표`

			`### 2.1. CUDA API 통합`
			`- Runtime API와 Driver API 혼용 제거`
			`- 일관된 API 사용으로 안정성 확보`

			`### 2.2. 에러 복구 메커니즘 강화`
			`- CUDA 커널 실패 시 슬롯 정리 보장`
			`- 순서 보장 로직 데드락 방지`
			`- HandlePictureDisplay 중복 호출 차단`

			`## 3. 설계 옵션`

			`### 옵션 A: Runtime API 통일 (권장)`

			`장점:`
			`- ✅ Surface 생성 코드 변경 불필요 (ExternalMemoryCache.cpp)`
			`- ✅ 커널 로딩 단순화 (PTX 모듈 관리 불필요)`
			`- ✅ 더 나은 에러 메시지`
			`- ✅ 코드 변경 최소화`

			`단점:`
			`- ⚠️ PTX 기반 커널을 Runtime API로 전환 필요`
			- ⚠️ `cuModuleLoadData` → CUDA Runtime 컴파일로 변경

			`구현 방안:`

			`#### 3.1.1. D3D12SurfaceHandler.h 수정`
			```cpp
			`class D3D12SurfaceHandler {`
			`private:`
			`// Driver API → Runtime API 전환`
			`// CUmodule m_surfaceWriteModule; // 제거`
			`// CUfunction m_surfaceWriteKernel; // 제거`

			`// Runtime API 커널 함수 포인터 추가`
			`void (m_kernelFunction)(const unsigned char, unsigned long long,`
			`unsigned int, unsigned int, unsigned int);`
			`};`
			```

			`#### 3.1.2. D3D12SurfaceHandler.cpp 수정`

			`LoadSurfaceWriteKernel() 함수 변경:`
			```cpp
			`bool D3D12SurfaceHandler::LoadSurfaceWriteKernel()`
			`{`
			`LOGF_DEBUG("[D3D12SurfaceHandler] Loading kernel via Runtime API");`

			`// Runtime API: PTX를 모듈로 컴파일`
			`// 1. PTX를 파일로 저장 (임시)`
			`// 2. nvrtcCompileProgram으로 컴파일`
			`// 3. 커널 함수 포인터 획득`

			`// 또는: CUDA C++ 소스를 직접 컴파일하여 링크`
			`// (구현 세부사항은 4단계에서 결정)`

			`LOGF_DEBUG("[D3D12SurfaceHandler] Kernel loaded successfully");`
			`return true;`
			`}`
			```

			`CopyRGBAToSurfaceViaKernel() 함수 변경:`
			```cpp
			`bool D3D12SurfaceHandler::CopyRGBAToSurfaceViaKernel(`
			`cudaSurfaceObject_t dst_surface,`
			`CUdeviceptr src_rgba,`
			`uint32_t width, uint32_t height, uint32_t src_pitch,`
			`cudaStream_t stream)`
			`{`
			`if (!m_kernelFunction) {`
			`LOGF_ERROR("[D3D12SurfaceHandler] Kernel not loaded");`
			`return false;`
			`}`

			`// Launch parameters: 16x16 thread blocks`
			`dim3 block_size(16, 16);`
			`dim3 grid_size((width + 15) / 16, (height + 15) / 16);`

			`LOGF_DEBUG("[D3D12SurfaceHandler] Launching kernel via Runtime API: grid=(%u,%u), block=(%u,%u)",`
			`grid_size.x, grid_size.y, block_size.x, block_size.y);`

			`// Runtime API kernel launch (<<<>>> syntax)`
			`// Note: This requires the kernel to be compiled as a device function`
			`// For now, we'll use a workaround with a global function pointer`

			`// Alternative: Use CUDA Runtime kernel launch`
			`void* kernel_args[] = {`
			`(void*)&src_rgba,`
			`(void*)&dst_surface,`
			`(void*)&width,`
			`(void*)&height,`
			`(void*)&src_pitch`
			`};`

			`cudaError_t err = cudaLaunchKernel(`
			`(void*)m_kernelFunction,`
			`grid_size,`
			`block_size,`
			`kernel_args,`
			`0, // Shared memory`
			`stream`
			`);`

			`if (err != cudaSuccess) {`
			`LOGF_ERROR("[D3D12SurfaceHandler] Kernel launch failed: %s", cudaGetErrorString(err));`
			`return false;`
			`}`

			`// Synchronize to ensure kernel completes`
			`err = cudaStreamSynchronize(stream);`
			`if (err != cudaSuccess) {`
			`LOGF_ERROR("[D3D12SurfaceHandler] Kernel synchronization failed: %s", cudaGetErrorString(err));`
			`return false;`
			`}`

			`LOGF_DEBUG("[D3D12SurfaceHandler] Kernel completed successfully");`
			`return true;`
			`}`
			```

			`### 옵션 B: Driver API 통일`

			`장점:`
			`- ✅ PTX 모듈 관리 유지 (기존 코드 활용)`
			`- ✅ 더 세밀한 제어 가능`

			`단점:`
			`- ❌ Surface 생성 코드 대규모 수정 필요 (ExternalMemoryCache.cpp)`
			- ❌ `cudaCreateSurfaceObject` → `cuSurfObjectCreate` 전환
			`- ❌ 모든 CUDA Runtime API 호출을 Driver API로 변경`
			`- ❌ 코드 복잡도 증가`

			`구현 방안:`

			`#### 3.2.1. ExternalMemoryCache.cpp 수정`
			```cpp
			`bool ExternalMemoryCache::ImportD3D12TextureAsSurface(ID3D12Resource* resource,`
			`CUexternalMemory* out_ext_mem,`
			`CUmipmappedArray* out_mipmap,`
			`CUsurfObject* out_surface)`
			`{`
			`// Step 6: Create surface object via Driver API`
			`CUDA_RESOURCE_DESC res_desc = {};`
			`res_desc.resType = CU_RESOURCE_TYPE_ARRAY;`
			`res_desc.res.array.hArray = array;`

			`CUsurfObject surface;`
			`CUresult result = cuSurfObjectCreate(&surface, &res_desc);`

			`if (result != CUDA_SUCCESS) {`
			`const char* errorName = nullptr;`
			`cuGetErrorName(result, &errorName);`
			`LOGF_ERROR("[ExternalMemoryCache] cuSurfObjectCreate failed: %s", errorName);`
			`// ... cleanup ...`
			`return false;`
			`}`

			`*out_surface = surface;`
			`return true;`
			`}`
			```

			`#### 3.2.2. D3D12SurfaceHandler.h 수정`
			```cpp
			`// cudaSurfaceObject_t → CUsurfObject`
			`bool CopyRGBAFrame(CUdeviceptr src_rgba,`
			`ID3D12Resource* dst_texture,`
			`uint32_t width,`
			`uint32_t height,`
			`CUstream stream); // cudaStream_t → CUstream`

			`bool CopyRGBAToSurfaceViaKernel(CUsurfObject dst_surface, // 타입 변경`
			`CUdeviceptr src_rgba,`
			`uint32_t width, uint32_t height, uint32_t src_pitch,`
			`CUstream stream);`
			```

			`### 권장 사항: 옵션 A (Runtime API 통일)`

			`이유:`
			`1. Surface 생성 코드가 이미 안정적으로 작동 중`
			`2. 커널 로딩만 변경하면 되어 영향 범위 최소화`
			`3. Runtime API가 더 간결하고 유지보수 용이`
			`4. todo20.txt에서도 옵션 A를 권장`

			`## 4. 에러 복구 메커니즘 강화`

			`### 4.1. DecodeToSurface 슬롯 정리 로직 추가`

			`NVDECAV1Decoder.cpp 수정 위치:`
			```cpp
			`bool NVDECAV1Decoder::DecodeToSurface(/* ... */)`
			`{`
			`// ... 기존 코드 ...`

			`// D3D12 frame processing`
			`bool copy_success = false;`
			`if (target_type == VAVCORE_SURFACE_D3D12_RESOURCE) {`
			`if (isRGBAFormat) {`
			`// RGBA conversion and copy`
			`if (!m_rgbaConverter->Convert(/* ... */)) {`
			`LOGF_ERROR("[DecodeToSurface] RGBA conversion failed");`
			`goto cleanup_on_error; // 에러 처리로 점프`
			`}`

			`if (!m_d3d12Handler->CopyRGBAFrame(/* ... */)) {`
			`LOGF_ERROR("[DecodeToSurface] Frame copy failed");`
			`goto cleanup_on_error; // 에러 처리로 점프`
			`}`
			`} else {`
			`// NV12 copy`
			`if (!m_d3d12Handler->CopyNV12Frame(/* ... */)) {`
			`LOGF_ERROR("[DecodeToSurface] NV12 copy failed");`
			`goto cleanup_on_error; // 에러 처리로 점프`
			`}`
			`}`
			`copy_success = true;`
			`}`

			`// ... 기존 성공 경로 ...`

			`// 10. Release slot`
			`my_slot.in_use.store(false);`
			`LOGF_DEBUG("[DecodeToSurface] Released slot %d", my_slot_idx);`

			`// 11. Increment return counter to allow next frame`
			`m_returnCounter.fetch_add(1);`

			`return true;`

			`cleanup_on_error:`
			`// NEW: 에러 복구 로직`
			`LOGF_ERROR("[DecodeToSurface] Entering cleanup on error for slot %d, submission_id=%llu",`
			`my_slot_idx, my_submission_id);`

			`// 1. Release decode slot`
			`my_slot.in_use.store(false);`
			`my_slot.is_ready.store(false);`

			`// 2. Increment return counter to prevent FIFO deadlock`
			`m_returnCounter.fetch_add(1);`

			`// 3. Signal next waiting thread (if any)`
			`my_slot.slot_cv.notify_one();`

			`LOGF_ERROR("[DecodeToSurface] Cleanup complete, returning false");`
			`return false;`
			`}`
			```

			`### 4.2. HandlePictureDisplay 중복 호출 방지`

			`추가 안전장치:`
			```cpp
			`int NVDECAV1Decoder::HandlePictureDisplay(CUVIDPARSERDISPINFO* disp_info)`
			`{`
			`// ... 기존 코드 ...`

			`// NEW: 중복 호출 감지`
			`if (slot.is_ready.load()) {`
			`LOGF_WARNING("[HandlePictureDisplay] Duplicate display callback for picture_index=%d (already ready)",`
			`disp_info->picture_index);`
			`return 1; // Skip duplicate`
			`}`

			`// ... 기존 코드 계속 ...`
			`}`
			```

			`## 5. 구현 단계`

			`### Phase 1: CUDA API 통합 (우선순위 높음)`
			`1. D3D12SurfaceHandler.h 수정`
			`- Driver API 멤버 변수 제거`
			`- Runtime API 커널 함수 포인터 추가`

			`2. D3D12SurfaceHandler.cpp 수정`
			- `LoadSurfaceWriteKernel()` Runtime API로 전환
			- `CopyRGBAToSurfaceViaKernel()` Runtime API로 전환

			`3. 빌드 및 테스트`
			`- VavCore 라이브러리 빌드`
			`- RedSurfaceNVDECTest 실행`
			`- CUDA 커널 에러 해결 확인`

			`### Phase 2: 에러 복구 강화 (안정성 개선)`
			`1. NVDECAV1Decoder.cpp 수정`
			- `DecodeToSurface`에 `cleanup_on_error` 라벨 추가
			`- 슬롯 정리 로직 구현`

			`2. HandlePictureDisplay 중복 방지`
			`- 중복 호출 감지 로직 추가`

			`3. 테스트`
			`- 25프레임 모두 정상 디코딩 확인`
			`- HandlePictureDisplay 호출 횟수 검증`
			`- 픽셀 에러 0개 확인`

			`## 6. 검증 기준`

			`### 6.1. CUDA API 통합 성공 기준`
			- ✅ `CUDA_ERROR_INVALID_HANDLE` 에러 0건
			`- ✅ 모든 프레임 RGBA 변환 성공`
			`- ✅ Kernel launch 성공 로그 출력`

			`### 6.2. 에러 복구 강화 성공 기준`
			`- ✅ HandlePictureDisplay 각 picture_index 1회만 호출`
			`- ✅ 슬롯 누수 0건`
			`- ✅ 25프레임 모두 픽셀 에러 0개`
			`- ✅ FIFO 순서 보장 정상 작동`

			`### 6.3. 성능 검증`
			`- ✅ 디코딩 성능 유지 (기존 대비 ±5% 이내)`
			`- ✅ 메모리 사용량 증가 없음`

			`## 7. 위험 요소 및 완화 방안`

			`### 7.1. Runtime API 커널 로딩 복잡도`
			`위험: PTX → Runtime API 전환 시 커널 로딩 로직 복잡화`

			`완화:`
			`- NVRTC (NVIDIA Runtime Compilation) 사용`
			`- 또는 사전 컴파일된 CUBIN 파일 사용`
			`- 최악의 경우 옵션 B (Driver API 통일)로 전환`

			`### 7.2. 기존 동기화 로직과의 충돌`
			`위험: 에러 복구 로직이 기존 동기화 메커니즘과 충돌`

			`완화:`
			- `cleanup_on_error`에서 슬롯 뮤텍스 획득 후 정리
			`- 철저한 단위 테스트`

			`### 7.3. 성능 저하`
			`위험: Runtime API 전환으로 인한 성능 저하`

			`완화:`
			`- 벤치마크 테스트 수행`
			`- 성능 저하 시 Driver API 전환 고려`

			`## 8. 참고 자료`

			`- CUDA Programming Guide: Runtime API vs Driver API`
			`- todo20.txt: CUDA API 불일치 분석 및 해결 제안`
			`- NVDECAV1Decoder_Sync_Stabilization_Design.md: 동기화 개선 설계`