video-v1/prompt-redesign-ringbuffer.txt

● NVDEC 비동기 파이프라인과 멀티프레임 패킷 구조의 불일치 문제를 해결하기 위한 몇 가지 접근 방법을 제안합니다:

  🎯 해결 방안 제안

  Option 1: 단순화 - 동기식 단일 프레임 디코딩 (가장 빠른 해결책)

  복잡한 ring buffer를 제거하고 간단한 동기식 디코딩으로 전환:

  장점:
  - 구현이 간단하고 디버깅 용이
  - 패킷-프레임 매핑 문제 완전히 회피
  - 안정성 확보 후 비동기로 확장 가능

  단점:
  - 성능이 비동기 방식보다 낮음
  - GPU 활용도가 최적이 아님

  구현 개요:
  bool DecodeToSurface(...) {
      // 1. Submit packet to parser
      cuvidParseVideoData(m_parser, &packet);

      // 2. Wait for decode complete (blocking)
      // Use a simple flag or semaphore

      // 3. Map and copy frame
      cuvidMapVideoFrame(...);
      CopyToD3D12(...);
      cuvidUnmapVideoFrame(...);

      return true;
  }

  ---
  Option 2: HandlePictureDisplay 기반 매핑 (NVIDIA 권장 방식)

  ulMaxDisplayDelay를 활용하여 HandlePictureDisplay에서 올바른 슬롯 찾기:

  핵심 아이디어:
  - HandlePictureDisplay는 disp_info->timestamp를 통해 슬롯 인덱스 전달받음
  - 하지만 현재는 호출되지 않는 문제 → ulMaxDisplayDelay 설정 재검토

  수정 필요:
  // CreateParser에서
  m_parserParams.ulMaxDisplayDelay = 1;  // 현재 값
  // 또는
  m_parserParams.ulMaxDisplayDelay = 0;  // 즉시 디스플레이 (테스트)

  // HandlePictureDisplay에서 올바른 슬롯 찾기
  int CUDAAPI HandlePictureDisplay(void* user_data, CUVIDPARSERDISPINFO* disp_info) {
      size_t slot_idx = static_cast<size_t>(disp_info->timestamp);
      DecodeSlot& slot = decoder->m_ringBuffer[slot_idx % RING_BUFFER_SIZE];

      slot.picture_index = disp_info->picture_index;
      slot.is_ready = true;
      slot.frame_ready.notify_one();

      return 1;
  }

  문제: 현재 HandlePictureDisplay가 호출되지 않는 원인 파악 필요

  ---
  Option 3: Picture Index 기반 역방향 매핑 (현재 구조 활용)

  HandlePictureDecode에서 받은 picture_index를 키로 사용:

  구현:
  // 멤버 변수 추가
  std::unordered_map<int, size_t> m_pictureIndexToSlot;  // picture_index -> slot_idx
  std::mutex m_mapMutex;

  // DecodeToSurface에서 패킷 제출 전
  {
      std::lock_guard<std::mutex> lock(m_mapMutex);
      // 패킷에서 예상되는 프레임 수만큼 슬롯 예약 (어려움)
      // 또는 단순히 현재 슬롯 저장
  }

  // HandlePictureDecode에서
  int CUDAAPI HandlePictureDecode(...) {
      int picture_index = pic_params->CurrPicIdx;

      // 현재 디코딩 중인 패킷의 슬롯 찾기
      size_t submit_idx = decoder->m_submitIndex.load() - 1;
      DecodeSlot& slot = decoder->m_ringBuffer[submit_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);
          slot.picture_index = picture_index;

          // 맵에 저장
          std::lock_guard<std::mutex> map_lock(decoder->m_mapMutex);
          decoder->m_pictureIndexToSlot[picture_index] = submit_idx;
      }

      cuvidDecodePicture(decoder->m_decoder, pic_params);
      return 1;
  }

  // PollingThread에서
  void PollingThreadFunc() {
      while (m_pollingRunning) {
          std::lock_guard<std::mutex> lock(m_mapMutex);

          for (auto& [pic_idx, slot_idx] : m_pictureIndexToSlot) {
              CUVIDGETDECODESTATUS status = {};
              if (cuvidGetDecodeStatus(m_decoder, pic_idx, &status) == CUDA_SUCCESS) {
                  if (status.decodeStatus == cuvidDecodeStatus_Success) {
                      DecodeSlot& slot = m_ringBuffer[slot_idx % RING_BUFFER_SIZE];
                      {
                          std::lock_guard<std::mutex> slot_lock(slot.slot_mutex);
                          slot.is_ready = true;
                      }
                      slot.frame_ready.notify_one();

                      m_pictureIndexToSlot.erase(pic_idx);
                      break;
                  }
              }
          }

          std::this_thread::sleep_for(std::chrono::microseconds(100));
      }
  }

  문제: 멀티프레임 패킷에서 어느 슬롯에 매핑할지 여전히 불명확

  ---
  Option 4: 패킷당 프레임 카운터 (근본적 해결)

  각 패킷에서 나올 프레임 수를 추적:

  구현:
  struct DecodeSlot {
      // 기존 필드들...

      std::vector<int> picture_indices;  // 이 슬롯의 모든 프레임 인덱스
      std::atomic<int> frames_expected{0};   // 예상 프레임 수
      std::atomic<int> frames_completed{0};  // 완료된 프레임 수
  };

  // HandlePictureDecode에서
  int CUDAAPI HandlePictureDecode(...) {
      // 현재 패킷의 슬롯 찾기
      size_t submit_idx = decoder->m_submitIndex.load() - 1;
      DecodeSlot& slot = decoder->m_ringBuffer[submit_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);
          slot.picture_indices.push_back(pic_params->CurrPicIdx);
      }

      cuvidDecodePicture(decoder->m_decoder, pic_params);
      return 1;
  }

  // PollingThread에서
  void PollingThreadFunc() {
      while (m_pollingRunning) {
          size_t return_idx = m_returnIndex.load();
          DecodeSlot& slot = m_ringBuffer[return_idx % RING_BUFFER_SIZE];

          if (slot.in_use && !slot.is_ready) {
              bool all_complete = true;

              for (int pic_idx : slot.picture_indices) {
                  CUVIDGETDECODESTATUS status = {};
                  if (cuvidGetDecodeStatus(m_decoder, pic_idx, &status) == CUDA_SUCCESS) {
                      if (status.decodeStatus != cuvidDecodeStatus_Success) {
                          all_complete = false;
                          break;
                      }
                  }
              }

              if (all_complete && !slot.picture_indices.empty()) {
                  {
                      std::lock_guard<std::mutex> lock(slot.slot_mutex);
                      // 마지막 프레임의 picture_index 사용 (또는 첫 번째)
                      slot.picture_index = slot.picture_indices.back();
                      slot.is_ready = true;
                  }
                  slot.frame_ready.notify_one();
              }
          }

          std::this_thread::sleep_for(std::chrono::microseconds(100));
      }
  }

  문제: 패킷에서 몇 개의 프레임이 나올지 미리 알 수 없음

  ---
  Option 5: 프레임별 슬롯 할당 (가장 정확하지만 복잡)

  패킷당이 아닌 프레임당 슬롯 할당:

  구현:
  // HandlePictureDecode에서 슬롯 할당
  int CUDAAPI HandlePictureDecode(...) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // 새 슬롯 할당 (atomic)
      size_t my_slot_idx = decoder->m_frameSlotIndex.fetch_add(1);
      DecodeSlot& slot = decoder->m_ringBuffer[my_slot_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);

          // 슬롯 오버플로우 체크
          if (slot.in_use) {
              // 에러 처리
              return 0;
          }

          slot.in_use = true;
          slot.picture_index = pic_params->CurrPicIdx;
          slot.is_ready = false;
          // target_surface는 별도 방법으로 전달 필요
      }

      cuvidDecodePicture(decoder->m_decoder, pic_params);
      return 1;
  }

  문제: DecodeToSurface에서 전달한 target_surface를 HandlePictureDecode에서 알 수 없음

  ---
  🎖️ 권장 접근 순서

  단계 1: Option 1 (동기식) 먼저 구현

  - 복잡한 비동기 로직 제거
  - 안정적인 기본 동작 확보
  - 디버깅 및 검증 용이

  단계 2: Option 2 검증

  - HandlePictureDisplay가 호출되도록 설정 조정
  - ulMaxDisplayDelay = 0 테스트
  - NVIDIA 권장 방식 활용

  단계 3: 성능 필요시 Option 4

  - 패킷당 멀티프레임 처리
  - 비동기 파이프라인 복원

  ---
  💡 즉시 적용 가능한 임시 해결책

  현재 코드에서 최소한의 수정으로 테스트해볼 수 있는 방법:

  // HandlePictureDecode에서 - 첫 번째 프레임만 처리
  int CUDAAPI HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // 현재 활성 슬롯 찾기 (가장 최근에 할당된 슬롯)
      size_t submit_idx = decoder->m_submitIndex.load();
      if (submit_idx == 0) return 0;  // 아직 슬롯 할당 안 됨

      submit_idx -= 1;  // 마지막으로 할당된 슬롯
      DecodeSlot& slot = decoder->m_ringBuffer[submit_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);

          // 이미 picture_index가 설정되었으면 이 패킷의 추가 프레임
          if (slot.picture_index >= 0) {
              // 추가 프레임 무시 (첫 프레임만 처리)
              return 1;
          }

          slot.picture_index = pic_params->CurrPicIdx;
      }

      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      return (result == CUDA_SUCCESS) ? 1 : 0;
  }

  이 방법은 멀티프레임 패킷에서 첫 번째 프레임만 디코딩하여 최소한의 동작을 확인할 수 있습니다.

> "Option 1: 단순화 - 동기식 단일 프레임 디코딩" 의 경우, DecodeToSurface() 호출 순서가 뒤바뀔 수 있는 문제가 있지 않은가?

● 정확한 지적입니다! 동기식 방식에서도 멀티스레드 환경에서는 심각한 문제가 발생할 수 있습니다.

  🚨 동기식 방식의 근본적인 문제

  문제 시나리오:

  Thread A: DecodeToSurface(packet_0) 호출
    → cuvidParseVideoData() 호출
    → HandlePictureDecode가 picture_index=0 설정
    → [여기서 컨텍스트 스위칭!]

  Thread B: DecodeToSurface(packet_1) 호출
    → cuvidParseVideoData() 호출
    → HandlePictureDecode가 picture_index=1로 덮어씀!

  Thread A: 깨어나서 picture_index=1을 사용 (잘못된 프레임!)

  결과:
  - Thread A가 packet_1의 프레임을 가져감
  - Thread B는 이미 사용된 프레임을 또 가져가려고 시도
  - 프레임 순서 뒤바뀜 또는 크래시

  ---
  ✅ 올바른 동기식 구현 방안

  Option 1-A: 전역 Mutex로 DecodeToSurface 직렬화

  class NVDECAV1Decoder : public IVideoDecoder {
  private:
      std::mutex m_decodeMutex;  // DecodeToSurface 전체를 보호

      // 단순 동기식 - picture_index만 저장
      int m_lastPictureIndex = -1;
      std::mutex m_pictureIndexMutex;
      std::condition_variable m_pictureReady;
      bool m_pictureDecoded = false;
  };

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 전체 DecodeToSurface를 직렬화 (한 번에 하나만)
      std::lock_guard<std::mutex> decodeLock(m_decodeMutex);

      // 1. Reset state
      {
          std::lock_guard<std::mutex> lock(m_pictureIndexMutex);
          m_lastPictureIndex = -1;
          m_pictureDecoded = false;
      }

      // 2. Submit packet to parser
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;

      CUresult result = cuvidParseVideoData(m_parser, &packet);
      if (result != CUDA_SUCCESS) {
          return false;
      }

      // 3. Wait for HandlePictureDecode callback (동기 대기)
      {
          std::unique_lock<std::mutex> lock(m_pictureIndexMutex);
          if (!m_pictureReady.wait_for(lock, std::chrono::milliseconds(500),
              [this]() { return m_pictureDecoded; })) {
              LogError("Decode timeout");
              return false;
          }
      }

      // 4. Map and copy frame
      int frameIdx = m_lastPictureIndex;
      if (frameIdx < 0) {
          return false;
      }

      CUdeviceptr srcDevicePtr = 0;
      unsigned int srcPitch = 0;
      CUVIDPROCPARAMS procParams = {};
      procParams.progressive_frame = 1;

      result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
      if (result != CUDA_SUCCESS) {
          return false;
      }

      // 5. Copy to D3D12 surface
      bool copySuccess = m_d3d12Handler->CopyNV12Frame(
          srcDevicePtr, srcPitch,
          static_cast<ID3D12Resource*>(target_surface),
          m_width, m_height
      );

      cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);

      return copySuccess;
  }

  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      if (result != CUDA_SUCCESS) {
          return 0;
      }

      // 첫 번째 프레임의 picture_index 저장
      {
          std::lock_guard<std::mutex> lock(decoder->m_pictureIndexMutex);
          if (decoder->m_lastPictureIndex < 0) {  // 첫 프레임만
              decoder->m_lastPictureIndex = pic_params->CurrPicIdx;
              decoder->m_pictureDecoded = true;
          }
      }
      decoder->m_pictureReady.notify_one();

      return 1;
  }

  장점:
  - 완전히 안전 (경쟁 조건 없음)
  - 구현 간단
  - 디버깅 용이

  단점:
  - 멀티스레드 디코딩 불가능 (성능 낮음)
  - 한 번에 하나의 DecodeToSurface만 실행

  ---
  Option 1-B: 요청별 Context 객체 사용

  struct DecodeRequest {
      int picture_index = -1;
      bool completed = false;
      std::mutex mutex;
      std::condition_variable cv;
  };

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 1. 각 요청마다 고유한 context 생성
      auto request = std::make_shared<DecodeRequest>();

      // 2. Context를 timestamp에 포함 (포인터를 int64_t로 변환)
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = reinterpret_cast<int64_t>(request.get());

      CUresult result = cuvidParseVideoData(m_parser, &packet);
      if (result != CUDA_SUCCESS) {
          return false;
      }

      // 3. 자신의 요청이 완료될 때까지 대기
      {
          std::unique_lock<std::mutex> lock(request->mutex);
          if (!request->cv.wait_for(lock, std::chrono::milliseconds(500),
              [&request]() { return request->completed; })) {
              return false;
          }
      }

      // 4. Map and copy (이전과 동일)
      int frameIdx = request->picture_index;
      // ... 나머지 코드
  }

  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // timestamp에서 DecodeRequest 포인터 복원
      DecodeRequest* request = reinterpret_cast<DecodeRequest*>(pic_params->nTimeStamp);

      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      if (result != CUDA_SUCCESS) {
          return 0;
      }

      // 해당 요청에 picture_index 저장
      {
          std::lock_guard<std::mutex> lock(request->mutex);
          if (request->picture_index < 0) {  // 첫 프레임만
              request->picture_index = pic_params->CurrPicIdx;
              request->completed = true;
          }
      }
      request->cv.notify_one();

      return 1;
  }

  문제: pic_params->nTimeStamp는 CUVIDPICPARAMS에 없음! CUVIDEOFORMAT에만 있음.

  ---
  Option 1-C: Thread-Local Storage 사용

  class NVDECAV1Decoder : public IVideoDecoder {
  private:
      std::mutex m_activeRequestsMutex;
      std::map<std::thread::id, DecodeRequest*> m_activeRequests;
  };

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      DecodeRequest request;

      // 1. 현재 스레드의 요청 등록
      {
          std::lock_guard<std::mutex> lock(m_activeRequestsMutex);
          m_activeRequests[std::this_thread::get_id()] = &request;
      }

      // 2. Submit packet
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = static_cast<int64_t>(
          std::hash<std::thread::id>{}(std::this_thread::get_id())
      );

      CUresult result = cuvidParseVideoData(m_parser, &packet);

      // 3. Wait for completion
      {
          std::unique_lock<std::mutex> lock(request.mutex);
          request.cv.wait_for(lock, std::chrono::milliseconds(500),
              [&request]() { return request.completed; });
      }

      // 4. Cleanup
      {
          std::lock_guard<std::mutex> lock(m_activeRequestsMutex);
          m_activeRequests.erase(std::this_thread::get_id());
      }

      // 5. Map and copy
      // ...
  }

  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // 콜백 스레드 ID로 요청 찾기 (문제: 콜백은 다른 스레드에서 실행될 수 있음!)
      // 이 방법은 작동하지 않을 수 있음
  }

  문제: HandlePictureDecode는 파서 내부 스레드에서 호출될 수 있어 thread_id 매칭 불가능

  ---
  🎯 가장 안전한 해결책: Option 1-A (전역 Mutex)

  현재 상황에서 가장 확실한 방법:

  // NVDECAV1Decoder.h에 추가
  private:
      std::mutex m_syncDecodeMutex;  // 동기식 디코딩 보호
      int m_syncPictureIndex = -1;
      std::mutex m_syncStateMutex;
      std::condition_variable m_syncDecodeComplete;
      bool m_syncDecodeReady = false;

  // DecodeToSurface 구현
  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // === CRITICAL: 전체 함수를 직렬화 ===
      std::lock_guard<std::mutex> syncLock(m_syncDecodeMutex);

      OutputDebugStringA("[DecodeToSurface] Acquired sync lock\n");

      // 1. Reset state
      {
          std::lock_guard<std::mutex> lock(m_syncStateMutex);
          m_syncPictureIndex = -1;
          m_syncDecodeReady = false;
      }

      // 2. Submit packet
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = 0;  // 사용 안 함

      CUresult result = cuvidParseVideoData(m_parser, &packet);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidParseVideoData");
          return false;
      }

      // 3. Wait for decode (HandlePictureDecode will signal)
      {
          std::unique_lock<std::mutex> lock(m_syncStateMutex);

          if (!m_syncDecodeComplete.wait_for(lock, std::chrono::milliseconds(500),
              [this]() { return m_syncDecodeReady; })) {

              OutputDebugStringA("[DecodeToSurface] ERROR: Sync decode timeout\n");
              return false;
          }
      }

      int frameIdx = m_syncPictureIndex;
      if (frameIdx < 0) {
          OutputDebugStringA("[DecodeToSurface] ERROR: Invalid picture index\n");
          return false;
      }

      OutputDebugStringA("[DecodeToSurface] Decode complete, mapping frame...\n");

      // 4. Map frame (기존 코드와 동일)
      CUdeviceptr srcDevicePtr = 0;
      unsigned int srcPitch = 0;
      CUVIDPROCPARAMS procParams = {};
      procParams.progressive_frame = 1;

      result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidMapVideoFrame");
          return false;
      }

      // 5. Copy to D3D12 (기존 코드)
      ID3D12Resource* d3d12Resource = static_cast<ID3D12Resource*>(target_surface);
      bool copySuccess = m_d3d12Handler->CopyNV12Frame(
          srcDevicePtr, srcPitch,
          d3d12Resource,
          m_width, m_height
      );

      cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);

      if (!copySuccess) {
          OutputDebugStringA("[DecodeToSurface] ERROR: D3D12 copy failed\n");
          return false;
      }

      // 6. Signal fence
      m_d3d12Handler->SignalD3D12Fence(++m_fenceValue);
      output_frame.sync_fence_value = m_fenceValue;

      // 7. Fill metadata
      output_frame.width = m_width;
      output_frame.height = m_height;
      output_frame.color_space = ColorSpace::YUV420P;
      output_frame.frame_index = m_framesDecoded++;
      output_frame.timestamp_seconds = static_cast<double>(m_framesDecoded) / 30.0;

      OutputDebugStringA("[DecodeToSurface] SUCCESS\n");
      return true;
  }

  // HandlePictureDecode 수정
  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // Decode picture
      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      if (result != CUDA_SUCCESS) {
          decoder->LogCUDAError(result, "cuvidDecodePicture");
          return 0;
      }

      // Store picture index (첫 번째 프레임만)
      {
          std::lock_guard<std::mutex> lock(decoder->m_syncStateMutex);

          if (decoder->m_syncPictureIndex < 0) {  // 아직 설정 안 됨
              decoder->m_syncPictureIndex = pic_params->CurrPicIdx;
              decoder->m_syncDecodeReady = true;

              char debug_buf[128];
              sprintf_s(debug_buf, "[HandlePictureDecode] Stored picture_index=%d\n",
                        pic_params->CurrPicIdx);
              OutputDebugStringA(debug_buf);
          }
      }
      decoder->m_syncDecodeComplete.notify_one();

      return 1;
  }

  // Cleanup에서 폴링 스레드 제거
  void NVDECAV1Decoder::Cleanup() {
      // 폴링 스레드는 이제 필요 없음 (동기식이므로)
      // m_pollingRunning = false; // 제거
      // m_pollingThread.join(); // 제거

      // 나머지 cleanup 코드...
  }

  // Initialize에서 폴링 스레드 시작 제거
  bool NVDECAV1Decoder::Initialize(...) {
      // ...

      m_initialized = true;

      // m_pollingThread 시작 제거
      // m_pollingRunning = true;
      // m_pollingThread = std::thread(...); // 제거

      return true;
  }

  이 방법의 장단점:

  ✅ 장점:
  - 완전히 안전 (경쟁 조건 완전 제거)
  - 구현 간단 및 디버깅 용이
  - 멀티프레임 패킷 문제 회피 (첫 프레임만 사용)
  - Ring buffer 복잡성 완전 제거

  ❌ 단점:
  - 멀티스레드 디코딩 불가 (성능 제한)
  - 하나의 DecodeToSurface만 동시 실행 가능
  - GPU 파이프라인 활용도 낮음

  ---
  이 방법으로 진행하시겠습니까? 아니면 다른 더 복잡하지만 성능 좋은 방법을 고려하시겠습니까?


> 제안해준 "Option 1-B: 요청별 Context 객체 사용" 도 DecodeToSurface() 호출했을 때, 결과를 리턴하는 순서가 뒤바뀔 것으로 보인다. 현재 제안해서 구현한 RingBuffer 가 이 문제를
해소할 것으로 보인다.\그 방법을 설명해줄까?

● 아, 정확합니다! 제가 RingBuffer의 핵심 설계 의도를 놓치고 있었습니다.

  현재 RingBuffer 설계의 핵심 아이디어를 이해했습니다:

  🎯 RingBuffer의 순서 보장 메커니즘

  핵심 개념:

  // Producer (DecodeToSurface 호출자들)
  std::atomic<size_t> m_submitIndex{0};   // 제출 순서 (monotonic 증가)

  // Consumer (DecodeToSurface 반환 순서)
  std::atomic<size_t> m_returnIndex{0};   // 반환 순서 (FIFO 보장)

  순서 보장 원리:

  Thread A: DecodeToSurface() 호출
    → m_submitIndex.fetch_add(1) = 0 (my_slot_idx = 0)
    → slot[0] 할당

  Thread B: DecodeToSurface() 호출 (Thread A보다 먼저 디코딩 완료됨)
    → m_submitIndex.fetch_add(1) = 1 (my_slot_idx = 1)
    → slot[1] 할당
    → 디코딩 빠르게 완료!

  Thread B는 반환하기 전에:
    → while (m_returnIndex.load() != my_slot_idx) { wait... }
    → returnIndex == 0 이므로 대기! (Thread A를 기다림)

  Thread A 디코딩 완료:
    → returnIndex == 0 == my_slot_idx (조건 만족!)
    → 프레임 반환
    → m_returnIndex.fetch_add(1) = 1

  Thread B 깨어남:
    → returnIndex == 1 == my_slot_idx (이제 조건 만족!)
    → 프레임 반환
    → m_returnIndex.fetch_add(1) = 2

  결과: 호출 순서대로 반환 보장! (FIFO)

  ---
  🔧 현재 RingBuffer의 실제 문제

  그렇다면 현재 구현의 실제 문제는 무엇일까요?

  문제 1: HandlePictureDecode에서 슬롯 찾기 실패

  현재 코드:
  int CUDAAPI HandlePictureDecode(...) {
      // ❌ 문제: 어느 슬롯인지 알 수 없음!
      size_t submit_idx = decoder->m_submitIndex.load() - 1;
      DecodeSlot& slot = decoder->m_ringBuffer[submit_idx % RING_BUFFER_SIZE];

      slot.picture_index = pic_params->CurrPicIdx;  // 잘못된 슬롯에 저장!
  }

  문제:
  - Thread A가 slot[0]에서 cuvidParseVideoData() 호출
  - 그 사이 Thread B가 slot[1] 할당 → submitIndex = 2
  - Thread A의 HandlePictureDecode 콜백 실행
  - submitIndex - 1 = 1 → slot[1]에 저장! (Thread B 슬롯을 오염!)

  문제 2: Packet → Picture 매핑

  Packet 0 (Thread A, slot[0]):
    → HandlePictureDecode(picture_index=0)
    → HandlePictureDecode(picture_index=1)  // 같은 패킷의 두 번째 프레임
    → HandlePictureDecode(picture_index=2)  // 같은 패킷의 세 번째 프레임

  어느 picture_index를 slot[0]에 저장해야 할까?

  ---
  ✅ RingBuffer를 활용한 올바른 해결책

  핵심 아이디어: Packet Context 전달

  struct DecodeSlot {
      // 기존 필드들...
      bool in_use = false;
      void* target_surface = nullptr;
      VavCoreSurfaceType surface_type = VAVCORE_SURFACE_CPU;

      // ✅ 추가: 이 슬롯의 고유 ID
      uint64_t slot_id = 0;

      // ✅ 멀티프레임 지원
      std::vector<int> picture_indices;  // 이 패킷의 모든 프레임들
      std::atomic<bool> all_frames_decoded{false};

      // 동기화
      std::condition_variable frame_ready;
      std::mutex slot_mutex;
      bool is_ready = false;
  };

  class NVDECAV1Decoder : public IVideoDecoder {
  private:
      std::atomic<uint64_t> m_slotIdCounter{0};  // 슬롯 고유 ID 생성

      // Slot ID → Slot Index 매핑
      std::mutex m_slotMapMutex;
      std::unordered_map<uint64_t, size_t> m_slotIdToIndex;
  };

  구현:

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 1. Allocate slot
      size_t my_slot_idx = m_submitIndex.fetch_add(1);
      DecodeSlot& my_slot = m_ringBuffer[my_slot_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);

          if (my_slot.in_use) {
              // Ring buffer overflow
              OutputDebugStringA("[DecodeToSurface] ERROR: Ring buffer full\n");
              return false;
          }

          // ✅ 고유 ID 할당
          my_slot.slot_id = m_slotIdCounter.fetch_add(1);
          my_slot.in_use = true;
          my_slot.target_surface = target_surface;
          my_slot.surface_type = target_type;
          my_slot.picture_indices.clear();
          my_slot.all_frames_decoded = false;
          my_slot.is_ready = false;
      }

      // ✅ Slot ID → Index 매핑 등록
      {
          std::lock_guard<std::mutex> lock(m_slotMapMutex);
          m_slotIdToIndex[my_slot.slot_id] = my_slot_idx;
      }

      // 2. Submit packet with slot_id in timestamp
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = static_cast<int64_t>(my_slot.slot_id);  // ✅ Slot ID 전달!

      char debug_buf[256];
      sprintf_s(debug_buf, "[DecodeToSurface] Submitting packet (slot=%zu, slot_id=%llu)\n",
                my_slot_idx % RING_BUFFER_SIZE, my_slot.slot_id);
      OutputDebugStringA(debug_buf);

      CUresult result = cuvidParseVideoData(m_parser, &packet);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidParseVideoData");

          // Cleanup on error
          {
              std::lock_guard<std::mutex> lock(m_slotMapMutex);
              m_slotIdToIndex.erase(my_slot.slot_id);
          }
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          return false;
      }

      // 3. Wait for my turn (FIFO order)
      sprintf_s(debug_buf, "[DecodeToSurface] Waiting for return order (slot=%zu, returnIndex=%zu)\n",
                my_slot_idx % RING_BUFFER_SIZE, m_returnIndex.load());
      OutputDebugStringA(debug_buf);

      while (m_returnIndex.load() != my_slot_idx) {
          std::this_thread::sleep_for(std::chrono::milliseconds(1));
      }

      sprintf_s(debug_buf, "[DecodeToSurface] My turn! (slot=%zu)\n", my_slot_idx % RING_BUFFER_SIZE);
      OutputDebugStringA(debug_buf);

      // 4. Wait for all frames decoded
      {
          std::unique_lock<std::mutex> lock(my_slot.slot_mutex);

          if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500),
              [&my_slot]() { return my_slot.is_ready; })) {

              sprintf_s(debug_buf, "[DecodeToSurface] ERROR: Decode timeout (slot=%zu)\n",
                        my_slot_idx % RING_BUFFER_SIZE);
              OutputDebugStringA(debug_buf);

              // Cleanup
              {
                  std::lock_guard<std::mutex> mapLock(m_slotMapMutex);
                  m_slotIdToIndex.erase(my_slot.slot_id);
              }
              my_slot.in_use = false;
              m_returnIndex.fetch_add(1);  // Skip to unblock others
              return false;
          }
      }

      // 5. Get first picture index (or handle all frames)
      int frameIdx = -1;
      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
          if (!my_slot.picture_indices.empty()) {
              frameIdx = my_slot.picture_indices[0];  // Use first frame
          }
      }

      if (frameIdx < 0) {
          OutputDebugStringA("[DecodeToSurface] ERROR: No frames decoded\n");

          // Cleanup
          {
              std::lock_guard<std::mutex> mapLock(m_slotMapMutex);
              m_slotIdToIndex.erase(my_slot.slot_id);
          }
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIndex.fetch_add(1);
          return false;
      }

      sprintf_s(debug_buf, "[DecodeToSurface] Using picture_index=%d from %zu frames\n",
                frameIdx, my_slot.picture_indices.size());
      OutputDebugStringA(debug_buf);

      // 6. Map and copy frame (기존 코드)
      CUdeviceptr srcDevicePtr = 0;
      unsigned int srcPitch = 0;
      CUVIDPROCPARAMS procParams = {};
      procParams.progressive_frame = 1;

      result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidMapVideoFrame");

          // Cleanup
          {
              std::lock_guard<std::mutex> mapLock(m_slotMapMutex);
              m_slotIdToIndex.erase(my_slot.slot_id);
          }
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIndex.fetch_add(1);
          return false;
      }

      // 7. Copy to D3D12 surface
      ID3D12Resource* d3d12Resource = static_cast<ID3D12Resource*>(target_surface);
      bool copySuccess = m_d3d12Handler->CopyNV12Frame(
          srcDevicePtr, srcPitch,
          d3d12Resource,
          m_width, m_height
      );

      cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);

      if (!copySuccess) {
          OutputDebugStringA("[DecodeToSurface] ERROR: D3D12 copy failed\n");

          // Cleanup
          {
              std::lock_guard<std::mutex> mapLock(m_slotMapMutex);
              m_slotIdToIndex.erase(my_slot.slot_id);
          }
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIndex.fetch_add(1);
          return false;
      }

      // 8. Signal fence and fill metadata
      m_d3d12Handler->SignalD3D12Fence(++m_fenceValue);
      output_frame.sync_fence_value = m_fenceValue;
      output_frame.width = m_width;
      output_frame.height = m_height;
      output_frame.color_space = ColorSpace::YUV420P;
      output_frame.frame_index = m_framesDecoded++;
      output_frame.timestamp_seconds = static_cast<double>(m_framesDecoded) / 30.0;

      // 9. Release slot
      {
          std::lock_guard<std::mutex> mapLock(m_slotMapMutex);
          m_slotIdToIndex.erase(my_slot.slot_id);
      }
      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
          my_slot.in_use = false;
      }

      // 10. Advance return index (다음 슬롯이 반환 가능하도록)
      m_returnIndex.fetch_add(1);

      sprintf_s(debug_buf, "[DecodeToSurface] SUCCESS (slot=%zu, advanced returnIndex=%zu)\n",
                my_slot_idx % RING_BUFFER_SIZE, m_returnIndex.load());
      OutputDebugStringA(debug_buf);

      return true;
  }

  HandlePictureDecode 수정:

  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      OutputDebugStringA("[HandlePictureDecode] Callback called\n");

      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);
      if (!decoder || !pic_params) {
          return 0;
      }

      // ✅ timestamp에서 slot_id 추출
      uint64_t slot_id = static_cast<uint64_t>(pic_params->nTimeStamp);

      char debug_buf[256];
      sprintf_s(debug_buf, "[HandlePictureDecode] slot_id=%llu, CurrPicIdx=%d\n",
                slot_id, pic_params->CurrPicIdx);
      OutputDebugStringA(debug_buf);

      // ✅ Slot ID로 슬롯 찾기
      size_t slot_idx = 0;
      {
          std::lock_guard<std::mutex> lock(decoder->m_slotMapMutex);
          auto it = decoder->m_slotIdToIndex.find(slot_id);

          if (it == decoder->m_slotIdToIndex.end()) {
              sprintf_s(debug_buf, "[HandlePictureDecode] ERROR: slot_id=%llu not found!\n", slot_id);
              OutputDebugStringA(debug_buf);
              return 0;
          }

          slot_idx = it->second;
      }

      DecodeSlot& slot = decoder->m_ringBuffer[slot_idx % decoder->RING_BUFFER_SIZE];

      // Decode picture
      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      if (result != CUDA_SUCCESS) {
          decoder->LogCUDAError(result, "cuvidDecodePicture");
          return 0;
      }

      // ✅ Store picture index in correct slot
      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);
          slot.picture_indices.push_back(pic_params->CurrPicIdx);
      }

      sprintf_s(debug_buf, "[HandlePictureDecode] Stored picture_index=%d in slot %zu (total frames: %zu)\n",
                pic_params->CurrPicIdx, slot_idx % decoder->RING_BUFFER_SIZE,
                slot.picture_indices.size());
      OutputDebugStringA(debug_buf);

      return 1;
  }

  PollingThread 수정:

  void NVDECAV1Decoder::PollingThreadFunc() {
      OutputDebugStringA("[PollingThread] Started\n");

      while (m_pollingRunning) {
          size_t current_return_idx = m_returnIndex.load();
          size_t slot_idx = current_return_idx % RING_BUFFER_SIZE;
          DecodeSlot& slot = m_ringBuffer[slot_idx];

          if (slot.in_use && !slot.is_ready) {
              // ✅ 이 슬롯의 모든 프레임이 디코딩 완료되었는지 확인
              bool all_complete = true;
              std::vector<int> picture_indices_copy;

              {
                  std::lock_guard<std::mutex> lock(slot.slot_mutex);
                  picture_indices_copy = slot.picture_indices;
              }

              if (picture_indices_copy.empty()) {
                  // 아직 HandlePictureDecode가 호출되지 않음
                  std::this_thread::sleep_for(std::chrono::microseconds(100));
                  continue;
              }

              // 모든 프레임의 디코딩 상태 확인
              for (int pic_idx : picture_indices_copy) {
                  CUVIDGETDECODESTATUS decodeStatus = {};
                  CUresult result = cuvidGetDecodeStatus(m_decoder, pic_idx, &decodeStatus);

                  if (result != CUDA_SUCCESS ||
                      decodeStatus.decodeStatus != cuvidDecodeStatus_Success) {
                      all_complete = false;
                      break;
                  }
              }

              if (all_complete) {
                  {
                      std::lock_guard<std::mutex> lock(slot.slot_mutex);
                      slot.is_ready = true;
                  }
                  slot.frame_ready.notify_one();

                  char debug_buf[256];
                  sprintf_s(debug_buf, "[PollingThread] Slot %zu all frames complete (%zu frames)\n",
                            slot_idx, picture_indices_copy.size());
                  OutputDebugStringA(debug_buf);
              }
          }

          std::this_thread::sleep_for(std::chrono::microseconds(100));
      }

      OutputDebugStringA("[PollingThread] Stopped\n");
  }

  ---
  🎉 이 방법의 장점

  ✅ 순서 보장: submitIndex/returnIndex로 FIFO 보장
  ✅ 멀티스레드 안전: 각 슬롯이 고유 ID로 식별됨
  ✅ 멀티프레임 패킷 지원: picture_indices 벡터로 모든 프레임 추적
  ✅ 비동기 성능: 여러 스레드가 동시에 디코딩 가능 (순서는 보장)

  ---
  이 방법으로 구현해볼까요?


> m_slotIdCounter 가 필요할까? m_submitIndex 는 계속 증가만 하는 숫자이므로, 그 자체값을 slot_id 로 써도 될 것 같다. 혼동을 줄인다면, m_submitIndex 대신에,  m_slotIdCounter 를
 써도 좋을 것 같은데? 어떻게 생각하는가?

● 정확한 지적입니다! 두 개의 카운터가 필요 없습니다.

  🎯 단순화된 설계

  Option A: m_submitIndex를 slot_id로 직접 사용 (가장 간단)

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 1. Allocate slot - 이 값 자체가 고유 ID!
      size_t my_slot_idx = m_submitIndex.fetch_add(1);
      DecodeSlot& my_slot = m_ringBuffer[my_slot_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);

          if (my_slot.in_use) {
              return false;
          }

          // ✅ slot_id 필드 불필요 - my_slot_idx 자체가 고유 ID
          my_slot.in_use = true;
          my_slot.target_surface = target_surface;
          my_slot.surface_type = target_type;
          my_slot.picture_indices.clear();
          my_slot.is_ready = false;
      }

      // ✅ my_slot_idx → Index 매핑 등록
      {
          std::lock_guard<std::mutex> lock(m_slotMapMutex);
          m_slotIdToIndex[my_slot_idx] = my_slot_idx;  // ID == Index
      }

      // 2. Submit packet with my_slot_idx in timestamp
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = static_cast<int64_t>(my_slot_idx);  // ✅ submitIndex 자체를 ID로!

      // ...
  }

  장점:
  - 코드가 더 단순
  - slot_id 필드 불필요
  - 카운터 하나만 관리

  단점:
  - m_slotIdToIndex 맵이 사실상 항등 함수 (key == value)
  - 맵 자체가 불필요해 보임

  ---
  Option B: m_slotIdCounter만 사용 (명확한 의미)

  class NVDECAV1Decoder : public IVideoDecoder {
  private:
      // ✅ 이름을 더 명확하게
      std::atomic<uint64_t> m_slotIdCounter{0};  // submitIndex 역할 + 고유 ID
      std::atomic<uint64_t> m_returnIdCounter{0};  // returnIndex 역할

      // ID → Slot Index 매핑
      std::mutex m_slotMapMutex;
      std::unordered_map<uint64_t, size_t> m_idToSlotIndex;
  };

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 1. Allocate unique ID and slot
      uint64_t my_id = m_slotIdCounter.fetch_add(1);  // 고유 ID 할당
      size_t slot_idx = my_id % RING_BUFFER_SIZE;     // Ring buffer 인덱스
      DecodeSlot& my_slot = m_ringBuffer[slot_idx];

      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);

          if (my_slot.in_use) {
              return false;
          }

          my_slot.in_use = true;
          my_slot.target_surface = target_surface;
          my_slot.surface_type = target_type;
          my_slot.picture_indices.clear();
          my_slot.is_ready = false;
      }

      // 2. Register ID → Index mapping
      {
          std::lock_guard<std::mutex> lock(m_slotMapMutex);
          m_idToSlotIndex[my_id] = my_id;  // 실제로는 불필요...
      }

      // 3. Submit packet with ID in timestamp
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = static_cast<int64_t>(my_id);

      // ...

      // 4. Wait for my turn (FIFO)
      while (m_returnIdCounter.load() != my_id) {
          std::this_thread::sleep_for(std::chrono::milliseconds(1));
      }

      // ...

      // 5. Advance return counter
      m_returnIdCounter.fetch_add(1);

      return true;
  }

  장점:
  - 변수명이 더 명확 (slotIdCounter vs submitIndex)
  - 코드 의도가 명확

  단점:
  - 여전히 맵이 항등 함수

  ---
  Option C: 맵 제거 - 최적화 (최종 추천!)

  사실 맵이 전혀 필요 없습니다! timestamp에서 받은 ID로 바로 슬롯 인덱스 계산 가능:

  class NVDECAV1Decoder : public IVideoDecoder {
  private:
      // ✅ 단순화: 카운터 2개만
      std::atomic<uint64_t> m_slotIdCounter{0};   // 제출 순서 + 고유 ID
      std::atomic<uint64_t> m_returnIdCounter{0}; // 반환 순서

      // ❌ 맵 제거!
      // std::mutex m_slotMapMutex;
      // std::unordered_map<uint64_t, size_t> m_idToSlotIndex;
  };

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 1. Allocate unique ID
      uint64_t my_id = m_slotIdCounter.fetch_add(1);
      size_t slot_idx = my_id % RING_BUFFER_SIZE;
      DecodeSlot& my_slot = m_ringBuffer[slot_idx];

      char debug_buf[256];
      sprintf_s(debug_buf, "[DecodeToSurface] Allocated ID=%llu, slot=%zu\n", my_id, slot_idx);
      OutputDebugStringA(debug_buf);

      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);

          if (my_slot.in_use) {
              sprintf_s(debug_buf, "[DecodeToSurface] ERROR: Slot %zu overflow (ID=%llu)\n",
                        slot_idx, my_id);
              OutputDebugStringA(debug_buf);
              return false;
          }

          my_slot.in_use = true;
          my_slot.target_surface = target_surface;
          my_slot.surface_type = target_type;
          my_slot.picture_indices.clear();
          my_slot.is_ready = false;
      }

      // 2. Submit packet with ID in timestamp
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = static_cast<int64_t>(my_id);  // ✅ ID 전달

      CUresult result = cuvidParseVideoData(m_parser, &packet);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidParseVideoData");
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          return false;
      }

      // 3. Wait for my turn (FIFO order)
      sprintf_s(debug_buf, "[DecodeToSurface] Waiting for turn (my_id=%llu, returnId=%llu)\n",
                my_id, m_returnIdCounter.load());
      OutputDebugStringA(debug_buf);

      while (m_returnIdCounter.load() != my_id) {
          std::this_thread::sleep_for(std::chrono::milliseconds(1));
      }

      sprintf_s(debug_buf, "[DecodeToSurface] My turn! (ID=%llu)\n", my_id);
      OutputDebugStringA(debug_buf);

      // 4. Wait for decode complete
      {
          std::unique_lock<std::mutex> lock(my_slot.slot_mutex);

          if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500),
              [&my_slot]() { return my_slot.is_ready; })) {

              sprintf_s(debug_buf, "[DecodeToSurface] ERROR: Decode timeout (ID=%llu, slot=%zu)\n",
                        my_id, slot_idx);
              OutputDebugStringA(debug_buf);

              my_slot.in_use = false;
              m_returnIdCounter.fetch_add(1);  // Skip to unblock others
              return false;
          }
      }

      // 5. Get first picture index
      int frameIdx = -1;
      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
          if (!my_slot.picture_indices.empty()) {
              frameIdx = my_slot.picture_indices[0];
          }
      }

      if (frameIdx < 0) {
          OutputDebugStringA("[DecodeToSurface] ERROR: No frames decoded\n");
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIdCounter.fetch_add(1);
          return false;
      }

      sprintf_s(debug_buf, "[DecodeToSurface] Using picture_index=%d (from %zu frames)\n",
                frameIdx, my_slot.picture_indices.size());
      OutputDebugStringA(debug_buf);

      // 6. Map and copy frame
      CUdeviceptr srcDevicePtr = 0;
      unsigned int srcPitch = 0;
      CUVIDPROCPARAMS procParams = {};
      procParams.progressive_frame = 1;

      result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidMapVideoFrame");
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIdCounter.fetch_add(1);
          return false;
      }

      sprintf_s(debug_buf, "[DecodeToSurface] Mapped frame: ptr=%p, pitch=%u\n",
                (void*)srcDevicePtr, srcPitch);
      OutputDebugStringA(debug_buf);

      // 7. Copy to D3D12 surface
      ID3D12Resource* d3d12Resource = static_cast<ID3D12Resource*>(target_surface);
      bool copySuccess = m_d3d12Handler->CopyNV12Frame(
          srcDevicePtr, srcPitch,
          d3d12Resource,
          m_width, m_height
      );

      cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);

      if (!copySuccess) {
          OutputDebugStringA("[DecodeToSurface] ERROR: D3D12 copy failed\n");
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIdCounter.fetch_add(1);
          return false;
      }

      // 8. Signal fence and fill metadata
      m_d3d12Handler->SignalD3D12Fence(++m_fenceValue);
      output_frame.sync_fence_value = m_fenceValue;
      output_frame.width = m_width;
      output_frame.height = m_height;
      output_frame.color_space = ColorSpace::YUV420P;
      output_frame.frame_index = m_framesDecoded++;
      output_frame.timestamp_seconds = static_cast<double>(m_framesDecoded) / 30.0;

      // 9. Release slot
      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
          my_slot.in_use = false;
      }

      // 10. Advance return counter
      m_returnIdCounter.fetch_add(1);

      sprintf_s(debug_buf, "[DecodeToSurface] SUCCESS (ID=%llu, returnId now=%llu)\n",
                my_id, m_returnIdCounter.load());
      OutputDebugStringA(debug_buf);

      return true;
  }

  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // ✅ timestamp에서 ID 추출
      uint64_t slot_id = static_cast<uint64_t>(pic_params->nTimeStamp);

      // ✅ ID로 바로 슬롯 인덱스 계산 (맵 불필요!)
      size_t slot_idx = slot_id % decoder->RING_BUFFER_SIZE;
      DecodeSlot& slot = decoder->m_ringBuffer[slot_idx];

      char debug_buf[256];
      sprintf_s(debug_buf, "[HandlePictureDecode] ID=%llu, slot=%zu, CurrPicIdx=%d\n",
                slot_id, slot_idx, pic_params->CurrPicIdx);
      OutputDebugStringA(debug_buf);

      // Decode picture
      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      if (result != CUDA_SUCCESS) {
          decoder->LogCUDAError(result, "cuvidDecodePicture");
          return 0;
      }

      // ✅ Store picture index
      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);
          slot.picture_indices.push_back(pic_params->CurrPicIdx);
      }

      sprintf_s(debug_buf, "[HandlePictureDecode] Stored in slot %zu (total: %zu frames)\n",
                slot_idx, slot.picture_indices.size());
      OutputDebugStringA(debug_buf);

      return 1;
  }

  void NVDECAV1Decoder::PollingThreadFunc() {
      OutputDebugStringA("[PollingThread] Started\n");

      while (m_pollingRunning) {
          uint64_t current_return_id = m_returnIdCounter.load();
          size_t slot_idx = current_return_id % RING_BUFFER_SIZE;
          DecodeSlot& slot = m_ringBuffer[slot_idx];

          if (slot.in_use && !slot.is_ready) {
              bool all_complete = true;
              std::vector<int> picture_indices_copy;

              {
                  std::lock_guard<std::mutex> lock(slot.slot_mutex);
                  picture_indices_copy = slot.picture_indices;
              }

              if (picture_indices_copy.empty()) {
                  std::this_thread::sleep_for(std::chrono::microseconds(100));
                  continue;
              }

              for (int pic_idx : picture_indices_copy) {
                  CUVIDGETDECODESTATUS decodeStatus = {};
                  CUresult result = cuvidGetDecodeStatus(m_decoder, pic_idx, &decodeStatus);

                  if (result != CUDA_SUCCESS ||
                      decodeStatus.decodeStatus != cuvidDecodeStatus_Success) {
                      all_complete = false;
                      break;
                  }
              }

              if (all_complete) {
                  {
                      std::lock_guard<std::mutex> lock(slot.slot_mutex);
                      slot.is_ready = true;
                  }
                  slot.frame_ready.notify_one();

                  char debug_buf[256];
                  sprintf_s(debug_buf, "[PollingThread] Slot %zu complete (ID=%llu, %zu frames)\n",
                            slot_idx, current_return_id, picture_indices_copy.size());
                  OutputDebugStringA(debug_buf);
              }
          }

          std::this_thread::sleep_for(std::chrono::microseconds(100));
      }

      OutputDebugStringA("[PollingThread] Stopped\n");
  }

  ---
  📊 최종 비교

  | 방식                             | 코드 복잡도 | 성능  | 메모리  | 추천    |
  |--------------------------------|--------|-----|------|-------|
  | Option A (submitIndex 재사용)     | 중간     | 좋음  | 맵 필요 | ⭐⭐⭐   |
  | Option B (slotIdCounter 이름 변경) | 중간     | 좋음  | 맵 필요 | ⭐⭐⭐⭐  |
  | Option C (맵 제거)                | 낮음     | 최고  | 최소   | ⭐⭐⭐⭐⭐ |

  ---
  ✅ 최종 추천: Option C

  이유:
  1. 맵이 불필요 (ID % RING_BUFFER_SIZE로 바로 계산)
  2. 코드가 가장 단순
  3. 메모리 오버헤드 최소
  4. 변수명이 명확 (m_slotIdCounter, m_returnIdCounter)

  이 방법으로 구현하시겠습니까?