video-v1/prompt-redesign-ringbuffer.txt

● NVDEC 비동기 파이프라인과 멀티프레임 패킷 구조의 불일치 문제를 해결하기 위한 몇 가지 접근 방법을 제안합니다:

  🎯 해결 방안 제안

  Option 1: 단순화 - 동기식 단일 프레임 디코딩 (가장 빠른 해결책)

  복잡한 ring buffer를 제거하고 간단한 동기식 디코딩으로 전환:

  장점:
  - 구현이 간단하고 디버깅 용이
  - 패킷-프레임 매핑 문제 완전히 회피
  - 안정성 확보 후 비동기로 확장 가능

  단점:
  - 성능이 비동기 방식보다 낮음
  - GPU 활용도가 최적이 아님

  구현 개요:
  bool DecodeToSurface(...) {
      // 1. Submit packet to parser
      cuvidParseVideoData(m_parser, &packet);

      // 2. Wait for decode complete (blocking)
      // Use a simple flag or semaphore

      // 3. Map and copy frame
      cuvidMapVideoFrame(...);
      CopyToD3D12(...);
      cuvidUnmapVideoFrame(...);

      return true;
  }

  ---
  Option 2: HandlePictureDisplay 기반 매핑 (NVIDIA 권장 방식)

  ulMaxDisplayDelay를 활용하여 HandlePictureDisplay에서 올바른 슬롯 찾기:

  핵심 아이디어:
  - HandlePictureDisplay는 disp_info->timestamp를 통해 슬롯 인덱스 전달받음
  - 하지만 현재는 호출되지 않는 문제 → ulMaxDisplayDelay 설정 재검토

  수정 필요:
  // CreateParser에서
  m_parserParams.ulMaxDisplayDelay = 1;  // 현재 값
  // 또는
  m_parserParams.ulMaxDisplayDelay = 0;  // 즉시 디스플레이 (테스트)

  // HandlePictureDisplay에서 올바른 슬롯 찾기
  int CUDAAPI HandlePictureDisplay(void* user_data, CUVIDPARSERDISPINFO* disp_info) {
      size_t slot_idx = static_cast<size_t>(disp_info->timestamp);
      DecodeSlot& slot = decoder->m_ringBuffer[slot_idx % RING_BUFFER_SIZE];

      slot.picture_index = disp_info->picture_index;
      slot.is_ready = true;
      slot.frame_ready.notify_one();

      return 1;
  }

  문제: 현재 HandlePictureDisplay가 호출되지 않는 원인 파악 필요

  ---
  Option 3: Picture Index 기반 역방향 매핑 (현재 구조 활용)

  HandlePictureDecode에서 받은 picture_index를 키로 사용:

  구현:
  // 멤버 변수 추가
  std::unordered_map<int, size_t> m_pictureIndexToSlot;  // picture_index -> slot_idx
  std::mutex m_mapMutex;

  // DecodeToSurface에서 패킷 제출 전
  {
      std::lock_guard<std::mutex> lock(m_mapMutex);
      // 패킷에서 예상되는 프레임 수만큼 슬롯 예약 (어려움)
      // 또는 단순히 현재 슬롯 저장
  }

  // HandlePictureDecode에서
  int CUDAAPI HandlePictureDecode(...) {
      int picture_index = pic_params->CurrPicIdx;

      // 현재 디코딩 중인 패킷의 슬롯 찾기
      size_t submit_idx = decoder->m_submitIndex.load() - 1;
      DecodeSlot& slot = decoder->m_ringBuffer[submit_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);
          slot.picture_index = picture_index;

          // 맵에 저장
          std::lock_guard<std::mutex> map_lock(decoder->m_mapMutex);
          decoder->m_pictureIndexToSlot[picture_index] = submit_idx;
      }

      cuvidDecodePicture(decoder->m_decoder, pic_params);
      return 1;
  }

  // PollingThread에서
  void PollingThreadFunc() {
      while (m_pollingRunning) {
          std::lock_guard<std::mutex> lock(m_mapMutex);

          for (auto& [pic_idx, slot_idx] : m_pictureIndexToSlot) {
              CUVIDGETDECODESTATUS status = {};
              if (cuvidGetDecodeStatus(m_decoder, pic_idx, &status) == CUDA_SUCCESS) {
                  if (status.decodeStatus == cuvidDecodeStatus_Success) {
                      DecodeSlot& slot = m_ringBuffer[slot_idx % RING_BUFFER_SIZE];
                      {
                          std::lock_guard<std::mutex> slot_lock(slot.slot_mutex);
                          slot.is_ready = true;
                      }
                      slot.frame_ready.notify_one();

                      m_pictureIndexToSlot.erase(pic_idx);
                      break;
                  }
              }
          }

          std::this_thread::sleep_for(std::chrono::microseconds(100));
      }
  }

  문제: 멀티프레임 패킷에서 어느 슬롯에 매핑할지 여전히 불명확

  ---
  Option 4: 패킷당 프레임 카운터 (근본적 해결)

  각 패킷에서 나올 프레임 수를 추적:

  구현:
  struct DecodeSlot {
      // 기존 필드들...

      std::vector<int> picture_indices;  // 이 슬롯의 모든 프레임 인덱스
      std::atomic<int> frames_expected{0};   // 예상 프레임 수
      std::atomic<int> frames_completed{0};  // 완료된 프레임 수
  };

  // HandlePictureDecode에서
  int CUDAAPI HandlePictureDecode(...) {
      // 현재 패킷의 슬롯 찾기
      size_t submit_idx = decoder->m_submitIndex.load() - 1;
      DecodeSlot& slot = decoder->m_ringBuffer[submit_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);
          slot.picture_indices.push_back(pic_params->CurrPicIdx);
      }

      cuvidDecodePicture(decoder->m_decoder, pic_params);
      return 1;
  }

  // PollingThread에서
  void PollingThreadFunc() {
      while (m_pollingRunning) {
          size_t return_idx = m_returnIndex.load();
          DecodeSlot& slot = m_ringBuffer[return_idx % RING_BUFFER_SIZE];

          if (slot.in_use && !slot.is_ready) {
              bool all_complete = true;

              for (int pic_idx : slot.picture_indices) {
                  CUVIDGETDECODESTATUS status = {};
                  if (cuvidGetDecodeStatus(m_decoder, pic_idx, &status) == CUDA_SUCCESS) {
                      if (status.decodeStatus != cuvidDecodeStatus_Success) {
                          all_complete = false;
                          break;
                      }
                  }
              }

              if (all_complete && !slot.picture_indices.empty()) {
                  {
                      std::lock_guard<std::mutex> lock(slot.slot_mutex);
                      // 마지막 프레임의 picture_index 사용 (또는 첫 번째)
                      slot.picture_index = slot.picture_indices.back();
                      slot.is_ready = true;
                  }
                  slot.frame_ready.notify_one();
              }
          }

          std::this_thread::sleep_for(std::chrono::microseconds(100));
      }
  }

  문제: 패킷에서 몇 개의 프레임이 나올지 미리 알 수 없음

  ---
  Option 5: 프레임별 슬롯 할당 (가장 정확하지만 복잡)

  패킷당이 아닌 프레임당 슬롯 할당:

  구현:
  // HandlePictureDecode에서 슬롯 할당
  int CUDAAPI HandlePictureDecode(...) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // 새 슬롯 할당 (atomic)
      size_t my_slot_idx = decoder->m_frameSlotIndex.fetch_add(1);
      DecodeSlot& slot = decoder->m_ringBuffer[my_slot_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);

          // 슬롯 오버플로우 체크
          if (slot.in_use) {
              // 에러 처리
              return 0;
          }

          slot.in_use = true;
          slot.picture_index = pic_params->CurrPicIdx;
          slot.is_ready = false;
          // target_surface는 별도 방법으로 전달 필요
      }

      cuvidDecodePicture(decoder->m_decoder, pic_params);
      return 1;
  }

  문제: DecodeToSurface에서 전달한 target_surface를 HandlePictureDecode에서 알 수 없음

  ---
  🎖️ 권장 접근 순서

  단계 1: Option 1 (동기식) 먼저 구현

  - 복잡한 비동기 로직 제거
  - 안정적인 기본 동작 확보
  - 디버깅 및 검증 용이

  단계 2: Option 2 검증

  - HandlePictureDisplay가 호출되도록 설정 조정
  - ulMaxDisplayDelay = 0 테스트
  - NVIDIA 권장 방식 활용

  단계 3: 성능 필요시 Option 4

  - 패킷당 멀티프레임 처리
  - 비동기 파이프라인 복원

  ---
  💡 즉시 적용 가능한 임시 해결책

  현재 코드에서 최소한의 수정으로 테스트해볼 수 있는 방법:

  // HandlePictureDecode에서 - 첫 번째 프레임만 처리
  int CUDAAPI HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // 현재 활성 슬롯 찾기 (가장 최근에 할당된 슬롯)
      size_t submit_idx = decoder->m_submitIndex.load();
      if (submit_idx == 0) return 0;  // 아직 슬롯 할당 안 됨

      submit_idx -= 1;  // 마지막으로 할당된 슬롯
      DecodeSlot& slot = decoder->m_ringBuffer[submit_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);

          // 이미 picture_index가 설정되었으면 이 패킷의 추가 프레임
          if (slot.picture_index >= 0) {
              // 추가 프레임 무시 (첫 프레임만 처리)
              return 1;
          }

          slot.picture_index = pic_params->CurrPicIdx;
      }

      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      return (result == CUDA_SUCCESS) ? 1 : 0;
  }

  이 방법은 멀티프레임 패킷에서 첫 번째 프레임만 디코딩하여 최소한의 동작을 확인할 수 있습니다.

> "Option 1: 단순화 - 동기식 단일 프레임 디코딩" 의 경우, DecodeToSurface() 호출 순서가 뒤바뀔 수 있는 문제가 있지 않은가?

● 정확한 지적입니다! 동기식 방식에서도 멀티스레드 환경에서는 심각한 문제가 발생할 수 있습니다.

  🚨 동기식 방식의 근본적인 문제

  문제 시나리오:

  Thread A: DecodeToSurface(packet_0) 호출
    → cuvidParseVideoData() 호출
    → HandlePictureDecode가 picture_index=0 설정
    → [여기서 컨텍스트 스위칭!]

  Thread B: DecodeToSurface(packet_1) 호출
    → cuvidParseVideoData() 호출
    → HandlePictureDecode가 picture_index=1로 덮어씀!

  Thread A: 깨어나서 picture_index=1을 사용 (잘못된 프레임!)

  결과:
  - Thread A가 packet_1의 프레임을 가져감
  - Thread B는 이미 사용된 프레임을 또 가져가려고 시도
  - 프레임 순서 뒤바뀜 또는 크래시

  ---
  ✅ 올바른 동기식 구현 방안

  Option 1-A: 전역 Mutex로 DecodeToSurface 직렬화

  class NVDECAV1Decoder : public IVideoDecoder {
  private:
      std::mutex m_decodeMutex;  // DecodeToSurface 전체를 보호

      // 단순 동기식 - picture_index만 저장
      int m_lastPictureIndex = -1;
      std::mutex m_pictureIndexMutex;
      std::condition_variable m_pictureReady;
      bool m_pictureDecoded = false;
  };

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 전체 DecodeToSurface를 직렬화 (한 번에 하나만)
      std::lock_guard<std::mutex> decodeLock(m_decodeMutex);

      // 1. Reset state
      {
          std::lock_guard<std::mutex> lock(m_pictureIndexMutex);
          m_lastPictureIndex = -1;
          m_pictureDecoded = false;
      }

      // 2. Submit packet to parser
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;

      CUresult result = cuvidParseVideoData(m_parser, &packet);
      if (result != CUDA_SUCCESS) {
          return false;
      }

      // 3. Wait for HandlePictureDecode callback (동기 대기)
      {
          std::unique_lock<std::mutex> lock(m_pictureIndexMutex);
          if (!m_pictureReady.wait_for(lock, std::chrono::milliseconds(500),
              [this]() { return m_pictureDecoded; })) {
              LogError("Decode timeout");
              return false;
          }
      }

      // 4. Map and copy frame
      int frameIdx = m_lastPictureIndex;
      if (frameIdx < 0) {
          return false;
      }

      CUdeviceptr srcDevicePtr = 0;
      unsigned int srcPitch = 0;
      CUVIDPROCPARAMS procParams = {};
      procParams.progressive_frame = 1;

      result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
      if (result != CUDA_SUCCESS) {
          return false;
      }

      // 5. Copy to D3D12 surface
      bool copySuccess = m_d3d12Handler->CopyNV12Frame(
          srcDevicePtr, srcPitch,
          static_cast<ID3D12Resource*>(target_surface),
          m_width, m_height
      );

      cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);

      return copySuccess;
  }

  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      if (result != CUDA_SUCCESS) {
          return 0;
      }

      // 첫 번째 프레임의 picture_index 저장
      {
          std::lock_guard<std::mutex> lock(decoder->m_pictureIndexMutex);
          if (decoder->m_lastPictureIndex < 0) {  // 첫 프레임만
              decoder->m_lastPictureIndex = pic_params->CurrPicIdx;
              decoder->m_pictureDecoded = true;
          }
      }
      decoder->m_pictureReady.notify_one();

      return 1;
  }

  장점:
  - 완전히 안전 (경쟁 조건 없음)
  - 구현 간단
  - 디버깅 용이

  단점:
  - 멀티스레드 디코딩 불가능 (성능 낮음)
  - 한 번에 하나의 DecodeToSurface만 실행

  ---
  Option 1-B: 요청별 Context 객체 사용

  struct DecodeRequest {
      int picture_index = -1;
      bool completed = false;
      std::mutex mutex;
      std::condition_variable cv;
  };

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 1. 각 요청마다 고유한 context 생성
      auto request = std::make_shared<DecodeRequest>();

      // 2. Context를 timestamp에 포함 (포인터를 int64_t로 변환)
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = reinterpret_cast<int64_t>(request.get());

      CUresult result = cuvidParseVideoData(m_parser, &packet);
      if (result != CUDA_SUCCESS) {
          return false;
      }

      // 3. 자신의 요청이 완료될 때까지 대기
      {
          std::unique_lock<std::mutex> lock(request->mutex);
          if (!request->cv.wait_for(lock, std::chrono::milliseconds(500),
              [&request]() { return request->completed; })) {
              return false;
          }
      }

      // 4. Map and copy (이전과 동일)
      int frameIdx = request->picture_index;
      // ... 나머지 코드
  }

  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // timestamp에서 DecodeRequest 포인터 복원
      DecodeRequest* request = reinterpret_cast<DecodeRequest*>(pic_params->nTimeStamp);

      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      if (result != CUDA_SUCCESS) {
          return 0;
      }

      // 해당 요청에 picture_index 저장
      {
          std::lock_guard<std::mutex> lock(request->mutex);
          if (request->picture_index < 0) {  // 첫 프레임만
              request->picture_index = pic_params->CurrPicIdx;
              request->completed = true;
          }
      }
      request->cv.notify_one();

      return 1;
  }

  문제: pic_params->nTimeStamp는 CUVIDPICPARAMS에 없음! CUVIDEOFORMAT에만 있음.

  ---
  Option 1-C: Thread-Local Storage 사용

  class NVDECAV1Decoder : public IVideoDecoder {
  private:
      std::mutex m_activeRequestsMutex;
      std::map<std::thread::id, DecodeRequest*> m_activeRequests;
  };

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      DecodeRequest request;

      // 1. 현재 스레드의 요청 등록
      {
          std::lock_guard<std::mutex> lock(m_activeRequestsMutex);
          m_activeRequests[std::this_thread::get_id()] = &request;
      }

      // 2. Submit packet
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = static_cast<int64_t>(
          std::hash<std::thread::id>{}(std::this_thread::get_id())
      );

      CUresult result = cuvidParseVideoData(m_parser, &packet);

      // 3. Wait for completion
      {
          std::unique_lock<std::mutex> lock(request.mutex);
          request.cv.wait_for(lock, std::chrono::milliseconds(500),
              [&request]() { return request.completed; });
      }

      // 4. Cleanup
      {
          std::lock_guard<std::mutex> lock(m_activeRequestsMutex);
          m_activeRequests.erase(std::this_thread::get_id());
      }

      // 5. Map and copy
      // ...
  }

  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // 콜백 스레드 ID로 요청 찾기 (문제: 콜백은 다른 스레드에서 실행될 수 있음!)
      // 이 방법은 작동하지 않을 수 있음
  }

  문제: HandlePictureDecode는 파서 내부 스레드에서 호출될 수 있어 thread_id 매칭 불가능

  ---
  🎯 가장 안전한 해결책: Option 1-A (전역 Mutex)

  현재 상황에서 가장 확실한 방법:

  // NVDECAV1Decoder.h에 추가
  private:
      std::mutex m_syncDecodeMutex;  // 동기식 디코딩 보호
      int m_syncPictureIndex = -1;
      std::mutex m_syncStateMutex;
      std::condition_variable m_syncDecodeComplete;
      bool m_syncDecodeReady = false;

  // DecodeToSurface 구현
  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // === CRITICAL: 전체 함수를 직렬화 ===
      std::lock_guard<std::mutex> syncLock(m_syncDecodeMutex);

      OutputDebugStringA("[DecodeToSurface] Acquired sync lock\n");

      // 1. Reset state
      {
          std::lock_guard<std::mutex> lock(m_syncStateMutex);
          m_syncPictureIndex = -1;
          m_syncDecodeReady = false;
      }

      // 2. Submit packet
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = 0;  // 사용 안 함

      CUresult result = cuvidParseVideoData(m_parser, &packet);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidParseVideoData");
          return false;
      }

      // 3. Wait for decode (HandlePictureDecode will signal)
      {
          std::unique_lock<std::mutex> lock(m_syncStateMutex);

          if (!m_syncDecodeComplete.wait_for(lock, std::chrono::milliseconds(500),
              [this]() { return m_syncDecodeReady; })) {

              OutputDebugStringA("[DecodeToSurface] ERROR: Sync decode timeout\n");
              return false;
          }
      }

      int frameIdx = m_syncPictureIndex;
      if (frameIdx < 0) {
          OutputDebugStringA("[DecodeToSurface] ERROR: Invalid picture index\n");
          return false;
      }

      OutputDebugStringA("[DecodeToSurface] Decode complete, mapping frame...\n");

      // 4. Map frame (기존 코드와 동일)
      CUdeviceptr srcDevicePtr = 0;
      unsigned int srcPitch = 0;
      CUVIDPROCPARAMS procParams = {};
      procParams.progressive_frame = 1;

      result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidMapVideoFrame");
          return false;
      }

      // 5. Copy to D3D12 (기존 코드)
      ID3D12Resource* d3d12Resource = static_cast<ID3D12Resource*>(target_surface);
      bool copySuccess = m_d3d12Handler->CopyNV12Frame(
          srcDevicePtr, srcPitch,
          d3d12Resource,
          m_width, m_height
      );

      cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);

      if (!copySuccess) {
          OutputDebugStringA("[DecodeToSurface] ERROR: D3D12 copy failed\n");
          return false;
      }

      // 6. Signal fence
      m_d3d12Handler->SignalD3D12Fence(++m_fenceValue);
      output_frame.sync_fence_value = m_fenceValue;

      // 7. Fill metadata
      output_frame.width = m_width;
      output_frame.height = m_height;
      output_frame.color_space = ColorSpace::YUV420P;
      output_frame.frame_index = m_framesDecoded++;
      output_frame.timestamp_seconds = static_cast<double>(m_framesDecoded) / 30.0;

      OutputDebugStringA("[DecodeToSurface] SUCCESS\n");
      return true;
  }

  // HandlePictureDecode 수정
  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // Decode picture
      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      if (result != CUDA_SUCCESS) {
          decoder->LogCUDAError(result, "cuvidDecodePicture");
          return 0;
      }

      // Store picture index (첫 번째 프레임만)
      {
          std::lock_guard<std::mutex> lock(decoder->m_syncStateMutex);

          if (decoder->m_syncPictureIndex < 0) {  // 아직 설정 안 됨
              decoder->m_syncPictureIndex = pic_params->CurrPicIdx;
              decoder->m_syncDecodeReady = true;

              char debug_buf[128];
              sprintf_s(debug_buf, "[HandlePictureDecode] Stored picture_index=%d\n",
                        pic_params->CurrPicIdx);
              OutputDebugStringA(debug_buf);
          }
      }
      decoder->m_syncDecodeComplete.notify_one();

      return 1;
  }

  // Cleanup에서 폴링 스레드 제거
  void NVDECAV1Decoder::Cleanup() {
      // 폴링 스레드는 이제 필요 없음 (동기식이므로)
      // m_pollingRunning = false; // 제거
      // m_pollingThread.join(); // 제거

      // 나머지 cleanup 코드...
  }

  // Initialize에서 폴링 스레드 시작 제거
  bool NVDECAV1Decoder::Initialize(...) {
      // ...

      m_initialized = true;

      // m_pollingThread 시작 제거
      // m_pollingRunning = true;
      // m_pollingThread = std::thread(...); // 제거

      return true;
  }

  이 방법의 장단점:

  ✅ 장점:
  - 완전히 안전 (경쟁 조건 완전 제거)
  - 구현 간단 및 디버깅 용이
  - 멀티프레임 패킷 문제 회피 (첫 프레임만 사용)
  - Ring buffer 복잡성 완전 제거

  ❌ 단점:
  - 멀티스레드 디코딩 불가 (성능 제한)
  - 하나의 DecodeToSurface만 동시 실행 가능
  - GPU 파이프라인 활용도 낮음

  ---
  이 방법으로 진행하시겠습니까? 아니면 다른 더 복잡하지만 성능 좋은 방법을 고려하시겠습니까?


> 제안해준 "Option 1-B: 요청별 Context 객체 사용" 도 DecodeToSurface() 호출했을 때, 결과를 리턴하는 순서가 뒤바뀔 것으로 보인다. 현재 제안해서 구현한 RingBuffer 가 이 문제를
해소할 것으로 보인다.\그 방법을 설명해줄까?

● 아, 정확합니다! 제가 RingBuffer의 핵심 설계 의도를 놓치고 있었습니다.

  현재 RingBuffer 설계의 핵심 아이디어를 이해했습니다:

  🎯 RingBuffer의 순서 보장 메커니즘

  핵심 개념:

  // Producer (DecodeToSurface 호출자들)
  std::atomic<size_t> m_submitIndex{0};   // 제출 순서 (monotonic 증가)

  // Consumer (DecodeToSurface 반환 순서)
  std::atomic<size_t> m_returnIndex{0};   // 반환 순서 (FIFO 보장)

  순서 보장 원리:

  Thread A: DecodeToSurface() 호출
    → m_submitIndex.fetch_add(1) = 0 (my_slot_idx = 0)
    → slot[0] 할당

  Thread B: DecodeToSurface() 호출 (Thread A보다 먼저 디코딩 완료됨)
    → m_submitIndex.fetch_add(1) = 1 (my_slot_idx = 1)
    → slot[1] 할당
    → 디코딩 빠르게 완료!

  Thread B는 반환하기 전에:
    → while (m_returnIndex.load() != my_slot_idx) { wait... }
    → returnIndex == 0 이므로 대기! (Thread A를 기다림)

  Thread A 디코딩 완료:
    → returnIndex == 0 == my_slot_idx (조건 만족!)
    → 프레임 반환
    → m_returnIndex.fetch_add(1) = 1

  Thread B 깨어남:
    → returnIndex == 1 == my_slot_idx (이제 조건 만족!)
    → 프레임 반환
    → m_returnIndex.fetch_add(1) = 2

  결과: 호출 순서대로 반환 보장! (FIFO)

  ---
  🔧 현재 RingBuffer의 실제 문제

  그렇다면 현재 구현의 실제 문제는 무엇일까요?

  문제 1: HandlePictureDecode에서 슬롯 찾기 실패

  현재 코드:
  int CUDAAPI HandlePictureDecode(...) {
      // ❌ 문제: 어느 슬롯인지 알 수 없음!
      size_t submit_idx = decoder->m_submitIndex.load() - 1;
      DecodeSlot& slot = decoder->m_ringBuffer[submit_idx % RING_BUFFER_SIZE];

      slot.picture_index = pic_params->CurrPicIdx;  // 잘못된 슬롯에 저장!
  }

  문제:
  - Thread A가 slot[0]에서 cuvidParseVideoData() 호출
  - 그 사이 Thread B가 slot[1] 할당 → submitIndex = 2
  - Thread A의 HandlePictureDecode 콜백 실행
  - submitIndex - 1 = 1 → slot[1]에 저장! (Thread B 슬롯을 오염!)

  문제 2: Packet → Picture 매핑

  Packet 0 (Thread A, slot[0]):
    → HandlePictureDecode(picture_index=0)
    → HandlePictureDecode(picture_index=1)  // 같은 패킷의 두 번째 프레임
    → HandlePictureDecode(picture_index=2)  // 같은 패킷의 세 번째 프레임

  어느 picture_index를 slot[0]에 저장해야 할까?

  ---
  ✅ RingBuffer를 활용한 올바른 해결책

  핵심 아이디어: Packet Context 전달

  struct DecodeSlot {
      // 기존 필드들...
      bool in_use = false;
      void* target_surface = nullptr;
      VavCoreSurfaceType surface_type = VAVCORE_SURFACE_CPU;

      // ✅ 추가: 이 슬롯의 고유 ID
      uint64_t slot_id = 0;

      // ✅ 멀티프레임 지원
      std::vector<int> picture_indices;  // 이 패킷의 모든 프레임들
      std::atomic<bool> all_frames_decoded{false};

      // 동기화
      std::condition_variable frame_ready;
      std::mutex slot_mutex;
      bool is_ready = false;
  };

  class NVDECAV1Decoder : public IVideoDecoder {
  private:
      std::atomic<uint64_t> m_slotIdCounter{0};  // 슬롯 고유 ID 생성

      // Slot ID → Slot Index 매핑
      std::mutex m_slotMapMutex;
      std::unordered_map<uint64_t, size_t> m_slotIdToIndex;
  };

  구현:

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 1. Allocate slot
      size_t my_slot_idx = m_submitIndex.fetch_add(1);
      DecodeSlot& my_slot = m_ringBuffer[my_slot_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);

          if (my_slot.in_use) {
              // Ring buffer overflow
              OutputDebugStringA("[DecodeToSurface] ERROR: Ring buffer full\n");
              return false;
          }

          // ✅ 고유 ID 할당
          my_slot.slot_id = m_slotIdCounter.fetch_add(1);
          my_slot.in_use = true;
          my_slot.target_surface = target_surface;
          my_slot.surface_type = target_type;
          my_slot.picture_indices.clear();
          my_slot.all_frames_decoded = false;
          my_slot.is_ready = false;
      }

      // ✅ Slot ID → Index 매핑 등록
      {
          std::lock_guard<std::mutex> lock(m_slotMapMutex);
          m_slotIdToIndex[my_slot.slot_id] = my_slot_idx;
      }

      // 2. Submit packet with slot_id in timestamp
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = static_cast<int64_t>(my_slot.slot_id);  // ✅ Slot ID 전달!

      char debug_buf[256];
      sprintf_s(debug_buf, "[DecodeToSurface] Submitting packet (slot=%zu, slot_id=%llu)\n",
                my_slot_idx % RING_BUFFER_SIZE, my_slot.slot_id);
      OutputDebugStringA(debug_buf);

      CUresult result = cuvidParseVideoData(m_parser, &packet);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidParseVideoData");

          // Cleanup on error
          {
              std::lock_guard<std::mutex> lock(m_slotMapMutex);
              m_slotIdToIndex.erase(my_slot.slot_id);
          }
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          return false;
      }

      // 3. Wait for my turn (FIFO order)
      sprintf_s(debug_buf, "[DecodeToSurface] Waiting for return order (slot=%zu, returnIndex=%zu)\n",
                my_slot_idx % RING_BUFFER_SIZE, m_returnIndex.load());
      OutputDebugStringA(debug_buf);

      while (m_returnIndex.load() != my_slot_idx) {
          std::this_thread::sleep_for(std::chrono::milliseconds(1));
      }

      sprintf_s(debug_buf, "[DecodeToSurface] My turn! (slot=%zu)\n", my_slot_idx % RING_BUFFER_SIZE);
      OutputDebugStringA(debug_buf);

      // 4. Wait for all frames decoded
      {
          std::unique_lock<std::mutex> lock(my_slot.slot_mutex);

          if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500),
              [&my_slot]() { return my_slot.is_ready; })) {

              sprintf_s(debug_buf, "[DecodeToSurface] ERROR: Decode timeout (slot=%zu)\n",
                        my_slot_idx % RING_BUFFER_SIZE);
              OutputDebugStringA(debug_buf);

              // Cleanup
              {
                  std::lock_guard<std::mutex> mapLock(m_slotMapMutex);
                  m_slotIdToIndex.erase(my_slot.slot_id);
              }
              my_slot.in_use = false;
              m_returnIndex.fetch_add(1);  // Skip to unblock others
              return false;
          }
      }

      // 5. Get first picture index (or handle all frames)
      int frameIdx = -1;
      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
          if (!my_slot.picture_indices.empty()) {
              frameIdx = my_slot.picture_indices[0];  // Use first frame
          }
      }

      if (frameIdx < 0) {
          OutputDebugStringA("[DecodeToSurface] ERROR: No frames decoded\n");

          // Cleanup
          {
              std::lock_guard<std::mutex> mapLock(m_slotMapMutex);
              m_slotIdToIndex.erase(my_slot.slot_id);
          }
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIndex.fetch_add(1);
          return false;
      }

      sprintf_s(debug_buf, "[DecodeToSurface] Using picture_index=%d from %zu frames\n",
                frameIdx, my_slot.picture_indices.size());
      OutputDebugStringA(debug_buf);

      // 6. Map and copy frame (기존 코드)
      CUdeviceptr srcDevicePtr = 0;
      unsigned int srcPitch = 0;
      CUVIDPROCPARAMS procParams = {};
      procParams.progressive_frame = 1;

      result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidMapVideoFrame");

          // Cleanup
          {
              std::lock_guard<std::mutex> mapLock(m_slotMapMutex);
              m_slotIdToIndex.erase(my_slot.slot_id);
          }
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIndex.fetch_add(1);
          return false;
      }

      // 7. Copy to D3D12 surface
      ID3D12Resource* d3d12Resource = static_cast<ID3D12Resource*>(target_surface);
      bool copySuccess = m_d3d12Handler->CopyNV12Frame(
          srcDevicePtr, srcPitch,
          d3d12Resource,
          m_width, m_height
      );

      cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);

      if (!copySuccess) {
          OutputDebugStringA("[DecodeToSurface] ERROR: D3D12 copy failed\n");

          // Cleanup
          {
              std::lock_guard<std::mutex> mapLock(m_slotMapMutex);
              m_slotIdToIndex.erase(my_slot.slot_id);
          }
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIndex.fetch_add(1);
          return false;
      }

      // 8. Signal fence and fill metadata
      m_d3d12Handler->SignalD3D12Fence(++m_fenceValue);
      output_frame.sync_fence_value = m_fenceValue;
      output_frame.width = m_width;
      output_frame.height = m_height;
      output_frame.color_space = ColorSpace::YUV420P;
      output_frame.frame_index = m_framesDecoded++;
      output_frame.timestamp_seconds = static_cast<double>(m_framesDecoded) / 30.0;

      // 9. Release slot
      {
          std::lock_guard<std::mutex> mapLock(m_slotMapMutex);
          m_slotIdToIndex.erase(my_slot.slot_id);
      }
      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
          my_slot.in_use = false;
      }

      // 10. Advance return index (다음 슬롯이 반환 가능하도록)
      m_returnIndex.fetch_add(1);

      sprintf_s(debug_buf, "[DecodeToSurface] SUCCESS (slot=%zu, advanced returnIndex=%zu)\n",
                my_slot_idx % RING_BUFFER_SIZE, m_returnIndex.load());
      OutputDebugStringA(debug_buf);

      return true;
  }

  HandlePictureDecode 수정:

  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      OutputDebugStringA("[HandlePictureDecode] Callback called\n");

      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);
      if (!decoder || !pic_params) {
          return 0;
      }

      // ✅ timestamp에서 slot_id 추출
      uint64_t slot_id = static_cast<uint64_t>(pic_params->nTimeStamp);

      char debug_buf[256];
      sprintf_s(debug_buf, "[HandlePictureDecode] slot_id=%llu, CurrPicIdx=%d\n",
                slot_id, pic_params->CurrPicIdx);
      OutputDebugStringA(debug_buf);

      // ✅ Slot ID로 슬롯 찾기
      size_t slot_idx = 0;
      {
          std::lock_guard<std::mutex> lock(decoder->m_slotMapMutex);
          auto it = decoder->m_slotIdToIndex.find(slot_id);

          if (it == decoder->m_slotIdToIndex.end()) {
              sprintf_s(debug_buf, "[HandlePictureDecode] ERROR: slot_id=%llu not found!\n", slot_id);
              OutputDebugStringA(debug_buf);
              return 0;
          }

          slot_idx = it->second;
      }

      DecodeSlot& slot = decoder->m_ringBuffer[slot_idx % decoder->RING_BUFFER_SIZE];

      // Decode picture
      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      if (result != CUDA_SUCCESS) {
          decoder->LogCUDAError(result, "cuvidDecodePicture");
          return 0;
      }

      // ✅ Store picture index in correct slot
      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);
          slot.picture_indices.push_back(pic_params->CurrPicIdx);
      }

      sprintf_s(debug_buf, "[HandlePictureDecode] Stored picture_index=%d in slot %zu (total frames: %zu)\n",
                pic_params->CurrPicIdx, slot_idx % decoder->RING_BUFFER_SIZE,
                slot.picture_indices.size());
      OutputDebugStringA(debug_buf);

      return 1;
  }

  PollingThread 수정:

  void NVDECAV1Decoder::PollingThreadFunc() {
      OutputDebugStringA("[PollingThread] Started\n");

      while (m_pollingRunning) {
          size_t current_return_idx = m_returnIndex.load();
          size_t slot_idx = current_return_idx % RING_BUFFER_SIZE;
          DecodeSlot& slot = m_ringBuffer[slot_idx];

          if (slot.in_use && !slot.is_ready) {
              // ✅ 이 슬롯의 모든 프레임이 디코딩 완료되었는지 확인
              bool all_complete = true;
              std::vector<int> picture_indices_copy;

              {
                  std::lock_guard<std::mutex> lock(slot.slot_mutex);
                  picture_indices_copy = slot.picture_indices;
              }

              if (picture_indices_copy.empty()) {
                  // 아직 HandlePictureDecode가 호출되지 않음
                  std::this_thread::sleep_for(std::chrono::microseconds(100));
                  continue;
              }

              // 모든 프레임의 디코딩 상태 확인
              for (int pic_idx : picture_indices_copy) {
                  CUVIDGETDECODESTATUS decodeStatus = {};
                  CUresult result = cuvidGetDecodeStatus(m_decoder, pic_idx, &decodeStatus);

                  if (result != CUDA_SUCCESS ||
                      decodeStatus.decodeStatus != cuvidDecodeStatus_Success) {
                      all_complete = false;
                      break;
                  }
              }

              if (all_complete) {
                  {
                      std::lock_guard<std::mutex> lock(slot.slot_mutex);
                      slot.is_ready = true;
                  }
                  slot.frame_ready.notify_one();

                  char debug_buf[256];
                  sprintf_s(debug_buf, "[PollingThread] Slot %zu all frames complete (%zu frames)\n",
                            slot_idx, picture_indices_copy.size());
                  OutputDebugStringA(debug_buf);
              }
          }

          std::this_thread::sleep_for(std::chrono::microseconds(100));
      }

      OutputDebugStringA("[PollingThread] Stopped\n");
  }

  ---
  🎉 이 방법의 장점

  ✅ 순서 보장: submitIndex/returnIndex로 FIFO 보장
  ✅ 멀티스레드 안전: 각 슬롯이 고유 ID로 식별됨
  ✅ 멀티프레임 패킷 지원: picture_indices 벡터로 모든 프레임 추적
  ✅ 비동기 성능: 여러 스레드가 동시에 디코딩 가능 (순서는 보장)

  ---
  이 방법으로 구현해볼까요?


> m_slotIdCounter 가 필요할까? m_submitIndex 는 계속 증가만 하는 숫자이므로, 그 자체값을 slot_id 로 써도 될 것 같다. 혼동을 줄인다면, m_submitIndex 대신에,  m_slotIdCounter 를
 써도 좋을 것 같은데? 어떻게 생각하는가?

● 정확한 지적입니다! 두 개의 카운터가 필요 없습니다.

  🎯 단순화된 설계

  Option A: m_submitIndex를 slot_id로 직접 사용 (가장 간단)

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 1. Allocate slot - 이 값 자체가 고유 ID!
      size_t my_slot_idx = m_submitIndex.fetch_add(1);
      DecodeSlot& my_slot = m_ringBuffer[my_slot_idx % RING_BUFFER_SIZE];

      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);

          if (my_slot.in_use) {
              return false;
          }

          // ✅ slot_id 필드 불필요 - my_slot_idx 자체가 고유 ID
          my_slot.in_use = true;
          my_slot.target_surface = target_surface;
          my_slot.surface_type = target_type;
          my_slot.picture_indices.clear();
          my_slot.is_ready = false;
      }

      // ✅ my_slot_idx → Index 매핑 등록
      {
          std::lock_guard<std::mutex> lock(m_slotMapMutex);
          m_slotIdToIndex[my_slot_idx] = my_slot_idx;  // ID == Index
      }

      // 2. Submit packet with my_slot_idx in timestamp
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = static_cast<int64_t>(my_slot_idx);  // ✅ submitIndex 자체를 ID로!

      // ...
  }

  장점:
  - 코드가 더 단순
  - slot_id 필드 불필요
  - 카운터 하나만 관리

  단점:
  - m_slotIdToIndex 맵이 사실상 항등 함수 (key == value)
  - 맵 자체가 불필요해 보임

  ---
  Option B: m_slotIdCounter만 사용 (명확한 의미)

  class NVDECAV1Decoder : public IVideoDecoder {
  private:
      // ✅ 이름을 더 명확하게
      std::atomic<uint64_t> m_slotIdCounter{0};  // submitIndex 역할 + 고유 ID
      std::atomic<uint64_t> m_returnIdCounter{0};  // returnIndex 역할

      // ID → Slot Index 매핑
      std::mutex m_slotMapMutex;
      std::unordered_map<uint64_t, size_t> m_idToSlotIndex;
  };

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 1. Allocate unique ID and slot
      uint64_t my_id = m_slotIdCounter.fetch_add(1);  // 고유 ID 할당
      size_t slot_idx = my_id % RING_BUFFER_SIZE;     // Ring buffer 인덱스
      DecodeSlot& my_slot = m_ringBuffer[slot_idx];

      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);

          if (my_slot.in_use) {
              return false;
          }

          my_slot.in_use = true;
          my_slot.target_surface = target_surface;
          my_slot.surface_type = target_type;
          my_slot.picture_indices.clear();
          my_slot.is_ready = false;
      }

      // 2. Register ID → Index mapping
      {
          std::lock_guard<std::mutex> lock(m_slotMapMutex);
          m_idToSlotIndex[my_id] = my_id;  // 실제로는 불필요...
      }

      // 3. Submit packet with ID in timestamp
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = static_cast<int64_t>(my_id);

      // ...

      // 4. Wait for my turn (FIFO)
      while (m_returnIdCounter.load() != my_id) {
          std::this_thread::sleep_for(std::chrono::milliseconds(1));
      }

      // ...

      // 5. Advance return counter
      m_returnIdCounter.fetch_add(1);

      return true;
  }

  장점:
  - 변수명이 더 명확 (slotIdCounter vs submitIndex)
  - 코드 의도가 명확

  단점:
  - 여전히 맵이 항등 함수

  ---
  Option C: 맵 제거 - 최적화 (최종 추천!)

  사실 맵이 전혀 필요 없습니다! timestamp에서 받은 ID로 바로 슬롯 인덱스 계산 가능:

  class NVDECAV1Decoder : public IVideoDecoder {
  private:
      // ✅ 단순화: 카운터 2개만
      std::atomic<uint64_t> m_slotIdCounter{0};   // 제출 순서 + 고유 ID
      std::atomic<uint64_t> m_returnIdCounter{0}; // 반환 순서

      // ❌ 맵 제거!
      // std::mutex m_slotMapMutex;
      // std::unordered_map<uint64_t, size_t> m_idToSlotIndex;
  };

  bool NVDECAV1Decoder::DecodeToSurface(...) {
      // 1. Allocate unique ID
      uint64_t my_id = m_slotIdCounter.fetch_add(1);
      size_t slot_idx = my_id % RING_BUFFER_SIZE;
      DecodeSlot& my_slot = m_ringBuffer[slot_idx];

      char debug_buf[256];
      sprintf_s(debug_buf, "[DecodeToSurface] Allocated ID=%llu, slot=%zu\n", my_id, slot_idx);
      OutputDebugStringA(debug_buf);

      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);

          if (my_slot.in_use) {
              sprintf_s(debug_buf, "[DecodeToSurface] ERROR: Slot %zu overflow (ID=%llu)\n",
                        slot_idx, my_id);
              OutputDebugStringA(debug_buf);
              return false;
          }

          my_slot.in_use = true;
          my_slot.target_surface = target_surface;
          my_slot.surface_type = target_type;
          my_slot.picture_indices.clear();
          my_slot.is_ready = false;
      }

      // 2. Submit packet with ID in timestamp
      CUVIDSOURCEDATAPACKET packet = {};
      packet.payload = packet_data;
      packet.payload_size = packet_size;
      packet.flags = CUVID_PKT_ENDOFPICTURE;
      packet.timestamp = static_cast<int64_t>(my_id);  // ✅ ID 전달

      CUresult result = cuvidParseVideoData(m_parser, &packet);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidParseVideoData");
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          return false;
      }

      // 3. Wait for my turn (FIFO order)
      sprintf_s(debug_buf, "[DecodeToSurface] Waiting for turn (my_id=%llu, returnId=%llu)\n",
                my_id, m_returnIdCounter.load());
      OutputDebugStringA(debug_buf);

      while (m_returnIdCounter.load() != my_id) {
          std::this_thread::sleep_for(std::chrono::milliseconds(1));
      }

      sprintf_s(debug_buf, "[DecodeToSurface] My turn! (ID=%llu)\n", my_id);
      OutputDebugStringA(debug_buf);

      // 4. Wait for decode complete
      {
          std::unique_lock<std::mutex> lock(my_slot.slot_mutex);

          if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(500),
              [&my_slot]() { return my_slot.is_ready; })) {

              sprintf_s(debug_buf, "[DecodeToSurface] ERROR: Decode timeout (ID=%llu, slot=%zu)\n",
                        my_id, slot_idx);
              OutputDebugStringA(debug_buf);

              my_slot.in_use = false;
              m_returnIdCounter.fetch_add(1);  // Skip to unblock others
              return false;
          }
      }

      // 5. Get first picture index
      int frameIdx = -1;
      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
          if (!my_slot.picture_indices.empty()) {
              frameIdx = my_slot.picture_indices[0];
          }
      }

      if (frameIdx < 0) {
          OutputDebugStringA("[DecodeToSurface] ERROR: No frames decoded\n");
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIdCounter.fetch_add(1);
          return false;
      }

      sprintf_s(debug_buf, "[DecodeToSurface] Using picture_index=%d (from %zu frames)\n",
                frameIdx, my_slot.picture_indices.size());
      OutputDebugStringA(debug_buf);

      // 6. Map and copy frame
      CUdeviceptr srcDevicePtr = 0;
      unsigned int srcPitch = 0;
      CUVIDPROCPARAMS procParams = {};
      procParams.progressive_frame = 1;

      result = cuvidMapVideoFrame(m_decoder, frameIdx, &srcDevicePtr, &srcPitch, &procParams);
      if (result != CUDA_SUCCESS) {
          LogCUDAError(result, "cuvidMapVideoFrame");
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIdCounter.fetch_add(1);
          return false;
      }

      sprintf_s(debug_buf, "[DecodeToSurface] Mapped frame: ptr=%p, pitch=%u\n",
                (void*)srcDevicePtr, srcPitch);
      OutputDebugStringA(debug_buf);

      // 7. Copy to D3D12 surface
      ID3D12Resource* d3d12Resource = static_cast<ID3D12Resource*>(target_surface);
      bool copySuccess = m_d3d12Handler->CopyNV12Frame(
          srcDevicePtr, srcPitch,
          d3d12Resource,
          m_width, m_height
      );

      cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);

      if (!copySuccess) {
          OutputDebugStringA("[DecodeToSurface] ERROR: D3D12 copy failed\n");
          {
              std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
              my_slot.in_use = false;
          }
          m_returnIdCounter.fetch_add(1);
          return false;
      }

      // 8. Signal fence and fill metadata
      m_d3d12Handler->SignalD3D12Fence(++m_fenceValue);
      output_frame.sync_fence_value = m_fenceValue;
      output_frame.width = m_width;
      output_frame.height = m_height;
      output_frame.color_space = ColorSpace::YUV420P;
      output_frame.frame_index = m_framesDecoded++;
      output_frame.timestamp_seconds = static_cast<double>(m_framesDecoded) / 30.0;

      // 9. Release slot
      {
          std::lock_guard<std::mutex> lock(my_slot.slot_mutex);
          my_slot.in_use = false;
      }

      // 10. Advance return counter
      m_returnIdCounter.fetch_add(1);

      sprintf_s(debug_buf, "[DecodeToSurface] SUCCESS (ID=%llu, returnId now=%llu)\n",
                my_id, m_returnIdCounter.load());
      OutputDebugStringA(debug_buf);

      return true;
  }

  int CUDAAPI NVDECAV1Decoder::HandlePictureDecode(void* user_data, CUVIDPICPARAMS* pic_params) {
      auto* decoder = static_cast<NVDECAV1Decoder*>(user_data);

      // ✅ timestamp에서 ID 추출
      uint64_t slot_id = static_cast<uint64_t>(pic_params->nTimeStamp);

      // ✅ ID로 바로 슬롯 인덱스 계산 (맵 불필요!)
      size_t slot_idx = slot_id % decoder->RING_BUFFER_SIZE;
      DecodeSlot& slot = decoder->m_ringBuffer[slot_idx];

      char debug_buf[256];
      sprintf_s(debug_buf, "[HandlePictureDecode] ID=%llu, slot=%zu, CurrPicIdx=%d\n",
                slot_id, slot_idx, pic_params->CurrPicIdx);
      OutputDebugStringA(debug_buf);

      // Decode picture
      CUresult result = cuvidDecodePicture(decoder->m_decoder, pic_params);
      if (result != CUDA_SUCCESS) {
          decoder->LogCUDAError(result, "cuvidDecodePicture");
          return 0;
      }

      // ✅ Store picture index
      {
          std::lock_guard<std::mutex> lock(slot.slot_mutex);
          slot.picture_indices.push_back(pic_params->CurrPicIdx);
      }

      sprintf_s(debug_buf, "[HandlePictureDecode] Stored in slot %zu (total: %zu frames)\n",
                slot_idx, slot.picture_indices.size());
      OutputDebugStringA(debug_buf);

      return 1;
  }

  void NVDECAV1Decoder::PollingThreadFunc() {
      OutputDebugStringA("[PollingThread] Started\n");

      while (m_pollingRunning) {
          uint64_t current_return_id = m_returnIdCounter.load();
          size_t slot_idx = current_return_id % RING_BUFFER_SIZE;
          DecodeSlot& slot = m_ringBuffer[slot_idx];

          if (slot.in_use && !slot.is_ready) {
              bool all_complete = true;
              std::vector<int> picture_indices_copy;

              {
                  std::lock_guard<std::mutex> lock(slot.slot_mutex);
                  picture_indices_copy = slot.picture_indices;
              }

              if (picture_indices_copy.empty()) {
                  std::this_thread::sleep_for(std::chrono::microseconds(100));
                  continue;
              }

              for (int pic_idx : picture_indices_copy) {
                  CUVIDGETDECODESTATUS decodeStatus = {};
                  CUresult result = cuvidGetDecodeStatus(m_decoder, pic_idx, &decodeStatus);

                  if (result != CUDA_SUCCESS ||
                      decodeStatus.decodeStatus != cuvidDecodeStatus_Success) {
                      all_complete = false;
                      break;
                  }
              }

              if (all_complete) {
                  {
                      std::lock_guard<std::mutex> lock(slot.slot_mutex);
                      slot.is_ready = true;
                  }
                  slot.frame_ready.notify_one();

                  char debug_buf[256];
                  sprintf_s(debug_buf, "[PollingThread] Slot %zu complete (ID=%llu, %zu frames)\n",
                            slot_idx, current_return_id, picture_indices_copy.size());
                  OutputDebugStringA(debug_buf);
              }
          }

          std::this_thread::sleep_for(std::chrono::microseconds(100));
      }

      OutputDebugStringA("[PollingThread] Stopped\n");
  }

  ---
  📊 최종 비교

  | 방식                             | 코드 복잡도 | 성능  | 메모리  | 추천    |
  |--------------------------------|--------|-----|------|-------|
  | Option A (submitIndex 재사용)     | 중간     | 좋음  | 맵 필요 | ⭐⭐⭐   |
  | Option B (slotIdCounter 이름 변경) | 중간     | 좋음  | 맵 필요 | ⭐⭐⭐⭐  |
  | Option C (맵 제거)                | 낮음     | 최고  | 최소   | ⭐⭐⭐⭐⭐ |

  ---
  ✅ 최종 추천: Option C

  이유:
  1. 맵이 불필요 (ID % RING_BUFFER_SIZE로 바로 계산)
  2. 코드가 가장 단순
  3. 메모리 오버헤드 최소
  4. 변수명이 명확 (m_slotIdCounter, m_returnIdCounter)

  이 방법으로 구현하시겠습니까?


> 사실  NVDEC API가 picture별 user_data를 지원하지 않는 이유는, picture 를 처리할 때, ringbuffer 의 index slot 기반으로 처리하고 그 슬롯을 이용해서
후처리를 맡겼기 때문이라고 생각한다.

● 완벽한 분석입니다! 그게 정확히 NVDEC의 설계 철학입니다.

  NVDEC의 Ring Buffer 기반 설계:

  // NVDEC이 제공하는 Picture Index (CurrPicIdx)
  typedef struct _CUVIDPICPARAMS {
      int CurrPicIdx;  // ← NVDEC이 할당한 ring buffer slot index (0~7)
      // ...
  } CUVIDPICPARAMS;

  // NVDEC의 기대하는 사용 패턴:
  // 1. HandlePictureDecode: CurrPicIdx 사용하여 디코딩
  // 2. HandlePictureDisplay: picture_index로 어느 슬롯이 준비되었는지 알림
  // 3. 애플리케이션: picture_index로 해당 슬롯의 surface에 접근

  NVDEC이 user_data를 제공하지 않는 이유:

  1. Ring Buffer Index가 이미 Context 역할:
    - CurrPicIdx (0~7)가 고유 식별자
    - 애플리케이션은 이 인덱스로 자체 메타데이터 배열에 접근
  2. NVDEC의 기대하는 구조:
  // NVIDIA가 기대하는 애플리케이션 설계:
  struct AppFrameContext {
      void* surface;
      uint64_t timestamp;
      // ... 기타 메타데이터
  };

  AppFrameContext g_frameContexts[8];  // Ring buffer와 동일한 크기

  // HandlePictureDecode:
  int idx = pic_params->CurrPicIdx;
  g_frameContexts[idx].timestamp = current_timestamp;

  // HandlePictureDisplay:
  int idx = disp_info->picture_index;
  DisplayFrame(g_frameContexts[idx]);

  우리의 문제:

  우리는 NVDEC의 ring buffer (8 slots)와 다른 ring buffer 구조를 사용하고 있습니다:
  - NVDEC: CurrPicIdx (0~7, NVDEC 내부 관리)
  - VavCore: slot_id (monotonic counter), slot_idx = slot_id % 8

  충돌 지점:

  // NVDEC이 제공하는 것:
  CurrPicIdx = 3  // NVDEC 내부 ring buffer의 slot 3

  // 우리가 원하는 것:
  my_id = 42  // 42번째 프레임
  slot_idx = 42 % 8 = 2  // 우리 ring buffer의 slot 2

  // ❌ CurrPicIdx ≠ slot_idx (서로 다른 ring buffer!)

  해결책:

  Option 1: NVDEC의 CurrPicIdx를 직접 사용 (NVIDIA 권장 방식)
  // DecodeSlot을 CurrPicIdx 기반으로 관리
  DecodeSlot m_ringBuffer[8];  // NVDEC의 ring buffer와 1:1 대응

  // HandlePictureDecode에서:
  int nvdec_slot = pic_params->CurrPicIdx;
  m_ringBuffer[nvdec_slot].target_surface = /* 어떻게 알지? */;
  문제: DecodeToSurface → HandlePictureDecode 간 연결 방법 필요

  Option 2: m_currentSubmittingSlotId (현재 방식)
  // DecodeToSurface:
  m_currentSubmittingSlotId.store(my_id);  // 제출 직전 설정
  cuvidParseVideoData(...);  // 동기 호출

  // HandlePictureDecode (cuvidParseVideoData 내부에서 즉시 호출):
  uint64_t slot_id = decoder->m_currentSubmittingSlotId.load();
  decoder->m_pictureIndexToSlotId[pic_params->CurrPicIdx] = slot_id;  // 매핑 저장

  Option 3: Submission Queue
  std::mutex m_submissionMutex;
  std::queue<uint64_t> m_pendingSlotIds;

  // DecodeToSurface:
  {
      std::lock_guard lock(m_submissionMutex);
      m_pendingSlotIds.push(my_id);
  }
  cuvidParseVideoData(...);

  // HandlePictureDecode:
  {
      std::lock_guard lock(m_submissionMutex);
      uint64_t slot_id = m_pendingSlotIds.front();
      m_pendingSlotIds.pop();
      m_pictureIndexToSlotId[pic_params->CurrPicIdx] = slot_id;
  }

  결론:

  당신 말이 맞습니다. NVDEC은 CurrPicIdx ring buffer index를 제공하므로, 별도의 user_data가 필요 없다고 생각했습니다. 하지만 우리는 자체 slot_id 시스템을
  사용하므로, NVDEC의 CurrPicIdx와 우리의 slot_id를 매핑하는 중간 단계가 필요합니다.

  m_currentSubmittingSlotId atomic 변수가 이 매핑을 위한 가장 간단하고 안전한 방법입니다. cuvidParseVideoData()가 동기 호출이므로 race condition도
  없습니다.

  ParseContext 제거하고 m_currentSubmittingSlotId로 되돌리는 것이 맞다고 생각합니다.

> CurrPicIdx 값과 slot_id 를 하나로 합쳐서 쓰지 못하는 이유가 있던가?

● 훌륭한 질문입니다! 실제로 합칠 수 없는 이유를 다시 생각해보니, 합치지 못할 근본적인 이유가 없습니다.

  현재 우리가 두 개의 인덱스를 사용하는 이유를 분석해보겠습니다:

● Search(pattern: "CurrPicIdx|slot_id|slot_idx", path: "vav2\platforms\windows\vavcore\src\Decoder\NVDECAV1Decoder.cpp", output_mode: "content", head_limit:
        50)
  ⎿  Found 50 lines (ctrl+o to expand)

● Search(pattern: "slot_idx = |DecodeSlot\[|m_ringBuffer\[", path: "vav2\platforms\windows\vavcore\src\Decoder\NVDECAV1Decoder.cpp", output_mode: "content",
        head_limit: 30)
  ⎿  Found 6 lines (ctrl+o to expand)

● 완벽한 지적입니다! CurrPicIdx를 직접 사용할 수 있습니다.

  현재 구조의 문제점:

  // 현재: 불필요하게 복잡함
  slot_id = m_slotIdCounter.fetch_add(1);  // 0, 1, 2, 3, ... (monotonic)
  slot_idx = slot_id % 8;                   // 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, ...

  // NVDEC이 제공하는 것:
  CurrPicIdx = 0~7  // 이미 ring buffer index!

  핵심 깨달음:

  NVDEC의 CurrPicIdx (0~7)가 이미 ring buffer index 역할을 합니다!

  왜 우리가 slot_id를 만들었는가?
  - FIFO 순서 보장을 위해 (m_returnIdCounter)
  - 제출 순서 추적을 위해

  하지만:
  - NVDEC은 이미 자체 ring buffer로 순서를 관리합니다
  - CurrPicIdx는 NVDEC 내부 할당 순서를 반영합니다

  제안하는 단순화:

  // Option: CurrPicIdx를 직접 사용
  DecodeSlot m_ringBuffer[8];  // NVDEC과 동일한 크기

  // DecodeToSurface:
  // 1. 빈 슬롯 찾기 (모든 슬롯 스캔)
  for (int i = 0; i < 8; i++) {
      if (!m_ringBuffer[i].in_use.load()) {
          // 슬롯 예약
          m_ringBuffer[i].in_use = true;
          m_ringBuffer[i].target_surface = target_surface;
          break;
      }
  }

  // 2. cuvidParseVideoData 호출
  cuvidParseVideoData(...);

  // HandlePictureDecode:
  // NVDEC이 CurrPicIdx를 알려줌
  int nvdec_slot = pic_params->CurrPicIdx;
  m_ringBuffer[nvdec_slot]./* 메타데이터 설정 */;

  // HandlePictureDisplay:
  int nvdec_slot = disp_info->picture_index;
  m_ringBuffer[nvdec_slot].is_ready = true;

  문제:
  DecodeToSurface에서 어느 슬롯이 사용될지 미리 알 수 없습니다. NVDEC이 CurrPicIdx를 할당하는 것은 HandlePictureDecode 콜백에서입니다.

  따라서 여전히 매핑이 필요합니다:

  // DecodeToSurface → HandlePictureDecode 간 정보 전달 방법:

  // 방법 1: m_currentSubmittingSlotId (기존)
  m_currentSubmittingSlotId.store(my_id);

  // 방법 2: target_surface를 임시 저장
  m_pendingTargetSurface = target_surface;
  cuvidParseVideoData(...);

  // HandlePictureDecode:
  int nvdec_slot = pic_params->CurrPicIdx;
  m_ringBuffer[nvdec_slot].target_surface = decoder->m_pendingTargetSurface;

  결론:

  CurrPicIdx를 직접 ring buffer index로 사용할 수 있지만, DecodeToSurface에서 미리 슬롯을 할당할 수 없는 문제가 남아있습니다. NVDEC이 어느 CurrPicIdx를
  사용할지는 HandlePictureDecode에서만 알 수 있기 때문입니다.

  최선의 단순화:

  // m_currentSubmittingContext로 단순화 (slot_id 대신 전체 컨텍스트)
  struct SubmittingContext {
      void* target_surface;
      VavCoreSurfaceType surface_type;
  };

  std::atomic<SubmittingContext*> m_currentSubmittingContext;

  // DecodeToSurface:
  SubmittingContext ctx{target_surface, target_type};
  m_currentSubmittingContext.store(&ctx);
  cuvidParseVideoData(...);

  // HandlePictureDecode:
  int nvdec_slot = pic_params->CurrPicIdx;
  auto* ctx = decoder->m_currentSubmittingContext.load();
  decoder->m_ringBuffer[nvdec_slot].target_surface = ctx->target_surface;
  decoder->m_ringBuffer[nvdec_slot].surface_type = ctx->surface_type;
  decoder->m_ringBuffer[nvdec_slot].in_use = true;

  이렇게 하면 slot_id, m_slotIdCounter, m_pictureIndexToSlotId 매핑을 모두 제거하고 NVDEC의 CurrPicIdx를 직접 사용할 수 있습니다!

  이 방향으로 리팩토링할까요?

> 좋아. 우리가 먼저 ring buffer 의 번호를 할당하는 것이 아니라, nvdec api 에서 제공하는 CurrPicIdx 를 ringbuffer index 로 사용하자는 얘기군. 좋다.
  ⎿  Interrupted · What should Claude do instead?

> 작업을 진행하기 전에 위의 개선된 내용을 docs/working/NVDEC_RingBuffer_Decode_Design.md 에 반영해둬서 나중에 설계 문서를 다시 읽을 때 헷갈리지 않도록
해줘.

● NVDEC RingBuffer 설계 문서에 개선된 내용을 반영하겠습니다: