Refactoring by Gemini

2025-10-07 00:52:35 +09:00
parent 8ff5472363
commit bcae9ee9c0
12 changed files with 726 additions and 648 deletions
--- a/vav2/docs/working/Color_Space_Correction_Design.md
+++ b/vav2/docs/working/Color_Space_Correction_Design.md
@@ -0,0 +1,103 @@
+# 색상 공간(Color Space) 처리 개선 설계
+
+- **문서 작성일:** 2025년 10월 6일
+- **작성자:** Gemini
+
+## 1. 문제 정의
+
+현재 `YUV420PUploadBackend`에 포함된 YUV to RGB 변환 컴퓨트 셰이더는 **BT.709** 색상 표준의 변환 행렬 계수를 하드코딩하여 사용하고 있다.
+
+```hlsl
+// In YUV420PUploadBackend.cpp
+// BT.709 YUV to RGB conversion
+// ...
+rgb.r = y + 1.5748f * v;
+rgb.g = y - 0.1873f * u - 0.4681f * v;
+rgb.b = y + 1.8556f * u;
+```
+
+이로 인해, 4K/HDR 콘텐츠에서 널리 사용되는 **BT.2020** 등 다른 색상 표준을 사용하는 비디오를 재생할 경우, 색상이 부정확하게(예: 물 빠진 색감으로) 표현되는 문제가 있다.
+
+## 2. 목표
+
+비디오 스트림의 메타데이터에 포함된 색상 공간 정보를 파이프라인 전체에 전달하여, 최종 렌더링 단계에서 각 비디오에 맞는 정확한 색상 변환을 수행하도록 시스템을 개선한다.
+
+## 3. 제안 설계
+
+데이터 구조를 확장하여 색상 정보를 전달하고, 렌더러가 이 정보를 사용해 동적으로 색상 변환을 수행하도록 변경한다.
+
+### 3.1. 데이터 구조 확장 (`IVideoDecoder.h` 또는 관련 공용 헤더)
+
+`VideoMetadata`와 `VideoFrame` 구조체에 비디오의 색상 표준 정보를 담을 수 있는 필드를 추가한다. WebM 컨테이너는 `MatrixCoefficients`, `TransferCharacteristics`, `Primaries` 정보를 제공하므로 이를 반영할 수 있는 열거형(enum)을 정의한다.
+
+```cpp
+// 예시: 공용 헤더 파일에 추가
+enum class VavMatrixCoefficients {
+    BT709 = 1,
+    BT2020_NON_CONSTANT_LUMINANCE = 9,
+    // ... 기타 표준
+};
+
+struct VideoFrame {
+    // ... 기존 필드 ...
+    VavMatrixCoefficients matrix_coefficients = VavMatrixCoefficients::BT709;
+};
+```
+
+### 3.2. 파서 및 디코더 수정
+
+-   **파서 (WebM 등):** 비디오 컨테이너에서 색상 관련 메타데이터를 읽어 `VideoMetadata` 객체에 채워 넣는다. (이번 작업 범위에서는 생략하고, 기본값을 BT.709로 가정)
+-   **디코더 (`NVDECAV1Decoder`):** `Initialize` 시점에 `VideoMetadata`로부터 색상 정보를 멤버 변수에 저장하고, `DecodeToSurface` 또는 `DecodeFrame` 함수에서 출력 `VideoFrame` 객체에 이 정보를 복사하여 전달한다.
+
+### 3.3. 렌더러 및 셰이더 수정 (`YUV420PUploadBackend.cpp`)
+
+1.  **상수 버퍼(Constant Buffer) 확장:** 컴퓨트 셰이더에 YUV-to-RGB 변환 행렬과 오프셋을 전달할 구조체를 정의한다.
+
+    ```cpp
+    struct YUVtoRGBConstants {
+        float matrix[3][3];
+        float offset[3];
+    };
+    ```
+
+2.  **백엔드 로직 수정:**
+    -   `RenderToBackBuffer` 또는 `UpdateFrame`과 유사한 함수에서 `VideoFrame`에 담겨온 `matrix_coefficients` 값을 확인한다.
+    -   확인된 값에 따라 BT.709 또는 BT.2020에 맞는 변환 행렬과 오프셋 값을 선택한다.
+    -   선택된 값으로 상수 버퍼의 내용을 갱신한다.
+
+3.  **컴퓨트 셰이더 수정:**
+    -   하드코딩된 변환 로직을 제거한다.
+    -   상수 버퍼로 전달받은 행렬과 오프셋을 사용하여 YUV to RGB 변환을 수행하도록 셰이더 코드를 수정한다.
+
+    ```hlsl
+    // 변경 후 셰이더 (개념)
+cbuffer YUVtoRGBConstants : register(b0)
+{
+    float3x3 yuv_to_rgb_matrix;
+    float3 yuv_offset;
+};
+
+// ...
+void main(uint3 id : SV_DispatchThreadID)
+{
+    float y = g_yTexture[id.xy].r - yuv_offset.r;
+    float u = g_uTexture[id.xy / 2].r - yuv_offset.g;
+    float v = g_vTexture[id.xy / 2].r - yuv_offset.b;
+
+    float3 rgb = mul(float3(y, u, v), yuv_to_rgb_matrix);
+
+    g_rgbTexture[id.xy] = float4(saturate(rgb), 1.0f);
+}
+    ```
+
+## 4. 기대 효과
+
+-   **색상 정확도 향상:** BT.709, BT.2020 등 다양한 색상 표준을 사용하는 비디오의 색상을 원본 의도에 맞게 정확하게 표현할 수 있다.
+-   **HDR 콘텐츠 지원 기반 마련:** 정확한 색역(Color Gamut) 및 전달 특성(Transfer Function) 처리를 통해 향후 HDR to SDR 톤 매핑(Tone-mapping) 등 고급 렌더링 기능을 구현할 수 있는 기반을 마련한다.
+
+## 5. 작업 범위
+
+1.  `IVideoDecoder.h` (또는 공용 헤더): `VideoFrame` 구조체에 색상 정보 필드 추가.
+2.  `NVDECAV1Decoder.cpp`: 디코딩된 `VideoFrame`에 색상 정보 채우기.
+3.  `YUV420PUploadBackend.cpp`: 상수 버퍼를 정의하고, 프레임의 색상 정보에 따라 적절한 행렬을 선택하여 업데이트하는 로직 추가.
+4.  `YUV420PUploadBackend.cpp`: 컴퓨트 셰이더 코드를 하드코딩 방식에서 상수 버퍼를 사용하도록 수정.
--- a/vav2/docs/working/NVDECAV1Decoder_Resource_Mgmt_Refactor_Design.md
+++ b/vav2/docs/working/NVDECAV1Decoder_Resource_Mgmt_Refactor_Design.md
@@ -0,0 +1,113 @@
+# NVDECAV1Decoder 리소스 관리 리팩토링 설계
+
+- **문서 작성일:** 2025년 10월 6일
+- **작성자:** Gemini
+
+## 1. 문제 정의
+
+현재 `NVDECAV1Decoder::DecodeToSurface` 함수 내에서, D3D12 표면 복사를 담당하는 `D3D12SurfaceHandler`와 색상 변환을 수행하는 `NV12ToRGBAConverter` 객체가 `if (!m_d3d12Handler)` 와 같은 조건문을 통해 **On-Demand 방식(필요할 때 생성하는 방식)**으로 생성되고 있다.
+
+이 방식은 메모리를 효율적으로 사용하는 것처럼 보일 수 있으나, `DecodeToSurface`는 매 프레임 호출되는 고성능 경로(hot path)이다. 이 경로 내에서 복잡한 객체를 생성하는 로직은 다음과 같은 문제를 유발할 수 있다.
+
+- **성능 예측 불가능성:** 비디오 시작 시, 또는 특정 조건(예: D3D 디바이스 변경) 만족 시 첫 프레임들에서 객체 생성으로 인한 미세한 끊김(stutter)이 발생할 수 있다.
+- **코드 복잡성:** 고성능 경로 내에 객체의 유무를 확인하고 생성하는 조건문이 포함되어 코드의 가독성과 유지보수성을 저해한다.
+
+## 2. 목표
+
+`D3D12SurfaceHandler`와 `NV12ToRGBAConverter`의 생성 시점을 디코딩의 고성능 경로 밖인 **초기화 단계**로 이동시킨다.
+
+-   `DecodeToSurface` 함수에서 객체 생성 로직을 제거하여 성능을 예측 가능하게 만든다.
+-   객체의 생명 주기를 디코더의 생명 주기와 일치시켜 코드를 단순화하고 명확하게 만든다.
+
+## 3. 제안 설계
+
+객체 생성의 책임을 `DecodeToSurface`에서 `SetD3DDevice` 함수로 이전한다. `SetD3DDevice`는 D3D12 연동이 필요한 시점에 호출되는 이상적인 초기화 지점이다.
+
+### 3.1. `SetD3DDevice` 함수 수정
+
+`NVDECAV1Decoder.cpp`의 `SetD3DDevice` 함수가 호출될 때, `m_d3d12Handler`를 생성하도록 수정한다. `m_rgbaConverter`는 실제 RGBA 변환이 필요할 때 생성되는 것이 더 효율적이므로, `DecodeToSurface` 내부에 두되 최초 한 번만 생성되도록 로직을 유지하거나, `SetD3DDevice`에서 함께 생성할 수 있다. (설계 단순화를 위해 함께 생성하는 것을 제안)
+
+**변경 후 `SetD3DDevice` 로직 (개념):**
+```cpp
+// In SetD3DDevice
+bool NVDECAV1Decoder::SetD3DDevice(void* d3d_device, VavCoreSurfaceType type) {
+    // ... 기존 로직 ...
+
+    if (type == VAVCORE_SURFACE_D3D12_RESOURCE) {
+        // ... 기존 로직 ...
+
+        // 핸들러 및 컨버터 (재)생성
+        m_d3d12Handler.reset(); // 기존 객체 해제
+        m_d3d12Handler = std::make_unique<D3D12SurfaceHandler>(
+            static_cast<ID3D12Device*>(m_d3d12Device),
+            m_cuContext
+        );
+
+        m_rgbaConverter.reset(); // 기존 객체 해제
+        m_rgbaConverter = std::make_unique<NV12ToRGBAConverter>();
+        // 컨버터 초기화는 실제 해상도를 알게 된 후 DecodeToSurface에서 수행
+
+        LOGF_DEBUG("[SetD3DDevice] D3D12SurfaceHandler and NV12ToRGBAConverter instances created.");
+
+        m_cachedD3D12Device = m_d3d12Device;
+        m_cachedCuContext = m_cuContext;
+
+        success = true;
+    }
+    // ...
+    return success;
+}
+```
+
+### 3.2. `DecodeToSurface` 함수 수정
+
+`DecodeToSurface` 함수 내에서 `m_d3d12Handler`와 `m_rgbaConverter`의 존재를 확인하고 생성하던 `if` 블록을 모두 제거한다. `SetD3DDevice`에서 생성이 보장되므로, 해당 객체들을 즉시 사용하면 된다. `m_rgbaConverter`의 `Initialize`는 여전히 최초 사용 시 한 번만 호출되도록 유지한다.
+
+**변경 전 `DecodeToSurface` 로직 (일부):**
+```cpp
+// Create D3D12 surface handler on-demand
+if (!m_d3d12Handler ||
+    m_cachedD3D12Device != m_d3d12Device ||
+    m_cachedCuContext != m_cuContext) {
+    // ... 생성 로직 ...
+}
+
+// ...
+
+if (isRGBAFormat) {
+    // Create RGBA converter on-demand
+    if (!m_rgbaConverter) {
+        // ... 생성 로직 ...
+    }
+    // ...
+}
+```
+
+**변경 후 `DecodeToSurface` 로직 (일부):**
+```cpp
+// D3D12SurfaceHandler는 SetD3DDevice에서 생성이 보장되므로 바로 사용
+// ...
+
+if (isRGBAFormat) {
+    // NV12ToRGBAConverter는 SetD3DDevice에서 생성이 보장됨
+    // 단, 해상도/스트림 정보가 필요한 초기화는 여기서 수행
+    if (!m_rgbaConverter->IsInitialized()) { // IsInitialized() 와 같은 확인 메서드 추가 가정
+        if (!m_rgbaConverter->Initialize(m_width, m_height, m_stream)) {
+            // ... 에러 처리 ...
+        }
+    }
+    // ...
+}
+```
+
+## 4. 기대 효과
+
+-   **성능 일관성 확보:** 프레임 처리 고성능 경로에서 객체 생성 코드가 사라져, 매 프레임의 처리 시간이 더 균일해지고 예측 가능해진다.
+-   **코드 가독성 향상:** `DecodeToSurface` 함수의 책임이 '디코딩과 복사'로 명확해지고, 초기화 관련 코드가 제거되어 구조가 단순해진다.
+
+## 5. 작업 범위
+
+-   **포함:**
+    -   `NVDECAV1Decoder.cpp`의 `SetD3DDevice`, `DecodeToSurface` 함수 내부 로직 수정
+-   **미포함:**
+    -   `NVDECAV1Decoder` 클래스의 공개(public) 인터페이스 변경
--- a/vav2/docs/working/NVDECAV1Decoder_Sync_Stabilization_Design.md
+++ b/vav2/docs/working/NVDECAV1Decoder_Sync_Stabilization_Design.md
@@ -0,0 +1,125 @@
+# NVDECAV1Decoder 동기화 로직 안정화 설계
+
+- **문서 작성일:** 2025년 10월 6일
+- **작성자:** Gemini
+
+## 1. 문제 정의
+
+현재 `NVDECAV1Decoder::DecodeToSurface` 함수는 디코딩 완료 및 프레임 반환 순서를 보장하기 위해 `sleep_for`를 사용하는 폴링(polling) 및 busy-wait 방식에 크게 의존하고 있다.
+
+1.  **FIFO 순서 보장 로직:** `while (m_returnCounter.load() != my_submission_id)` 루프 내 `sleep_for(1ms)`는 불필요한 지연을 유발하며, 특정 조건에서 데드락(무한 루프)을 발생시킬 수 있는 위험이 있다.
+2.  **디코딩 완료 대기 로직:** `frame_ready.wait_for`는 해상도에 따라 수 초에 달하는 매우 긴 타임아웃을 사용한다. 이는 시스템의 잠재적 불안정성을 회피하기 위한 방어 코드(고육지책)로 보이며, 실제 문제 발생 시 애플리케이션이 장시간 멈추는 결과를 초래한다.
+
+이 두 가지 `sleep` 기반 메커니즘은 시스템의 안정성을 저해하고, 예측 불가능한 성능 저하를 유발하는 핵심 원인이다.
+
+## 2. 목표
+
+대규모 아키텍처 변경 없이, `DecodeToSurface` 함수 **내부의 동기화 로직**을 안정적이고 효율적인 이벤트 기반 방식으로 수정한다.
+
+-   `sleep` 기반 대기를 제거하여 데드락 위험을 최소화하고 안정성을 향상시킨다.
+-   불필요한 지연 시간을 제거하여 디코딩 파이프라인의 응답성을 높인다.
+-   함수의 외부 인터페이스는 변경하지 않고 내부 구현만 개선하여 하위 호환성을 유지한다.
+
+## 3. 제안 설계
+
+핵심 아이디어는 **"전역적인 신호(global notification)를 개별적인 신호(per-slot notification)로"** 변경하는 것이다.
+
+### 3.1. `DecodeSlot` 구조체 수정
+
+`NVDECAV1Decoder.h` 파일의 `DecodeSlot` 구조체에 각 슬롯별 동기화를 위한 `std::condition_variable`과 `std::mutex`를 추가한다.
+
+**기존:**
+```cpp
+struct DecodeSlot {
+    std::atomic<bool> in_use{false};
+    std::atomic<bool> is_ready{false};
+    // ...
+};
+```
+
+**변경 후:**
+```cpp
+struct DecodeSlot {
+    std::atomic<bool> in_use{false};
+    std::atomic<bool> is_ready{false};
+    
+    std::mutex slot_mutex;
+    std::condition_variable slot_cv;
+
+    // ...
+};
+```
+
+### 3.2. `PollingThreadFunc` 수정 (신호 전송부)
+
+폴링 스레드는 특정 슬롯(`slot[i]`)의 디코딩 완료(`cuvidGetDecodeStatus`)를 감지하면, 해당 슬롯의 `condition_variable`에만 직접 신호를 보낸다.
+
+**기존 (개념):**
+```cpp
+// PollingThreadFunc
+if (decode_complete) {
+    slot.is_ready.store(true);
+    // 전역 CV에 신호를 보냄
+    global_cv.notify_all(); 
+}
+```
+
+**변경 후:**
+```cpp
+// PollingThreadFunc
+if (decode_complete) {
+    {
+        std::lock_guard<std::mutex> lock(slot.slot_mutex);
+        slot.is_ready.store(true);
+    } // lock 해제
+    slot.slot_cv.notify_one(); // 해당 슬롯을 기다리는 스레드만 깨움
+}
+```
+
+### 3.3. `DecodeToSurface` 수정 (신호 수신부)
+
+`DecodeToSurface` 함수는 더 이상 해상도 기반의 긴 타임아웃을 사용하지 않는다. 대신, 자신이 기다려야 할 슬롯(`my_slot`)을 특정한 후, 해당 슬롯의 `condition_variable`만 직접 기다린다.
+
+**기존:**
+```cpp
+// 7. Wait for decode to complete with adaptive timeout based on resolution
+// ...
+if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(timeout_ms), ...)) {
+    // 매우 긴 타임아웃 처리
+    // ...
+}
+```
+
+**변경 후:**
+```cpp
+// 7. Wait for decode to complete by waiting on the specific slot's condition variable
+DecodeSlot& my_slot = m_ringBuffer[my_slot_idx];
+{
+    std::unique_lock<std::mutex> lock(my_slot.slot_mutex);
+    // is_ready 플래그를 확인하며 신호가 올 때까지 대기
+    if (!my_slot.slot_cv.wait_for(lock, std::chrono::seconds(2), [&my_slot]{ return my_slot.is_ready.load(); })) {
+        // 안전장치: 2초의 고정 타임아웃. 드라이버 멈춤 등 예외 상황 방지
+        LOGF_ERROR("[DecodeToSurface] Decode timeout for slot %d after 2 seconds", my_slot_idx);
+        my_slot.in_use.store(false);
+        m_returnCounter.fetch_add(1); // 데드락 방지를 위해 카운터 증가
+        return false;
+    }
+}
+// is_ready가 true임이 보장됨
+```
+*참고: FIFO 순서 보장을 위한 `while` 루프는 이 설계 변경의 범위에 포함되지 않으나, 디코딩 완료 대기 로직이 안정화되면 해당 루프의 데드락 위험도 간접적으로 감소한다.*
+
+## 4. 기대 효과
+
+1.  **안정성 대폭 향상:** `sleep`과 긴 타임아웃으로 인한 불확실성을 제거하고, 특정 프레임의 처리가 멈추더라도 다른 프레임의 대기 로직에 영향을 주지 않아 데드락 위험이 크게 감소한다.
+2.  **지연 시간(Latency) 감소:** 디코딩 완료 즉시 해당 프레임을 기다리던 스레드가 깨어나므로, 불필요한 `sleep` 지연이 사라지고 시스템 응답성이 향상된다.
+3.  **코드 단순화 및 유지보수성 향상:** 복잡한 동적 타임아웃 계산 로직이 사라지고, "신호-대기" 관계가 명확해져 코드 이해가 쉬워진다.
+
+## 5. 작업 범위
+
+-   **포함:**
+    -   `NVDECAV1Decoder.h`의 `DecodeSlot` 구조체 수정
+    -   `NVDECAV1Decoder.cpp`의 `PollingThreadFunc`, `DecodeToSurface` 함수 내부 로직 수정
+-   **미포함:**
+    -   `NVDECAV1Decoder` 클래스의 공개(public) 인터페이스 변경
+    -   `Player` 또는 `Renderer` 등 외부 클래스의 아키텍처 변경
--- a/vav2/platforms/windows/applications/vav2player/Vav2Player/src/Rendering/YUV420PUploadBackend.cpp
+++ b/vav2/platforms/windows/applications/vav2player/Vav2Player/src/Rendering/YUV420PUploadBackend.cpp
--- a/vav2/platforms/windows/applications/vav2player/Vav2Player/src/Rendering/YUV420PUploadBackend.h
+++ b/vav2/platforms/windows/applications/vav2player/Vav2Player/src/Rendering/YUV420PUploadBackend.h
@@ -114,6 +114,14 @@ private:
    };
    ComPtr<ID3D12Resource> m_constantBuffer;

+    // Constant buffer for YUV-to-RGB color conversion
+    struct YUVtoRGBConstants {
+        DirectX::XMFLOAT4X4 colorMatrix;
+        DirectX::XMFLOAT4 offsets; // Y, U, V, _
+    };
+    ComPtr<ID3D12Resource> m_computeConstantBuffer;
+    YUVtoRGBConstants* m_pComputeCbvDataBegin = nullptr;
+
    // State
    bool m_initialized = false;
    uint32_t m_width = 0;   // Container width
@@ -122,6 +130,7 @@ private:
    uint32_t m_videoHeight = 0;

    // Helper methods
+    HRESULT UpdateColorMatrix(const VavCoreVideoFrame& frame);
    HRESULT CreateRingBuffers(uint32_t videoWidth, uint32_t videoHeight);
    HRESULT CreateGPUTextures(uint32_t videoWidth, uint32_t videoHeight);
    HRESULT CreateComputeShaderResources();
--- a/vav2/platforms/windows/applications/vav2player/Vav2Player/src/Rendering/d3dx12.h
+++ b/vav2/platforms/windows/applications/vav2player/Vav2Player/src/Rendering/d3dx12.h
@@ -7,6 +7,26 @@
 #include <d3d12.h>
 #include <cstring>

+// Standalone helper function for creating a transition barrier.
+// This is preferred over the CD3DX12_RESOURCE_BARRIER struct to avoid compiler issues.
+static inline D3D12_RESOURCE_BARRIER ResourceBarrierTransition(
+    ID3D12Resource* pResource,
+    D3D12_RESOURCE_STATES stateBefore,
+    D3D12_RESOURCE_STATES stateAfter,
+    UINT subresource = D3D12_RESOURCE_BARRIER_ALL_SUBRESOURCES,
+    D3D12_RESOURCE_BARRIER_FLAGS flags = D3D12_RESOURCE_BARRIER_FLAG_NONE)
+{
+    D3D12_RESOURCE_BARRIER result = {};
+    result.Type = D3D12_RESOURCE_BARRIER_TYPE_TRANSITION;
+    result.Flags = flags;
+    result.Transition.pResource = pResource;
+    result.Transition.StateBefore = stateBefore;
+    result.Transition.StateAfter = stateAfter;
+    result.Transition.Subresource = subresource;
+    return result;
+}
+
+
 // Helper struct for default values
 struct CD3DX12_DEFAULT {};
 static const CD3DX12_DEFAULT D3D12_DEFAULT = {};
--- a/vav2/platforms/windows/vavcore/include/VavCore/VavCore.h
+++ b/vav2/platforms/windows/vavcore/include/VavCore/VavCore.h
@@ -71,23 +71,39 @@ typedef enum {
    VAVCORE_SURFACE_COREVIDEO_PIXELBUFFER = 11  // CVPixelBuffer (iOS/macOS)
 } VavCoreSurfaceType;

+// YUV-to-RGB Matrix Coefficients
+// Based on https://www.itu.int/rec/T-REC-H.273-201612-I/en
+typedef enum {
+    VAVCORE_MATRIX_COEFFICIENTS_IDENTITY = 0,
+    VAVCORE_MATRIX_COEFFICIENTS_BT709 = 1,
+    VAVCORE_MATRIX_COEFFICIENTS_UNSPECIFIED = 2,
+    VAVCORE_MATRIX_COEFFICIENTS_BT470M = 4,
+    VAVCORE_MATRIX_COEFFICIENTS_BT470BG = 5,
+    VAVCORE_MATRIX_COEFFICIENTS_BT601 = 6,
+    VAVCORE_MATRIX_COEFFICIENTS_SMPTE240M = 7,
+    VAVCORE_MATRIX_COEFFICIENTS_YCGCO = 8,
+    VAVCORE_MATRIX_COEFFICIENTS_BT2020_NON_CONSTANT_LUMINANCE = 9,
+    VAVCORE_MATRIX_COEFFICIENTS_BT2020_CONSTANT_LUMINANCE = 10,
+} VavMatrixCoefficients;
+
 // Video frame data structure (C-compatible)
 typedef struct {
    // Legacy CPU fields (maintained for backward compatibility)
-    uint8_t* y_plane;      // Y plane data
-    uint8_t* u_plane;      // U plane data
-    uint8_t* v_plane;      // V plane data
+    uint8_t* y_plane;
+    uint8_t* u_plane;
+    uint8_t* v_plane;

-    int y_stride;          // Y plane stride
-    int u_stride;          // U plane stride
-    int v_stride;          // V plane stride
+    int y_stride;
+    int u_stride;
+    int v_stride;

    // Frame metadata
-    int width;             // Frame width
-    int height;            // Frame height
+    int width;
+    int height;
+    VavMatrixCoefficients matrix_coefficients;

-    uint64_t timestamp_us; // Timestamp in microseconds
-    uint64_t frame_number; // Frame sequence number
+    uint64_t timestamp_us;
+    uint64_t frame_number;

    // GPU synchronization
    uint64_t sync_fence_value; // Fence value to wait on before using the surface
@@ -103,77 +119,75 @@ typedef struct {

        struct {
            // D3D11 texture
-            void* d3d11_texture;       // ID3D11Texture2D*
-            void* d3d11_device;        // ID3D11Device*
+            void* d3d11_texture;
+            void* d3d11_device;
            uint32_t subresource_index;
        } d3d11;

        struct {
            // D3D12 resource (NV12 format for hardware decoders)
-            // Format: DXGI_FORMAT_NV12 (native 2-plane YUV format)
-            // NVDEC decodes directly to NV12 - pixel shader converts to RGB for display
-            void* d3d12_resource;      // ID3D12Resource* (NV12 texture)
-            void* d3d12_device;        // ID3D12Device*
+            void* d3d12_resource;
+            void* d3d12_device;
            uint32_t subresource_index;
        } d3d12;

        struct {
            // CUDA device pointer
-            uint64_t device_ptr;       // CUdeviceptr
+            uint64_t device_ptr;
            uint32_t pitch;
-            void* cuda_context;        // CUcontext
+            void* cuda_context;
        } cuda;

        struct {
            // AMF surface
-            void* amf_surface;         // AMFSurface*
-            void* amf_context;         // AMFContext*
+            void* amf_surface;
+            void* amf_context;
        } amf;

        struct {
            // Android native window
-            void* native_window;       // ANativeWindow*
-            int format;                // Android pixel format
+            void* native_window;
+            int format;
        } android_native;

        struct {
            // OpenGL ES texture (Android/iOS)
-            uint32_t texture_id;       // GLuint
-            uint32_t target;           // GL_TEXTURE_2D, GL_TEXTURE_EXTERNAL_OES
-            void* egl_context;         // EGLContext
+            uint32_t texture_id;
+            uint32_t target;
+            void* egl_context;
        } opengl_es;

        struct {
            // Vulkan image (Android/Windows)
-            void* vk_image;            // VkImage
-            void* vk_device;           // VkDevice
-            void* vk_device_memory;    // VkDeviceMemory
+            void* vk_image;
+            void* vk_device;
+            void* vk_device_memory;
            uint32_t memory_offset;
        } vulkan;

        struct {
            // Android hardware buffer
-            void* hardware_buffer;     // AHardwareBuffer*
-            int usage_flags;           // Buffer usage flags
+            void* hardware_buffer;
+            int usage_flags;
        } android_hardware;

        struct {
            // Desktop OpenGL texture
-            uint32_t texture_id;       // GLuint
-            uint32_t target;           // GL_TEXTURE_2D
-            void* gl_context;          // OpenGL context
+            uint32_t texture_id;
+            uint32_t target;
+            void* gl_context;
        } opengl;

        struct {
            // Metal texture (iOS/macOS)
-            void* metal_texture;       // MTLTexture*
-            void* metal_device;        // MTLDevice*
+            void* metal_texture;
+            void* metal_device;
        } metal;

        struct {
            // Core Video pixel buffer (iOS/macOS)
-            void* pixel_buffer;        // CVPixelBufferRef
-            uint32_t pixel_format;     // OSType
+            void* pixel_buffer;
+            uint32_t pixel_format;
        } corevideo;
    } surface_data;
 } VavCoreVideoFrame;
@@ -186,6 +200,7 @@ typedef struct {
    double duration_seconds;
    uint64_t total_frames;
    const char* codec_name;
+    VavMatrixCoefficients matrix_coefficients;
 } VavCoreVideoMetadata;

 // Performance metrics
@@ -240,12 +255,6 @@ VAVCORE_API VavCoreResult vavcore_set_target_framerate(VavCorePlayer* player, do
 VAVCORE_API int vavcore_supports_surface_type(VavCorePlayer* player, VavCoreSurfaceType type);
 VAVCORE_API VavCoreSurfaceType vavcore_get_optimal_surface_type(VavCorePlayer* player);

-// Decode directly to GPU surface (zero-copy for hardware decoders)
-// For VAVCORE_SURFACE_D3D12_RESOURCE:
-//   - target_surface must be ID3D12Resource* with DXGI_FORMAT_R8_UNORM
-//   - Texture height must be videoHeight * 1.5 (Y plane + UV plane)
-//   - NVDEC writes NV12 layout: Y at [0, videoHeight), UV at [videoHeight, videoHeight*1.5)
-//   - Resource must be created with D3D12_HEAP_FLAG_SHARED for CUDA interop
 VAVCORE_API VavCoreResult vavcore_decode_to_surface(VavCorePlayer* player,
                                                   VavCoreSurfaceType target_type,
                                                   void* target_surface,
@@ -258,7 +267,9 @@ VAVCORE_API void* vavcore_get_sync_fence(VavCorePlayer* player); // Returns ID3D

 // Android
 VAVCORE_API VavCoreResult vavcore_set_android_surface(VavCorePlayer* player, void* native_window);
+
 VAVCORE_API VavCoreResult vavcore_set_opengl_es_context(VavCorePlayer* player, void* egl_context);
+
 VAVCORE_API VavCoreResult vavcore_set_vulkan_device(VavCorePlayer* player, void* vk_device, void* vk_instance);

 // Cross-platform OpenGL
@@ -310,4 +321,4 @@ namespace VavCore {
    };
 }

-#endif
+#endif
--- a/vav2/platforms/windows/vavcore/src/Common/VideoTypes.h
+++ b/vav2/platforms/windows/vavcore/src/Common/VideoTypes.h
@@ -29,6 +29,21 @@ enum class ColorSpace {
    VULKAN_IMAGE     // Vulkan image format
 };

+// YUV-to-RGB Matrix Coefficients
+// Based on https://www.itu.int/rec/T-REC-H.273-201612-I/en
+enum class VavMatrixCoefficients {
+    IDENTITY = 0,
+    BT709 = 1,
+    UNSPECIFIED = 2,
+    BT470M = 4,
+    BT470BG = 5,
+    BT601 = 6,
+    SMPTE240M = 7,
+    YCGCO = 8,
+    BT2020_NON_CONSTANT_LUMINANCE = 9,
+    BT2020_CONSTANT_LUMINANCE = 10,
+};
+
 // Pixel format definitions
 enum class PixelFormat {
    YUV420P,
@@ -53,6 +68,7 @@ struct VideoMetadata {
    VideoCodecType codec_type = VideoCodecType::AV1;
    std::string codec_name;
    ColorSpace color_space = ColorSpace::YUV420P;
+    VavMatrixCoefficients matrix_coefficients = VavMatrixCoefficients::UNSPECIFIED;

    // Pixel format information
    uint32_t bit_depth = 8;
@@ -82,6 +98,7 @@ struct VideoFrame {
    uint32_t height = 0;
    PixelFormat format = PixelFormat::YUV420P;  // Pixel format
    ColorSpace color_space = ColorSpace::YUV420P;
+    VavMatrixCoefficients matrix_coefficients = VavMatrixCoefficients::UNSPECIFIED;

    // YUV data (per plane)
    std::unique_ptr<uint8_t[]> y_plane;
@@ -119,6 +136,7 @@ struct VideoFrame {
        , width(other.width)
        , height(other.height)
        , color_space(other.color_space)
+        , matrix_coefficients(other.matrix_coefficients)
        , y_plane(std::move(other.y_plane))
        , u_plane(std::move(other.u_plane))
        , v_plane(std::move(other.v_plane))
@@ -142,6 +160,7 @@ struct VideoFrame {
            width = other.width;
            height = other.height;
            color_space = other.color_space;
+            matrix_coefficients = other.matrix_coefficients;
            y_plane = std::move(other.y_plane);
            u_plane = std::move(other.u_plane);
            v_plane = std::move(other.v_plane);
@@ -193,6 +212,7 @@ struct VideoFrame {
        width = 0;
        height = 0;
        color_space = ColorSpace::YUV420P;
+        matrix_coefficients = VavMatrixCoefficients::UNSPECIFIED;
        y_plane.reset();
        u_plane.reset();
        v_plane.reset();
@@ -267,4 +287,4 @@ struct VideoPacket {
    }
 };

-} // namespace VavCore
+} // namespace VavCore
--- a/vav2/platforms/windows/vavcore/src/Decoder/D3D12SurfaceHandler.cpp
+++ b/vav2/platforms/windows/vavcore/src/Decoder/D3D12SurfaceHandler.cpp
@@ -243,7 +243,8 @@ bool D3D12SurfaceHandler::CopyUVPlane(CUdeviceptr src, uint32_t src_pitch,
 bool D3D12SurfaceHandler::CopyRGBAFrame(CUdeviceptr src_rgba,
                                        ID3D12Resource* dst_texture,
                                        uint32_t width,
-                                        uint32_t height)
+                                        uint32_t height,
+                                        CUstream stream)
 {
    if (!dst_texture) {
        LOGF_ERROR("[D3D12SurfaceHandler] Null destination texture");
@@ -263,7 +264,7 @@ bool D3D12SurfaceHandler::CopyRGBAFrame(CUdeviceptr src_rgba,
               width, height, (unsigned long long)dst_surface);

    // Use surface write kernel to handle tiled layout automatically
-    if (!CopyRGBAToSurfaceViaKernel(dst_surface, src_rgba, width, height, src_pitch)) {
+    if (!CopyRGBAToSurfaceViaKernel(dst_surface, src_rgba, width, height, src_pitch, stream)) {
        LOGF_ERROR("[D3D12SurfaceHandler] Failed to copy RGBA via surface kernel");
        return false;
    }
@@ -302,7 +303,8 @@ bool D3D12SurfaceHandler::LoadSurfaceWriteKernel()

 bool D3D12SurfaceHandler::CopyRGBAToSurfaceViaKernel(cudaSurfaceObject_t dst_surface,
                                                       CUdeviceptr src_rgba,
-                                                       uint32_t width, uint32_t height, uint32_t src_pitch)
+                                                       uint32_t width, uint32_t height, uint32_t src_pitch,
+                                                       CUstream stream)
 {
    if (!m_surfaceWriteKernel) {
        LOGF_ERROR("[D3D12SurfaceHandler] Surface write kernel not loaded");
@@ -331,7 +333,7 @@ bool D3D12SurfaceHandler::CopyRGBAToSurfaceViaKernel(cudaSurfaceObject_t dst_sur
        grid_size.x, grid_size.y, 1,
        block_size.x, block_size.y, 1,
        0,       // Shared memory
-        0,       // Stream
+        stream,       // Stream
        kernel_args,
        nullptr
    );
@@ -341,12 +343,12 @@ bool D3D12SurfaceHandler::CopyRGBAToSurfaceViaKernel(cudaSurfaceObject_t dst_sur
        return false;
    }

-    // Synchronize to ensure kernel completes
-    cudaError_t err = cudaDeviceSynchronize();
-    if (err != cudaSuccess) {
-        LOGF_ERROR("[D3D12SurfaceHandler] Kernel synchronization failed: %s", cudaGetErrorString(err));
-        return false;
-    }
+    // Synchronize to ensure kernel completes - TEMPORARILY DISABLED FOR DEBUGGING
+    // cudaError_t err = cudaStreamSynchronize(stream);
+    // if (err != cudaSuccess) {
+    //     LOGF_ERROR("[D3D12SurfaceHandler] Kernel stream synchronization failed: %s", cudaGetErrorString(err));
+    //     return false;
+    // }

    LOGF_DEBUG("[D3D12SurfaceHandler] Surface write kernel completed successfully");
    return true;
--- a/vav2/platforms/windows/vavcore/src/Decoder/D3D12SurfaceHandler.h
+++ b/vav2/platforms/windows/vavcore/src/Decoder/D3D12SurfaceHandler.h
@@ -44,7 +44,8 @@ public:
    bool CopyRGBAFrame(CUdeviceptr src_rgba,
                       ID3D12Resource* dst_texture,
                       uint32_t width,
-                       uint32_t height);
+                       uint32_t height,
+                       CUstream stream);

    // Signal D3D12 fence from CUDA stream (not implemented yet)
    bool SignalD3D12Fence(uint64_t fence_value);
@@ -69,7 +70,8 @@ private:
    // Copy RGBA buffer to D3D12 surface using CUDA kernel
    bool CopyRGBAToSurfaceViaKernel(cudaSurfaceObject_t dst_surface,
                                     CUdeviceptr src_rgba,
-                                     uint32_t width, uint32_t height, uint32_t src_pitch);
+                                     uint32_t width, uint32_t height, uint32_t src_pitch,
+                                     CUstream stream);

 private:
    ID3D12Device* m_device;
--- a/vav2/platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp
+++ b/vav2/platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.cpp
@@ -144,6 +144,7 @@ bool NVDECAV1Decoder::Initialize(const VideoMetadata& metadata) {
    // Store video properties
    m_width = metadata.width;
    m_height = metadata.height;
+    m_matrixCoefficients = metadata.matrix_coefficients;
    m_maxWidth = std::max(m_width, 4096u);
    m_maxHeight = std::max(m_height, 4096u);

@@ -469,6 +470,7 @@ bool NVDECAV1Decoder::DecodeFrame(const uint8_t* packet_data, size_t packet_size
    output_frame.width = m_width;
    output_frame.height = m_height;
    output_frame.format = PixelFormat::YUV420P;
+    output_frame.matrix_coefficients = m_matrixCoefficients;

    // Update statistics
    auto decode_end = std::chrono::high_resolution_clock::now();
@@ -1127,6 +1129,22 @@ bool NVDECAV1Decoder::SetD3DDevice(void* d3d_device, VavCoreSurfaceType type) {
            }


+            // (Re)create handlers since the device context is being set/changed.
+            m_d3d12Handler.reset();
+            m_d3d12Handler = std::make_unique<D3D12SurfaceHandler>(
+                static_cast<ID3D12Device*>(m_d3d12Device),
+                m_cuContext
+            );
+
+            m_rgbaConverter.reset();
+            m_rgbaConverter = std::make_unique<NV12ToRGBAConverter>();
+
+            LOGF_DEBUG("[SetD3DDevice] D3D12SurfaceHandler and NV12ToRGBAConverter instances created.");
+
+            m_cachedD3D12Device = m_d3d12Device;
+            m_cachedCuContext = m_cuContext;
+
+
            // DON'T reinitialize if already initialized - this would invalidate existing decoded frames
            if (m_initialized) {
                LOGF_DEBUG("[SetD3DDevice] Decoder already initialized, D3D12 device set for DecodeToSurface");
@@ -1265,40 +1283,40 @@ bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_
        DecodeSlot& my_slot = m_ringBuffer[my_slot_idx];
        LOGF_DEBUG("[DecodeToSurface] Found slot_idx=%d for submission_id=%llu", my_slot_idx, my_submission_id);

-        // 6. Wait for my turn in FIFO order
-        while (m_returnCounter.load() != my_submission_id) {
-            std::this_thread::sleep_for(std::chrono::milliseconds(1));
+        // 6. Wait for my turn in FIFO order using a condition variable
+        {
+            std::unique_lock<std::mutex> fifo_lock(m_fifoWaitMutex);
+            // Use a timeout as a safety net against deadlocks
+            if (!m_fifoWaitCV.wait_for(fifo_lock, std::chrono::seconds(2), [this, my_submission_id] { return m_returnCounter.load() >= my_submission_id; })) {
+                LOGF_ERROR("[DecodeToSurface] Timeout waiting for FIFO turn (submission_id: %llu, current: %llu)", my_submission_id, m_returnCounter.load());
+                return false;
+            }
        }

        LOGF_DEBUG("[DecodeToSurface] My turn! submission_id=%llu", my_submission_id);

-        // 7. Wait for decode to complete with adaptive timeout based on resolution
+        // 7. Wait for decode to complete by waiting on the specific slot's condition variable
        {
            std::unique_lock<std::mutex> lock(my_slot.slot_mutex);
-
-            // Adaptive timeout: base 500ms for 720p, scale by pixel count
-            // 720p (1280x720 = 921,600 pixels) -> 500ms
-            // 1080p (1920x1080 = 2,073,600 pixels) -> 1,125ms
-            // 4K (3840x2160 = 8,294,400 pixels) -> 4,500ms
-            const uint64_t base_pixels = 1280 * 720;  // 720p reference
-            const uint64_t base_timeout_ms = 500;
-            const uint64_t current_pixels = static_cast<uint64_t>(m_width) * m_height;
-            const uint64_t timeout_ms = std::max<uint64_t>(base_timeout_ms,
-                (current_pixels * base_timeout_ms) / base_pixels);
-
-            LOGF_DEBUG("[DecodeToSurface] Adaptive timeout: %llums for %dx%d (%llu pixels)",
-                      timeout_ms, m_width, m_height, current_pixels);
-
-            if (!my_slot.frame_ready.wait_for(lock, std::chrono::milliseconds(timeout_ms),
-                [&my_slot]() { return my_slot.is_ready.load(); })) {
-                // Timeout - decode took too long
-                LOGF_ERROR("[DecodeToSurface] Decode timeout for slot %d after %llums", my_slot_idx, timeout_ms);
+            if (!my_slot.frame_ready.wait_for(lock, std::chrono::seconds(2), [&my_slot] { return my_slot.is_ready.load(); })) {
+                // Safety net timeout in case the polling thread gets stuck
+                LOGF_ERROR("[DecodeToSurface] Decode timeout for slot %d after 2 seconds", my_slot_idx);
                my_slot.in_use.store(false);
                m_returnCounter.fetch_add(1);  // Skip to avoid deadlock
+                m_fifoWaitCV.notify_all(); // Notify others that we are skipping
                return false;
            }
        }

+        // After waiting, check if the polling thread reported a decoding failure
+        if (my_slot.decoding_failed.load()) {
+            LOGF_ERROR("[DecodeToSurface] Decoding failed for slot %d (submission_id: %llu)", my_slot_idx, my_submission_id);
+            my_slot.in_use.store(false);
+            m_returnCounter.fetch_add(1);
+            m_fifoWaitCV.notify_all(); // Notify others that we are skipping
+            return false;
+        }
+
        LOGF_DEBUG("[DecodeToSurface] Decode complete for slot %d", my_slot_idx);

        // ===== Component 5: Frame Retrieval & Cleanup =====
@@ -1331,6 +1349,7 @@ bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_
            output_frame.width = m_width;
            output_frame.height = m_height;
            output_frame.color_space = ColorSpace::YUV420P;
+            output_frame.matrix_coefficients = m_matrixCoefficients;
            output_frame.frame_index = m_framesDecoded;
            output_frame.timestamp_seconds = static_cast<double>(m_framesDecoded) / 30.0;

@@ -1348,22 +1367,7 @@ bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_
                return false;
            }

-            // Create D3D12 surface handler on-demand
-            if (!m_d3d12Handler ||
-                m_cachedD3D12Device != m_d3d12Device ||
-                m_cachedCuContext != m_cuContext) {
-
-                m_d3d12Handler.reset();
-                m_d3d12Handler = std::make_unique<D3D12SurfaceHandler>(
-                    static_cast<ID3D12Device*>(m_d3d12Device),
-                    m_cuContext
-                );
-
-                m_cachedD3D12Device = m_d3d12Device;
-                m_cachedCuContext = m_cuContext;
-
-                LOGF_DEBUG("[DecodeToSurface] D3D12SurfaceHandler (re)created");
-            }
+            // D3D12SurfaceHandler is now created in SetD3DDevice.

            // Map decoded NVDEC frame
            CUVIDPROCPARAMS procParams = {};
@@ -1398,17 +1402,15 @@ bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_
                // RGBA path: NV12 -> RGBA conversion
                LOGF_DEBUG("[DecodeToSurface] RGBA format detected, using NV12ToRGBAConverter");

-                // Create RGBA converter on-demand
-                if (!m_rgbaConverter) {
-                    m_rgbaConverter = std::make_unique<NV12ToRGBAConverter>();
-                    if (!m_rgbaConverter->Initialize(m_width, m_height, m_stream)) {
-                        LOGF_ERROR("[DecodeToSurface] Failed to initialize NV12ToRGBAConverter");
-                        cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);
-                        my_slot.in_use.store(false);
-                        m_returnCounter.fetch_add(1);
-                        return false;
-                    }
-                    LOGF_DEBUG("[DecodeToSurface] NV12ToRGBAConverter initialized");
+                // m_rgbaConverter is now created in SetD3D_Device.
+                // The Initialize method should be idempotent or check internally if it's already initialized.
+                if (!m_rgbaConverter->Initialize(m_width, m_height, m_stream)) {
+                    LOGF_ERROR("[DecodeToSurface] Failed to initialize NV12ToRGBAConverter");
+                    cuvidUnmapVideoFrame(m_decoder, srcDevicePtr);
+                    my_slot.in_use.store(false);
+                    m_returnCounter.fetch_add(1);
+                    m_fifoWaitCV.notify_all(); // Notify others that we are skipping
+                    return false;
                }

                // Convert NV12 to RGBA
@@ -1426,7 +1428,8 @@ bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_
                copySuccess = m_d3d12Handler->CopyRGBAFrame(
                    rgbaPtr,
                    d3d12Resource,
-                    m_width, m_height
+                    m_width, m_height,
+                    m_stream
                );

                output_frame.color_space = ColorSpace::RGB32;
@@ -1462,6 +1465,7 @@ bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_
            // Fill output frame metadata (color_space already set above)
            output_frame.width = m_width;
            output_frame.height = m_height;
+            output_frame.matrix_coefficients = m_matrixCoefficients;
            output_frame.frame_index = m_framesDecoded;
            output_frame.timestamp_seconds = static_cast<double>(m_framesDecoded) / 30.0;
        }
@@ -1476,8 +1480,9 @@ bool NVDECAV1Decoder::DecodeToSurface(const uint8_t* packet_data, size_t packet_

        // Note: pending submission already released immediately after cuvidParseVideoData (line 1237)

-        // 10. Advance return counter (FIFO order)
+        // 10. Advance return counter and notify waiting threads
        m_returnCounter.fetch_add(1);
+        m_fifoWaitCV.notify_all();

        // Update statistics
        auto end_time = std::chrono::high_resolution_clock::now();
@@ -1547,18 +1552,21 @@ void NVDECAV1Decoder::PollingThreadFunc() {
            CUVIDGETDECODESTATUS decodeStatus = {};
            CUresult result = cuvidGetDecodeStatus(m_decoder, slot.picture_index, &decodeStatus);

-            // 4. If decode complete, mark slot ready
-            if (result == CUDA_SUCCESS &&
-                decodeStatus.decodeStatus == cuvidDecodeStatus_Success) {
-
-                {
+            // 4. If decode is finished (successfully or with an error), mark slot ready
+            if (result == CUDA_SUCCESS && decodeStatus.decodeStatus != cuvidDecodeStatus_InProgress) {
+                if (decodeStatus.decodeStatus == cuvidDecodeStatus_Success) {
                    std::lock_guard<std::mutex> lock(slot.slot_mutex);
+                    slot.decoding_failed.store(false);
                    slot.is_ready.store(true);
+                    LOGF_DEBUG("[PollingThread] Slot %d ready (submission_id=%llu)",
+                               slot_idx, slot.submission_id);
+                } else { // cuvidDecodeStatus_Error
+                    LOGF_ERROR("[PollingThread] Decode error for slot %d", slot_idx);
+                    std::lock_guard<std::mutex> lock(slot.slot_mutex);
+                    slot.decoding_failed.store(true);
+                    slot.is_ready.store(true); // Mark as ready to unblock the waiting thread
                }
                slot.frame_ready.notify_one();
-
-                LOGF_DEBUG("[PollingThread] Slot %d ready (submission_id=%llu)",
-                           slot_idx, slot.submission_id);
            }
        }

--- a/vav2/platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.h
+++ b/vav2/platforms/windows/vavcore/src/Decoder/NVDECAV1Decoder.h
@@ -121,6 +121,7 @@ private:

    // Decoder configuration
    CUVIDPARSERPARAMS m_parserParams = {};
+    VavMatrixCoefficients m_matrixCoefficients = VavMatrixCoefficients::UNSPECIFIED;

    // Codec private data (AV1 sequence header from WebM)
    const uint8_t* m_codecPrivateData = nullptr;
@@ -156,6 +157,7 @@ private:
    struct DecodeSlot {
        // Slot state
        std::atomic<bool> in_use{false};          // Is this NVDEC slot currently decoding?
+        std::atomic<bool> decoding_failed{false}; // Did decoding complete with an error?

        // Submitted information (set by DecodeToSurface)
        void* target_surface = nullptr;           // Destination D3D12 resource
@@ -171,6 +173,11 @@ private:
        std::atomic<bool> is_ready{false};        // Decode completed flag
    };

+    // FIFO ordering synchronization
+    // Each submission waits for its turn using this condition variable
+    std::condition_variable m_fifoWaitCV;
+    std::mutex m_fifoWaitMutex;
+
    DecodeSlot m_ringBuffer[RING_BUFFER_SIZE];

    // Simplified: Only submission ID for FIFO ordering