Files

ened 3ab4ab14c6 Organize project documents

2025-09-28 17:10:41 +09:00

15 KiB

Raw Blame History

D3D Surface 직접 디코딩 아키텍처 설계

개요

이 문서는 VavCore에서 D3D surface 직접 디코딩을 구현하여 CPU 메모리 복사를 제거하고 고성능 GPU-to-GPU 렌더링을 가능하게 하는 아키텍처를 제시합니다. 이 설계는 CPU 전용 디코딩과의 호환성을 유지하면서 모든 주요 하드웨어 가속 SDK를 지원합니다.

현재 상태 분석

VavCoreVideoFrame (CPU 전용)

typedef struct {
    uint8_t* y_plane;      // Y 평면 데이터 (CPU 메모리)
    uint8_t* u_plane;      // U 평면 데이터 (CPU 메모리)
    uint8_t* v_plane;      // V 평면 데이터 (CPU 메모리)

    int y_stride;          // Y 평면 stride
    int u_stride;          // U 평면 stride
    int v_stride;          // V 평면 stride

    int width;             // 프레임 너비
    int height;            // 프레임 높이

    uint64_t timestamp_us; // 타임스탬프 (마이크로초)
    uint64_t frame_number; // 프레임 시퀀스 번호
} VavCoreVideoFrame;

제한사항:

CPU 메모리 포인터만 제공
렌더링을 위해 GPU → CPU → GPU 메모리 복사 필요
고해상도 콘텐츠에서 성능 병목 발생

SDK D3D Surface 지원 분석

1. AMD AMF (Advanced Media Framework)

D3D Surface 지원: ✅ 완전한 D3D11/D3D12 텍스처 지원

핵심 컴포넌트:

AMFSurface - 범용 surface 추상화
AMFContext::CreateSurfaceFromDX11Native() - D3D11 텍스처 래퍼
AMFContext::CreateSurfaceFromDX12Native() - D3D12 리소스 래퍼

사용 패턴:

// D3D11 텍스처 surface 생성
ID3D11Texture2D* d3d11Texture;
AMFSurfacePtr amfSurface;
amfContext->CreateSurfaceFromDX11Native(d3d11Texture, &amfSurface, nullptr);

// AMF surface에 직접 디코딩
amfDecoder->SubmitInput(amfSurface);
amfDecoder->QueryOutput(&outputSurface);

2. Intel VPL (Video Processing Library)

D3D Surface 지원: ✅ mfxFrameSurface1을 통한 D3D11/D3D12 지원

핵심 컴포넌트:

mfxFrameSurface1 - D3D 핸들이 포함된 surface 디스크립터
mfxHandleType - D3D11/D3D12 핸들 타입 지정
외부 할당자 통합

사용 패턴:

// D3D11 surface 할당자 설정
mfxFrameSurface1 surface = {};
surface.Info = videoParams.mfx.FrameInfo;
surface.Data.MemId = d3d11Texture; // 직접 D3D11 텍스처 할당

// D3D surface에 디코딩
MFXVideoDECODE_DecodeFrameAsync(session, nullptr, &surface, &outputSurface, &sync);

3. NVIDIA NVDEC

D3D Surface 지원: ✅ D3D interop을 통한 CUDA 디바이스 포인터

핵심 컴포넌트:

cuvidMapVideoFrame() - 디코딩된 프레임을 CUDA 디바이스 포인터로 매핑
CUdeviceptr - CUDA 디바이스 메모리 포인터
D3D-CUDA 상호 운용성 API

사용 패턴:

// 디코딩된 프레임을 CUDA 디바이스 메모리로 매핑
CUdeviceptr devicePtr;
unsigned int pitch;
cuvidMapVideoFrame(decoder, picIdx, &devicePtr, &pitch, &params);

// D3D 텍스처를 CUDA에 등록
CUgraphicsResource cudaResource;
cuGraphicsD3D11RegisterResource(&cudaResource, d3d11Texture, CU_GRAPHICS_REGISTER_FLAGS_NONE);

4. dav1d (소프트웨어 디코더)

D3D Surface 지원: ❌ CPU 전용 디코더

특징:

순수 소프트웨어 구현
CPU 메모리 포인터만 제공
GPU surface 통합 없음
렌더링을 위해 CPU → GPU 업로드 필요

제안된 아키텍처

1. 확장된 Surface 타입

새로운 VavCoreSurfaceType 열거형:

typedef enum {
    VAVCORE_SURFACE_CPU = 0,           // 기존 CPU 메모리
    VAVCORE_SURFACE_D3D11_TEXTURE = 1, // D3D11 텍스처
    VAVCORE_SURFACE_D3D12_RESOURCE = 2,// D3D12 리소스
    VAVCORE_SURFACE_CUDA_DEVICE = 3,   // CUDA 디바이스 포인터
    VAVCORE_SURFACE_AMF_SURFACE = 4    // AMF surface 래퍼
} VavCoreSurfaceType;

확장된 VavCoreVideoFrame:

typedef struct {
    // 기존 CPU 필드들 (호환성 유지)
    uint8_t* y_plane;
    uint8_t* u_plane;
    uint8_t* v_plane;
    int y_stride;
    int u_stride;
    int v_stride;

    // 프레임 메타데이터
    int width;
    int height;
    uint64_t timestamp_us;
    uint64_t frame_number;

    // 새로운 D3D surface 필드들
    VavCoreSurfaceType surface_type;
    union {
        struct {
            // CPU 메모리 (기존)
            uint8_t* planes[3];
            int strides[3];
        } cpu;

        struct {
            // D3D11 텍스처
            void* d3d11_texture;       // ID3D11Texture2D*
            void* d3d11_device;        // ID3D11Device*
            uint32_t subresource_index;
        } d3d11;

        struct {
            // D3D12 리소스
            void* d3d12_resource;      // ID3D12Resource*
            void* d3d12_device;        // ID3D12Device*
            uint32_t subresource_index;
        } d3d12;

        struct {
            // CUDA 디바이스 포인터
            uint64_t device_ptr;       // CUdeviceptr
            uint32_t pitch;
            void* cuda_context;        // CUcontext
        } cuda;

        struct {
            // AMF surface
            void* amf_surface;         // AMFSurface*
            void* amf_context;         // AMFContext*
        } amf;
    } surface_data;
} VavCoreVideoFrame;

2. 디코더 인터페이스 확장

향상된 디코더 인터페이스:

class IVideoDecoder {
public:
    // 기존 메서드들
    virtual bool DecodeFrame(const uint8_t* packet_data, size_t packet_size,
                           VavCoreVideoFrame& frame) = 0;

    // 새로운 D3D surface 메서드들
    virtual bool SupportsSurfaceType(VavCoreSurfaceType type) = 0;
    virtual bool DecodeToSurface(const uint8_t* packet_data, size_t packet_size,
                               VavCoreSurfaceType target_type,
                               void* target_surface,
                               VavCoreVideoFrame& frame) = 0;
    virtual bool SetD3DDevice(void* d3d_device, VavCoreSurfaceType type) = 0;
};

3. 하드웨어별 구현

AMD AMF 디코더 구현

class AMFDecoder : public IVideoDecoder {
private:
    AMFContextPtr m_amfContext;
    AMFComponentPtr m_amfDecoder;
    ID3D11Device* m_d3d11Device;

public:
    bool SupportsSurfaceType(VavCoreSurfaceType type) override {
        return (type == VAVCORE_SURFACE_D3D11_TEXTURE ||
                type == VAVCORE_SURFACE_D3D12_RESOURCE ||
                type == VAVCORE_SURFACE_AMF_SURFACE);
    }

    bool DecodeToSurface(const uint8_t* packet_data, size_t packet_size,
                        VavCoreSurfaceType target_type,
                        void* target_surface,
                        VavCoreVideoFrame& frame) override {
        // D3D 텍스처로부터 AMF surface 생성
        AMFSurfacePtr inputSurface;
        if (target_type == VAVCORE_SURFACE_D3D11_TEXTURE) {
            m_amfContext->CreateSurfaceFromDX11Native(
                static_cast<ID3D11Texture2D*>(target_surface),
                &inputSurface, nullptr);
        }

        // AMF surface에 직접 디코딩
        m_amfDecoder->SubmitInput(inputSurface);

        AMFDataPtr outputData;
        m_amfDecoder->QueryOutput(&outputData);

        // 프레임 메타데이터 채움
        frame.surface_type = VAVCORE_SURFACE_AMF_SURFACE;
        frame.surface_data.amf.amf_surface = outputData.GetPtr();
        frame.surface_data.amf.amf_context = m_amfContext.GetPtr();

        return true;
    }
};

Intel VPL 디코더 구현

class VPLDecoder : public IVideoDecoder {
private:
    mfxSession m_session;
    ID3D11Device* m_d3d11Device;

public:
    bool DecodeToSurface(const uint8_t* packet_data, size_t packet_size,
                        VavCoreSurfaceType target_type,
                        void* target_surface,
                        VavCoreVideoFrame& frame) override {
        mfxFrameSurface1 surface = {};
        surface.Info = m_videoParams.mfx.FrameInfo;

        if (target_type == VAVCORE_SURFACE_D3D11_TEXTURE) {
            surface.Data.MemId = target_surface; // 직접 D3D11 텍스처

            mfxSyncPoint sync;
            mfxFrameSurface1* outputSurface;

            MFXVideoDECODE_DecodeFrameAsync(m_session, nullptr, &surface,
                                          &outputSurface, &sync);
            MFXVideoCORE_SyncOperation(m_session, sync, MFX_INFINITE);

            // 프레임 메타데이터 채움
            frame.surface_type = VAVCORE_SURFACE_D3D11_TEXTURE;
            frame.surface_data.d3d11.d3d11_texture = outputSurface->Data.MemId;
            frame.surface_data.d3d11.d3d11_device = m_d3d11Device;
        }

        return true;
    }
};

NVIDIA NVDEC 구현

class NVDECDecoder : public IVideoDecoder {
private:
    CUvideodecoder m_decoder;
    CUcontext m_cudaContext;

public:
    bool DecodeToSurface(const uint8_t* packet_data, size_t packet_size,
                        VavCoreSurfaceType target_type,
                        void* target_surface,
                        VavCoreVideoFrame& frame) override {
        // 프레임 디코딩
        CUVIDPICPARAMS picParams = {};
        // ... packet_data로부터 picParams 설정

        cuvidDecodePicture(m_decoder, &picParams);

        // CUDA 디바이스 포인터로 매핑
        CUdeviceptr devicePtr;
        unsigned int pitch;
        CUVIDPROCPARAMS procParams = {};

        cuvidMapVideoFrame(m_decoder, picParams.CurrPicIdx,
                          &devicePtr, &pitch, &procParams);

        // 프레임 메타데이터 채움
        frame.surface_type = VAVCORE_SURFACE_CUDA_DEVICE;
        frame.surface_data.cuda.device_ptr = devicePtr;
        frame.surface_data.cuda.pitch = pitch;
        frame.surface_data.cuda.cuda_context = m_cudaContext;

        return true;
    }
};

4. 렌더러 통합

D3D Surface 인식 렌더러:

class D3DSurfaceRenderer {
public:
    bool RenderFrame(const VavCoreVideoFrame& frame) {
        switch (frame.surface_type) {
            case VAVCORE_SURFACE_D3D11_TEXTURE:
                return RenderD3D11Texture(frame.surface_data.d3d11);

            case VAVCORE_SURFACE_D3D12_RESOURCE:
                return RenderD3D12Resource(frame.surface_data.d3d12);

            case VAVCORE_SURFACE_CUDA_DEVICE:
                return RenderCudaDevicePtr(frame.surface_data.cuda);

            case VAVCORE_SURFACE_AMF_SURFACE:
                return RenderAMFSurface(frame.surface_data.amf);

            case VAVCORE_SURFACE_CPU:
            default:
                return RenderCPUFrame(frame);
        }
    }

private:
    bool RenderD3D11Texture(const auto& d3d11_data) {
        auto texture = static_cast<ID3D11Texture2D*>(d3d11_data.d3d11_texture);
        // 직접 텍스처-to-백버퍼 복사 또는 셰이더 렌더링
        // CPU 메모리 복사 불필요
        return true;
    }
};

5. 폴백 전략

자동 Surface 타입 선택:

class AdaptiveDecoder {
public:
    VavCoreSurfaceType SelectOptimalSurfaceType(VavCoreDecoderType decoder_type) {
        switch (decoder_type) {
            case VAVCORE_DECODER_AMF:
                if (m_d3d11Device) return VAVCORE_SURFACE_D3D11_TEXTURE;
                if (m_d3d12Device) return VAVCORE_SURFACE_D3D12_RESOURCE;
                break;

            case VAVCORE_DECODER_VPL:
                if (m_d3d11Device) return VAVCORE_SURFACE_D3D11_TEXTURE;
                break;

            case VAVCORE_DECODER_NVDEC:
                return VAVCORE_SURFACE_CUDA_DEVICE;

            case VAVCORE_DECODER_DAV1D:
            case VAVCORE_DECODER_MEDIA_FOUNDATION:
            default:
                return VAVCORE_SURFACE_CPU;
        }

        return VAVCORE_SURFACE_CPU; // 폴백
    }
};

성능 향상 효과

예상 성능 개선

4K AV1 디코딩 + 렌더링 파이프라인:

구성 요소	현재 (CPU)	D3D Surface 사용	개선도
디코딩	15-25ms	10-20ms	1.2-1.5배
GPU 업로드	5-10ms	0ms	∞
렌더링	1-3ms	0.5-1ms	2-3배
총합	21-38ms	10.5-21ms	2-3.6배

메모리 대역폭 절약:

4K YUV420: 프레임당 ~12MB
60fps: ~720MB/s 메모리 대역폭 절약
메모리 압박 및 캐시 오염 감소

사용 사례 및 이점

고해상도 콘텐츠 (4K+)
- GPU → CPU → GPU 병목 제거
- 실시간 4K60 디코딩 + 렌더링 가능
멀티 스트림 시나리오
- CPU 메모리 복사 없이 여러 비디오 스트림 처리
- 효율적인 GPU 메모리 공유
실시간 애플리케이션
- 라이브 스트리밍 지연 시간 감소
- 시스템 반응성을 위한 CPU 사용률 감소

구현 단계

1단계: 핵심 인프라

Surface union을 포함한 VavCoreVideoFrame 확장
Surface 메서드가 포함된 IVideoDecoder 인터페이스 업데이트
Surface 타입 기능 감지 구현

2단계: 하드웨어 디코더 통합

AMD AMF surface 디코딩 구현
Intel VPL surface 디코딩 구현
NVIDIA NVDEC CUDA 통합

3단계: 렌더러 업데이트

D3D11/D3D12 surface 렌더링
CUDA-D3D 상호 운용성
AMF surface 렌더링

4단계: 최적화 및 테스트

성능 벤치마킹
폴백 메커니즘 개선
멀티 GPU 지원

호환성 고려사항

하위 호환성

기존 CPU 기반 코드는 변경 없이 계속 작동
VavCoreVideoFrame이 기존 CPU 필드 유지
D3D 불가능 시 CPU 디코딩으로 자동 폴백

플랫폼 지원

Windows 10/11: 완전한 D3D11/D3D12 지원
구형 Windows: CPU 디코딩으로 폴백
비Windows: CPU 전용 (향후: Vulkan/OpenGL)

하드웨어 요구사항

AMD: AV1 하드웨어 디코딩을 위한 RX 6000+ 시리즈
Intel: Arc 시리즈 또는 11세대+ 내장 그래픽
NVIDIA: AV1 하드웨어 디코딩을 위한 RTX 30 시리즈+

위험 평가

기술적 위험

드라이버 호환성: 하드웨어별 드라이버 문제
- 완화: CPU 디코딩으로 완전한 폴백
메모리 관리: D3D surface 생명주기 관리
- 완화: RAII 래퍼 및 참조 카운팅
동기화: GPU-GPU 동기화 복잡성
- 완화: 명시적 동기화 프리미티브

성능 위험

초기 구현: 최적화된 CPU 경로보다 느릴 수 있음
- 완화: 성능 게이트가 포함된 단계적 롤아웃
메모리 오버헤드: 추가 surface 메타데이터
- 완화: Union 기반 저장소, 최소 오버헤드

결론

D3D surface 직접 디코딩은 고해상도 AV1 콘텐츠에 대한 중요한 성능 기회를 제공합니다. 제안된 아키텍처는 하드웨어 가속 시나리오에서 상당한 성능 향상을 가능하게 하면서 하위 호환성을 유지합니다.

구현은 다음을 우선시합니다:

호환성: 기존 코드가 계속 작동
성능: 불필요한 메모리 복사 제거
유연성: 여러 하드웨어 벤더 지원
유지보수성: 명확한 추상화 계층

적절한 구현을 통해 이 아키텍처는 시스템 안정성과 호환성을 유지하면서 4K+ 콘텐츠에 대해 2-3배의 성능 향상을 제공할 수 있습니다.

15 KiB Raw Blame History

D3D Surface 직접 디코딩 아키텍처 설계

개요

현재 상태 분석

VavCoreVideoFrame (CPU 전용)

SDK D3D Surface 지원 분석

1. AMD AMF (Advanced Media Framework)

2. Intel VPL (Video Processing Library)

3. NVIDIA NVDEC

4. dav1d (소프트웨어 디코더)

제안된 아키텍처

1. 확장된 Surface 타입

2. 디코더 인터페이스 확장

3. 하드웨어별 구현

AMD AMF 디코더 구현

Intel VPL 디코더 구현

NVIDIA NVDEC 구현

4. 렌더러 통합

5. 폴백 전략

성능 향상 효과

예상 성능 개선

사용 사례 및 이점

구현 단계

1단계: 핵심 인프라

2단계: 하드웨어 디코더 통합

3단계: 렌더러 업데이트

4단계: 최적화 및 테스트

호환성 고려사항

하위 호환성

플랫폼 지원

하드웨어 요구사항

위험 평가

기술적 위험

성능 위험

결론

15 KiB

Raw Blame History