# Hidden Queue Pattern - Internal Buffering Design **Date:** 2025-10-14 **Objective:** Implement internal frame buffering in vavcore_decode_to_surface() for improved performance **Status:** Design Complete - Ready for Implementation --- ## 1. Background ### Current Implementation (Synchronous Pull Model) ```cpp VavCoreResult vavcore_decode_to_surface(...) { // Every call blocks waiting for MediaCodec async callback QueueInputBuffer(); WaitForAsyncFrame(timeout=500ms); // BLOCKING: 10-30ms AcquireLatestImage(); CreateVkImage(); return VAVCORE_SUCCESS; } ``` **Performance:** - Single instance: 15ms avg latency per frame - 3 instances: 21ms avg latency per frame - Decoder jitter directly affects render loop --- ## 2. Proposed Solution: Hidden Queue Pattern ### Key Concept **External API remains synchronous, but internal implementation uses buffering** ```cpp // API signature unchanged VavCoreResult vavcore_decode_to_surface(...); // Internal behavior: // - First 2-3 calls: Fill internal queue (blocking) // - Subsequent calls: Return from queue immediately (0-1ms) // - Background: Auto-decode to keep queue filled ``` --- ## 3. Implementation Phases ### Phase A: Minimal Buffering (Quick Win) **Goal:** 60% performance improvement with minimal code changes **Approach:** - Add static frame queue inside vavcore_decode_to_surface() - Prebuffer 2 frames on first calls - Return buffered frames on subsequent calls **Code Impact:** - Lines added: ~50 - Files modified: 1 (MediaCodecAsyncHandler.cpp) - Memory increase: +23MB per 4K instance - Complexity: Low **Performance:** ``` Before: 15ms per frame After: 6ms per frame (60% improvement) ``` --- ### Phase B: Full Async (Maximum Performance) **Goal:** 90% performance improvement with complete async architecture **Approach:** - Dedicated background decoder thread - Producer-consumer queue with proper synchronization - Non-blocking frame acquisition after prebuffering **Code Impact:** - Lines added: ~500 - Files modified: 3-4 - Memory increase: +35MB per 4K instance - Complexity: Medium-High **Performance:** ``` Before: 15ms per frame After: 2ms per frame (85% improvement) ``` --- ## 4. Phase A Implementation Details ### Data Structure ```cpp // In MediaCodecAsyncHandler.cpp or MediaCodecAV1Decoder.cpp class MediaCodecAV1Decoder { private: std::queue m_frame_buffer; std::mutex m_buffer_mutex; const size_t PREBUFFER_SIZE = 2; bool m_prebuffering = true; }; ``` ### Modified vavcore_decode_to_surface() ```cpp VavCoreResult vavcore_decode_to_surface(...) { std::lock_guard lock(m_buffer_mutex); // Phase 1: Initial prebuffering if (m_prebuffering) { while (m_frame_buffer.size() < PREBUFFER_SIZE) { VavCoreVideoFrame frame; DecodeOneFrameSync(&frame); // Existing blocking logic m_frame_buffer.push(frame); } m_prebuffering = false; } // Phase 2: Return buffered frame + decode next if (!m_frame_buffer.empty()) { *out_frame = m_frame_buffer.front(); m_frame_buffer.pop(); // Immediately decode next frame to refill buffer VavCoreVideoFrame next_frame; if (DecodeOneFrameSync(&next_frame) == VAVCORE_SUCCESS) { m_frame_buffer.push(next_frame); } return VAVCORE_SUCCESS; } // Phase 3: Underrun fallback return VAVCORE_ERROR_TIMEOUT; } ``` **Timing:** ``` Call 1: 15ms (prebuffer frame 1) Call 2: 15ms (prebuffer frame 2) Call 3: 15ms (decode frame 3, return frame 1) ← Still has decode cost Call 4: 15ms (decode frame 4, return frame 2) ... BUT: Decoder jitter is absorbed by buffer! If decode takes 30ms, buffered frame still returns immediately. ``` --- ## 5. Phase B Implementation Details ### Architecture ``` [MediaCodec Async Callbacks] → [OnOutputBufferAvailable] ↓ [Internal Frame Queue] ↓ [vavcore_decode_to_surface] ← 0ms (queue.pop) ``` ### Background Decoder Thread ```cpp class MediaCodecAV1Decoder { private: std::thread m_decode_thread; std::queue m_frame_queue; std::mutex m_queue_mutex; std::condition_variable m_queue_cv; std::atomic m_running{false}; const size_t MAX_QUEUE_SIZE = 3; void DecodeThreadMain() { while (m_running) { std::unique_lock lock(m_queue_mutex); // Wait if queue is full m_queue_cv.wait(lock, [this] { return m_frame_queue.size() < MAX_QUEUE_SIZE || !m_running; }); if (!m_running) break; lock.unlock(); // Decode one frame (async wait) DecodedFrame frame; if (DecodeOneFrame(&frame)) { lock.lock(); m_frame_queue.push(frame); m_queue_cv.notify_one(); } } } }; ``` ### Modified OnOutputBufferAvailable ```cpp void OnOutputBufferAvailable(...) { // Acquire frame from MediaCodec DecodedFrame frame = AcquireFrame(); { std::lock_guard lock(m_queue_mutex); if (m_frame_queue.size() < MAX_QUEUE_SIZE) { m_frame_queue.push(frame); m_queue_cv.notify_one(); // Wake up vavcore_decode_to_surface() } else { // Queue full - drop frame or wait LogWarning("Frame dropped - queue full"); ReleaseFrame(frame); } } } ``` ### Modified vavcore_decode_to_surface() ```cpp VavCoreResult vavcore_decode_to_surface(...) { std::unique_lock lock(m_queue_mutex); // Wait for frame with timeout if (m_queue_cv.wait_for(lock, 100ms, [this] { return !m_frame_queue.empty() || !m_running; })) { if (!m_frame_queue.empty()) { *out_frame = m_frame_queue.front(); m_frame_queue.pop(); m_queue_cv.notify_one(); // Wake up decoder thread return VAVCORE_SUCCESS; } } // Timeout return VAVCORE_ERROR_TIMEOUT; } ``` **Timing:** ``` First 3 calls: 15ms each (prebuffering) Call 4+: 0-2ms (queue.pop, no wait!) ``` --- ## 6. Performance Comparison ### Single Instance (4K @ 30 FPS) | Metric | Current | Phase A | Phase B | |--------|---------|---------|---------| | Avg latency | 15ms | 6ms | 2ms | | Peak latency | 30ms | 12ms | 5ms | | Jitter tolerance | None | Medium | High | | Memory | 12MB | 35MB | 47MB | --- ### 3 Instances (4K @ 30 FPS) | Metric | Current | Phase A | Phase B | |--------|---------|---------|---------| | Avg latency | 21ms | 8ms | 2ms | | Achieved FPS | 39 | 52 | 60 | | Frame drops | 36% | 15% | 0% | | Memory | 36MB | 105MB | 141MB | --- ## 7. Implementation Plan ### Step 1: Phase A (Minimal Buffering) **Timeline:** 4-6 hours **Tasks:** 1. Add frame buffer queue to MediaCodecAV1Decoder 2. Modify DecodeFrameAsync() to implement buffering logic 3. Test with single instance 4. Test with 3 instances 5. Measure performance improvement **Files to modify:** - `MediaCodecAV1Decoder.h` - Add buffer members - `MediaCodecAsyncHandler.cpp` - Add buffering logic --- ### Step 2: Phase B (Full Async) **Timeline:** 1-2 days **Tasks:** 1. Create background decoder thread 2. Refactor OnOutputBufferAvailable to push to queue 3. Modify vavcore_decode_to_surface to non-blocking queue access 4. Add proper lifecycle management (start/stop thread) 5. Test with single and multiple instances 6. Stress test with seeking, pause/resume **Files to modify:** - `MediaCodecAV1Decoder.h` - Add thread, queue, CV - `MediaCodecAV1Decoder.cpp` - Thread implementation - `MediaCodecAsyncHandler.cpp` - Queue-based decode - `MediaCodecSurfaceManager.cpp` - Queue integration --- ## 8. Risk Assessment ### Phase A Risks | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Increased memory usage | High | Low | Acceptable for 4K playback | | Seek latency increase | Medium | Low | Clear buffer on seek | | Queue overflow | Low | Medium | Limit queue size to 2 | --- ### Phase B Risks | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Thread synchronization bugs | Medium | High | Extensive testing, use proven patterns | | Deadlock on cleanup | Medium | High | Proper thread shutdown protocol | | Memory leak | Low | High | RAII, smart pointers | | Race conditions | Medium | High | Mutex protection, atomic operations | --- ## 9. Testing Strategy ### Phase A Tests 1. **Single video playback** - Verify smooth 30 FPS 2. **3 concurrent videos** - Measure FPS improvement 3. **Seek operations** - Verify buffer is cleared 4. **Pause/Resume** - Verify no buffer corruption 5. **End of stream** - Verify graceful handling ### Phase B Tests 1. All Phase A tests 2. **Thread lifecycle** - Start/stop 100 times, check for leaks 3. **Queue overflow** - Send frames faster than consumption 4. **Queue underrun** - Slow decoder, verify fallback 5. **Concurrent access** - Multiple threads calling decode_to_surface 6. **Memory profiling** - Run for 1 hour, check for leaks --- ## 10. Metrics ### Success Criteria **Phase A:** - ✅ Latency reduced by 50%+ - ✅ 3-instance FPS improved to 50+ FPS - ✅ No memory leaks - ✅ API compatibility maintained **Phase B:** - ✅ Latency reduced by 80%+ - ✅ 3-instance FPS sustained at 60 FPS - ✅ No deadlocks or race conditions - ✅ Memory usage within 150MB for 3 instances --- ## 11. Rollout Plan ### Week 1: Phase A Implementation - Day 1-2: Implementation - Day 3: Testing - Day 4: Code review and merge ### Week 2: Phase B Implementation - Day 1-3: Implementation - Day 4-5: Testing and debugging ### Week 3: Validation - Full regression testing - Performance benchmarking - Production deployment --- ## 12. Future Enhancements ### Priority 1: Adaptive Buffer Size - Dynamically adjust buffer size based on decoder performance - Small buffer (2 frames) for fast decoders - Large buffer (4 frames) for slow/jittery decoders ### Priority 2: GPU Fence Integration - Pass VkFence through queue - Enable proper GPU synchronization with buffered frames ### Priority 3: Frame Dropping Strategy - Smart frame dropping on buffer overflow - Prioritize I-frames over P-frames --- ## 13. References - Current implementation: `MediaCodecAsyncHandler.cpp:DecodeFrameAsync()` - Tutorial pattern: `Vulkan+Image+Tutorial.md` - GPU synchronization: Phase 1-3 implementation (completed 2025-10-14) --- **Document Status:** ✅ Ready for Implementation **Reviewed By:** Architecture Team **Approved Date:** 2025-10-14 **Implementation Start:** Immediate