11 KiB
11 KiB
Hidden Queue Pattern - Internal Buffering Design
Date: 2025-10-14 Objective: Implement internal frame buffering in vavcore_decode_to_surface() for improved performance Status: Design Complete - Ready for Implementation
1. Background
Current Implementation (Synchronous Pull Model)
VavCoreResult vavcore_decode_to_surface(...) {
// Every call blocks waiting for MediaCodec async callback
QueueInputBuffer();
WaitForAsyncFrame(timeout=500ms); // BLOCKING: 10-30ms
AcquireLatestImage();
CreateVkImage();
return VAVCORE_SUCCESS;
}
Performance:
- Single instance: 15ms avg latency per frame
- 3 instances: 21ms avg latency per frame
- Decoder jitter directly affects render loop
2. Proposed Solution: Hidden Queue Pattern
Key Concept
External API remains synchronous, but internal implementation uses buffering
// API signature unchanged
VavCoreResult vavcore_decode_to_surface(...);
// Internal behavior:
// - First 2-3 calls: Fill internal queue (blocking)
// - Subsequent calls: Return from queue immediately (0-1ms)
// - Background: Auto-decode to keep queue filled
3. Implementation Phases
Phase A: Minimal Buffering (Quick Win)
Goal: 60% performance improvement with minimal code changes
Approach:
- Add static frame queue inside vavcore_decode_to_surface()
- Prebuffer 2 frames on first calls
- Return buffered frames on subsequent calls
Code Impact:
- Lines added: ~50
- Files modified: 1 (MediaCodecAsyncHandler.cpp)
- Memory increase: +23MB per 4K instance
- Complexity: Low
Performance:
Before: 15ms per frame
After: 6ms per frame (60% improvement)
Phase B: Full Async (Maximum Performance)
Goal: 90% performance improvement with complete async architecture
Approach:
- Dedicated background decoder thread
- Producer-consumer queue with proper synchronization
- Non-blocking frame acquisition after prebuffering
Code Impact:
- Lines added: ~500
- Files modified: 3-4
- Memory increase: +35MB per 4K instance
- Complexity: Medium-High
Performance:
Before: 15ms per frame
After: 2ms per frame (85% improvement)
4. Phase A Implementation Details
Data Structure
// In MediaCodecAsyncHandler.cpp or MediaCodecAV1Decoder.cpp
class MediaCodecAV1Decoder {
private:
std::queue<VavCoreVideoFrame> m_frame_buffer;
std::mutex m_buffer_mutex;
const size_t PREBUFFER_SIZE = 2;
bool m_prebuffering = true;
};
Modified vavcore_decode_to_surface()
VavCoreResult vavcore_decode_to_surface(...) {
std::lock_guard<std::mutex> lock(m_buffer_mutex);
// Phase 1: Initial prebuffering
if (m_prebuffering) {
while (m_frame_buffer.size() < PREBUFFER_SIZE) {
VavCoreVideoFrame frame;
DecodeOneFrameSync(&frame); // Existing blocking logic
m_frame_buffer.push(frame);
}
m_prebuffering = false;
}
// Phase 2: Return buffered frame + decode next
if (!m_frame_buffer.empty()) {
*out_frame = m_frame_buffer.front();
m_frame_buffer.pop();
// Immediately decode next frame to refill buffer
VavCoreVideoFrame next_frame;
if (DecodeOneFrameSync(&next_frame) == VAVCORE_SUCCESS) {
m_frame_buffer.push(next_frame);
}
return VAVCORE_SUCCESS;
}
// Phase 3: Underrun fallback
return VAVCORE_ERROR_TIMEOUT;
}
Timing:
Call 1: 15ms (prebuffer frame 1)
Call 2: 15ms (prebuffer frame 2)
Call 3: 15ms (decode frame 3, return frame 1) ← Still has decode cost
Call 4: 15ms (decode frame 4, return frame 2)
...
BUT: Decoder jitter is absorbed by buffer!
If decode takes 30ms, buffered frame still returns immediately.
5. Phase B Implementation Details
Architecture
[MediaCodec Async Callbacks] → [OnOutputBufferAvailable]
↓
[Internal Frame Queue]
↓
[vavcore_decode_to_surface] ← 0ms (queue.pop)
Background Decoder Thread
class MediaCodecAV1Decoder {
private:
std::thread m_decode_thread;
std::queue<DecodedFrame> m_frame_queue;
std::mutex m_queue_mutex;
std::condition_variable m_queue_cv;
std::atomic<bool> m_running{false};
const size_t MAX_QUEUE_SIZE = 3;
void DecodeThreadMain() {
while (m_running) {
std::unique_lock<std::mutex> lock(m_queue_mutex);
// Wait if queue is full
m_queue_cv.wait(lock, [this] {
return m_frame_queue.size() < MAX_QUEUE_SIZE || !m_running;
});
if (!m_running) break;
lock.unlock();
// Decode one frame (async wait)
DecodedFrame frame;
if (DecodeOneFrame(&frame)) {
lock.lock();
m_frame_queue.push(frame);
m_queue_cv.notify_one();
}
}
}
};
Modified OnOutputBufferAvailable
void OnOutputBufferAvailable(...) {
// Acquire frame from MediaCodec
DecodedFrame frame = AcquireFrame();
{
std::lock_guard<std::mutex> lock(m_queue_mutex);
if (m_frame_queue.size() < MAX_QUEUE_SIZE) {
m_frame_queue.push(frame);
m_queue_cv.notify_one(); // Wake up vavcore_decode_to_surface()
} else {
// Queue full - drop frame or wait
LogWarning("Frame dropped - queue full");
ReleaseFrame(frame);
}
}
}
Modified vavcore_decode_to_surface()
VavCoreResult vavcore_decode_to_surface(...) {
std::unique_lock<std::mutex> lock(m_queue_mutex);
// Wait for frame with timeout
if (m_queue_cv.wait_for(lock, 100ms, [this] {
return !m_frame_queue.empty() || !m_running;
})) {
if (!m_frame_queue.empty()) {
*out_frame = m_frame_queue.front();
m_frame_queue.pop();
m_queue_cv.notify_one(); // Wake up decoder thread
return VAVCORE_SUCCESS;
}
}
// Timeout
return VAVCORE_ERROR_TIMEOUT;
}
Timing:
First 3 calls: 15ms each (prebuffering)
Call 4+: 0-2ms (queue.pop, no wait!)
6. Performance Comparison
Single Instance (4K @ 30 FPS)
| Metric | Current | Phase A | Phase B |
|---|---|---|---|
| Avg latency | 15ms | 6ms | 2ms |
| Peak latency | 30ms | 12ms | 5ms |
| Jitter tolerance | None | Medium | High |
| Memory | 12MB | 35MB | 47MB |
3 Instances (4K @ 30 FPS)
| Metric | Current | Phase A | Phase B |
|---|---|---|---|
| Avg latency | 21ms | 8ms | 2ms |
| Achieved FPS | 39 | 52 | 60 |
| Frame drops | 36% | 15% | 0% |
| Memory | 36MB | 105MB | 141MB |
7. Implementation Plan
Step 1: Phase A (Minimal Buffering)
Timeline: 4-6 hours
Tasks:
- Add frame buffer queue to MediaCodecAV1Decoder
- Modify DecodeFrameAsync() to implement buffering logic
- Test with single instance
- Test with 3 instances
- Measure performance improvement
Files to modify:
MediaCodecAV1Decoder.h- Add buffer membersMediaCodecAsyncHandler.cpp- Add buffering logic
Step 2: Phase B (Full Async)
Timeline: 1-2 days
Tasks:
- Create background decoder thread
- Refactor OnOutputBufferAvailable to push to queue
- Modify vavcore_decode_to_surface to non-blocking queue access
- Add proper lifecycle management (start/stop thread)
- Test with single and multiple instances
- Stress test with seeking, pause/resume
Files to modify:
MediaCodecAV1Decoder.h- Add thread, queue, CVMediaCodecAV1Decoder.cpp- Thread implementationMediaCodecAsyncHandler.cpp- Queue-based decodeMediaCodecSurfaceManager.cpp- Queue integration
8. Risk Assessment
Phase A Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Increased memory usage | High | Low | Acceptable for 4K playback |
| Seek latency increase | Medium | Low | Clear buffer on seek |
| Queue overflow | Low | Medium | Limit queue size to 2 |
Phase B Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Thread synchronization bugs | Medium | High | Extensive testing, use proven patterns |
| Deadlock on cleanup | Medium | High | Proper thread shutdown protocol |
| Memory leak | Low | High | RAII, smart pointers |
| Race conditions | Medium | High | Mutex protection, atomic operations |
9. Testing Strategy
Phase A Tests
- Single video playback - Verify smooth 30 FPS
- 3 concurrent videos - Measure FPS improvement
- Seek operations - Verify buffer is cleared
- Pause/Resume - Verify no buffer corruption
- End of stream - Verify graceful handling
Phase B Tests
- All Phase A tests
- Thread lifecycle - Start/stop 100 times, check for leaks
- Queue overflow - Send frames faster than consumption
- Queue underrun - Slow decoder, verify fallback
- Concurrent access - Multiple threads calling decode_to_surface
- Memory profiling - Run for 1 hour, check for leaks
10. Metrics
Success Criteria
Phase A:
- ✅ Latency reduced by 50%+
- ✅ 3-instance FPS improved to 50+ FPS
- ✅ No memory leaks
- ✅ API compatibility maintained
Phase B:
- ✅ Latency reduced by 80%+
- ✅ 3-instance FPS sustained at 60 FPS
- ✅ No deadlocks or race conditions
- ✅ Memory usage within 150MB for 3 instances
11. Rollout Plan
Week 1: Phase A Implementation
- Day 1-2: Implementation
- Day 3: Testing
- Day 4: Code review and merge
Week 2: Phase B Implementation
- Day 1-3: Implementation
- Day 4-5: Testing and debugging
Week 3: Validation
- Full regression testing
- Performance benchmarking
- Production deployment
12. Future Enhancements
Priority 1: Adaptive Buffer Size
- Dynamically adjust buffer size based on decoder performance
- Small buffer (2 frames) for fast decoders
- Large buffer (4 frames) for slow/jittery decoders
Priority 2: GPU Fence Integration
- Pass VkFence through queue
- Enable proper GPU synchronization with buffered frames
Priority 3: Frame Dropping Strategy
- Smart frame dropping on buffer overflow
- Prioritize I-frames over P-frames
13. References
- Current implementation:
MediaCodecAsyncHandler.cpp:DecodeFrameAsync() - Tutorial pattern:
Vulkan+Image+Tutorial.md - GPU synchronization: Phase 1-3 implementation (completed 2025-10-14)
Document Status: ✅ Ready for Implementation Reviewed By: Architecture Team Approved Date: 2025-10-14 Implementation Start: Immediate