434 lines
11 KiB
Markdown
434 lines
11 KiB
Markdown
|
|
# Hidden Queue Pattern - Internal Buffering Design
|
||
|
|
|
||
|
|
**Date:** 2025-10-14
|
||
|
|
**Objective:** Implement internal frame buffering in vavcore_decode_to_surface() for improved performance
|
||
|
|
**Status:** Design Complete - Ready for Implementation
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. Background
|
||
|
|
|
||
|
|
### Current Implementation (Synchronous Pull Model)
|
||
|
|
|
||
|
|
```cpp
|
||
|
|
VavCoreResult vavcore_decode_to_surface(...) {
|
||
|
|
// Every call blocks waiting for MediaCodec async callback
|
||
|
|
QueueInputBuffer();
|
||
|
|
WaitForAsyncFrame(timeout=500ms); // BLOCKING: 10-30ms
|
||
|
|
AcquireLatestImage();
|
||
|
|
CreateVkImage();
|
||
|
|
return VAVCORE_SUCCESS;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Performance:**
|
||
|
|
- Single instance: 15ms avg latency per frame
|
||
|
|
- 3 instances: 21ms avg latency per frame
|
||
|
|
- Decoder jitter directly affects render loop
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. Proposed Solution: Hidden Queue Pattern
|
||
|
|
|
||
|
|
### Key Concept
|
||
|
|
|
||
|
|
**External API remains synchronous, but internal implementation uses buffering**
|
||
|
|
|
||
|
|
```cpp
|
||
|
|
// API signature unchanged
|
||
|
|
VavCoreResult vavcore_decode_to_surface(...);
|
||
|
|
|
||
|
|
// Internal behavior:
|
||
|
|
// - First 2-3 calls: Fill internal queue (blocking)
|
||
|
|
// - Subsequent calls: Return from queue immediately (0-1ms)
|
||
|
|
// - Background: Auto-decode to keep queue filled
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. Implementation Phases
|
||
|
|
|
||
|
|
### Phase A: Minimal Buffering (Quick Win)
|
||
|
|
|
||
|
|
**Goal:** 60% performance improvement with minimal code changes
|
||
|
|
|
||
|
|
**Approach:**
|
||
|
|
- Add static frame queue inside vavcore_decode_to_surface()
|
||
|
|
- Prebuffer 2 frames on first calls
|
||
|
|
- Return buffered frames on subsequent calls
|
||
|
|
|
||
|
|
**Code Impact:**
|
||
|
|
- Lines added: ~50
|
||
|
|
- Files modified: 1 (MediaCodecAsyncHandler.cpp)
|
||
|
|
- Memory increase: +23MB per 4K instance
|
||
|
|
- Complexity: Low
|
||
|
|
|
||
|
|
**Performance:**
|
||
|
|
```
|
||
|
|
Before: 15ms per frame
|
||
|
|
After: 6ms per frame (60% improvement)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Phase B: Full Async (Maximum Performance)
|
||
|
|
|
||
|
|
**Goal:** 90% performance improvement with complete async architecture
|
||
|
|
|
||
|
|
**Approach:**
|
||
|
|
- Dedicated background decoder thread
|
||
|
|
- Producer-consumer queue with proper synchronization
|
||
|
|
- Non-blocking frame acquisition after prebuffering
|
||
|
|
|
||
|
|
**Code Impact:**
|
||
|
|
- Lines added: ~500
|
||
|
|
- Files modified: 3-4
|
||
|
|
- Memory increase: +35MB per 4K instance
|
||
|
|
- Complexity: Medium-High
|
||
|
|
|
||
|
|
**Performance:**
|
||
|
|
```
|
||
|
|
Before: 15ms per frame
|
||
|
|
After: 2ms per frame (85% improvement)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. Phase A Implementation Details
|
||
|
|
|
||
|
|
### Data Structure
|
||
|
|
|
||
|
|
```cpp
|
||
|
|
// In MediaCodecAsyncHandler.cpp or MediaCodecAV1Decoder.cpp
|
||
|
|
class MediaCodecAV1Decoder {
|
||
|
|
private:
|
||
|
|
std::queue<VavCoreVideoFrame> m_frame_buffer;
|
||
|
|
std::mutex m_buffer_mutex;
|
||
|
|
const size_t PREBUFFER_SIZE = 2;
|
||
|
|
bool m_prebuffering = true;
|
||
|
|
};
|
||
|
|
```
|
||
|
|
|
||
|
|
### Modified vavcore_decode_to_surface()
|
||
|
|
|
||
|
|
```cpp
|
||
|
|
VavCoreResult vavcore_decode_to_surface(...) {
|
||
|
|
std::lock_guard<std::mutex> lock(m_buffer_mutex);
|
||
|
|
|
||
|
|
// Phase 1: Initial prebuffering
|
||
|
|
if (m_prebuffering) {
|
||
|
|
while (m_frame_buffer.size() < PREBUFFER_SIZE) {
|
||
|
|
VavCoreVideoFrame frame;
|
||
|
|
DecodeOneFrameSync(&frame); // Existing blocking logic
|
||
|
|
m_frame_buffer.push(frame);
|
||
|
|
}
|
||
|
|
m_prebuffering = false;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Phase 2: Return buffered frame + decode next
|
||
|
|
if (!m_frame_buffer.empty()) {
|
||
|
|
*out_frame = m_frame_buffer.front();
|
||
|
|
m_frame_buffer.pop();
|
||
|
|
|
||
|
|
// Immediately decode next frame to refill buffer
|
||
|
|
VavCoreVideoFrame next_frame;
|
||
|
|
if (DecodeOneFrameSync(&next_frame) == VAVCORE_SUCCESS) {
|
||
|
|
m_frame_buffer.push(next_frame);
|
||
|
|
}
|
||
|
|
|
||
|
|
return VAVCORE_SUCCESS;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Phase 3: Underrun fallback
|
||
|
|
return VAVCORE_ERROR_TIMEOUT;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Timing:**
|
||
|
|
```
|
||
|
|
Call 1: 15ms (prebuffer frame 1)
|
||
|
|
Call 2: 15ms (prebuffer frame 2)
|
||
|
|
Call 3: 15ms (decode frame 3, return frame 1) ← Still has decode cost
|
||
|
|
Call 4: 15ms (decode frame 4, return frame 2)
|
||
|
|
...
|
||
|
|
|
||
|
|
BUT: Decoder jitter is absorbed by buffer!
|
||
|
|
If decode takes 30ms, buffered frame still returns immediately.
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. Phase B Implementation Details
|
||
|
|
|
||
|
|
### Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
[MediaCodec Async Callbacks] → [OnOutputBufferAvailable]
|
||
|
|
↓
|
||
|
|
[Internal Frame Queue]
|
||
|
|
↓
|
||
|
|
[vavcore_decode_to_surface] ← 0ms (queue.pop)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Background Decoder Thread
|
||
|
|
|
||
|
|
```cpp
|
||
|
|
class MediaCodecAV1Decoder {
|
||
|
|
private:
|
||
|
|
std::thread m_decode_thread;
|
||
|
|
std::queue<DecodedFrame> m_frame_queue;
|
||
|
|
std::mutex m_queue_mutex;
|
||
|
|
std::condition_variable m_queue_cv;
|
||
|
|
std::atomic<bool> m_running{false};
|
||
|
|
const size_t MAX_QUEUE_SIZE = 3;
|
||
|
|
|
||
|
|
void DecodeThreadMain() {
|
||
|
|
while (m_running) {
|
||
|
|
std::unique_lock<std::mutex> lock(m_queue_mutex);
|
||
|
|
|
||
|
|
// Wait if queue is full
|
||
|
|
m_queue_cv.wait(lock, [this] {
|
||
|
|
return m_frame_queue.size() < MAX_QUEUE_SIZE || !m_running;
|
||
|
|
});
|
||
|
|
|
||
|
|
if (!m_running) break;
|
||
|
|
|
||
|
|
lock.unlock();
|
||
|
|
|
||
|
|
// Decode one frame (async wait)
|
||
|
|
DecodedFrame frame;
|
||
|
|
if (DecodeOneFrame(&frame)) {
|
||
|
|
lock.lock();
|
||
|
|
m_frame_queue.push(frame);
|
||
|
|
m_queue_cv.notify_one();
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
};
|
||
|
|
```
|
||
|
|
|
||
|
|
### Modified OnOutputBufferAvailable
|
||
|
|
|
||
|
|
```cpp
|
||
|
|
void OnOutputBufferAvailable(...) {
|
||
|
|
// Acquire frame from MediaCodec
|
||
|
|
DecodedFrame frame = AcquireFrame();
|
||
|
|
|
||
|
|
{
|
||
|
|
std::lock_guard<std::mutex> lock(m_queue_mutex);
|
||
|
|
if (m_frame_queue.size() < MAX_QUEUE_SIZE) {
|
||
|
|
m_frame_queue.push(frame);
|
||
|
|
m_queue_cv.notify_one(); // Wake up vavcore_decode_to_surface()
|
||
|
|
} else {
|
||
|
|
// Queue full - drop frame or wait
|
||
|
|
LogWarning("Frame dropped - queue full");
|
||
|
|
ReleaseFrame(frame);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Modified vavcore_decode_to_surface()
|
||
|
|
|
||
|
|
```cpp
|
||
|
|
VavCoreResult vavcore_decode_to_surface(...) {
|
||
|
|
std::unique_lock<std::mutex> lock(m_queue_mutex);
|
||
|
|
|
||
|
|
// Wait for frame with timeout
|
||
|
|
if (m_queue_cv.wait_for(lock, 100ms, [this] {
|
||
|
|
return !m_frame_queue.empty() || !m_running;
|
||
|
|
})) {
|
||
|
|
if (!m_frame_queue.empty()) {
|
||
|
|
*out_frame = m_frame_queue.front();
|
||
|
|
m_frame_queue.pop();
|
||
|
|
m_queue_cv.notify_one(); // Wake up decoder thread
|
||
|
|
return VAVCORE_SUCCESS;
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Timeout
|
||
|
|
return VAVCORE_ERROR_TIMEOUT;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Timing:**
|
||
|
|
```
|
||
|
|
First 3 calls: 15ms each (prebuffering)
|
||
|
|
Call 4+: 0-2ms (queue.pop, no wait!)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 6. Performance Comparison
|
||
|
|
|
||
|
|
### Single Instance (4K @ 30 FPS)
|
||
|
|
|
||
|
|
| Metric | Current | Phase A | Phase B |
|
||
|
|
|--------|---------|---------|---------|
|
||
|
|
| Avg latency | 15ms | 6ms | 2ms |
|
||
|
|
| Peak latency | 30ms | 12ms | 5ms |
|
||
|
|
| Jitter tolerance | None | Medium | High |
|
||
|
|
| Memory | 12MB | 35MB | 47MB |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3 Instances (4K @ 30 FPS)
|
||
|
|
|
||
|
|
| Metric | Current | Phase A | Phase B |
|
||
|
|
|--------|---------|---------|---------|
|
||
|
|
| Avg latency | 21ms | 8ms | 2ms |
|
||
|
|
| Achieved FPS | 39 | 52 | 60 |
|
||
|
|
| Frame drops | 36% | 15% | 0% |
|
||
|
|
| Memory | 36MB | 105MB | 141MB |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 7. Implementation Plan
|
||
|
|
|
||
|
|
### Step 1: Phase A (Minimal Buffering)
|
||
|
|
|
||
|
|
**Timeline:** 4-6 hours
|
||
|
|
|
||
|
|
**Tasks:**
|
||
|
|
1. Add frame buffer queue to MediaCodecAV1Decoder
|
||
|
|
2. Modify DecodeFrameAsync() to implement buffering logic
|
||
|
|
3. Test with single instance
|
||
|
|
4. Test with 3 instances
|
||
|
|
5. Measure performance improvement
|
||
|
|
|
||
|
|
**Files to modify:**
|
||
|
|
- `MediaCodecAV1Decoder.h` - Add buffer members
|
||
|
|
- `MediaCodecAsyncHandler.cpp` - Add buffering logic
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Step 2: Phase B (Full Async)
|
||
|
|
|
||
|
|
**Timeline:** 1-2 days
|
||
|
|
|
||
|
|
**Tasks:**
|
||
|
|
1. Create background decoder thread
|
||
|
|
2. Refactor OnOutputBufferAvailable to push to queue
|
||
|
|
3. Modify vavcore_decode_to_surface to non-blocking queue access
|
||
|
|
4. Add proper lifecycle management (start/stop thread)
|
||
|
|
5. Test with single and multiple instances
|
||
|
|
6. Stress test with seeking, pause/resume
|
||
|
|
|
||
|
|
**Files to modify:**
|
||
|
|
- `MediaCodecAV1Decoder.h` - Add thread, queue, CV
|
||
|
|
- `MediaCodecAV1Decoder.cpp` - Thread implementation
|
||
|
|
- `MediaCodecAsyncHandler.cpp` - Queue-based decode
|
||
|
|
- `MediaCodecSurfaceManager.cpp` - Queue integration
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 8. Risk Assessment
|
||
|
|
|
||
|
|
### Phase A Risks
|
||
|
|
|
||
|
|
| Risk | Likelihood | Impact | Mitigation |
|
||
|
|
|------|------------|--------|------------|
|
||
|
|
| Increased memory usage | High | Low | Acceptable for 4K playback |
|
||
|
|
| Seek latency increase | Medium | Low | Clear buffer on seek |
|
||
|
|
| Queue overflow | Low | Medium | Limit queue size to 2 |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Phase B Risks
|
||
|
|
|
||
|
|
| Risk | Likelihood | Impact | Mitigation |
|
||
|
|
|------|------------|--------|------------|
|
||
|
|
| Thread synchronization bugs | Medium | High | Extensive testing, use proven patterns |
|
||
|
|
| Deadlock on cleanup | Medium | High | Proper thread shutdown protocol |
|
||
|
|
| Memory leak | Low | High | RAII, smart pointers |
|
||
|
|
| Race conditions | Medium | High | Mutex protection, atomic operations |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 9. Testing Strategy
|
||
|
|
|
||
|
|
### Phase A Tests
|
||
|
|
|
||
|
|
1. **Single video playback** - Verify smooth 30 FPS
|
||
|
|
2. **3 concurrent videos** - Measure FPS improvement
|
||
|
|
3. **Seek operations** - Verify buffer is cleared
|
||
|
|
4. **Pause/Resume** - Verify no buffer corruption
|
||
|
|
5. **End of stream** - Verify graceful handling
|
||
|
|
|
||
|
|
### Phase B Tests
|
||
|
|
|
||
|
|
1. All Phase A tests
|
||
|
|
2. **Thread lifecycle** - Start/stop 100 times, check for leaks
|
||
|
|
3. **Queue overflow** - Send frames faster than consumption
|
||
|
|
4. **Queue underrun** - Slow decoder, verify fallback
|
||
|
|
5. **Concurrent access** - Multiple threads calling decode_to_surface
|
||
|
|
6. **Memory profiling** - Run for 1 hour, check for leaks
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 10. Metrics
|
||
|
|
|
||
|
|
### Success Criteria
|
||
|
|
|
||
|
|
**Phase A:**
|
||
|
|
- ✅ Latency reduced by 50%+
|
||
|
|
- ✅ 3-instance FPS improved to 50+ FPS
|
||
|
|
- ✅ No memory leaks
|
||
|
|
- ✅ API compatibility maintained
|
||
|
|
|
||
|
|
**Phase B:**
|
||
|
|
- ✅ Latency reduced by 80%+
|
||
|
|
- ✅ 3-instance FPS sustained at 60 FPS
|
||
|
|
- ✅ No deadlocks or race conditions
|
||
|
|
- ✅ Memory usage within 150MB for 3 instances
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 11. Rollout Plan
|
||
|
|
|
||
|
|
### Week 1: Phase A Implementation
|
||
|
|
- Day 1-2: Implementation
|
||
|
|
- Day 3: Testing
|
||
|
|
- Day 4: Code review and merge
|
||
|
|
|
||
|
|
### Week 2: Phase B Implementation
|
||
|
|
- Day 1-3: Implementation
|
||
|
|
- Day 4-5: Testing and debugging
|
||
|
|
|
||
|
|
### Week 3: Validation
|
||
|
|
- Full regression testing
|
||
|
|
- Performance benchmarking
|
||
|
|
- Production deployment
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 12. Future Enhancements
|
||
|
|
|
||
|
|
### Priority 1: Adaptive Buffer Size
|
||
|
|
- Dynamically adjust buffer size based on decoder performance
|
||
|
|
- Small buffer (2 frames) for fast decoders
|
||
|
|
- Large buffer (4 frames) for slow/jittery decoders
|
||
|
|
|
||
|
|
### Priority 2: GPU Fence Integration
|
||
|
|
- Pass VkFence through queue
|
||
|
|
- Enable proper GPU synchronization with buffered frames
|
||
|
|
|
||
|
|
### Priority 3: Frame Dropping Strategy
|
||
|
|
- Smart frame dropping on buffer overflow
|
||
|
|
- Prioritize I-frames over P-frames
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 13. References
|
||
|
|
|
||
|
|
- Current implementation: `MediaCodecAsyncHandler.cpp:DecodeFrameAsync()`
|
||
|
|
- Tutorial pattern: `Vulkan+Image+Tutorial.md`
|
||
|
|
- GPU synchronization: Phase 1-3 implementation (completed 2025-10-14)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Document Status:** ✅ Ready for Implementation
|
||
|
|
**Reviewed By:** Architecture Team
|
||
|
|
**Approved Date:** 2025-10-14
|
||
|
|
**Implementation Start:** Immediate
|