The boring part of real-time vision is what keeps it usable

Computer vision demos usually focus on the box on the screen.

That box matters, but it is not the whole product. A useful vision system also needs to answer a quieter set of questions: how fast frames are moving, what happens when inference is slow, whether the browser gets ahead of the backend, and whether the UI stays honest about latency.

The demo shape

The project has a FastAPI backend, WebSocket streaming, a detector abstraction, and a React dashboard that draws bounding boxes over the video frame.

It ships with a deterministic mock detector so the whole thing runs without a GPU, camera setup, or model weights. The detector output is shaped like a real YOLO pipeline, so the demo can be upgraded without changing the product surface.

Why I added backpressure

Without backpressure, real-time dashboards become dishonest.

The frontend can send frames faster than the backend can process them. The backend queues work. The UI keeps moving. Suddenly the dashboard looks live, but the detections are stale.

This project uses a one-frame acknowledgement loop: the browser sends a frame, waits for the result, updates the overlay and stats, then sends the next frame.

It is less flashy than maximum FPS, but it makes the interface truthful. The latency numbers you see are real. The frame you are looking at is the frame the detector actually analyzed. That matters a lot more than a high FPS counter that is lying to you.

The real claim

The interesting part of this project is not the bounding boxes. It is that the system accurately represents its own state: frame rate, detection latency, connection limits, and whether the backend is keeping up.

Most demos do not bother with that. I think they should.