The problem
Industrial monitoring needs to catch a worker falling as it happens, on hardware a site can actually afford — one mid-range GPU, not a cluster. The hard part isn't a model that classifies falls offline; it's a pipeline that ingests a live stream, holds latency under control, and stays honest about what is real inference and what is a fallback.
Approach & tradeoffs
Sentinel Vision is a single observable FastAPI service, not a notebook. A frame flows through YOLO26-Pose → ByteTrack identity tracking → SAM 2.1 masks (called only on new, stale or uncertain tracks, to skip the expensive call when nothing changed) → a per-person skeleton transformer that reads the action over a temporal window.
The engineering decisions are about keeping a live system honest:
- Bounded latest-frame queues so live latency can't drift under load — the pipeline drops old frames instead of falling behind.
- Fail-closed production config that refuses to start on demo backends, and a
/health/readyendpoint that exposes its degradations instead of hiding them. - Models as swappable adapters (PyTorch / ONNX / TensorRT), with an auditable kinematic fallback for the temporal head.
The temporal head is trained on the UR Fall Detection Dataset with a by-sequence split — falls in validation are never seen in training, so the score means something.
Results
- Validation macro-F1 = 0.90 on held-out fall sequences — an honest split, not a synthetic 100%.
- 27.98 FPS, p95 49 ms end-to-end, ~1990 MiB VRAM on an RTX 2070, with 0% frame drops over the benchmark and every throughput / latency / drop SLO passing.
- Full observability: Prometheus metrics, a live FPS/latency dashboard, and a WebSocket telemetry stream.
What I'd flag
Every FPS and VRAM figure is measured on one named GPU — nothing is extrapolated from a bigger card. Genuine zero-copy decode-to-inference is scoped honestly to a DeepStream boundary the portable build does not claim; readiness reports the active decode path instead of pretending it's PCIe-free.