Real Time Helmet Detection Project

Team: M Divya, Kavali Rohitha, Katragadda Navya
Code: GitHub Repository This project was completed as part of the Edge AI Course, which focuses on real-time AI deployment on edge devices.

google drive link for video presentation : https://drive.google.com/file/d/1yNl1FJYWZVjtlT5UGi9V5FdqgKGQ1F8Q/view?usp=sharing

1. Problem Statement, Motivation & Objectives

Road accidents involving two-wheeler riders without helmets are a major safety concern. Manual monitoring by traffic authorities is inefficient and not scalable. This project addresses the need for an automated system that can detect helmet usage in real time using computer vision.

The motivation behind this work is to build a low-cost, real-time edge AI system that can operate without reliance on cloud infrastructure. By performing inference on edge devices like Raspberry Pi, the system reduces latency, improves privacy, and minimizes bandwidth usage.

Objectives:

Detect riders, helmet status (with/without) in real time using a YOLOv5-based pipeline
Deploy the trained model on a Raspberry Pi 5 using ONNX Runtime, with no PyTorch dependency on the device
Optimize the system for real-time performance by minimizing latency and maximizing stable FPS for efficient and timely detection on edge devices.
Implement a TCP client-server architecture for distributed frame capture and inference
Evaluate model quality using Precision, Recall, mAP@0.5, and mAP@0.5:0.95
Achieve measurable on-device inference with target latency under 1 second per frame on CPU

2. Proposed Solution

The system implements a two-node distributed inference pipeline, where the laptop acts as a client for real-time frame capture, compression, and TCP streaming, while the Raspberry Pi 5 serves as the inference server running the ONNX model and returning structured JSON detections, which are rendered as bounding boxes on the client for visualization.

System Pipeline:

Webcam (Laptop)
  → JPEG Compression (quality 80)
    → TCP Socket (length-prefixed framing)
      → Raspberry Pi 5: ONNX Runtime Inference
        → JSON Detection Results
          → TCP Socket → Laptop (Bounding Box Rendering)

Design Justifications:

Component	Choice	Rationale
Detection model	YOLOv5s	Lightweight single-stage detector; strong accuracy-to-compute ratio for CPU inference
Inference runtime	ONNX Runtime	Removes PyTorch from the Pi; delivers CPU-optimized execution with lower memory footprint
Transport	TCP + length-prefixed framing	Guarantees ordered, lossless delivery of variable-length image payloads; prevents partial reads
Compression	JPEG at quality 80	Reduces per-frame TCP payload size with negligible detection quality impact

3. Hardware & Software Setup

Hardware:

Component	Specification
Edge Inference Device	Raspberry Pi 5 (8GB RAM)
Client Device	Laptop (frame capture and result visualization)
Camera	USB Webcam

The Raspberry Pi 5 was selected for its significantly improved CPU performance over prior Pi generations, making CPU-bound ONNX inference feasible for near-real-time use without a dedicated hardware accelerator.

Software:

Component	Tool / Version
OS (Raspberry Pi)	Raspberry Pi OS 64-bit (Bookworm)
Language	Python 3.11
Detection Framework	YOLOv5 (Ultralytics)
Inference Runtime	ONNX Runtime (CPU)
Vision Library	OpenCV
Communication	Custom TCP protocol (length-prefixed binary framing)
Utilities	NumPy, JSON

4. Data Collection & Dataset Preparation

Source: link : https://www.kaggle.com/datasets/aneesarom/rider-with-helmet-without-helmet-number-plate Public Kaggle helmet detection dataset with YOLO-format annotations (normalized bounding boxes).

Class Distribution:

Class	Approx. Instances
Rider	~120
Number Plate	~116
Without Helmet	~93
With Helmet	~64

The dataset is fairly well-structured, covering all key classes required for the task. While the “With Helmet” class has slightly fewer samples compared to others, it still provides useful learning signals for the model. With further data collection, especially for this class, performance can be improved even more.

Although the overall dataset size is relatively small, it is sufficient to demonstrate the working pipeline and achieve meaningful results. Expanding the dataset in the future will help the model generalize better across different environments, lighting conditions, and real-world scenarios.

Data Split: Standard train / validation / test partition.

Augmentation pipeline:

Horizontal flip - helps the model handle different riding directions and improves generalization
HSV color jitter - makes the model more robust to changes in lighting and camera exposure.
Mosaic augmentation (YOLO default) - combines four images into one during training, which improves detection of small and partially visible objects like helmets and number plates.

Key Dataset Characteristics & Challenges:

Occlusion: Helmets and number plates are sometimes partially hidden in real traffic scenes, helping the model learn to handle such cases.
Class distribution: The “With Helmet” class has fewer samples, providing an opportunity to further improve performance with additional data.
Lighting variability: The dataset includes both bright and low-light conditions, making the model more adaptable to real-world environments.

5. Model Design, Training & Evaluation

Architecture: YOLOv5s

Component	Details
Backbone	CSPDarknet - performs efficient feature extraction using cross-stage partial connections, reducing computational cost while maintaining strong feature representation
Neck	PANet - enhances feature fusion by combining low-level and high-level features, enabling robust detection across different object scales
Detection Heads	Three detection scales - specialized heads for small, medium, and large objects, improving detection accuracy for both tiny objects (e.g., helmets) and larger ones (e.g., riders)

YOLOv5s is a lightweight, single-stage object detection architecture designed for high-speed inference with competitive accuracy. The backbone (CSPDarknet) extracts rich spatial and semantic features from input images while keeping computation efficient, which is important for edge deployment.

The neck (PANet) plays a key role in aggregating features from multiple layers, allowing the model to retain both fine-grained details and high-level context. This is particularly useful in scenarios where objects vary significantly in size, such as small number plates and larger rider regions within the same frame.

The detection heads operate at three different scales, ensuring that objects of varying sizes are effectively detected. This multi-scale detection capability is critical for real-world traffic scenes, where object sizes can change due to distance, camera angle, and perspective.

Overall, YOLOv5s provides a strong balance between speed and accuracy, making it well-suited for real-time inference on resource-constrained devices like the Raspberry Pi 5.

Training Configuration:

Parameter	Value
Input resolution	640 × 640
Batch size	16
Epochs	50
Optimizer	SGD
Initial learning rate	0.01
LR schedule	Cosine annealing with linear warm-up

Evaluation Results:

Metric	Value
Precision	0.9527
Recall	0.8891
mAP@0.5	0.9430
mAP@0.5:0.95	0.6956

Per-Class AP@0.5:

Class	AP@0.5
With Helmet	0.972
Without Helmet	0.914
Rider	0.929
Number Plate	0.956

Analysis:

High precision (0.9527) confirms a very low false positive rate, the model fires reliably when a detection is warranted
Recall of 0.8891 reflects a small miss rate, most likely attributable to occlusion and the relatively limited “with helmet” training samples
Strong mAP@0.5 (0.9430) demonstrates robust multi-class detection performance across all four classes at standard IoU
The drop to mAP@0.5:0.95 (0.6956) is expected for a lightweight model like YOLOv5s and reflects reduced localization precision at stricter IoU thresholds - a known speed-accuracy trade-off
All three loss components (box, objectness, classification) decreased consistently across 50 epochs on both train and validation sets with no signs of overfitting, confirming good generalization given the dataset size
The optimal F1 score across all classes is 0.92 at a confidence threshold of 0.656
The confusion matrix shows “without helmet” recall at 0.80, with 0.20 of instances misclassified as “with helmet” - the most safety-critical error mode, driven by occlusion and class imbalance
Figures

Figure 1: Training performance over 50 epochs showing loss curves (box, objectness, classification) along with precision, recall, and mAP metrics. The steady decrease in losses and consistent validation trends indicate stable training and good generalization.

fig_results

Figure 2: Confusion matrix on the validation set. The model shows strong performance across all classes, with high recall for most categories. Minor confusion is observed between “with helmet” and “without helmet”, primarily due to occlusion and visual similarity in challenging cases.

fig_confusion_matrix

Figure 3: Precision–Recall curves for each class and overall model performance (mAP@0.5 = 0.943). The curves demonstrate high precision across a wide recall range, indicating a well-calibrated and reliable detector.

fig_pr_curve

Figure 4: Training loss curves (box, objectness, and classification) over epochs. All losses decrease steadily, indicating stable training and good model convergence without signs of overfitting.

Figure 5: Validation loss curves (box, objectness, and classification) over epochs. The consistent decrease with minor fluctuations indicates stable generalization and no significant overfitting.

Figure 6: Precision and recall trends over epochs. Both metrics improve steadily, indicating better detection performance and balanced learning as training progresses.

Figure 7: Mean Average Precision (mAP@0.5 and mAP@0.5:0.95) over epochs. The consistent increase reflects improved detection accuracy and localization performance across training.

Figure 8: F1–confidence curves for all classes. The model achieves an optimal F1 score of approximately 0.92 at a confidence threshold around 0.65, indicating a good balance between precision and recall.

Figure 9: Precision–confidence curves for all classes. Precision increases with higher confidence thresholds, indicating reduced false positives at stricter detection confidence levels.

Figure 10: Recall–confidence curves for all classes. Recall decreases as confidence increases, highlighting the trade-off between capturing all detections and maintaining prediction certainty.

6. Model Conversion & Efficiency Metrics

Conversion: PyTorch (.pt) → ONNX (.onnx) via torch.onnx.export

ONNX export is a runtime optimization, not a compression technique. It serializes the trained computation graph into a portable, framework-independent format executable by ONNX Runtime - a lightweight inference engine with CPU-specific kernel optimizations unavailable in standard PyTorch. This removes the need to install PyTorch on the Raspberry Pi, reducing the deployment environment’s size, startup time, and memory footprint significantly.

On-Device Performance (Raspberry Pi 5, CPU-only):

Metric	Value
Inference latency	~969 ms / frame
Throughput	~1.2 FPS
Inference runtime	ONNX Runtime (CPU)
PyTorch required on device	No

Trade-offs and Bottlenecks:

At ~969 ms per frame (~1.2 FPS), the system is suitable for controlled-intersection violation flagging, but insufficient for high-speed or multi-lane video streams
The primary bottleneck is CPU-only execution; a dedicated NPU such as the Coral TPU or Hailo-8 could reduce latency by 10–20× while preserving detection accuracy
JPEG compression at quality 80 effectively reduces TCP transmission overhead with negligible impact on detection quality at typical traffic camera distances
ONNX Runtime removes PyTorch overhead on the Pi, lowering RAM usage and improving startup time — but does not address the fundamental compute limitations of a CPU-only deployment
For production use, INT8 quantization combined with hardware acceleration is the natural next step toward achieving real-time throughput (≥10 FPS)

7. Model Deployment & On-Device Performance

Deployment Steps

The deployment pipeline is designed to enable efficient edge inference on the Raspberry Pi 5 using a lightweight ONNX-based runtime. The complete flow, including preprocessing and postprocessing, is as follows:

Train the YOLOv5 model on a development machine using PyTorch
Export the trained model to ONNX format (.onnx) for optimized, framework-independent inference
Transfer the ONNX model to the Raspberry Pi 5 and set up the runtime environment (Python, OpenCV, ONNX Runtime, NumPy)
Start the inference server on the Raspberry Pi using ONNX Runtime with CPU execution and graph optimizations enabled
On the client (laptop), capture webcam frames and compress them using JPEG (quality = 80) to reduce transmission size
Send compressed frames over a TCP connection using a length-prefixed protocol to ensure reliable delivery
On the Raspberry Pi (Server-side processing):
- Decode the incoming JPEG frame into an image
- Apply letterbox preprocessing to resize the image to 640×640 while preserving aspect ratio (padding added as needed)
- Normalize pixel values and convert the image into tensor format (NCHW) for model input
Model Inference:
- Run the ONNX model using ONNX Runtime to generate raw predictions
- Output contains bounding boxes, objectness scores, and class probabilities
Postprocessing:
- Apply confidence thresholding to filter out low-confidence detections
- Convert bounding boxes from center format (x, y, w, h) to corner format (x1, y1, x2, y2)
- Apply Non-Maximum Suppression (NMS) per class to remove overlapping detections and retain the most confident ones
- Map bounding boxes back to original image coordinates by reversing letterbox scaling and padding
Package the final detections (class, confidence score, bounding box) into JSON format and send back to the client over TCP
On the client side, receive detection results and render bounding boxes, class labels, and confidence scores on the original video frame
Display real-time performance metrics such as FPS and inference latency for monitoring

This pipeline ensures efficient preprocessing, optimized inference, and accurate postprocessing, enabling a stable end-to-end real-time detection system on edge hardware. This deployment removes PyTorch dependency from the edge device and uses ONNX Runtime with graph optimizations and multi-threading for efficient CPU inference.

On-Device Performance

Metric	Observation
Inference latency	~900–1000 ms per frame
Throughput	~1–2 FPS
Resource utilization	CPU-based inference with multi-threading (4 threads)
Real-time behavior	Stable near real-time pipeline with continuous streaming

The system maintains consistent performance during continuous operation, with stable latency and reliable frame processing. While the frame rate is limited due to CPU-only inference, the pipeline remains efficient and suitable for real-time monitoring use cases on edge devices.

8. System Prototype (Pictures / Figures)

The prototype demonstrates a complete distributed edge AI system with real-time interaction between a laptop and Raspberry Pi 5.

Hardware Setup: Raspberry Pi 5 acts as the inference server, while the laptop handles frame capture and visualization
Working Prototype: Frames captured from a webcam are compressed and streamed to the Pi, where inference is performed and results are sent back
Output Visualization: Detected objects are displayed with bounding boxes, class labels, confidence scores, and real-time performance metrics (FPS, inference time)

The system overlays runtime statistics such as FPS and inference latency directly on the output frames, enabling real-time monitoring of performance alongside detection results.

Figure 11: Hardware setup showing the Raspberry Pi 5 (inference server) and laptop (client) used for real-time helmet detection.

setup

Figure 12: System pipeline illustrating the distributed client–server architecture, including frame capture, TCP transmission, ONNX-based inference, and result communication.

pipeline

Figure 13: Real-time detection output showing bounding boxes, class labels, confidence scores, and performance metrics (FPS and inference latency).

9. Conclusions & Limitations

Conclusion

A complete real-time helmet monitoring system was successfully implemented using a distributed edge AI architecture. The system demonstrates that low-cost hardware such as the Raspberry Pi 5 can perform real-time object detection using optimized inference pipelines without relying on cloud infrastructure.

The use of ONNX Runtime, TCP-based streaming, and efficient preprocessing enables a practical and scalable solution for edge deployment. The system is robust, modular, and capable of continuous real-time operation.

Limitations

Limited FPS due to CPU-only inference on the Raspberry Pi
Increased latency (~1 second per frame) restricts high-speed real-time applications
Detection performance can be affected in cases of heavy occlusion
Dependence on network stability for continuous TCP streaming
Small dataset limits robustness across diverse real-world scenarios

10. Future Work

The system can be further improved in the following ways:

Integrate hardware accelerators such as Coral TPU or Hailo for significant speed improvements
Apply model quantization (e.g., INT8) to reduce inference latency and improve throughput
Optimize the pipeline using parallel processing or asynchronous inference
Extend support for multiple camera streams for large-scale monitoring
Integrate number plate recognition (OCR) for complete traffic violation detection
Upgrade to newer architectures such as YOLOv8 or YOLO-NAS for improved accuracy and efficiency

11. Challenges & Mitigation

Challenges

Running deep learning inference efficiently on resource-constrained hardware
Ensuring reliable transmission of image frames over TCP without data loss
Handling partial reads and incomplete data in socket communication
Managing large image payload sizes during real-time streaming
Debugging frame decoding issues and connection interruptions

Mitigation

Used ONNX Runtime with graph optimizations and multi-threading for efficient CPU inference
Implemented a length-prefixed TCP protocol to guarantee complete message delivery
Applied JPEG compression (quality = 80) to reduce network bandwidth usage
Built robust socket handling with exact byte reads to prevent partial frame issues
Added error handling for decode failures and connection resets to maintain stability

12. Edge AI Optimization and Deployment Impact

This project incorporates key Edge AI principles by optimizing the trained model for deployment on a resource-constrained device (Raspberry Pi 5). Instead of running inference using PyTorch, the model was converted to ONNX format and executed using ONNX Runtime, enabling efficient CPU-based inference.

Optimization Applied:

Conversion from PyTorch (.pt) to ONNX (.onnx) for lightweight inference
Use of ONNX Runtime with graph optimizations enabled
Removal of PyTorch dependency on the edge device
JPEG compression (quality = 80) to reduce network transmission overhead

Before vs After Optimization Comparison

Metric	Before Optimization (PyTorch - Training)	After Optimization (ONNX - Edge)
Runtime	PyTorch (GPU/desktop environment)	ONNX Runtime (CPU on Raspberry Pi 5)
Deployment feasibility	Not suitable for edge device	Fully deployable on Raspberry Pi
Precision	0.9527	~same (no re-evaluation performed)
Recall	0.8891	~same (expected unchanged)
mAP@0.5	0.9430	~same (expected unchanged)
Memory usage	High (PyTorch dependency)	Reduced (lightweight runtime)
Inference latency	Not measured on Pi	~969 ms per frame
Throughput	Not practical on Pi	~1.2 FPS

Observations:

ONNX conversion preserves model weights and computation graph, so no significant accuracy degradation is expected.
The primary improvement is in deployment feasibility and runtime efficiency rather than model accuracy.
ONNX Runtime enables practical inference on Raspberry Pi by reducing memory usage and removing heavy dependencies.
The system achieves stable near real-time performance (~1 FPS) on CPU-only hardware.
JPEG compression further reduces network overhead, improving end-to-end latency.

This demonstrates how model optimization and runtime selection are critical for enabling real-time Edge AI applications on low-power devices.