Edge AI-Based Real-Time Exercise Form Detection System (Push-ups + Squats)
Team: Anjesh (MTech CSA, IISc Bangalore) · Ashish Nambiar (MTech CSA, IISc Bangalore) · Garima Papnai (MTech AI, IISc Bangalore) · Shubham Bijalwan (MTech Smart Manufacturing, IISc Bangalore)
Code: GitHub Repository
1. Problem Statement, Motivation & Objectives
Fitness form mistakes during push-ups and squats are a leading cause of workout-related injuries and reduced training effectiveness. Real-time feedback typically requires gym trainers or expensive cloud-connected systems — neither of which is practical for everyday home use. The goal of this project is to build a low-cost, portable, fully offline solution that runs entirely on an embedded device.
This project develops a real-time, on-device exercise form detection system using Edge AI on the Arduino Nicla Vision. Two lightweight MobileNetV1 models — one for push-ups and one for squats — are trained via transfer learning, quantized to INT8, and deployed directly on the microcontroller. Classification results are served through an on-device web dashboard accessible from any phone on the local network.
Why Edge AI?
- Real-time inference with minimal latency (42–53 ms per frame on-device)
- Privacy-preserving: no video or image data leaves the device
- No internet dependency — fully offline operation
- Low-cost and portable: runs on a ₹5000-range embedded board
Key Objectives:
- Collect and label a custom image dataset for push-up and squat form classification
- Train lightweight MobileNetV1-based models using transfer learning via Edge Impulse
- Apply INT8 quantization to optimize models for constrained hardware
- Deploy models on Arduino Nicla Vision (STM32H747) using the EON Compiler
- Build an on-device web dashboard for live feedback with model switching
2. Proposed Solution
The system implements a complete Edge AI pipeline — from image capture to on-device classification and live feedback:
Camera Capture → Resize 96×96 → RGB565 → EI Format (float buffer)
↓
Model Selection (Squat / Pushup)
↓
Run Inference (INT8 Model)
↓
Prediction Output (Label + Confidence)
↓
┌──────────────────────────────────────────────┐
↓ ↓
Downsample 80×60 (Streaming) On-device Web Server
↓
Web Dashboard (Live Feed + Scores)
The Arduino Nicla Vision captures frames, converts them to the Edge Impulse float buffer format, and runs inference locally using the selected INT8 model. A dual-model system allows switching between push-up and squat classifiers via the web dashboard. Results — label and confidence scores — are streamed to a phone browser in real time over the local network.
| Model | Task | Classes |
|---|---|---|
| Push-up | Binary classification | bad (incorrect form), good (correct form) |
| Squat | Multi-class classification | badform, deep, shallow |
3. Hardware & Software Setup
Hardware:
| Component | Details |
|---|---|
| Board | Arduino Nicla Vision |
| MCU | STM32H747 (Cortex-M7 @ 480 MHz + Cortex-M4) |
| Camera | Built-in onboard camera (captures exercise frames) |
| Memory | 1 MB RAM, 2 MB Flash |
| Power | Power bank (portable operation) |
Software:
| Tool | Purpose |
|---|---|
| Edge Impulse | Dataset management, model training, deployment pipeline |
| TensorFlow Lite Micro | On-device inference runtime |
| EON Compiler (RAM optimized) | Model optimization and memory-efficient deployment |
| OpenMV IDE | Firmware scripting, camera integration, and flashing on Nicla Vision |
| Arduino IDE | Firmware (.ino) development and flashing (alternative deployment path) |
| Google Colab | Burst-frame extraction pipeline (video → labeled image dataset) |
4. Data Collection & Dataset Preparation
The dataset is a combination of multiple sources to improve robustness across lighting conditions, environments, and body variations:
- Kaggle datasets — https://www.kaggle.com/code/youssefemad004/pushups-data-videopreprocssing-data
- Zenodo squat dataset — Teng, C. (2025). Squat Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17558630
- YouTube video frame extraction — https://www.youtube.com/watch?v=txnwoJz-Rno, https://www.youtube.com/watch?v=daDK0huWvfc
- Google Images — supplementary frames for class diversity
- Self-recorded images and videos — captured using the Nicla Vision onboard camera, placed on the floor pointing at the person performing exercises (side/front view). Frames were extracted using a custom burst-frame extraction pipeline (Google Colab notebook) and labeled manually via the Edge Impulse Data Acquisition interface.
Push-up Dataset
Side-view push-up videos (correct and incorrect form) were recorded and converted to labeled image frames. The dataset was split at the video level to prevent data leakage between train and test sets.
| Split | Bad (incorrect) | Good (correct) | Total |
|---|---|---|---|
| Train | 1,509 | 1,062 | 2,571 |
| Validation | 431 | 303 | 734 |
| Test | 216 | 153 | 369 |
| Total | 2,156 | 1,518 | 3,674 |
Class imbalance: Bad-form samples outnumber good-form by ~42%. Data augmentation was enabled during training to compensate. Per-class F1 reflects this: bad (0.88) vs good (0.83).
Squat Dataset
| Detail | Value |
|---|---|
| Total samples | 509 images |
| Classes | badform, deep, shallow |
| Train / Test split | 75% / 25% |
| Source | Self-collected + Kaggle + Zenodo (Teng 2025) + YouTube frames |
Preprocessing Steps
- Images resized to 96×96 pixels using “fit shortest axis” mode
- Frames extracted from videos via Colab burst-frame pipeline
- RGB565 → float buffer conversion handled in firmware at inference time
- Manual labeling performed in Edge Impulse Data Acquisition
- Noisy and ambiguous samples removed during review
- 100% of data subset used for training
Edge Impulse Impulse Design
| Block | Push-up Config | Squat Config |
|---|---|---|
| Input | 96×96 image, fit shortest axis | 96×96 image, fit shortest axis |
| Processing | Image block | Image block |
| Learning | Transfer Learning (MobileNetV1) | Transfer Learning (MobileNetV1) |
| Output features | 2 classes: bad, good | 3 classes: badform, deep, shallow |
5. Model Design, Training & Evaluation
Both models use MobileNetV1 as the backbone with transfer learning (ImageNet pretrained weights fine-tuned on exercise data), implemented via Edge Impulse’s Transfer Learning block.
Push-up Model
Training Configuration:
| Parameter | Value |
|---|---|
| Architecture | MobileNetV1 96×96 0.1 (final layer: 8 neurons, 0.1 dropout) |
| Input features | 27,648 |
| Epochs | 50 |
| Learning rate | 0.0005 |
| Data augmentation | Enabled |
| Training processor | CPU |
Note: Initial training used a larger MobileNetV1 architecture which caused Flash overflow on Nicla Vision. Architecture was reduced to MobileNetV1 0.1 to fit within device memory.
Results (Validation Set — Quantized INT8):
| Metric | Value |
|---|---|
| Accuracy | 85.9% |
| Loss | 0.30 |
| ROC-AUC | 0.86 |
| Weighted Precision | 0.87 |
| Weighted Recall | 0.86 |
| Weighted F1 Score | 0.86 |
Confusion Matrix:
| Predicted: BAD | Predicted: GOOD | |
|---|---|---|
| Actual: BAD | 84.4% | 15.6% |
| Actual: GOOD | 11.8% | 88.2% |
| F1 Score | 0.88 | 0.83 |
Interpretation: Strong binary classification with clear visual separation between correct and incorrect push-up postures. The slightly lower good-form F1 (0.83 vs 0.88) is consistent with the dataset class imbalance. Data augmentation was enabled to partially compensate.
Squat Model
Training Configuration:
| Parameter | Value |
|---|---|
| Architecture | MobileNetV1 96×96 0.25 (no final dense layer, 0.1 dropout) |
| Input features | 9,216 |
| Epochs | 80 |
| Learning rate | 0.0004 |
| Data augmentation | Disabled |
| Training processor | CPU |
Results (Validation Set — Quantized INT8):
| Metric | Value |
|---|---|
| Accuracy | 79.2% |
| Loss | 0.48 |
| ROC-AUC | 0.93 |
| Weighted Precision | 0.80 |
| Weighted Recall | 0.79 |
| Weighted F1 Score | 0.80 |
Confusion Matrix:
| Predicted: BADFORM | Predicted: DEEP | Predicted: SHALLOW | |
|---|---|---|---|
| Actual: BADFORM | 84.8% | 15.2% | 0% |
| Actual: DEEP | 5% | 70% | 25% |
| Actual: SHALLOW | 4.2% | 16.7% | 79.2% |
| F1 Score | 0.89 | 0.65 | 0.79 |
Interpretation: Badform detection is strong (F1: 0.89). The main confusion is between deep and shallow squats (25% of deep misclassified as shallow), expected due to high visual similarity at intermediate squat depths. The high ROC-AUC (0.93) confirms strong overall discriminative ability.
Model Comparison Summary
| Model | Task | Classes | Accuracy | Weighted F1 | ROC-AUC |
|---|---|---|---|---|---|
| Push-up | Binary | 2 | 85.9% | 0.86 | 0.86 |
| Squat | Multi-class | 3 | 79.2% | 0.80 | 0.93 |
6. Model Compression & Efficiency Metrics
Technique: INT8 Post-Training Quantization via Edge Impulse EON Compiler (RAM-optimized)
All weights and activations are quantized from 32-bit float to 8-bit integer, enabling faster fixed-point arithmetic on the Cortex-M7 and significantly reducing memory footprint.
Compression Results (target: Arduino Nicla Vision, Cortex-M7 @ 480 MHz)
| Metric | Float32 (Unoptimized) | INT8 (Quantized) | Improvement |
|---|---|---|---|
| Total Latency | 141 ms | 54 ms | ~2.6× faster |
| Peak RAM | 300.1 KB | 124.7 KB | ~58% reduction |
| Flash | 857.5 KB | 284.0 KB | ~67% reduction |
Key Observations:
- ~2.6× faster inference enables real-time operation
- ~58% RAM reduction brings both models within Nicla Vision’s 1 MB limit
- ~67% Flash reduction leaves headroom for firmware and web server
- Negligible accuracy loss between INT8 and Float32 on validation set
- EON Compiler further reduces footprint beyond standard TFLite quantization
7. Model Deployment & On-Device Performance
Deployment Steps:
- Train and validate models on Edge Impulse Studio (cloud)
- Select INT8 quantization + EON Compiler (RAM-optimized) in deployment settings
- Export as OpenMV library (
.zip) targeting Arduino Nicla Vision - Flash firmware to Nicla Vision using OpenMV IDE (primary) or Arduino IDE via
.inofile - Arduino
.inofirmware handles: image capture → preprocessing → inference → web server output - Add WiFi credentials in
arduino_secrets.hand connect device to hotspot/router - Results streamed to on-device web dashboard — access via the IP shown in Serial Monitor from any phone browser on the local network
On-Device Performance (EON Compiler, RAM-optimized):
| Model | Inference Time | Peak RAM | Flash Usage |
|---|---|---|---|
| Push-up (INT8, EON) | 42 ms | 77.7 KB | 106.3 KB |
| Squat (INT8, EON) | 53 ms | 115.9 KB | 295.8 KB |
Both models run comfortably within Nicla Vision hardware limits (1 MB RAM, 2 MB Flash). At 42–53 ms per inference, the system achieves approximately 19–24 classifications per second — sufficient for real-time exercise monitoring.
8. System Prototype
System Pipeline
The full end-to-end pipeline implemented in firmware:
Camera Capture (Nicla Vision)
↓
Resize 96×96 (Inference)
↓
RGB565 → EI Format (float buffer)
↓ ↘
Model Selection Downsample 80×60 (Streaming)
(Squat / Pushup)
↓
Run Inference (INT8 Model)
↓
Prediction Output (Label + Confidence)
↓
On-device Web Server
↓
Web Dashboard (Live Feed + Scores)
On-Device Web Dashboard
The firmware hosts a lightweight web server accessible from any phone browser on the local network. The dashboard (“Exercise Classifier”) provides:
- Model toggle buttons: Switch between Squat and Pushup classifier in real time
- Active model label: Shows which model is currently running
- Live camera feed: Downsampled 80×60 grayscale stream from Nicla Vision
- Classification output: Current label (good/bad/badform/deep/shallow) displayed prominently
- Confidence bars: Per-class confidence scores shown as progress bars
Live Demo
The system was demonstrated live with a person performing push-ups in front of the Nicla Vision (placed on the floor). The phone browser showed real-time classification with confidence scores updating with each inference cycle. A demo video is available in the repository.
9. Conclusions & Limitations
Conclusions:
- Successfully deployed a real-time Edge AI exercise form detection system on the Arduino Nicla Vision with no cloud dependency
- Push-up binary classifier achieved 85.9% accuracy with strong bad/good form separation
- Squat multi-class classifier achieved 79.2% accuracy with excellent badform detection (F1: 0.89) and high discriminative ability (ROC-AUC: 0.93)
- INT8 quantization reduced inference latency by ~2.6× and RAM by ~58% with negligible accuracy loss
- On-device web dashboard enables live feedback accessible from any phone — no app install needed
Limitations:
- Sensitive to lighting conditions — performance degrades in low or inconsistent light
- Single-person detection only — system not designed for multi-person scenes
- Accuracy depends heavily on dataset quality and camera angle consistency
- No temporal modeling — each frame classified independently, ignoring motion context
- Deep vs. shallow squat confusion (25%) due to high visual similarity at intermediate depths
- Limited environmental diversity in training data (backgrounds, body types, angles)
10. Future Work
- Add more exercises (lunges, planks, deadlifts, burpees)
- Use pose estimation (keypoints) — e.g., MediaPipe Pose — for skeleton-based form analysis
- Add multi-person support for group fitness settings
- Mobile app integration for a richer user experience beyond the browser dashboard
- Add repetition counting using temporal state tracking
- Expand dataset with diverse users, lighting, and camera angles
- Explore temporal models (TCN, LSTM) using IMU data for motion-aware classification
11. Challenges & Mitigation
| Challenge | Impact | Mitigation Applied |
|---|---|---|
| Video frame extraction created near-duplicate images | Risk of data leakage between train/test | Split performed at video level, not frame level |
| Class imbalance in push-up dataset (bad » good) | Biased model, lower good-form F1 | Data augmentation enabled; per-class F1 monitored |
| Initial larger model caused Flash overflow on Nicla Vision | Deployment failure | Switched to MobileNetV1 0.1; applied INT8 + EON Compiler |
| High inference latency on Float32 model (141 ms) | Not real-time capable | INT8 quantization brought latency to 42–53 ms |
| Deep vs. shallow squat class overlap | 25% misclassification | Increased epochs to 80; acknowledged as fundamental visual ambiguity |
| Sensitivity to lighting during live demo | Inconsistent inference results | Controlled demo environment; noted as key limitation for future work |
12. References
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications — Howard et al., 2017 — https://arxiv.org/abs/1704.04861
- TensorFlow Lite Micro — https://www.tensorflow.org/lite/microcontrollers
- Edge Impulse Documentation — https://docs.edgeimpulse.com
- Arduino Nicla Vision Documentation — https://docs.arduino.cc/hardware/nicla-vision/
- OpenMV IDE — https://openmv.io/
- Push-up Dataset (Kaggle) — https://www.kaggle.com/code/youssefemad004/pushups-data-videopreprocssing-data
- Squat Dataset (Zenodo) — Teng, C. (2025). Squat Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17558630