CP 330 - Edge AI

Air Piano: Real-Time Finger Detection for Virtual Piano Playing

Introduction

Instead of playing a heavy, expensive instrument like a piano, you can wave your hands in the air to create sounds!

Using the Arduino Nicla Vision board, we built a system to detect finger movements and turn them into music. A FOMO-based model was developed to detect these gestures in real time, achieving a validation F1 score of 78.1%. To run on Nicla Vision’s limited resources, the model was optimized and compressed into a 56 KB TFLite file.

The pipeline translates detected finger positions into 8 virtual piano keys, sending serial commands to a connected computer to produce music.

This project is a successful integration of creativity, machine learning, and embedded systems—delivering an innovative, real-time, and interactive musical experience.

No heavy instruments: Pianos are hard to move around. Our system makes music creation lightweight and mobile.
Low cost: Instruments are expensive. This makes music fun and affordable for everyone.
Interactive and engaging: Play music just by waving your fingers in the air—magical and intuitive!

Air Piano System

Methodology

Hardware Requirements

Arduino Nicla Vision

Software Used

Edge Impulse Studio / Google Colab – for training and deploying the FOMO object detection model
OpenMV IDE – for programming Nicla Vision and integrating the model with custom logic
Python – for preprocessing images and linking gestures to sound playback via serial communication

3. Working Principle

Step 1: Camera Initialization

Nicla Vision’s camera captures frames continuously.

Step 2: Image Preprocessing

Each frame is cropped and resized for the FOMO model (96×96 grayscale).

Step 3: Running the Model

The model detects the index finger and returns:

Label
Confidence score
Bounding box: (x, y, width, height)

Step 4: Interpreting Results

Bounding box area = width × height
Larger area → finger is closer (release)
Smaller area → finger is farther (press)

Step 5: Mapping to Piano Keys

Image width (240px) is split into 8 zones (30px each)
Bounding box center (x + width / 2) → mapped to key: key = center_x // 30

Step 6: Detecting Press Events

Changes in bounding box area across frames detect press vs. release.

Step 7: Continuous Loop

Real-time loop for detection and sound playback.

Data Collection

Device: Arduino Nicla Vision
Setup: Mounted 35 cm above tabletop
Captured: 3-minute video → ~683 frames

Dataset Breakdown

546 images for training
137 images for validation

Labeled Classes

Index Finger (with bounding box)
Fist (with bounding box)
Background (no annotations)

Data Augmentation

Transformations: Rotation, scaling, flipping
Final dataset size: 1,689 images

Model Development and Compression

Phase 1: Initial Model

Input: RGB 320×240 → downscaled to 48×48 grayscale
Pre-quantization model size: 351.41 KB, accuracy: 74.34%
Post-quantization (int8): 96.40 KB, accuracy: 55.75%

Phase 2: Optimized FOMO Model (Edge Impulse)

Architecture: FOMO (MobileNetV2 0.35)
Input: 96×96 grayscale
Output classes: Finger, Fist

Performance

F1 Score: 78.1%
Precision (non-bg): 75%
Recall (non-bg): 82%
Inference time: 65 ms

Final Deployed Model

Format: trained.tflite
Size: 56.0 KB

Final Testing Results

Accuracy: 70.73%
Precision: 70%
Recall: 78%
F1 Score: 74%

Prototype and Demo

GitHub Repo:
github.com/btech10423/Air-Piano-Real-Time-Finger-Detection-for-Virtual-Piano-Playing
Demo Video:
Watch on YouTube

Challenges and Workarounds

Gesture Confusion (Finger vs Fist):
Solution: Added more diverse training samples + augmentation → F1 score boosted to 78.1%.
Laggy Performance:
Solution: Reduced input to 96x96 grayscale + model optimization → achieved real-time response.
Press Threshold Tuning:
Solution: Trial-and-error for bounding box area threshold → achieved reliable press detection.

References

FOMO Object Detection Docs:
Edge Impulse FOMO Guide
FOMO Model Video Explanation:
YouTube: FOMO Model for Edge AI