Real-Time Two-Way ASL Translator on Raspberry Pi 5


Team: Liz Maria George, Naznin Amirul Haque, Ayush Kumar, Prayanshu Sharma
(M.Tech. MVLSI, ECE, IISc)
Code: GitHub Repository

1. Introduction


Communication between ASL users and non-signers is a persistent everyday barrier. This project builds a real-time, two-way ASL translator that runs entirely on a Raspberry Pi 5 โ€” no cloud ML inference required.

Two operating modes:

Mode Input Output
๐ŸคŸ Speaker Pi Camera (live signs) Spoken audio via BT speaker
๐ŸŽ™๏ธ Listener Bluetooth microphone Transcribed text on touchscreen

The Sign-to-Speech pipeline recognises 19 ASL signs using MediaPipe landmark extraction + a CNN classifier (TFLite). Speech-to-Text uses Google Speech Recognition API.

2. System Architecture


High-Level Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  TWO-WAY ASL TRANSLATOR                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  SIGN-TO-SPEECH (Speaker Mode)                           โ”‚
โ”‚  Pi Camera โ”€โ–บ MediaPipe Holistic โ”€โ–บ Features โ”€โ–บ CNN     โ”‚
โ”‚  640ร—480        Pose+Hands (75 kp)   (30,504)  TFLite    โ”‚
โ”‚                                               โ”€โ–บ espeak-ngโ”‚
โ”‚                                                          โ”‚
โ”‚  SPEECH-TO-TEXT (Listener Mode)                          โ”‚
โ”‚  BT Mic โ”€โ–บ PyAudio โ”€โ–บ SpeechRecognition โ”€โ–บ Display      โ”‚
โ”‚  Sony XB12                 Google STT API    (Tkinter)   โ”‚
โ”‚                                                          โ”‚
โ”‚  SHARED: Tkinter GUI โ”‚ Threaded workers โ”‚ Mode toggle    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

System_Architecture

Figure 1: Top level system architecture

Sign-to-Speech: Step-by-Step

Picamera2          MediaPipe Holistic        Feature Eng.
640ร—480 @ 30fps โ”€โ”€โ–บ Pose (33 kp)      โ”€โ”€โ–บ  252-dim/frame
                    Left Hand (21 kp)        + velocity
                    Right Hand (21 kp)  โ”€โ”€โ–บ  504-dim/frame
                    (every 2nd frame)
                                        โ”€โ”€โ–บ  (30, 504) tensor
                                             โ†“
                                         TFLite CNN
                                         conf > 85%?
                                             โ†“
                                         espeak-ng TTS

Threading Architecture

Main Thread (Tkinter loop, 15 ms tick)
 โ”œโ”€โ”€ update_frame()  โ† camera + MediaPipe + feature buffer + render
 โ”œโ”€โ”€ InferenceWorker โ† TFLite on (30,504) via maxsize-1 queue
 โ”œโ”€โ”€ TTSWorker       โ† espeak-ng subprocess, flushes stale requests
 โ””โ”€โ”€ STTWorker       โ† PyAudio + Google STT (Listener mode only)

3. Hardware


Component Specification
Compute Raspberry Pi 5 (8 GB RAM)
Camera Pi Camera Module (CSI ribbon)
Display Official Pi Touchscreen (DSI)
Audio I/O Sony SRS-XB12 (BT speaker + mic)
OS Raspberry Pi OS Bookworm 64-bit

4. Dataset


Data collected on the Pi itself (same camera as deployment) to eliminate domain gap.

Property Value
Signs 19 common ASL words
Clips per sign 180 (ร—30 frames each)
Total clips 3,420
Resolution 640ร—480 @ 30 FPS
Format .mp4 โ†’ .npy (MediaPipe features)
Split 85% train/val ยท 15% held-out test

Supported signs:

ย  ย  ย  ย 
yes no hello please
thank you water more home
good bad come go
stop time me food
want help (18 shown) ย 

5. Feature Engineering


Raw landmarks โ†’ engineered features for robustness to hand size, distance, and orientation.

Per-Frame Feature Vector (252-dim)

Feature Group Dims Description
Pose landmarks 99 33 body keypoints ร— (x,y,z) โ€” spatial context
Right hand (wrist-normalised) 63 21 kp ร— 3, scaled by wrist-to-knuckle distance
Left hand (wrist-normalised) 63 Same normalisation
Right fingertip distances 10 C(5,2) pairwise distances (shape descriptor)
Left fingertip distances 10 Same
Right palm normal 3 Unit cross-product vector (hand orientation)
Left palm normal 3 Same
Inter-hand distance 1 Wrist-to-wrist distance (key for two-handed signs)

+ Velocity features: np.diff of all 252 base features = 504 features/frame
Model input tensor: (30 frames ร— 504 features)

6. Model Training


Training on GPU workstation (train.py) โ€” 3 automated stages:

Stage 1 โ€” Neural Architecture Search (NAS)

Keras Tuner Hyperband searches over 1D-CNN vs LSTM, filters/units {32,64,96,128}, kernel sizes {3,5}, dropout {0.2โ€“0.5}, LR {1e-3, 1e-4}. Up to 50 epochs per trial.

Stage 2 โ€” 5-Fold Stratified Cross-Validation

Best NAS architecture trained for 100 epochs ร— 5 folds โ†’ mean val accuracy + std dev + mean best epoch (used as fixed epoch count for Stage 3).

Stage 3 โ€” LR Scheduler Comparison

Scheduler Description
Constant Fixed LR
Cosine Decay Smooth anneal to zero
Exponential Decay Rate 0.9 per 1/10th of steps
ReduceLROnPlateau Halves LR after 5 stagnant epochs

Best scheduler โ†’ final exported model.

Confusion Matrix

Figure 2: Confusion matrix on held-out test set (19 classes)

7. Model Optimisation & Quantisation


Property FP32 Model INT8 Model
File size 628 KB 163 KB (โˆ’74%)
Tensor RAM 771.9 KB 252.5 KB (โˆ’67%)
x86 latency 0.082 ms 0.023 ms
Test accuracy 99.12% 99.12%
  • FP32: Standard TFLite conversion (TFLITE_BUILTINS + SELECT_TF_OPS)
  • INT8: Post-training integer quantisation using 200 representative calibration samples; input/output remain float32 for compatibility
  • Both models swappable at runtime via GUI radio buttons

8. Deployment Optimisations


Optimisation Effect
MediaPipe every 2nd frame + cache ~50% CPU reduction
Sliding window: 30-frame, 15-frame step Balances latency vs compute
Min. 8 hand-presence frames before inference Suppresses false triggers
Inference queue maxsize=1 Drops stale sequences instantly
TTS queue flush on new word No audio pileup
2-second same-word cooldown Prevents repetition

9. Application Interface


Tkinter GUI designed for touchscreen:

  • Live camera feed (640ร—480) with MediaPipe skeleton overlay
  • Transcript box โ€” 22pt bold text for detected sign / transcribed speech
  • Mode toggle โ€” Speaker โ†” Listener
  • Model selector โ€” FP32 / INT8 radio buttons (hot-swap)
  • Real-time metrics โ€” FPS + end-to-end inference latency (ms)
  • Status bar โ€” confidence score, hand tracking state, current mode

GUI Screenshot - Speaker Mode

Figure 3: GUI in Speaker mode โ€” live skeleton overlay + detected sign

GUI Screenshot - Listener Mode

Figure 4: GUI in Listener mode โ€” transcribed speech displayed

10. Results


Metric FP32 INT8
Test accuracy 99.12% 99.12%
Model size 628 KB 163 KB
x86 inference latency 0.082 ms 0.023 ms
Pi 5 end-to-end latency ~0.5 ms ~0.5 ms
GUI frame rate 20โ€“30 FPS 20โ€“30 FPS
No. of classes 19 19
Confidence threshold 85% 85%

INT8 quantisation achieves 4ร— model compression and 3.6ร— faster inference with zero accuracy loss.

11. Challenges & Learnings


Challenge Solution
Domain gap (laptop vs Pi camera) Collected all training data on the Pi itself
MediaPipe CPU cost on Pi 5 Process every 2nd frame; reuse cached results
Bluetooth audio latency (~100โ€“200 ms) Used espeak-ng for fast local synthesis
Hand-size / distance variance Wrist-normalised + hand-scale features
Whisper on-device STT (deferred) Used Google Speech API; distil-whisper planned for v2

12. Future Work


  • Vocabulary expansion: Fingerspelling (Aโ€“Z) + more conversational signs
  • Fully offline STT: On-device distil-Whisper to replace Google Speech API
  • Sentence construction: Sequential sign โ†’ grammatical English via lightweight LM
  • Wearable form factor: Chest-/head-mounted camera on smaller SBC or custom PCB
  • Continuous learning: Few-shot on-device fine-tuning for user-specific signs

13. Quick-Start Reproduction


# 1. Clone
git clone https://github.com/<your-username>/Realtime_ASL_RasPi.git && cd Realtime_ASL_RasPi

# 2. System deps (Pi OS Bookworm)
sudo apt update && sudo apt install -y \
    python3-pip python3-opencv espeak-ng portaudio19-dev \
    python3-tk libatlas-base-dev bluetooth bluez

# 3. Python deps
pip3 install mediapipe numpy opencv-python Pillow \
    SpeechRecognition tflite-runtime picamera2

# 4. Download dataset (optional โ€“ to retrain)
python3 fetch_dataset.py

# 5. Run the translator
cd Demo_RasPi && python3 pi_two_way_translator_V2.py

See README.md for full retraining pipeline (data_capture โ†’ preprocess โ†’ train).

14. Project Structure


Realtime_ASL_RasPi/
โ”œโ”€โ”€ Demo_RasPi/
โ”‚   โ”œโ”€โ”€ pi_two_way_translator_V2.py   # Main app
โ”‚   โ”œโ”€โ”€ sign_model_fp32.tflite        # 628 KB
โ”‚   โ””โ”€โ”€ sign_model_qat.tflite         # 163 KB (INT8)
โ””โ”€โ”€ sign_to_speech/
    โ”œโ”€โ”€ 01_data_collection_raspi/
    โ”‚   โ”œโ”€โ”€ data_capture_raspi_V3.py  # On-device collection
    โ”‚   โ””โ”€โ”€ preprocess.py             # Feature extraction
    โ””โ”€โ”€ 02_training/
        โ””โ”€โ”€ train.py                  # NAS + K-Fold + LR search

15. References


AI Tools Disclosure: GitHub Copilot used for Tkinter boilerplate. Claude used for documentation. All architecture, feature engineering, and system design decisions were made by the team.

Team


Name Affiliation Email Contribution
Liz Maria George M.Tech MVLSI, ECE, IISc lizgeorge@iisc.ac.in Dataset collection, feature engineering
Naznin Amirul Haque M.Tech MVLSI, ECE, IISc nazninhaque@iisc.ac.in Dataset collection, feature engineering
Ayush Kumar M.Tech MVLSI, ECE, IISc ayush1@iisc.ac.in Integration, GUI, data collection script
Prayanshu Sharma M.Tech MVLSI, ECE, IISc prayanshus@iisc.ac.in Dataset collection, model development