IDD Edge AI Segmentation

Team: Debanshu Mallick, Chandan Rai, Tamaghna Mandal, Yuvaraj DC
Code: GitHub Repository

This is a more detail report documents the complete project flow for semantic segmentation and instance segmentation on edge hardware: dataset post-processing, model training, logit knowledge distillation, quantization/compilation for Hailo, and Raspberry Pi deployment. GPU-cluster setup details are intentionally omitted, but training logs were inspected to recover the reported losses and validation metrics. and contains all the detailed metrics

1. Project Scope

The project builds a real-time road-scene perception stack for the India Driving Dataset (IDD). The final deployed demo combines:

A semantic segmentation model running as a Hailo HEF.
A YOLOv8n-seg instance segmentation model running as a Hailo HEF.
A Raspberry Pi PyQt/OpenCV application that overlays semantic classes and detected dynamic objects.
Runtime switching between three semantic model variants.

2. End-to-End Pipeline

The full pipeline is:

Convert raw IDD polygon annotations into semantic masks, instance masks, color previews, or panoptic annotations.
Organize images and masks into train/validation/test splits for each label level.
Train strong semantic teachers on Label1ID, Label2ID, and Label3ID.
Train smaller semantic students using logit knowledge distillation.
Train YOLOv8n-seg on IDD instance classes using a dataset bridge.
Export selected PyTorch checkpoints to ONNX.
Parse ONNX to Hailo HAR, optimize/quantize with calibration images, and compile HEF files.
Deploy semantic and YOLO HEFs to Raspberry Pi with a PyQt/OpenCV/Hailo runtime.

3. IDD Dataset Post-Processing

The conversion logic is in idd_polygon_to_mask.ipynb. It reads the original IDD polygon annotations from gtFine and writes derived supervision targets.

The notebook exposes:

run_pipeline(
    datadir,
    out_basedir,
    encoding,
    do_semantic,
    do_instance,
    do_color,
    do_panoptic,
)

Supported encodings include:

Encoding	Use
`level1Id`	Coarse 7-class semantic labels.
`level2Id`	Medium 16-class semantic labels.
`level3Id`	Fine 26-class semantic labels.
`id`, `csId`, `csTrainId`, `level4Id`, `unifiedId`	Alternate IDD/Cityscapes-compatible encodings.

Generated outputs can include:

Output	Description
Semantic PNG	Single-channel class-ID mask.
Instance PNG	Encodes instance IDs as `classId * 1000 + instanceIndex`.
Color PNG	Visual preview using the label color table.
Panoptic PNG/JSON	Panoptic-style category and segment records.

Important conversion details:

Unknown labels are skipped.
Label names ending with group are normalized before lookup.
Background/ignore pixels use ID 255 for semantic-style encodings.
Instance counters are keyed by the actual encoded class value, which avoids collisions when using coarser label levels.
Panoptic category IDs are generated from the selected encoding rather than being hardcoded to a single IDD level.

For semantic training, the project expects split folders like:

<IDDLx>/train/images
<IDDLx>/train/masks
<IDDLx>/val/images
<IDDLx>/val/masks
<IDDLx>/test/images
<IDDLx>/test/masks

All semantic logs report:

Split	Images
Train	12,872
Validation	1,995

4. Label Spaces

The semantic projects train on three IDD label levels.

Label1ID

Label1ID is a 7-class coarse setup:

ID	Class
0	road
1	sky
2	vegetation
3	building
4	vehicle
5	person
6	background

Label2ID

Label2ID is a 16-class medium-granularity setup and is the label space used for all deployed models:

ID	Class
0	road
1	drivable-other
2	sidewalk
3	non-drivable-other
4	person
5	rider
6	2-wheeler
7	small-vehicle
8	large-vehicle
9	barrier-solid
10	barrier-open
11	structures-sign
12	structures-pole
13	construction
14	vegetation
15	sky

Label3ID

Label3ID is a 26-class fine-granularity setup:

ID	Class
0	road
1	parking
2	sidewalk
3	rail track
4	person
5	rider
6	motorcycle
7	bicycle
8	autorickshaw
9	car
10	truck
11	bus
12	large-vehicle
13	curb
14	wall
15	fence
16	guard rail
17	billboard
18	traffic sign
19	traffic light
20	pole
21	obs-str-bar-fallback
22	building
23	bridge/tunnel
24	vegetation
25	sky

5. Semantic Training Setup

The semantic dataset loader is IDDSegDataset. It performs:

ImageNet normalization.
Random resized crop to 512 x 512 during training.
Horizontal flip augmentation.
Color jitter augmentation.
Validation resize to 512 x 512.
Ignore index 255.

Common training settings across all projects:

Setting	Value
Image size	`512 x 512`
Batch size	`32`
Workers	`8`
Seed	`42`
AMP	Enabled
Teacher epochs	`50`
Teacher LR	`1e-4`
Teacher weight decay	`1e-4`
Baseline epochs	`50`
Baseline LR	`6e-4`
Baseline weight decay	`1e-4`
Logit KD epochs	`125`
Logit KD LR	`6e-4`
Logit KD weight decay	`1e-4`
KD temperature	`4.0`
KD alpha	`1.0`

The mask encoding can run in automatic mode. If the mask already contains contiguous training IDs, it is used directly; otherwise raw IDD IDs are remapped into the target training label space.

6. Model Families

Original Teacher and Student

The original semantic project uses:

Role	Backbone	Decoder	Channels
Teacher	`mobilenetv4_conv_large.e600_r384_in1k`	DeepLabV3+	`[24, 48, 96, 192]`
Student	`mobilenetv4_conv_small.e2400_r224_in1k`	LR-ASPP	`[32, 32, 64, 96]`

The Hailo-optimized LR-ASPP student uses deployment-friendly changes:

Fixed interpolation sizes of 128 x 128 and 512 x 512.
Sigmoid instead of hard-sigmoid style operations.
Export mode returns a raw tensor instead of a dictionary.

ConvNeXt Teacher

The stronger teacher experiments use:

Role	Backbone	Decoder	Channels
Teacher	ConvNeXt-Base	UPerNet	`[128, 256, 512, 1024]`

The ConvNeXt UPerNet project uses:

Setting	Value
Batch size	`24`
Teacher epochs	`50`
Logit KD epochs	`100`

DeepLab Student Variants

Two additional students were trained against the ConvNeXt teacher:

Student
MobileNetV4-L DeepLabV3+
MobileNetV4-S DeepLabV3+

7. Logit Knowledge Distillation

Logit KD combines supervised cross-entropy with masked KL divergence between teacher and student output distributions:

loss = CE(student, target) + alpha * KL(student_logits / T, teacher_logits / T)

where T = 4.0 is the distillation temperature and alpha = 1.0 is the distillation weight. This method is applied consistently across all label levels and all student architectures.

8. Semantic Training Results

The following values were recovered from the training logs. “Best mIoU” is the best validation mIoU recorded for that run.

Original MobileNetV4 Teacher/Student Project

Label	Run	Best mIoU	Best Epoch	Final Losses
Label1	Teacher	78.12%	50	CE 0.1101
Label1	Baseline student	72.42%	49	CE 0.1734
Label1	Logit KD	72.15%	123	CE 0.1844, KL 0.4075, total 0.5919
Label2	Teacher	67.61%	42	CE 0.1750
Label2	Baseline student	59.34%	46	CE 0.2818
Label2	Logit KD	59.67%	123	CE 0.2812, KL 0.4944, total 0.7756
Label3	Teacher	60.91%	50	CE 0.1890
Label3	Baseline student	51.56%	47	CE 0.3038
Label3	Logit KD	51.38%	114	CE 0.3053, KL 0.5106, total 0.8158

Key observation: Logit KD provided modest improvement over the baseline for the original MobileNetV4-S LR-ASPP student. The teacher-student capacity gap remained the primary limiting factor.

ConvNeXt UPerNet Teacher to MobileNetV4-S LR-ASPP Student

Label	Run	Best mIoU	Best Epoch	Final Losses
Label1	Teacher	81.92%	39	CE 0.0626
Label1	Logit KD	71.85%	67	CE 0.2102, KL 0.8387, total 1.0489
Label2	Teacher	73.14%	42	CE 0.0898
Label2	Logit KD	59.37%	92	CE 0.3298, KL 1.2405, total 1.5702
Label3	Teacher	68.54%	44	CE 0.0946
Label3	Logit KD	51.59%	87	CE 0.3529, KL 1.2890, total 1.6419

Key observation: The ConvNeXt UPerNet teacher is materially stronger, but the compact LR-ASPP student remains capacity-limited and does not close much additional gap despite the stronger teacher signal.

ConvNeXt Teacher to MobileNetV4-L DeepLabV3+ Student

Label	Run	Best mIoU	Best Epoch	Final Losses
Label1	Teacher	81.92%	39	same ConvNeXt teacher
Label1	Logit KD	78.87%	116	CE 0.0978, KL 0.2991, total 0.3969
Label2	Teacher	73.14%	42	same ConvNeXt teacher
Label2	Logit KD	68.15%	71	CE 0.1716, KL 0.5660, total 0.7376

Key observation: MobileNetV4-L DeepLabV3+ closes much more of the teacher-student gap than the smaller LR-ASPP student, especially on Label1 and Label2. This is the highest-accuracy deployed student.

ConvNeXt Teacher to MobileNetV4-S DeepLabV3+ Student

Label	Run	Best mIoU	Best Epoch	Final Losses
Label1	Teacher	81.92%	39	same ConvNeXt teacher
Label1	Logit KD	76.25%	121	CE 0.1451, KL 0.5144, total 0.6595
Label2	Teacher	73.14%	42	same ConvNeXt teacher
Label2	Logit KD	63.99%	95	CE 0.2504, KL 0.8781, total 1.1285
Label3	Teacher	68.54%	44	same ConvNeXt teacher
Label3	Logit KD	56.33%	91	CE 0.2717, KL 0.9373, total 1.2090

Key observation: MobileNetV4-S DeepLabV3+ is a strong middle ground — it materially improves over the original LR-ASPP student while remaining much smaller than the MobileNetV4-L variant.

9. YOLOv8n Instance Segmentation

The dataset bridge script creates a YOLO-compatible dataset at:

datasets/idd_instance_seg

It hardlinks images and symlinks labels so Ultralytics can resolve images and segmentation labels from a local YOLO-style tree.

YOLO class mapping:

YOLO ID	Class
0	person_animal
1	rider
2	motorcycle_bicycle
3	autorickshaw_car
4	large_vehicle

Training configuration:

Setting	Value
Base model	`yolov8n-seg.pt`
Epochs	`100`
Image size	`512`
Batch size	`192` default
Workers	`8`
Device	CUDA device `0`
Project/name	`runs/idd_yolov8n_seg`

YOLO dataset log summary:

Split	Images	Backgrounds	Corrupt
Train	12,872	90	0
Validation	1,995	21	0

The training log reports 381 of 417 pretrained weights transferred from the base YOLOv8n-seg checkpoint.

Best and final YOLO metrics from runs/idd_yolov8n_seg/results.csv:

Metric	Value
Best box mAP50-95	0.21013 at epoch 90
Best box mAP50	0.36054
Best box precision	0.69262
Best box recall	0.32409
Best mask mAP50-95	0.15582 at epoch 88
Best mask mAP50	0.32560
Best mask precision	0.63365
Best mask recall	0.28907
Epoch 100 train box loss	1.31612
Epoch 100 train segmentation loss	2.48142
Epoch 100 train classification loss	0.75843
Epoch 100 train DFL loss	0.92990
Epoch 100 validation box loss	1.35210
Epoch 100 validation segmentation loss	2.54123
Epoch 100 validation classification loss	0.78819
Epoch 100 validation DFL loss	0.93597
Epoch 100 box mAP50-95	0.20966
Epoch 100 mask mAP50-95	0.15497

10. Hailo Quantization and Compilation

The project exports selected PyTorch checkpoints to ONNX, parses them into Hailo HAR files, optimizes/quantizes with calibration images, and compiles HEF files for Hailo-8.

Common semantic Hailo flow:

Export PyTorch checkpoint to ONNX.
Parse ONNX to Hailo HAR.
Optimize HAR with a calibration set.
Compile optimized HAR to HEF for hailo8.
Copy HEFs into Final__Demo.

Common semantic model-script settings:

normalization mean = [123.675, 116.28, 103.53]
normalization std  = [58.395, 57.12, 57.375]
calibration size   = 64 images/tensors
calibration batch  = 1
target             = hailo8

The Hailo logs show that optimization level was reduced to level 0 because only 64 calibration entries were available and no GPU was available for higher-level optimization. QAT fine-tuning was skipped.

Semantic HEF Summary

Hailo Folder	Model	ONNX Size	HEF Size	Compile FPS Estimate
`model1_hailo`	MobileNetV4-S LR-ASPP, 16 classes	1.38 MB	1.40 MB	605.732 FPS post-allocation
`model2_hailo`	MobileNetV4-L DeepLabV3+, 16 classes	25.59 MB	6.92 MB	51.6728 FPS post-allocation
`model3_hailo`	MobileNetV4-S DeepLabV3+, 16 classes	10.63 MB	10.41 MB	121.596 FPS post-allocation

Additional compile-log bottleneck FPS estimates:

Model	Raw FPS
`model1_hailo`	728.753
`model2_hailo`	61.663
`model3_hailo`	138.300

YOLOv8n-Seg HEF Summary

The YOLO Hailo folder is yolov8n_seg_hailo.

Export settings:

Setting	Value
Source checkpoint task	Segment
Architecture	`yolov8n-seg`
Classes	5
ONNX opset	11
Dynamic axes	Disabled
NMS in export	Disabled
Half precision export	Disabled
Batch	1
Export device	CPU

YOLO model-script settings:

normalization([0, 0, 0], [255, 255, 255])
change_output_activation(conv45, sigmoid)
change_output_activation(conv61, sigmoid)
change_output_activation(conv74, sigmoid)

YOLO Hailo artifact summary:

Artifact	Size
ONNX	13.20 MB
HEF	7.41 MB

YOLO compile log summary:

Item	Value
Calibration entries	64
Optimization level	Reduced to 0
QAT	Skipped
Resolver FPS estimate	224.836
Context 1 post-allocation FPS	1388.21
Context 2 post-allocation FPS	1732.73

11. Raspberry Pi Deployment

The deployment application is:

Final__Demo/python_if_models_switch_pipelined.py

It uses:

PyQt5 for the GUI.
OpenCV for video/camera input and visualization.
Hailo Platform APIs for loading HEFs and running inference.
A thread-based inference loop to keep the UI responsive.

Deployed HEFs:

File	Role
`model1_hailo.hef`	Semantic model 1, MobileNetV4-S LR-ASPP.
`model2_hailo.hef`	Semantic model 2, MobileNetV4-L DeepLabV3+.
`model3_hailo.hef`	Semantic model 3, MobileNetV4-S DeepLabV3+.
`yolov8n_seg_hailo.hef`	YOLOv8n-seg instance segmentation.

Runtime constants:

Setting	Value
Input size	`512`
Semantic classes	`16`
YOLO confidence threshold	`0.5`
YOLO NMS IoU threshold	`0.5`
YOLO DFL `REG_MAX`	`16`
YOLO mask dimension	`32`

Semantic preprocessing:

Resize frame to 512 x 512.
Convert BGR to RGB.
Convert to float32.
Run semantic HEF.
Apply argmax over class logits.
Map class IDs to a color palette.

YOLO preprocessing and decoding:

Letterbox input to 512 x 512.
Convert BGR to RGB.
Run YOLO HEF.
Decode three detection heads:
- stride 8: regression conv44, class conv45, mask conv46
- stride 16: regression conv60, class conv61, mask conv62
- stride 32: regression conv73, class conv74, mask conv75
- prototype output: conv48
Decode boxes with DFL.
Apply confidence threshold and multiclass NMS.
Combine mask coefficients with prototype masks.
Undo letterbox scaling and overlay instance masks.

Application features:

Load video files.
Discover Raspberry Pi camera sources using libcamera/GStreamer and /dev/video*.
Switch between semantic models at runtime.
Enable or disable YOLO overlay.
Adjust overlay alpha.
Run a benchmark mode:

python python_if_models_switch_pipelined.py --benchmark <video> [frames]

The benchmark prints average milliseconds per frame and FPS for each semantic model.

12. Model Selection Notes

The logs support the following practical conclusions:

ConvNeXt UPerNet is the best teacher family, reaching 81.92% Label1, 73.14% Label2, and 68.54% Label3 mIoU.
MobileNetV4-L DeepLabV3+ gives the best distilled student accuracy, with 78.87% Label1 and 68.15% Label2 using logit KD.
MobileNetV4-S DeepLabV3+ is a strong compact deployment option, reaching 63.99% Label2 and 56.33% Label3 with logit KD — a material improvement over the original LR-ASPP student.
Logit KD is consistently effective across all student architectures and label levels.
Hailo compilation succeeded for all selected semantic and YOLO models, but the calibration and optimization setup was conservative — only 64 calibration samples were used and QAT was skipped.

13. Known Limitations

The MobileNetV4-L DeepLabV3+ Label2 run was only recorded to epoch 17 for the KD-D variant (not used); the logit KD run completed fully.
The Hailo logs indicate optimization level 0 due to limited calibration data and no GPU-assisted optimization. More calibration images and full optimization/QAT could improve quantized accuracy.
YOLO mask mAP is modest but remains useful as a dynamic-object overlay when combined with semantic segmentation.

The deployable system is a hybrid semantic-plus-instance perception pipeline: semantic segmentation supplies dense road-scene layout, YOLO supplies dynamic object instances, and the Raspberry Pi application fuses both outputs in real time through the Hailo accelerator.