Edge Deployed Crack Detection


Team: Niyati Jawariya, Priyanshi Dubey.
Code: GitHub Repository

YOLOv8noptimised through PTQ, QAT, pruning, and resolution sweep for real-time inference on a VOLTA Bot Sync platform

Project: Real-time crack detection for autonomous building-defect inspection Platform: VOLTA Bot Sync · Intel RealSense D455 · Raspberry Pi 5 Base model: YOLOv8n — optimised end-to-end for on-device deployment

1. Introduction


Buildings age. Cracks, spalling, and surface defects appear over time and, if left unmonitored, become safety hazards on bridges, dams, facades, and tunnels. Today most of this inspection is still done manually — an inspector walking a site with a clipboard, or a rope-access team scaling a tall structure. It is slow, expensive, inconsistent across inspectors, and dangerous in confined or elevated spaces.

An autonomous mobile robot equipped with a camera and an on-device AI model can scan large surfaces continuously, flag defects the moment they appear in the camera feed, and produce a consistent visual record. Three things make this hard in practice:

  • Compute is limited. The robot can only carry a low-power single-board computer like a Raspberry Pi 5 — no GPU, no cloud connection guaranteed in a tunnel or basement.

  • Detection must be real-time. If the model takes 200 ms per frame the robot has to slow down, which makes site coverage unbearably slow.

  • Cracks are hard targets. They are thin, low-contrast, and easy to confuse with surface texture — a generic detector tuned for COCO objects will not work out of the box.

This project addresses all three. We trained a YOLOv8n crack detector on the BD3 building-defect dataset, then put it through a five-stage optimisation pipeline — post-training quantisation, quantisation-aware training, pruning, and a resolution sweep — to fit it onto a Raspberry Pi 5 at real-time framerates. The optimised model is deployed on a VOLTA Bot Sync autonomous platform with an Intel RealSense D455 depth camera, where it processes the live camera feed and flags cracks in real time during navigation.

Figure 1. VOLTA Bot Sync — the autonomous mobile platform that hosts the inference stack and camera.

2. Project Overview


The project is structured as a reproducible end-to-end pipeline. We start with model selection — comparing three lightweight detector candidates (YOLOv8n, YOLOv8s, YOLOv10n) on the same crack dataset and picking the best base. We then take the chosen model through five optimisation stages, ranking the resulting candidates with a custom Raspberry-Pi-aware metric (PiScore), and pick the best (model, resolution) pair to deploy. The final artefact is a 3.23 MB TFLite file that runs on the Pi at real-time framerates.

What this report covers

  1. Hardware and software requirements for the deployment platform.

  2. Dataset collection, annotation, and augmentation pipeline.

  3. Comparative training and evaluation of three candidate detector architectures.

  4. The five-stage optimisation pipeline (PTQ → QAT → pruning → resolution sweep).

  5. The PiScore metric that drives final model selection.

  6. Results, the final selected model, and how it is deployed on the bot.

3. Hardware Requirements


Component Role Notes
VOLTA Bot Sync Autonomous mobile platform Carries the compute stack and camera; supports manual override
Raspberry Pi 5 On-board inference Quad-core ARM Cortex-A76 2.4 GHz; CPU inference using XNNPACK (multi-threaded, up to 4 threads)
Intel RealSense D455 RGB-D image acquisition Depth-aware capture for 3D surface reconstruction
Joystick controller Manual motion control Used during data capture and supervised navigation

Figure 2. Raspberry Pi5 — the on-board compute platform for inference.

Figure 3.Intel RealSense camera

4. Software Requirements


Tool Version Purpose
Visual Studio Code 1.x Primary development environment
Python 3.10+ Runtime for training, optimisation, deployment
Ultralytics YOLO ≥ 8.2 YOLOv8n training and export pipelines
PyTorch ≥ 2.1 Deep learning backend; pruning utilities
TensorFlow / TFLite ≥ 2.15 Quantisation, TFLite conversion, on-device inference
ONNX ≥ 1.15 Intermediate export format
OpenCV ≥ 4.8 Camera capture and pre-processing on the Pi
Roboflow n/a Dataset versioning, annotation, augmentation

Inference on the Pi uses the TensorFlow Lite Python interpreter with the XNNPACK delegate enabled and num_threads=4. Benchmarks reported in this document use 50 warmup runs followed by 200 timed runs at batch size 1.

5. Dataset Collection and Pre-processing


5.1 Source

  • Primary repository: BD3 Dataset on GitHub — https://github.com/Praveenkottari/BD3-Dataset

  • Mirror: BD3 Dataset on Kaggle — https://www.kaggle.com/datasets/praveenkottari/bd3-dataset-for-building-defect-detection

  • BD3 is a building-defect detection dataset that contains annotated images of various surface defects such as cracks, peeling, stains, and spalling. The images are collected from real building surfaces under different conditions, including varying lighting, textures, and crack widths, which helps improve model generalization. The dataset includes both original images and augmented images, where transformations like rotation, flipping, and color adjustments are applied to increase diversity and robustness. Although BD3 contains multiple defect classes, in our project we focus only on crack detection. So we use major crack, minor crack, and normal images for training, converting it into a simplified detection problem.

5.2 Annotation format

Annotations are stored as polygons (segmentation labels) which preserve the irregular, branching shape of cracks better than bounding boxes do. For training the YOLOv8 detector, polygons are converted to tight axis-aligned bounding boxes via min/max of the polygon vertices in normalised coordinates.

Figure 4. Detailed annotationexample .

Figure 5. A second annotation example — a wider, branching crack on textured concrete

5.3 pre-processing and augmentation (Roboflow)

We staged the dataset through Roboflow for cleanup, splitting, and augmentation. The final version (2026-4-12_augmented, v2) expanded the original 1,800 source images into 4,189 images via the augmentation pipeline below. Augmentation matters here because the bot will encounter cracks under arbitrary orientations, lighting, and perspective — the model must be invariant to these.

Step Setting Why
Auto-orient Applied Strip EXIF orientation flags
Resize Stretch to 512×512 Common training resolution
Outputs per example 3 3× the dataset volume via augmented variants
Flip Horizontal, Vertical Cracks are direction-invariant
90° rotate CW, CCW, Upside-down Bot can encounter cracks from any heading
Crop 0% – 28% zoom Robustness to varying camera distance
Rotation ±11° Compensates for slight camera tilt
Shear ±13° H, ±14° V Simulates oblique viewing angles
Blur Up to 2.4 px Robustness to motion blur during navigation

5.4 Dataset splits

Split Images
Train 3666
Validation 349
Test 174
Total 4189

Figure 6.Roboflow dataset overview — version, splits, preprocessing, and augmentations as configured for v2.

Figure 7. Sample thumbnails from the train split (3,666 images) with crack polygon annotations overlaid.

6. Model Comparison: YOLOv8n vs YOLOv8s vs YOLOv10n


Before optimisation we trained three lightweight detector candidates on the same dataset splits and evaluated them on the held-out test set (174 images / 187 crack instances) to identify the best base model. All three were trained from official pretrained COCO weights for 80 epochs at 672×672 input resolution with identical augmentation, batch size 8, and learning rate 1e-3. Latency and FPS below are measured with batch size 1 to give a clear per-image latency comparison; the CPU latencies reported later in the optimisation section are measured on the deployment-target Pi/ desktop CPU stack with TFLite + XNNPACK.

6.1 Test-set results

Model Params (M) GFLOPs Size (MB) mAP@0.5 mAP@0.5:0.95 Precision Recall Inference (ms) FPS
YOLOv8n 3.01 8.1 5.97 0.714 0.489 0.697 0.658 3.12 167.7
YOLOv8s 11.13 28.4 21.48 0.686 0.480 0.761 0.615 3.25 170.7
YOLOv10n 2.27 6.5 5.49 0.639 0.428 0.665 0.610 5.26 139.7

6.2 Visual comparison

The plots below were generated from the comparison notebook (train_compare 1.ipynb, cells 7–12) and visualise the trade-off across multiple axes simultaneously.

Figure 8. Detection quality (mAP@0.5, mAP@0.5:0.95, precision, recall) per model

Figure9. Accuracy vs file size and Accuracy vs inference latency

Figure 10. Training curves (box loss and validation mAP)

6.3 Observations

  • YOLOv8n leads on accuracy (mAP@0.5 = 0.714) — despite being the smallest YOLOv8 variant, it outperforms YOLOv8s on this crack dataset. Cracks are thin, low-feature objects; the larger v8s backbone tends to over-fit and loses recall (0.615 vs 0.658).

  • YOLOv10n trails on every accuracy axis (mAP50 0.639, mAP50-95 0.428). Although marginally smaller than v8n, the accuracy cost is too large for a safety-critical inspection task. It is also the slowest on inference (5.26 ms vs ~3.1 ms).

  • YOLOv8s is 3.6× bigger (21.5 MB vs 6.0 MB) for worse accuracy — clearly not optimal on this dataset.

  • Latency is similar for v8n and v8s on GPU but the gap widens dramatically on CPU (the deployment target), where parameter count dominates. This makes v8n even more attractive for the Pi.

7. Why YOLOv8n — and Why It Quantises Well


YOLOv8n was selected as the base model not only because it leads the comparison on accuracy, latency, and size simultaneously, but also because its architecture is well-suited to the optimisation pipeline that follows. Several architectural choices in YOLOv8 make it a clean target for INT8 quantisation:

  • Standard Conv-BN-SiLU blocks throughout the backbone. No deformable convs, no exotic attention mechanisms, no SE blocks — every layer maps to a TFLite kernel that has a well-tested INT8 implementation.

  • C2f modules (Cross-Stage Partial blocks with multiple 3×3 convs and shortcut connections) produce predictable weight distributions — the activation ranges stay reasonably bounded, which keeps INT8 quantisation error small.

  • Decoupled detection head separates classification and regression into independent branches. Each branch can be quantised cleanly; there is no shared logits tensor whose dynamic range would otherwise dominate the calibration.

  • Anchor-free design means fewer post-processing operations to quantise — the network outputs (cx, cy, w, h, conf) directly, with NMS handled in float on the CPU after dequantisation.

  • SiLU activation is monotonic and bounded over the working range, which TFLite implements cleanly via a lookup-table approximation in INT8.

  • Small parameter count (3.0 M for v8n) means quantisation error has fewer layers to accumulate through — deeper / wider networks tend to lose more accuracy under INT8 because errors compound.

Together these properties mean YOLOv8n typically loses less than half a mAP point under post-training INT8 quantisation, and recovers fully when fine-tuned with quantisation-aware training (QAT). Larger or more exotic detectors require either FP16 (less compression) or per-layer mixed-precision schemes that are harder to deploy on a Pi.

8. Optimisation Pipeline


Once YOLOv8n is selected, the model goes through a five-stage optimisation pipeline. Stage 0 establishes the FP32 reference. Stages 1–3 explore three orthogonal compression strategies at 640×640 (the nominal training resolution). Stage 4 sweeps the input resolution on the top-2 candidates from the first round to find the best (model, resolution) pair for Raspberry Pi.

Stage Pipeline tag Technique
0 S0_FP32 FP32 baseline (reference only)
1 S1_FP16, S1_INT8 Post-Training Quantisation (PTQ)
2 S2_QAT_INT8 Quantisation-Aware Training (QAT)
3 S3_PrunedINT8 10% L1-unstructured prune → masked fine-tune → INT8
4 top-2 × {640, 512, 416, 320} Resolution sweep on the best two pipelines

8.1 Stage 0 — FP32 baseline

Standard YOLOv8n training for 30 epochs on the augmented BD3 split, 640×640 input. The resulting best.pt is the reference for every downstream comparison. This stage answers a single question: what is the maximum accuracy the architecture can reach on this dataset without any compression applied?

8.2 Stage 1 — Post-Training Quantisation (PTQ)

Two flavours exported from the same FP32 checkpoint:

  • FP16 — half-precision; smallest accuracy loss, half the bytes, ~2× latency reduction over FP32 on the Pi’s CPU.

  • INT8 — 8-bit integer; ~2× smaller and ~2× faster than FP16, with a small mAP cost. Calibration uses a 200-image subset of the training split to estimate per-tensor quantisation ranges.

PTQ is essentially free — no training is required — so it serves as both a baseline compression result and a quick sanity check for whether the architecture quantises cleanly.

8.3 Stage 2 — Quantisation-Aware Training (QAT)

Fine-tune from the FP32 checkpoint for 15 epochs with a snap-to-INT8-grid callback applied after every batch. The callback rounds each Conv2d weight to its INT8 quantisation grid so the optimiser sees the quantisation noise during training and learns weights that survive INT8 export with less mAP loss than vanilla PTQ.

QAT snap-to-INT8-grid callback (applied after every batch)

def attach_qat_snap(yolo_obj, skip_name=’model.22’): def snap(trainer): for name, m in trainer.model.named_modules(): if isinstance(m, nn.Conv2d) and skip_name not in name: with torch.no_grad(): w = m.weight.data s = w.abs().max() / 127.0 + 1e-9 m.weight.data = (w / s).round().clamp(-128, 127) * s yolo_obj.add_callback(‘on_train_batch_end’, snap)

8.4 Stage 3 — Pruning + Masked Fine-tune + INT8

L1-unstructured pruning at 10% sparsity applied to every Conv2d weight (the detection head model.22 is excluded). After pruning, a masked fine-tune for 15 epochs preserves the sparsity while recovering any accuracy lost to the pruning step. The recovered checkpoint is then exported to INT8 TFLite, stacking sparsity onto quantisation.

Stage 3: 10% L1-unstructured prune + masked fine-tune

for name, m in yolo_pr.model.named_modules(): if isinstance(m, nn.Conv2d) and SKIP_NAME not in name: prune.l1_unstructured(m, name=’weight’, amount=PRUNE_AMOUNT)

yolo_pr.train(data=data_yaml, epochs=FT_EPOCHS, imgsz=640, lr0=1e-4)

Final sparsity ~10%, baked into weights for export

8.5 Stage 4 — Resolution sweep

The top-2 pipelines from Stages 1–3 (ranked by PiScore) are re-exported and benchmarked at four input resolutions: 640, 512, 416, 320 px. Smaller resolutions reduce CPU latency approximately quadratically but can hurt mAP on small cracks below ~320 px. The sweep identifies the optimal trade-off for the deployment target.

9. PiScore Metric


Picking a single “best” model from the candidate pool is a multi-objective decision — accuracy, latency, and size all matter, and they generally trade against each other. Rather than choosing arbitrarily we define a single composite score, PiScore, that combines the relevant metrics with weights chosen for Raspberry Pi deployment:

Metric Weight Direction
mAP@0.5 0.40 higher is better
p50_cpu_ms 0.35 lower is better
size_mb 0.15 lower is better
mAP_drop (vs FP32) 0.10 lower is better

Each metric is min-max normalised across the candidate pool, then weighted and summed to produce a final score in [0, 1]. The candidate with the highest PiScore wins. Accuracy is weighted heaviest because the candidate pool already consists of small, fast quantised models — accuracy is the most discriminating axis. For the Stage 4 resolution sweep we shift to a simpler 0.7 / 0.3 accuracy/latency split so the resolution choice is dominated by the two metrics that actually move with image size.

10. Results


10.1 Stage 1–3 results (640×640)

Pipeline mAP@0.5 mAP@0.5:0.95 p50 (ms) Size (MB) Compression Speed-up
S0_FP32 0.7146 0.5137 38.60 6.25 1.00× 1.00×
S1_FP16 0.7086 0.5208 21.61 6.18 1.01× 1.79×
S1_INT8 0.7090 0.5212 20.34 3.35 1.87× 1.90×
S2_QAT_INT8 0.7183 0.5254 18.33 3.35 1.87× 2.11×
S3_PrunedINT8 0.7353 0.5287 10.71 3.35 1.87× 3.60×

Round-1 PiScore ranking (Stage 1–3 only)

Rank Pipeline PiScore
1 S3_PrunedINT8 0.9000
2 S2_QAT_INT8 0.4007
3 S1_INT8 0.1967
4 S1_FP16 0.0000

The Pareto front contains S1_INT8, S2_QAT_INT8, and S3_PrunedINT8 — each is non-dominated on at least one axis. We carry the top 2 (S3_PrunedINT8 and S2_QAT_INT8) into Stage 4.

Figure 11. Pareto plot — mAP vs CPU latency for Stages 0–3. Top-left is best.

Figure 12. Round-1 PiScore (Stage 1–3 candidates only). The top-2 are carried into the resolution sweep.

10.2 Stage 4 — Resolution sweep

Pipeline imgsz mAP@0.5 p50 (ms) Size (MB) FPS PiScore
S3_PrunedINT8 640 0.7353 10.71 3.35 93.3 0.9000
S3_PrunedINT8 512 0.7225 7.42 3.27 134.8 0.7665
S3_PrunedINT8 416 0.7010 5.09 3.23 196.5 0.7161
S3_PrunedINT8 320 0.6479 3.81 3.20 262.5 0.5000
S2_QAT_INT8 640 0.7183 18.33 3.35 54.6 0.4007
S2_QAT_INT8 512 0.7215 9.29 3.27 107.6 0.7251
S2_QAT_INT8 416 0.7114 4.93 3.23 202.8 0.7668 ← winner
S2_QAT_INT8 320 0.6675 4.93 3.20 202.8 0.5677

Figure 13. Stage 4 sweep — mAP and CPU latency vs input resolution for the top-2 pipelines. The 320 px point shows the resolution cliff.

S2_QAT_INT8 @ 416 px wins on the final PiScore by combining near-baseline accuracy (only 0.0032 mAP below FP32) with the largest observed speed-up (7.83×). At 320 px both pipelines lose 4–7 mAP points, indicating the resolution cliff for small cracks.

11. Final Selected Model


WINNER: S2_QAT_INT8 @ 416 px

QAT-trained, INT8-quantised YOLOv8n re-exported and validated at a 416×416 input resolution.

Metric Value
mAP@0.5 0.7114
mAP@0.5:0.95 0.5258
CPU p50 latency 4.93 ms
Throughput 202.8 FPS
File size 3.23 MB
Speed-up vs FP32 7.83×
Compression vs FP32 1.93×
mAP drop vs FP32 0.0032

The deployed artefact is S2_QAT_INT8_r416.tflite (3.23 MB). At 4.93 ms per frame on a 4-thread Linux CPU, the model leaves headroom for the camera capture, pre-processing, and post-processing on the Pi 5 while still hitting real-time framerates for live navigation.

12. Deployment


12.1 On-device runtime

Inference on the Pi uses the tflite_runtime Python interpreter with XNNPACK enabled and num_threads=4. The model accepts a 416×416×3 INT8 input and returns YOLOv8 detection tensors which are decoded with the standard non-max-suppression post-processing.

import tflite_runtime.interpreter as tflite

interpreter = tflite.Interpreter( model_path=’S2_QAT_INT8_r416.tflite’, num_threads=4, ) interpreter.allocate_tensors() input_details = interpreter.get_input_details() output_details = interpreter.get_output_details()

Capture → resize 416×416 → quantise → invoke → NMS

interpreter.set_tensor(input_details[0][‘index’], img_int8) interpreter.invoke() boxes = interpreter.get_tensor(output_details[0][‘index’])

12.2 Camera integration

Frames are captured from the Intel RealSense D455 at 1280×720, then letter-boxed and resized to 416×416 for the detector. Detected cracks are projected back to the original frame coordinates and overlaid on the camera feed. Depth values from the D455 are used downstream to estimate crack depth and physical extent.

13. Conclusion and Future Work


We delivered a deployable real-time crack detector for the VOLTA Bot Sync platform, taking a YOLOv8n FP32 model from 6.25 MB / 38.6 ms to 3.23 MB / 4.93 ms — a 1.93× size reduction and 7.83× latency reduction — with no measurable accuracy loss. The end-to-end pipeline (model selection → PTQ → QAT → pruning → resolution sweep) is reproducible and the PiScore metric makes the final model selection auditable.

Future work

  • Structured pruning. Channel-level pruning would let TFLite shrink conv shapes for real on-device speed gains, unlike unstructured sparsity which TFLite stores densely.

  • Crack severity grading. Use D455 depth to estimate physical crack width and classify severity (hairline / minor / major).

  • 3D Coordinates of crack – Use SLAM to calculate 3D point of cracks in real world

  • Completely automate system- Autonomous bot with no human interference

14. References