Model Optimization for Mobile
Why optimization is essential on mobile
A model that’s accurate but 200MB and slow is useless in a phone app — it bloats your download, eats memory, and lags. Optimization shrinks models and speeds up inference so they fit the tight budgets of a mobile device, ideally with little accuracy loss.
Quantization: the biggest win
Quantization reduces the precision of a model’s numbers — for example from 32-bit floats to 8-bit integers. This can shrink the model ~4× and make it run faster (especially on hardware that loves integers), usually with only a small accuracy drop.
# TensorFlow Lite — convert with int8 quantization (done off-device)
converter = tf.lite.TFLiteConverter.from_saved_model(path)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
On iOS, Core ML supports similar quantization/palettization to compress models.
Other techniques
- Pruning — remove weights that barely matter, making the model smaller and sometimes faster.
- Knowledge distillation — train a small “student” model to mimic a big “teacher” model.
- Pick a mobile-first architecture — models like MobileNet/EfficientNet are designed to be small and fast from the start.
Hardware delegates / accelerators
Beyond shrinking the model, use the phone’s specialised chips. The GPU/NPU delegates (Android) and the Neural Engine (iOS) run inference much faster and with less battery than the CPU.
Measure, don’t guess
Always benchmark on real devices (including older, low-end ones), not just the latest flagship. Track three things:
- Latency — milliseconds per inference.
- Model/app size — download and storage impact.
- Accuracy — make sure optimization didn’t break the feature.
Reducing app size
Bundling a model inflates your app. Options: quantize aggressively, download the model on first run (rather than bundling it), or use a hosted/Play Feature delivery so users only get it if they need the feature.
The accuracy/size/speed triangle
You’re always trading off accuracy, size and speed. The art is finding the smallest, fastest model that’s still “good enough” for your feature — users prefer a fast, slightly-less-perfect feature over an accurate one that lags.
Common mistakes
- Shipping an unquantized model and bloating the app.
- Benchmarking only on a flagship phone, then shipping lag to most users.
- Optimizing so hard that accuracy quietly breaks — always re-test quality.
Summary: Make models mobile-ready with quantization (the biggest win), pruning/distillation, and mobile-first architectures, plus hardware delegates for speed. Benchmark latency, size and accuracy on real low-end devices, and consider downloading models instead of bundling them.