Model Optimization for Mobile

Why optimization is essential on mobile

A model that’s accurate but 200MB and slow is useless in a phone app — it bloats your download, eats memory, and lags. Optimization shrinks models and speeds up inference so they fit the tight budgets of a mobile device, ideally with little accuracy loss.

Quantization: the biggest win

Quantization reduces the precision of a model’s numbers — for example from 32-bit floats to 8-bit integers. This can shrink the model ~4× and make it run faster (especially on hardware that loves integers), usually with only a small accuracy drop.

# TensorFlow Lite — convert with int8 quantization (done off-device)
converter = tf.lite.TFLiteConverter.from_saved_model(path)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

On iOS, Core ML supports similar quantization/palettization to compress models.

Other techniques

Pruning — remove weights that barely matter, making the model smaller and sometimes faster.
Knowledge distillation — train a small “student” model to mimic a big “teacher” model.
Pick a mobile-first architecture — models like MobileNet/EfficientNet are designed to be small and fast from the start.

Hardware delegates / accelerators

Beyond shrinking the model, use the phone’s specialised chips. The GPU/NPU delegates (Android) and the Neural Engine (iOS) run inference much faster and with less battery than the CPU.

Measure, don’t guess

Always benchmark on real devices (including older, low-end ones), not just the latest flagship. Track three things:

Latency — milliseconds per inference.
Model/app size — download and storage impact.
Accuracy — make sure optimization didn’t break the feature.

Reducing app size

Bundling a model inflates your app. Options: quantize aggressively, download the model on first run (rather than bundling it), or use a hosted/Play Feature delivery so users only get it if they need the feature.

The accuracy/size/speed triangle

You’re always trading off accuracy, size and speed. The art is finding the smallest, fastest model that’s still “good enough” for your feature — users prefer a fast, slightly-less-perfect feature over an accurate one that lags.

Common mistakes

Shipping an unquantized model and bloating the app.
Benchmarking only on a flagship phone, then shipping lag to most users.
Optimizing so hard that accuracy quietly breaks — always re-test quality.

Summary: Make models mobile-ready with quantization (the biggest win), pruning/distillation, and mobile-first architectures, plus hardware delegates for speed. Benchmark latency, size and accuracy on real low-end devices, and consider downloading models instead of bundling them.