← All courses

Model Optimization for Mobile

🗓 May 31, 2026 ⏱ 3 min read

Why optimization is essential on mobile

A model that’s accurate but 200MB and slow is useless in a phone app — it bloats your download, eats memory, and lags. Optimization shrinks models and speeds up inference so they fit the tight budgets of a mobile device, ideally with little accuracy loss.

Quantization: the biggest win

Quantization reduces the precision of a model’s numbers — for example from 32-bit floats to 8-bit integers. This can shrink the model ~4× and make it run faster (especially on hardware that loves integers), usually with only a small accuracy drop.

# TensorFlow Lite — convert with int8 quantization (done off-device)
converter = tf.lite.TFLiteConverter.from_saved_model(path)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

On iOS, Core ML supports similar quantization/palettization to compress models.

Other techniques

  • Pruning — remove weights that barely matter, making the model smaller and sometimes faster.
  • Knowledge distillation — train a small “student” model to mimic a big “teacher” model.
  • Pick a mobile-first architecture — models like MobileNet/EfficientNet are designed to be small and fast from the start.

Hardware delegates / accelerators

Beyond shrinking the model, use the phone’s specialised chips. The GPU/NPU delegates (Android) and the Neural Engine (iOS) run inference much faster and with less battery than the CPU.

Measure, don’t guess

Always benchmark on real devices (including older, low-end ones), not just the latest flagship. Track three things:

  • Latency — milliseconds per inference.
  • Model/app size — download and storage impact.
  • Accuracy — make sure optimization didn’t break the feature.

Reducing app size

Bundling a model inflates your app. Options: quantize aggressively, download the model on first run (rather than bundling it), or use a hosted/Play Feature delivery so users only get it if they need the feature.

The accuracy/size/speed triangle

You’re always trading off accuracy, size and speed. The art is finding the smallest, fastest model that’s still “good enough” for your feature — users prefer a fast, slightly-less-perfect feature over an accurate one that lags.

Common mistakes

  • Shipping an unquantized model and bloating the app.
  • Benchmarking only on a flagship phone, then shipping lag to most users.
  • Optimizing so hard that accuracy quietly breaks — always re-test quality.
Summary: Make models mobile-ready with quantization (the biggest win), pruning/distillation, and mobile-first architectures, plus hardware delegates for speed. Benchmark latency, size and accuracy on real low-end devices, and consider downloading models instead of bundling them.