Computer Vision on Mobile
The camera is the AI superpower of phones
A huge share of Mobile AI features come from the camera: identifying objects, scanning documents, applying effects, measuring, translating signs. This lesson covers the common vision tasks and how to run them smoothly in real time.
The main vision tasks
- Image classification — “what is this a picture of?” (one label for the whole image).
- Object detection — “what objects are here and where?” (boxes + labels).
- Segmentation — which pixels belong to which object (e.g. background removal).
- Face/pose landmarks — key points for filters, fitness, AR.
- OCR — reading text from the scene.
Still image vs real-time
Running a model on a single captured photo is easy. The challenge is real-time processing of the live camera feed — many frames per second — without lag or overheating. That requires care.
Wiring up the camera + model
// Android: CameraX analyzer runs your model on each frame
imageAnalysis.setAnalyzer(executor) { imageProxy ->
val bitmap = imageProxy.toBitmap()
val results = classifier.classify(TensorImage.fromBitmap(bitmap))
runOnUiThread { overlay.show(results) }
imageProxy.close() // always close the frame
}
On iOS, the equivalent is an AVCaptureVideoDataOutput delegate that hands each frame to Vision/Core ML.
Keeping real-time fast
- Throttle frames — you rarely need 30fps inference; process every 2nd–5th frame.
- Use a small, fast model — accuracy vs speed is a real trade-off on live video.
- Hardware acceleration — GPU/NPU delegates (Android) or the Neural Engine (iOS).
- Downscale input — feed the model a smaller image when possible.
- Drop frames under load — never queue up work faster than you can process it.
Drawing results over the camera
Detections are returned in the model’s coordinate space; you must map them to screen coordinates and draw boxes/labels on an overlay above the camera preview, accounting for rotation and scaling.
Common mistakes
- Running inference on every frame on the main thread (freeze + heat).
- Forgetting to close/release camera frames (memory leaks, crashes).
- Not mapping detection coordinates correctly to the preview (boxes in the wrong place).
Summary: Phone cameras power most Mobile AI. Use CameraX (Android) or AVFoundation + Vision (iOS) to feed frames to a model, but for real time keep models small, throttle and downscale frames, accelerate with hardware, and draw results on an overlay mapped to the preview.