Computer Vision on Mobile — AI Mobile Coders

The camera is the AI superpower of phones

A huge share of Mobile AI features come from the camera: identifying objects, scanning documents, applying effects, measuring, translating signs. This lesson covers the common vision tasks and how to run them smoothly in real time.

The main vision tasks

Image classification — “what is this a picture of?” (one label for the whole image).
Object detection — “what objects are here and where?” (boxes + labels).
Segmentation — which pixels belong to which object (e.g. background removal).
Face/pose landmarks — key points for filters, fitness, AR.
OCR — reading text from the scene.

Still image vs real-time

Running a model on a single captured photo is easy. The challenge is real-time processing of the live camera feed — many frames per second — without lag or overheating. That requires care.

Wiring up the camera + model

// Android: CameraX analyzer runs your model on each frame
imageAnalysis.setAnalyzer(executor) { imageProxy ->
    val bitmap = imageProxy.toBitmap()
    val results = classifier.classify(TensorImage.fromBitmap(bitmap))
    runOnUiThread { overlay.show(results) }
    imageProxy.close()        // always close the frame
}

On iOS, the equivalent is an AVCaptureVideoDataOutput delegate that hands each frame to Vision/Core ML.

Keeping real-time fast

Throttle frames — you rarely need 30fps inference; process every 2nd–5th frame.
Use a small, fast model — accuracy vs speed is a real trade-off on live video.
Hardware acceleration — GPU/NPU delegates (Android) or the Neural Engine (iOS).
Downscale input — feed the model a smaller image when possible.
Drop frames under load — never queue up work faster than you can process it.

Drawing results over the camera

Detections are returned in the model’s coordinate space; you must map them to screen coordinates and draw boxes/labels on an overlay above the camera preview, accounting for rotation and scaling.

Common mistakes

Running inference on every frame on the main thread (freeze + heat).
Forgetting to close/release camera frames (memory leaks, crashes).
Not mapping detection coordinates correctly to the preview (boxes in the wrong place).

Summary: Phone cameras power most Mobile AI. Use CameraX (Android) or AVFoundation + Vision (iOS) to feed frames to a model, but for real time keep models small, throttle and downscale frames, accelerate with hardware, and draw results on an overlay mapped to the preview.