← All courses

Generative AI & LLMs in Mobile Apps

🗓 May 31, 2026 ⏱ 3 min read

The generative AI wave

Large Language Models (LLMs) like Gemini and GPT can chat, write, summarise, translate, answer questions and call tools. Adding these to a mobile app unlocks powerful features — smart assistants, auto-replies, content creation. Because these models are huge, they almost always run in the cloud, and your app calls them over an API.

How a mobile app talks to an LLM

You send a request (the user’s prompt + context) to the provider’s API and get back generated text. Treat it like any network call — with loading, error and offline handling.

// conceptual: call an LLM endpoint
val response = httpClient.post("https://api.provider.com/v1/chat") {
    header("Authorization", "Bearer $API_KEY")
    setBody(ChatRequest(messages = listOf(Message("user", prompt))))
}.body<ChatResponse>()
val answer = response.choices.first().message.content

Streaming responses

LLMs generate text word-by-word. For a great UX, stream the response so text appears progressively (like a typing effect) instead of making the user wait for the whole answer. Most providers support streaming over server-sent events.

Protect your API key!

This is critical: never put your secret API key in the app. Anyone can extract it from the app binary and run up your bill. Instead, route LLM calls through your own backend, which holds the key and forwards requests. Your app calls your server; your server calls the LLM.

Your app Your backend LLM API

Prompts, context and cost

  • Prompt design — clear instructions and examples produce better results.
  • Context window — you can include chat history or app data, but there’s a size limit, and more tokens cost more.
  • Cost & rate limits — you pay per token; cache results and limit requests to control spend.

On-device generative AI

Small on-device models (Gemini Nano via ML Kit GenAI, or compact open models) can do lightweight generation — summarise, rewrite, smart reply — privately and offline, with no per-call cost. Use them for simple tasks; use the cloud for advanced reasoning.

Handling the realities of LLMs

  • They can be wrong (“hallucinate”) — don’t present output as guaranteed fact for critical use.
  • They’re slow & online — show progress, handle timeouts and offline.
  • Add guardrails — validate and moderate output where needed.

Common mistakes

  • Embedding the API key in the app (security & cost disaster).
  • Not streaming, so the UI feels frozen during generation.
  • Trusting LLM output blindly for important decisions.
Summary: Add generative AI by calling LLMs — through your own backend, never with the key in the app. Stream responses for great UX, manage context and cost, use on-device models for light tasks, and design for slowness, errors and the occasional wrong answer.