AI & On-Device ML
The 2025-2026 Android platform is AI-native. Gemini Nano ships in Android 14 QPR2+ devices with AICore. ML Kit has expanded from vision-only to full text / image / document understanding. TensorFlow Lite remains the escape hatch for custom models. This module covers the stack — when to use each, how to ship, and the privacy posture that makes on-device AI compelling in the first place.
The stack
Gemini Nano (AICore)
Google's on-device LLM — summarization, proofreading, reply suggestions. Android 14 QPR2+, Pixel 8 Pro / Pixel 9 / Samsung S24.
ML Kit
Pre-built ML features: text recognition, barcode, face, pose, translation, smart reply. Cross-device. No model hosting needed.
TensorFlow Lite + LiteRT
Run custom TF models on-device. GPU / NNAPI / NPU accelerated. Escape hatch when ML Kit doesn't fit.
Media Pipe / MediaPipe Tasks
High-level APIs for gesture, pose, segmentation — built on TF Lite but easier to use.
When to pick which
| Need | Recommended |
|---|---|
| Summarize a news article | Gemini Nano |
| Proofread a comment | Gemini Nano |
| Generate smart reply suggestions | ML Kit Smart Reply |
| Extract text from a receipt photo | ML Kit Text Recognition |
| Scan a barcode | ML Kit Barcode Scanning |
| Detect faces / smile | ML Kit Face Detection |
| Translate text offline | ML Kit Translation |
| Real-time pose estimation | MediaPipe Pose |
| Custom vision model (e.g., product type) | TensorFlow Lite + your model |
| Custom NLP model (e.g., classification) | TensorFlow Lite or Gemini Nano (prompt-engineered) |
| Streaming text generation | Gemini Nano with streaming API |
| Heavy LLM / image gen | Server-side (Gemini Pro, OpenAI) — not on-device |
AICore — the Android AI runtime
AICore is a system-level service on supported devices (Pixel 8 Pro, Pixel 9, Pixel Fold 2, Samsung S24, and a growing list). It hosts Gemini Nano and future on-device models:
- Apps don't embed the model — AICore hosts it, shared across apps
- Download once, used everywhere — no per-app 3GB download
- Automatic updates via Play Services
- Hardware acceleration — uses the NPU / DSP
Availability
class AiCoreAvailability @Inject constructor(
@ApplicationContext private val context: Context
) {
val genAiClient = GenerativeModel(
generationConfig = generationConfig {
context = context
temperature = 0.2f
topK = 16
maxOutputTokens = 256
}
)
suspend fun isAvailable(): Boolean = try {
genAiClient.prepareInferenceEngine()
true
} catch (e: UnsupportedOperationException) {
false
}
}
Check at runtime. Fall back to a server-side API for devices without AICore.
Privacy by default
Declaring in Play Data Safety:
# Used only on device; not shared
data_usage:
- data_type: user_messages
purpose: feature_functionality
shared_with: []
processing: on_device_only
user_deletable: true
The chapter roadmap
- 01
Gemini Nano
API, prompt engineering, streaming, function calling, response validation.
- 02
ML Kit
Pre-built features: text, barcode, face, pose, translation, smart reply.
- 03
TensorFlow Lite + LiteRT
Custom model integration, GPU / NNAPI delegates, model updates via Play FFS.
Design patterns for AI features
1. The graceful degradation pattern
class SummarizationService @Inject constructor(
private val geminiNano: GenerativeModel?, // null if unavailable
private val serverApi: AiServerApi
) {
suspend fun summarize(text: String): Outcome<String, SummarizationError> {
// Try on-device first
if (geminiNano != null) {
return runCatching {
geminiNano.generateContent("Summarize: $text").text.orEmpty()
}.map { Outcome.Ok(it) }
.getOrElse { Outcome.Err(SummarizationError.OnDeviceFailed) }
}
// Fall back to server
return runCatching { serverApi.summarize(text) }
.map { Outcome.Ok(it) }
.getOrElse { Outcome.Err(SummarizationError.Network) }
}
}
On-device preferred; server fallback for older devices. Transparent to the caller.
2. The streaming pattern
// Gemini Nano supports streaming
fun summarizeStreaming(text: String): Flow<String> = flow {
val response = geminiNano.generateContentStream("Summarize: $text")
response.collect { chunk ->
emit(chunk.text.orEmpty())
}
}
@Composable
fun SummaryScreen(viewModel: SummaryViewModel = hiltViewModel()) {
val summary by viewModel.summary.collectAsStateWithLifecycle()
Text(summary, modifier = Modifier.animateContentSize())
}
Display chunks as they arrive — feels instant even for 500-token summaries.
3. The cache pattern
AI inference takes 500ms-5s. Cache results by prompt hash:
class AiCache @Inject constructor(private val dao: AiCacheDao) {
suspend fun summarize(text: String, generator: suspend (String) -> String): String {
val hash = sha256(text)
dao.get(hash)?.let { return it.result }
val fresh = generator(text)
dao.insert(AiCacheEntry(hash = hash, result = fresh, createdAt = now()))
return fresh
}
}
Skip re-running inference for the same input (common for summarization of unchanged articles).
4. The validation pattern
LLMs hallucinate. Always validate outputs:
suspend fun categorize(text: String): Category {
val raw = geminiNano.generateContent(
"""Classify as one of: FOOD, TECH, SPORTS, OTHER.
Text: $text
Return only the category name."""
).text?.trim()?.uppercase()
return Category.values().find { it.name == raw } ?: Category.OTHER
}
Constrain outputs to an enum. If the LLM returns nonsense, default to a safe value.
5. The safety guardrail pattern
suspend fun proofread(userText: String): String {
// Check against prompt injection
if (userText.contains("ignore previous instructions", ignoreCase = true)) {
return userText // don't feed suspicious text to the model
}
val prompt = "Proofread. Return only the corrected text:\n$userText"
val output = geminiNano.generateContent(prompt).text.orEmpty()
// Validate output isn't dramatically different (likely hallucination)
if (output.length > userText.length * 2) return userText
return output
}
Defense in depth: input filtering + output validation + user confirmation for high-stakes operations.
Performance considerations
- First inference is slow (~1-3s) due to model load; subsequent inferences are <500ms
- Warm up on app start if your feature is common
- Batch inferences when possible (classify 10 posts in one prompt, not 10 prompts)
- Respect battery — heavy inference at < 20% battery should prompt or defer
- Thermal throttling — 30+ seconds of sustained LLM inference can
trigger thermal limits; monitor via
PowerManager
Model size and distribution
Gemini Nano is ~3-4 GB on disk. AICore downloads it as a platform component:
- User sees "Downloading Android System Components" occasionally
- No per-app model download
- Models auto-update via Play Services
For TF Lite custom models, use Firebase ML Feature Flag (Play FFS) to download models at runtime — keeps your base APK small.
Observability
class AiMetrics @Inject constructor(meter: Meter) {
private val inferenceDuration = meter.histogramBuilder("ai.inference.duration")
.setUnit("ms").ofLongs().build()
private val inferenceCounter = meter.counterBuilder("ai.inference.total").build()
private val validationFailures = meter.counterBuilder("ai.validation.failed").build()
suspend fun <T> trackInference(feature: String, block: suspend () -> T): T {
val start = System.currentTimeMillis()
return try {
val result = block()
val duration = System.currentTimeMillis() - start
inferenceDuration.record(duration, Attributes.of(AttributeKey.stringKey("feature"), feature))
inferenceCounter.add(1, Attributes.of(AttributeKey.stringKey("feature"), feature, AttributeKey.stringKey("result"), "success"))
result
} catch (t: Throwable) {
inferenceCounter.add(1, Attributes.of(AttributeKey.stringKey("feature"), feature, AttributeKey.stringKey("result"), "error"))
throw t
}
}
}
Track:
- Inference latency P50/P95/P99 per feature
- Success / error rate
- Validation failure rate (output constraints violated)
- Server fallback rate (when on-device fails)
Testing AI features
Deterministic inputs
Set temperature = 0 for deterministic outputs in tests:
val config = generationConfig {
temperature = 0f
topK = 1
seed = 42
}
Golden outputs
@Test fun summarizes_news_article() = runTest {
val article = readFixture("news-article.txt")
val expected = readFixture("news-article-summary.txt")
val actual = summarizer.summarize(article)
// Fuzzy match — LLMs aren't exact
assertTrue(actual.length in 50..200)
assertTrue(similarity(actual, expected) > 0.7f)
}
Shadow / canary runs
Ship both on-device and server-side inference; log the difference for a sample. Gives you a quality signal before switching from server to on-device.
Key takeaways
Next
Continue to Gemini Nano for the on-device LLM API, or ML Kit for pre-built features.