Module: 21 of 21Duration: 2 weeksTopics: 3 · 8 subtopicsPrerequisites: Modules 01–10

AI & On-Device ML

The 2025-2026 Android platform is AI-native. Gemini Nano ships in Android 14 QPR2+ devices with AICore. ML Kit has expanded from vision-only to full text / image / document understanding. TensorFlow Lite remains the escape hatch for custom models. This module covers the stack — when to use each, how to ship, and the privacy posture that makes on-device AI compelling in the first place.

The stack

🧠

Gemini Nano (AICore)

Google's on-device LLM — summarization, proofreading, reply suggestions. Android 14 QPR2+, Pixel 8 Pro / Pixel 9 / Samsung S24.

👁️

ML Kit

Pre-built ML features: text recognition, barcode, face, pose, translation, smart reply. Cross-device. No model hosting needed.

🔧

TensorFlow Lite + LiteRT

Run custom TF models on-device. GPU / NNAPI / NPU accelerated. Escape hatch when ML Kit doesn't fit.

🌐

Media Pipe / MediaPipe Tasks

High-level APIs for gesture, pose, segmentation — built on TF Lite but easier to use.

When to pick which

Need	Recommended
Summarize a news article	Gemini Nano
Proofread a comment	Gemini Nano
Generate smart reply suggestions	ML Kit Smart Reply
Extract text from a receipt photo	ML Kit Text Recognition
Scan a barcode	ML Kit Barcode Scanning
Detect faces / smile	ML Kit Face Detection
Translate text offline	ML Kit Translation
Real-time pose estimation	MediaPipe Pose
Custom vision model (e.g., product type)	TensorFlow Lite + your model
Custom NLP model (e.g., classification)	TensorFlow Lite or Gemini Nano (prompt-engineered)
Streaming text generation	Gemini Nano with streaming API
Heavy LLM / image gen	Server-side (Gemini Pro, OpenAI) — not on-device

AICore — the Android AI runtime

AICore is a system-level service on supported devices (Pixel 8 Pro, Pixel 9, Pixel Fold 2, Samsung S24, and a growing list). It hosts Gemini Nano and future on-device models:

Apps don't embed the model — AICore hosts it, shared across apps
Download once, used everywhere — no per-app 3GB download
Automatic updates via Play Services
Hardware acceleration — uses the NPU / DSP

Availability

class AiCoreAvailability @Inject constructor(
    @ApplicationContext private val context: Context
) {
    val genAiClient = GenerativeModel(
        generationConfig = generationConfig {
            context = context
            temperature = 0.2f
            topK = 16
            maxOutputTokens = 256
        }
    )

    suspend fun isAvailable(): Boolean = try {
        genAiClient.prepareInferenceEngine()
        true
    } catch (e: UnsupportedOperationException) {
        false
    }
}

Check at runtime. Fall back to a server-side API for devices without AICore.

Privacy by default

Declaring in Play Data Safety:

# Used only on device; not shared
data_usage:
  - data_type: user_messages
    purpose: feature_functionality
    shared_with: []
    processing: on_device_only
    user_deletable: true

The chapter roadmap

01
Gemini Nano
API, prompt engineering, streaming, function calling, response validation.
02
ML Kit
Pre-built features: text, barcode, face, pose, translation, smart reply.
03
TensorFlow Lite + LiteRT
Custom model integration, GPU / NNAPI delegates, model updates via Play FFS.

Design patterns for AI features

1. The graceful degradation pattern

class SummarizationService @Inject constructor(
    private val geminiNano: GenerativeModel?,         // null if unavailable
    private val serverApi: AiServerApi
) {
    suspend fun summarize(text: String): Outcome<String, SummarizationError> {
        // Try on-device first
        if (geminiNano != null) {
            return runCatching {
                geminiNano.generateContent("Summarize: $text").text.orEmpty()
            }.map { Outcome.Ok(it) }
             .getOrElse { Outcome.Err(SummarizationError.OnDeviceFailed) }
        }

        // Fall back to server
        return runCatching { serverApi.summarize(text) }
            .map { Outcome.Ok(it) }
            .getOrElse { Outcome.Err(SummarizationError.Network) }
    }
}

On-device preferred; server fallback for older devices. Transparent to the caller.

2. The streaming pattern

// Gemini Nano supports streaming
fun summarizeStreaming(text: String): Flow<String> = flow {
    val response = geminiNano.generateContentStream("Summarize: $text")
    response.collect { chunk ->
        emit(chunk.text.orEmpty())
    }
}

@Composable
fun SummaryScreen(viewModel: SummaryViewModel = hiltViewModel()) {
    val summary by viewModel.summary.collectAsStateWithLifecycle()
    Text(summary, modifier = Modifier.animateContentSize())
}

Display chunks as they arrive — feels instant even for 500-token summaries.

3. The cache pattern

AI inference takes 500ms-5s. Cache results by prompt hash:

class AiCache @Inject constructor(private val dao: AiCacheDao) {
    suspend fun summarize(text: String, generator: suspend (String) -> String): String {
        val hash = sha256(text)
        dao.get(hash)?.let { return it.result }

        val fresh = generator(text)
        dao.insert(AiCacheEntry(hash = hash, result = fresh, createdAt = now()))
        return fresh
    }
}

Skip re-running inference for the same input (common for summarization of unchanged articles).

4. The validation pattern

LLMs hallucinate. Always validate outputs:

suspend fun categorize(text: String): Category {
    val raw = geminiNano.generateContent(
        """Classify as one of: FOOD, TECH, SPORTS, OTHER.
           Text: $text
           Return only the category name."""
    ).text?.trim()?.uppercase()

    return Category.values().find { it.name == raw } ?: Category.OTHER
}

Constrain outputs to an enum. If the LLM returns nonsense, default to a safe value.

5. The safety guardrail pattern

suspend fun proofread(userText: String): String {
    // Check against prompt injection
    if (userText.contains("ignore previous instructions", ignoreCase = true)) {
        return userText      // don't feed suspicious text to the model
    }

    val prompt = "Proofread. Return only the corrected text:\n$userText"
    val output = geminiNano.generateContent(prompt).text.orEmpty()

    // Validate output isn't dramatically different (likely hallucination)
    if (output.length > userText.length * 2) return userText

    return output
}

Defense in depth: input filtering + output validation + user confirmation for high-stakes operations.

Performance considerations

First inference is slow (~1-3s) due to model load; subsequent inferences are <500ms
Warm up on app start if your feature is common
Batch inferences when possible (classify 10 posts in one prompt, not 10 prompts)
Respect battery — heavy inference at < 20% battery should prompt or defer
Thermal throttling — 30+ seconds of sustained LLM inference can trigger thermal limits; monitor via PowerManager

Model size and distribution

Gemini Nano is ~3-4 GB on disk. AICore downloads it as a platform component:

User sees "Downloading Android System Components" occasionally
No per-app model download
Models auto-update via Play Services

For TF Lite custom models, use Firebase ML Feature Flag (Play FFS) to download models at runtime — keeps your base APK small.

Observability

class AiMetrics @Inject constructor(meter: Meter) {
    private val inferenceDuration = meter.histogramBuilder("ai.inference.duration")
        .setUnit("ms").ofLongs().build()

    private val inferenceCounter = meter.counterBuilder("ai.inference.total").build()

    private val validationFailures = meter.counterBuilder("ai.validation.failed").build()

    suspend fun <T> trackInference(feature: String, block: suspend () -> T): T {
        val start = System.currentTimeMillis()
        return try {
            val result = block()
            val duration = System.currentTimeMillis() - start
            inferenceDuration.record(duration, Attributes.of(AttributeKey.stringKey("feature"), feature))
            inferenceCounter.add(1, Attributes.of(AttributeKey.stringKey("feature"), feature, AttributeKey.stringKey("result"), "success"))
            result
        } catch (t: Throwable) {
            inferenceCounter.add(1, Attributes.of(AttributeKey.stringKey("feature"), feature, AttributeKey.stringKey("result"), "error"))
            throw t
        }
    }
}

Track:

Inference latency P50/P95/P99 per feature
Success / error rate
Validation failure rate (output constraints violated)
Server fallback rate (when on-device fails)

Testing AI features

Deterministic inputs

Set temperature = 0 for deterministic outputs in tests:

val config = generationConfig {
    temperature = 0f
    topK = 1
    seed = 42
}

Golden outputs

@Test fun summarizes_news_article() = runTest {
    val article = readFixture("news-article.txt")
    val expected = readFixture("news-article-summary.txt")

    val actual = summarizer.summarize(article)

    // Fuzzy match — LLMs aren't exact
    assertTrue(actual.length in 50..200)
    assertTrue(similarity(actual, expected) > 0.7f)
}

Shadow / canary runs

Ship both on-device and server-side inference; log the difference for a sample. Gives you a quality signal before switching from server to on-device.

Key takeaways

Continue to Gemini Nano for the on-device LLM API, or ML Kit for pre-built features.

The stack​

Gemini Nano (AICore)

ML Kit

TensorFlow Lite + LiteRT

Media Pipe / MediaPipe Tasks

When to pick which​

AICore — the Android AI runtime​

Availability​

Privacy by default​

The chapter roadmap​

Gemini Nano

ML Kit

TensorFlow Lite + LiteRT

Design patterns for AI features​

1. The graceful degradation pattern​

2. The streaming pattern​

3. The cache pattern​

4. The validation pattern​

5. The safety guardrail pattern​

Performance considerations​

Model size and distribution​

Observability​

Testing AI features​

Deterministic inputs​

Golden outputs​

Shadow / canary runs​

Key takeaways​

Next​

The stack

When to pick which

AICore — the Android AI runtime

Availability

Privacy by default

The chapter roadmap

Design patterns for AI features

1. The graceful degradation pattern

2. The streaming pattern

3. The cache pattern

4. The validation pattern

5. The safety guardrail pattern

Performance considerations

Model size and distribution

Observability

Testing AI features

Deterministic inputs

Golden outputs

Shadow / canary runs

Key takeaways

Next