Gemini Nano & AICore

Gemini Nano is Google's on-device LLM, shipping in AICore on Android 14 QPR2+ devices (Pixel 8 Pro, Pixel 9, Samsung S24, Pixel Fold 2 and more). Apps get access via the GenerativeModel SDK — no model hosting, no per-inference cost, no data leaving the device.

Supported devices (as of 2025)

Pixel 8 Pro (original launch)
Pixel 9, 9 Pro, 9 Pro XL, 9 Pro Fold
Samsung Galaxy S24, S24+, S24 Ultra
Samsung Galaxy Z Fold6, Z Flip6
Growing list — check GenerativeModel.isSupported() at runtime

On unsupported devices, fall back to server-side LLM (Gemini Pro via the Firebase AI Logic SDK, or your own backend).

Setup

// libs.versions.toml
genai = "0.0.0-alpha01"
firebase-ai = "16.0.0"

genai-client = { module = "com.google.ai.edge.aicore:aicore", version.ref = "genai" }
firebase-ai = { module = "com.google.firebase:firebase-ai", version.ref = "firebase-ai" }

@Provides @Singleton
fun provideGenerativeModel(@ApplicationContext context: Context): GenerativeModel {
    val config = generationConfig {
        context = context
        temperature = 0.2f
        topK = 16
        maxOutputTokens = 256
    }
    return GenerativeModel(generationConfig = config)
}

Check availability

class GenAiAvailabilityChecker @Inject constructor(
    private val model: GenerativeModel
) {
    suspend fun canUseOnDevice(): Boolean = try {
        model.prepareInferenceEngine()   // downloads / prepares the model
        true
    } catch (e: Exception) {
        false
    }
}

prepareInferenceEngine() returns quickly if AICore is available and the feature is enabled. On non-supported devices or when AICore is disabled, it throws.

Basic inference

suspend fun summarize(text: String): String {
    val response = model.generateContent("Summarize in one sentence:\n$text")
    return response.text.orEmpty()
}

// Usage
val summary = summarize(articleText)

Streaming

For long outputs, stream chunks:

fun summarizeStreaming(text: String): Flow<String> = flow {
    model.generateContentStream("Summarize:\n$text").collect { chunk ->
        emit(chunk.text.orEmpty())
    }
}

@Composable
fun StreamingSummary(viewModel: SummaryViewModel = hiltViewModel()) {
    val partial by viewModel.partial.collectAsStateWithLifecycle()
    Text(partial, modifier = Modifier.animateContentSize())
}

class SummaryViewModel @Inject constructor(
    private val summarizer: Summarizer
) : ViewModel() {
    private val _partial = MutableStateFlow("")
    val partial: StateFlow<String> = _partial.asStateFlow()

    fun summarize(text: String) = viewModelScope.launch {
        _partial.value = ""
        summarizer.summarizeStreaming(text)
            .collect { chunk -> _partial.update { it + chunk } }
    }
}

Gives a streaming-feel ("typing in") UX even though total wall-time is the same.

Prompt engineering — the core skill

Gemini Nano is ~3-6B parameters (much smaller than Gemini Pro). Your prompts need to be tighter to get good results.

Be specific

// ❌ Vague
"Rewrite this"

// ✅ Specific
"""Rewrite the following text to be more formal. Preserve the meaning.
Keep it under 100 words.

TEXT: $text

REWRITE:"""

Structure the output

val prompt = """Extract product details from this receipt text.
Return as JSON with exactly these fields: name, priceCents, quantity.
If a field is missing, use null.

RECEIPT:
$text

JSON:"""

val json = model.generateContent(prompt).text.orEmpty().substringAfter("JSON:").trim()
val product = runCatching { Json.decodeFromString<Product>(json) }.getOrNull()

Always:

Ask for a specific format (JSON, one of N labels, bullet list)
Specify "return only the X" to prevent preamble
Parse defensively (the model sometimes wraps JSON in ```json blocks)

Few-shot examples

val prompt = """Classify the sentiment as POSITIVE, NEGATIVE, or NEUTRAL.

Example 1:
Text: I love this phone!
Sentiment: POSITIVE

Example 2:
Text: The battery is terrible.
Sentiment: NEGATIVE

Example 3:
Text: The screen is 6.2 inches.
Sentiment: NEUTRAL

Text: $input
Sentiment:"""

3-5 examples dramatically improve accuracy for classification tasks.

Anchor tokens

Force the model to start with a known prefix:

val prompt = """You are proofreading text for grammar and spelling only.
Return ONLY the corrected text. No explanation.

Original: $text

Corrected:"""

val result = model.generateContent(prompt).text.orEmpty()
    .substringAfter("Corrected:")
    .substringBefore("\n\n")
    .trim()

The model's output starts after "Corrected:" — your parsing is deterministic.

Function calling

Newer AICore builds support tool use — the model can request function invocations:

val tools = listOf(
    Tool(
        functionDeclarations = listOf(
            FunctionDeclaration(
                name = "get_weather",
                description = "Get current weather for a city",
                parameters = mapOf(
                    "city" to Schema.str("City name")
                )
            )
        )
    )
)

val config = generationConfig {
    context = context
    tools = tools
}

val model = GenerativeModel(generationConfig = config)

val response = model.generateContent("What's the weather in Tokyo?")

// Check if the model wants to call a function
val functionCall = response.functionCalls.firstOrNull()
if (functionCall?.name == "get_weather") {
    val city = functionCall.args["city"] as String
    val weather = weatherApi.current(city)

    // Feed the function result back
    val final = model.generateContent(
        Content.FunctionResponse("get_weather", mapOf("temperature" to weather.temp))
    )
    return final.text
}

Great for:

Weather / location queries
Triggering in-app actions from natural language
Multi-step agentic flows

Grounding with local data

Gemini Nano has no knowledge of your user's data. Pass relevant context in the prompt:

class NotesSearch @Inject constructor(
    private val notesDao: NoteDao,
    private val model: GenerativeModel
) {
    suspend fun askAboutNotes(question: String): String {
        val recent = notesDao.recent(limit = 20)
        val context = recent.joinToString("\n---\n") { "${it.title}\n${it.body}" }

        val prompt = """Answer the user's question based only on these notes.
If the answer isn't in the notes, say "I don't see that in your notes."

NOTES:
$context

QUESTION: $question

ANSWER:"""

        return model.generateContent(prompt).text.orEmpty()
    }
}

This is RAG (Retrieval-Augmented Generation) — retrieve relevant docs, insert into the prompt as context. The model answers from the context, not its trained knowledge.

Semantic search for RAG

For 1000+ notes, you need embeddings. TF Lite text embedding models let you compute cosine similarity on-device:

class NoteRagEngine @Inject constructor(
    private val embedder: TextEmbedder,   // TF Lite model
    private val noteEmbeddings: NoteEmbeddingDao
) {
    suspend fun relevant(question: String, topK: Int = 5): List<NoteEntity> {
        val queryEmbedding = embedder.embed(question)
        return noteEmbeddings.all()
            .map { it.note to cosineSimilarity(queryEmbedding, it.embedding) }
            .sortedByDescending { it.second }
            .take(topK)
            .map { it.first }
    }
}

See ML Kit & TensorFlow Lite for the embedding model setup.

Safety and guardrails

Response safety settings

val config = generationConfig {
    safetySettings = listOf(
        SafetySetting(HarmCategory.HARASSMENT, HarmBlockThreshold.MEDIUM_AND_ABOVE),
        SafetySetting(HarmCategory.HATE_SPEECH, HarmBlockThreshold.MEDIUM_AND_ABOVE),
        SafetySetting(HarmCategory.SEXUALLY_EXPLICIT, HarmBlockThreshold.MEDIUM_AND_ABOVE),
        SafetySetting(HarmCategory.DANGEROUS_CONTENT, HarmBlockThreshold.HIGH_AND_ABOVE)
    )
}

Responses flagged as unsafe return null text with a blocked reason. Handle both cases.

Prompt injection defense

fun sanitizePrompt(userInput: String): String {
    // Block common injection patterns
    val blocked = listOf(
        "ignore previous instructions",
        "disregard the above",
        "you are now",
        "new instructions:"
    )
    if (blocked.any { userInput.contains(it, ignoreCase = true) }) {
        return "[User input blocked for policy violation]"
    }
    return userInput
}

Never concatenate user input with system prompts without some sanitization. Prompt injection is a real vector.

Output validation

suspend fun categorize(text: String): Category {
    val raw = model.generateContent(
        """Classify: FOOD, TECH, SPORTS, OTHER.
           Return only the category.
           Text: $text"""
    ).text?.trim()?.uppercase() ?: return Category.OTHER

    return Category.values().find { it.name == raw } ?: Category.OTHER
}

Constrain outputs to a known set. If the LLM returns "I'm not sure", default safely.

Performance

Warm up on app start

class GenAiWarmup @Inject constructor(
    private val model: GenerativeModel,
    private val scope: CoroutineScope
) {
    fun install() {
        scope.launch {
            runCatching { model.prepareInferenceEngine() }
        }
    }
}

First inference latency: ~2-3s. Subsequent: <500ms. Warm up during app launch so the first user-facing request is fast.

Batch when possible

// ❌ Ten inferences, ten token overheads
val results = items.map { item -> model.generateContent("Classify: $item").text }

// ✅ One inference
val prompt = """Classify each item as FOOD / TECH / SPORTS / OTHER.
Return one category per line, same order.

Items:
${items.mapIndexed { i, it -> "$i. $it" }.joinToString("\n")}

Classifications:"""

val results = model.generateContent(prompt).text.orEmpty().lines()

Respect battery and thermal

suspend fun safeInference(prompt: String): String? {
    val battery = context.getSystemService(BatteryManager::class.java)
    val level = battery.getIntProperty(BatteryManager.BATTERY_PROPERTY_CAPACITY)

    if (level < 20 && !batteryManager.isCharging) {
        // Defer or use server fallback
        return null
    }

    val power = context.getSystemService(PowerManager::class.java)
    if (power.currentThermalStatus > PowerManager.THERMAL_STATUS_MODERATE) {
        // Thermal throttling — defer
        return null
    }

    return model.generateContent(prompt).text
}

Sustained LLM inference can drain 20% battery per hour and thermal- throttle after 5 minutes. Gate on battery + thermal state.

Full app example — compose proofreader

class ProofreaderViewModel @Inject constructor(
    private val model: GenerativeModel
) : ViewModel() {
    private val _state = MutableStateFlow(ProofreadState())
    val state: StateFlow<ProofreadState> = _state.asStateFlow()

    fun proofread(text: String) = viewModelScope.launch {
        _state.update { it.copy(isLoading = true, corrected = "", error = null) }
        try {
            val sanitized = sanitizePrompt(text)
            val prompt = """Correct grammar and spelling. Preserve meaning and style.
Return only the corrected text.

TEXT: $sanitized

CORRECTED:"""
            val corrected = model.generateContent(prompt).text
                ?.substringAfter("CORRECTED:")
                ?.trim()
                ?: text

            // Guardrail: output shouldn't be dramatically different
            if (corrected.length > text.length * 2) {
                _state.update { it.copy(isLoading = false, corrected = text, error = "Model returned unexpected output") }
                return@launch
            }

            _state.update { it.copy(isLoading = false, corrected = corrected) }
        } catch (e: Exception) {
            _state.update { it.copy(isLoading = false, error = e.message) }
        }
    }
}

@Composable
fun ProofreadScreen(viewModel: ProofreaderViewModel = hiltViewModel()) {
    val state by viewModel.state.collectAsStateWithLifecycle()
    var draft by rememberSaveable { mutableStateOf("") }

    Column(Modifier.padding(16.dp)) {
        OutlinedTextField(
            value = draft,
            onValueChange = { draft = it },
            label = { Text("Draft") },
            minLines = 5,
            modifier = Modifier.fillMaxWidth()
        )

        Spacer(Modifier.height(12.dp))

        Button(
            onClick = { viewModel.proofread(draft) },
            enabled = !state.isLoading && draft.isNotBlank()
        ) {
            if (state.isLoading) CircularProgressIndicator(strokeWidth = 2.dp)
            else Text("Proofread")
        }

        state.error?.let { Text(it, color = MaterialTheme.colorScheme.error) }

        if (state.corrected.isNotBlank()) {
            Spacer(Modifier.height(16.dp))
            Text("Corrected", style = MaterialTheme.typography.labelLarge)
            Text(state.corrected, style = MaterialTheme.typography.bodyMedium)
        }
    }
}

data class ProofreadState(
    val isLoading: Boolean = false,
    val corrected: String = "",
    val error: String? = null
)

Deploying Gemini Nano at scale

Adoption strategy

Device gate — enable feature only on AICore-supported devices
Server fallback for older devices (Gemini Pro via Firebase AI Logic)
Feature flag for gradual rollout (Remote Config)
A/B test — on-device vs server — measure quality and latency

class GenerativeAIStrategy @Inject constructor(
    private val onDevice: GenerativeModel,
    private val serverModel: FirebaseVertexAI,
    private val featureFlags: FeatureFlags
) {
    suspend fun summarize(text: String): String {
        if (!featureFlags.aiSummarizationEnabled) return ""

        val useOnDevice = featureFlags.preferOnDeviceAi && isOnDeviceAvailable()
        return if (useOnDevice) {
            runCatching { onDevice.generateContent("Summarize: $text").text.orEmpty() }
                .getOrElse { serverModel.generateContent("Summarize: $text").text.orEmpty() }
        } else {
            serverModel.generateContent("Summarize: $text").text.orEmpty()
        }
    }

    private suspend fun isOnDeviceAvailable(): Boolean =
        runCatching { onDevice.prepareInferenceEngine(); true }.getOrElse { false }
}

Quality monitoring

Log prompt hashes + user ratings (thumbs up/down on output)
Sample traces — review in aggregate; never log raw PII
Track inference success / error / safety-blocked rates

Common anti-patterns

Anti-patterns

Gemini Nano mistakes

No device availability check (crashes on unsupported)
Vague prompts ("make this better")
No output validation (LLM hallucinations ship)
Unbounded generation (100k tokens)
Raw user text into system prompts (injection)
Running inference on low battery

Best practices

Production-grade AI

prepareInferenceEngine() at startup; fallback for failure
Specific, few-shot, structured prompts
Validate output (enum, length, schema)
maxOutputTokens set; safety thresholds configured
Sanitize user input before adding to prompts
Battery + thermal gate before inference

Key takeaways

Practice exercises

01
Availability check
Wire up prepareInferenceEngine() + a server fallback. Log which path is used per user.
02
Streaming summary
Build a screen that summarizes the current article. Use generateContentStream to stream words into a Text composable.
03
RAG over notes
Implement semantic search over 100 notes using TF Lite embeddings. Pass top-5 relevant notes as context for user questions.
04
Structured output
Extract receipt details (name, price, quantity) as JSON. Parse defensively; retry with tighter prompt on parse failure.
05
Safety guardrails
Implement prompt injection detection + output validation. Test with adversarial inputs ("ignore previous instructions").

Continue to ML Kit & TensorFlow Lite for pre-built features and custom models.

Supported devices (as of 2025)​

Setup​

Check availability​

Basic inference​

Streaming​

Prompt engineering — the core skill​

Be specific​

Structure the output​

Few-shot examples​

Anchor tokens​

Function calling​

Grounding with local data​

Semantic search for RAG​

Safety and guardrails​

Response safety settings​

Prompt injection defense​

Output validation​

Performance​

Warm up on app start​

Batch when possible​

Respect battery and thermal​

Full app example — compose proofreader​

Deploying Gemini Nano at scale​

Adoption strategy​

Quality monitoring​

Common anti-patterns​