Module: 18 of 20Duration: 2 weeksTopics: 5 · 10 subtopicsPrerequisites: Modules 07, 10, 17

Observability & Monitoring

You don't know how your app behaves until you watch it run on devices that aren't yours. This module teaches you the three pillars of observability — logs, metrics, traces — and shows you how to translate them into actionable signal: crash-free rate, ANR rate, P95 screen load, cold start distribution, and business KPIs that leadership cares about.

Topic 1 · Crash & ANR reporting

Firebase Crashlytics is the de-facto standard. Rules for making crash reports actionable:

class CrashlyticsTree @Inject constructor() : Timber.Tree() {
    override fun log(priority: Int, tag: String?, message: String, t: Throwable?) {
        if (priority < Log.WARN) return
        Firebase.crashlytics.log("[$tag] $message")
        t?.let { Firebase.crashlytics.recordException(it) }
    }
}

class CrashlyticsBootstrap @Inject constructor(
    private val authRepo: AuthRepository,
    private val deviceInfo: DeviceInfo
) {
    fun install() {
        Firebase.crashlytics.apply {
            setCrashlyticsCollectionEnabled(!BuildConfig.DEBUG)
            setCustomKey("app.flavor", BuildConfig.FLAVOR)
            setCustomKey("device.abi", deviceInfo.abi)
            setCustomKey("device.model", Build.MODEL)
            setCustomKey("device.api", Build.VERSION.SDK_INT)
            setCustomKey("locale", Locale.getDefault().toLanguageTag())
        }

        // Update user ID when the user signs in/out
        authRepo.userFlow.onEach { user ->
            Firebase.crashlytics.setUserId(user?.id ?: "")
        }.launchIn(GlobalScope)
    }
}

The three non-negotiable custom keys

Key	Why it matters
`user.id`	Lets you reach out to affected users or reproduce on their data
`app.flavor/build`	Separate staging noise from production signal
`last.screen`	Added in a navigation observer — turns a stack trace into a story

// Track the last screen every time the user navigates
navController.addOnDestinationChangedListener { _, destination, _ ->
    Firebase.crashlytics.setCustomKey("last.screen", destination.route ?: "unknown")
}

ANR tracking

Crashlytics catches ANRs automatically on modern versions. Pair with ApplicationExitInfo for richer diagnostics on Android 11+:

class AnrReporter @Inject constructor(
    @ApplicationContext context: Context,
    private val crashlytics: FirebaseCrashlytics
) {
    fun reportPreviousAnrs() {
        val am = context.getSystemService(ActivityManager::class.java)
        val reasons = am.getHistoricalProcessExitReasons(null, 0, 20)
        reasons.filter { it.reason == ApplicationExitInfo.REASON_ANR }
            .forEach { info ->
                crashlytics.log("ANR ${info.timestamp} status=${info.status} desc=${info.description}")
                info.traceInputStream?.let { crashlytics.recordException(AnrException(it.readBytes().toString(Charsets.UTF_8))) }
            }
    }
}

Topic 2 · Performance monitoring

Firebase Performance automatic traces

Firebase Performance auto-traces app start (cold/warm/hot), screen rendering, and HTTP requests. You get the P50/P75/P90/P95/P99 distribution for free.

Custom traces for business flows

Auto traces tell you "HTTP to /checkout took 800 ms P95." Custom traces tell you "end-to-end cart → payment → confirmation took 4.2 s P95," which is what the PM actually cares about.

class CheckoutTracer @Inject constructor() {
    suspend fun <T> traceCheckout(block: suspend () -> T): T {
        val trace = Firebase.performance.newTrace("checkout_flow")
        trace.start()
        return try {
            block().also { trace.putMetric("items_sold", /* ... */) }
        } finally {
            trace.stop()
        }
    }
}

viewModelScope.launch {
    tracer.traceCheckout {
        val cart = repo.currentCart()
        val payment = payments.charge(cart.total)
        orders.create(cart, payment)
    }
}

Topic 3 · Structured logging

Logs are only useful if you can search them. Use structured logs (JSON) with a stable schema — this turns a firehose into a queryable table:

data class LogEvent(
    val timestamp: Long = System.currentTimeMillis(),
    val level: String,
    val tag: String,
    val message: String,
    val userId: String? = null,
    val sessionId: String,
    val screen: String? = null,
    val requestId: String? = null,
    val extra: Map<String, Any?> = emptyMap()
)

class StructuredLoggingTree @Inject constructor(
    private val shipper: LogShipper,         // ships to backend / Datadog / Elastic
    private val session: SessionProvider
) : Timber.Tree() {

    override fun log(priority: Int, tag: String?, message: String, t: Throwable?) {
        if (priority < Log.INFO) return
        val event = LogEvent(
            level = when (priority) {
                Log.ERROR -> "ERROR"
                Log.WARN  -> "WARN"
                Log.INFO  -> "INFO"
                else      -> "DEBUG"
            },
            tag = tag ?: "app",
            message = message,
            userId = session.userId(),
            sessionId = session.sessionId(),
            screen = session.currentScreen(),
            extra = t?.let { mapOf("stacktrace" to it.stackTraceToString()) } ?: emptyMap()
        )
        shipper.enqueue(event)
    }
}

Redacting PII

Logging is the single biggest source of PII leaks into logs-retention systems. Redact at the source, not at the backend:

private val EMAIL_REGEX = Regex("[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}")
private val PHONE_REGEX = Regex("\\+?\\d[\\d ()-]{7,}")

fun String.redactPii(): String = this
    .replace(EMAIL_REGEX, "[email]")
    .replace(PHONE_REGEX, "[phone]")

Topic 4 · Distributed tracing

Modern Android apps make 10–40 HTTP calls per screen. A slow checkout could be the client, the API gateway, the payments service, or the database. You don't know without distributed tracing.

Distributed trace propagation

The trace-id follows the request from client through every backend service.

W3C Trace Context propagation with OkHttp

class TracingInterceptor @Inject constructor(
    private val tracer: Tracer   // io.opentelemetry.api.trace.Tracer
) : Interceptor {
    override fun intercept(chain: Interceptor.Chain): Response {
        val req = chain.request()
        val span = tracer.spanBuilder("http ${req.method} ${req.url.encodedPath}")
            .setSpanKind(SpanKind.CLIENT)
            .setAttribute("http.method", req.method)
            .setAttribute("http.url", req.url.toString())
            .startSpan()

        return try {
            span.makeCurrent().use {
                val traced = req.newBuilder()
                    .header("traceparent", span.spanContext.toTraceParent())
                    .build()
                val response = chain.proceed(traced)
                span.setAttribute("http.status_code", response.code.toLong())
                if (response.code >= 400) span.setStatus(StatusCode.ERROR)
                response
            }
        } catch (t: Throwable) {
            span.recordException(t); span.setStatus(StatusCode.ERROR); throw t
        } finally {
            span.end()
        }
    }
}

fun SpanContext.toTraceParent(): String =
    "00-${traceId}-${spanId}-${if (traceFlags.isSampled) "01" else "00"}"

Ship spans to Grafana Tempo, Honeycomb, Datadog APM, or AWS X-Ray. When checkout is slow, you can drill from client span → API gateway span → payments service span → DB query span in one click.

Topic 5 · SLOs, SLIs, and alerting

Set Service Level Objectives for the user experience and alert when you're burning the error budget:

SLI (signal)	SLO target	Budget per 30 d (1 M sessions)
Crash-free sessions	≥ 99.5%	5,000 crashes
ANR-free sessions	≥ 99.8%	2,000 ANRs
Cold start P95	< 2,000 ms	50,000 slow starts
Checkout success rate	≥ 99.0%	10,000 failed checkouts
Screen load P75 (critical screens)	< 1,200 ms	250,000 slow loads

Alerting in production

# Grafana / Cloud Monitoring alerting policy (pseudo-YAML)
- name: crash_free_regression
  condition: |
    avg_over_time(crash_free_sessions{app="prod"}[1h]) < 0.995
    for: 15m
  action:
    - pager: oncall-android
    - slack: '#android-incidents'
    - auto_halt_rollout: true

- name: anr_spike
  condition: |
    rate(anr_events{app="prod"}[15m]) > 3 * baseline_1w
    for: 10m
  action:
    - slack: '#android-incidents'

- name: checkout_success_drop
  condition: |
    sum(rate(checkout_success[5m])) / sum(rate(checkout_attempt[5m])) < 0.98
    for: 10m
  action:
    - pager: oncall-payments
    - slack: '#checkout-team'

Alert fatigue kills teams. Every alert should be:

Actionable (there's a runbook)
Ownable (a named team, not "mobile")
Breaking something real (tied to an SLO, not a metric anomaly)

The observability stack

Firebase Crashlytics

Crash + ANR reporting

Default choice. Free for unlimited events. Issue grouping, Slack integration, velocity alerts.

Firebase Performance

Auto traces

App start, screen render, HTTP traces out of the box. Free for most teams.

Sentry

Crash + APM + replays

Commercial alternative with source-maps, session replay, and distributed tracing in one tool.

Datadog RUM + APM

End-to-end observability

Real user monitoring + backend APM + logs in one pane. Expensive but powerful for enterprise.

OpenTelemetry Android

Vendor-neutral tracing

SDK for traces and metrics that shipouts to any OTLP backend. Avoid vendor lock-in.

Timber

Logging abstraction

Swap log backends without touching call sites. Pair with custom Trees for Crashlytics + shipping.

Privacy-preserving telemetry

Observability cannot become surveillance. Apply:

Purpose limitation — collect only what you act on. No "just in case" data.
Sampling — send 10% of spans, not 100%. You get signal without cost.
Aggregation, not identification — for UX metrics, user IDs are unnecessary. Use randomized install IDs that reset on uninstall.
Retention windows — logs 30 days, traces 7 days, crashes 90 days. Automate deletion; don't rely on policy.
Opt-out controls — every analytics SDK needs a user-visible toggle. GDPR requires it; users notice it.

Key takeaways

Practice exercises

01
Crashlytics bootstrap
Add a CrashlyticsBootstrap class that sets user ID, flavor, and updates last.screen on every navigation event.
02
Custom Performance trace
Wrap your checkout flow in a Firebase Performance custom trace with item-count and order-total metrics.
03
Structured logging tree
Implement a StructuredLoggingTree that ships JSON logs to a backend endpoint, with PII redaction.
04
OpenTelemetry integration
Add an OkHttp TracingInterceptor that propagates traceparent headers. Verify the trace appears in Honeycomb or Tempo.
05
SLO definition
Write down three SLOs for your app (crash-free, cold start, business flow). Compute the monthly error budget for each.

Next module

Continue to Module 19 — Enterprise UX — design systems, accessibility, internationalization, and large-screen / foldable support at scale.

Topic 1 · Crash & ANR reporting​

The three non-negotiable custom keys​

ANR tracking​

Topic 2 · Performance monitoring​

Firebase Performance automatic traces​

Custom traces for business flows​

Topic 3 · Structured logging​

Redacting PII​

Topic 4 · Distributed tracing​

W3C Trace Context propagation with OkHttp​

Topic 5 · SLOs, SLIs, and alerting​

Alerting in production​

The observability stack​

Privacy-preserving telemetry​

Key takeaways​

Practice exercises​

Crashlytics bootstrap

Custom Performance trace

Structured logging tree

OpenTelemetry integration

SLO definition

Next module​