Observability & Monitoring
You don't know how your app behaves until you watch it run on devices that aren't yours. This module teaches you the three pillars of observability — logs, metrics, traces — and shows you how to translate them into actionable signal: crash-free rate, ANR rate, P95 screen load, cold start distribution, and business KPIs that leadership cares about.
Topic 1 · Crash & ANR reporting
Firebase Crashlytics is the de-facto standard. Rules for making crash reports actionable:
class CrashlyticsTree @Inject constructor() : Timber.Tree() {
override fun log(priority: Int, tag: String?, message: String, t: Throwable?) {
if (priority < Log.WARN) return
Firebase.crashlytics.log("[$tag] $message")
t?.let { Firebase.crashlytics.recordException(it) }
}
}
class CrashlyticsBootstrap @Inject constructor(
private val authRepo: AuthRepository,
private val deviceInfo: DeviceInfo
) {
fun install() {
Firebase.crashlytics.apply {
setCrashlyticsCollectionEnabled(!BuildConfig.DEBUG)
setCustomKey("app.flavor", BuildConfig.FLAVOR)
setCustomKey("device.abi", deviceInfo.abi)
setCustomKey("device.model", Build.MODEL)
setCustomKey("device.api", Build.VERSION.SDK_INT)
setCustomKey("locale", Locale.getDefault().toLanguageTag())
}
// Update user ID when the user signs in/out
authRepo.userFlow.onEach { user ->
Firebase.crashlytics.setUserId(user?.id ?: "")
}.launchIn(GlobalScope)
}
}
The three non-negotiable custom keys
| Key | Why it matters |
|---|---|
user.id | Lets you reach out to affected users or reproduce on their data |
app.flavor/build | Separate staging noise from production signal |
last.screen | Added in a navigation observer — turns a stack trace into a story |
// Track the last screen every time the user navigates
navController.addOnDestinationChangedListener { _, destination, _ ->
Firebase.crashlytics.setCustomKey("last.screen", destination.route ?: "unknown")
}
ANR tracking
Crashlytics catches ANRs automatically on modern versions. Pair with ApplicationExitInfo for richer diagnostics on Android 11+:
class AnrReporter @Inject constructor(
@ApplicationContext context: Context,
private val crashlytics: FirebaseCrashlytics
) {
fun reportPreviousAnrs() {
val am = context.getSystemService(ActivityManager::class.java)
val reasons = am.getHistoricalProcessExitReasons(null, 0, 20)
reasons.filter { it.reason == ApplicationExitInfo.REASON_ANR }
.forEach { info ->
crashlytics.log("ANR ${info.timestamp} status=${info.status} desc=${info.description}")
info.traceInputStream?.let { crashlytics.recordException(AnrException(it.readBytes().toString(Charsets.UTF_8))) }
}
}
}
Topic 2 · Performance monitoring
Firebase Performance automatic traces
Firebase Performance auto-traces app start (cold/warm/hot), screen rendering, and HTTP requests. You get the P50/P75/P90/P95/P99 distribution for free.
Custom traces for business flows
Auto traces tell you "HTTP to /checkout took 800 ms P95." Custom traces tell you "end-to-end cart → payment → confirmation took 4.2 s P95," which is what the PM actually cares about.
class CheckoutTracer @Inject constructor() {
suspend fun <T> traceCheckout(block: suspend () -> T): T {
val trace = Firebase.performance.newTrace("checkout_flow")
trace.start()
return try {
block().also { trace.putMetric("items_sold", /* ... */) }
} finally {
trace.stop()
}
}
}
viewModelScope.launch {
tracer.traceCheckout {
val cart = repo.currentCart()
val payment = payments.charge(cart.total)
orders.create(cart, payment)
}
}
Topic 3 · Structured logging
Logs are only useful if you can search them. Use structured logs (JSON) with a stable schema — this turns a firehose into a queryable table:
data class LogEvent(
val timestamp: Long = System.currentTimeMillis(),
val level: String,
val tag: String,
val message: String,
val userId: String? = null,
val sessionId: String,
val screen: String? = null,
val requestId: String? = null,
val extra: Map<String, Any?> = emptyMap()
)
class StructuredLoggingTree @Inject constructor(
private val shipper: LogShipper, // ships to backend / Datadog / Elastic
private val session: SessionProvider
) : Timber.Tree() {
override fun log(priority: Int, tag: String?, message: String, t: Throwable?) {
if (priority < Log.INFO) return
val event = LogEvent(
level = when (priority) {
Log.ERROR -> "ERROR"
Log.WARN -> "WARN"
Log.INFO -> "INFO"
else -> "DEBUG"
},
tag = tag ?: "app",
message = message,
userId = session.userId(),
sessionId = session.sessionId(),
screen = session.currentScreen(),
extra = t?.let { mapOf("stacktrace" to it.stackTraceToString()) } ?: emptyMap()
)
shipper.enqueue(event)
}
}
Redacting PII
Logging is the single biggest source of PII leaks into logs-retention systems. Redact at the source, not at the backend:
private val EMAIL_REGEX = Regex("[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}")
private val PHONE_REGEX = Regex("\\+?\\d[\\d ()-]{7,}")
fun String.redactPii(): String = this
.replace(EMAIL_REGEX, "[email]")
.replace(PHONE_REGEX, "[phone]")
Topic 4 · Distributed tracing
Modern Android apps make 10–40 HTTP calls per screen. A slow checkout could be the client, the API gateway, the payments service, or the database. You don't know without distributed tracing.
W3C Trace Context propagation with OkHttp
class TracingInterceptor @Inject constructor(
private val tracer: Tracer // io.opentelemetry.api.trace.Tracer
) : Interceptor {
override fun intercept(chain: Interceptor.Chain): Response {
val req = chain.request()
val span = tracer.spanBuilder("http ${req.method} ${req.url.encodedPath}")
.setSpanKind(SpanKind.CLIENT)
.setAttribute("http.method", req.method)
.setAttribute("http.url", req.url.toString())
.startSpan()
return try {
span.makeCurrent().use {
val traced = req.newBuilder()
.header("traceparent", span.spanContext.toTraceParent())
.build()
val response = chain.proceed(traced)
span.setAttribute("http.status_code", response.code.toLong())
if (response.code >= 400) span.setStatus(StatusCode.ERROR)
response
}
} catch (t: Throwable) {
span.recordException(t); span.setStatus(StatusCode.ERROR); throw t
} finally {
span.end()
}
}
}
fun SpanContext.toTraceParent(): String =
"00-${traceId}-${spanId}-${if (traceFlags.isSampled) "01" else "00"}"
Ship spans to Grafana Tempo, Honeycomb, Datadog APM, or AWS X-Ray. When checkout is slow, you can drill from client span → API gateway span → payments service span → DB query span in one click.
Topic 5 · SLOs, SLIs, and alerting
Set Service Level Objectives for the user experience and alert when you're burning the error budget:
| SLI (signal) | SLO target | Budget per 30 d (1 M sessions) |
|---|---|---|
| Crash-free sessions | ≥ 99.5% | 5,000 crashes |
| ANR-free sessions | ≥ 99.8% | 2,000 ANRs |
| Cold start P95 | < 2,000 ms | 50,000 slow starts |
| Checkout success rate | ≥ 99.0% | 10,000 failed checkouts |
| Screen load P75 (critical screens) | < 1,200 ms | 250,000 slow loads |
Alerting in production
# Grafana / Cloud Monitoring alerting policy (pseudo-YAML)
- name: crash_free_regression
condition: |
avg_over_time(crash_free_sessions{app="prod"}[1h]) < 0.995
for: 15m
action:
- pager: oncall-android
- slack: '#android-incidents'
- auto_halt_rollout: true
- name: anr_spike
condition: |
rate(anr_events{app="prod"}[15m]) > 3 * baseline_1w
for: 10m
action:
- slack: '#android-incidents'
- name: checkout_success_drop
condition: |
sum(rate(checkout_success[5m])) / sum(rate(checkout_attempt[5m])) < 0.98
for: 10m
action:
- pager: oncall-payments
- slack: '#checkout-team'
Alert fatigue kills teams. Every alert should be:
- Actionable (there's a runbook)
- Ownable (a named team, not "mobile")
- Breaking something real (tied to an SLO, not a metric anomaly)
The observability stack
Default choice. Free for unlimited events. Issue grouping, Slack integration, velocity alerts.
App start, screen render, HTTP traces out of the box. Free for most teams.
Commercial alternative with source-maps, session replay, and distributed tracing in one tool.
Real user monitoring + backend APM + logs in one pane. Expensive but powerful for enterprise.
SDK for traces and metrics that shipouts to any OTLP backend. Avoid vendor lock-in.
Swap log backends without touching call sites. Pair with custom Trees for Crashlytics + shipping.
Privacy-preserving telemetry
Observability cannot become surveillance. Apply:
- Purpose limitation — collect only what you act on. No "just in case" data.
- Sampling — send 10% of spans, not 100%. You get signal without cost.
- Aggregation, not identification — for UX metrics, user IDs are unnecessary. Use randomized install IDs that reset on uninstall.
- Retention windows — logs 30 days, traces 7 days, crashes 90 days. Automate deletion; don't rely on policy.
- Opt-out controls — every analytics SDK needs a user-visible toggle. GDPR requires it; users notice it.
Key takeaways
Practice exercises
- 01
Crashlytics bootstrap
Add a CrashlyticsBootstrap class that sets user ID, flavor, and updates last.screen on every navigation event.
- 02
Custom Performance trace
Wrap your checkout flow in a Firebase Performance custom trace with item-count and order-total metrics.
- 03
Structured logging tree
Implement a StructuredLoggingTree that ships JSON logs to a backend endpoint, with PII redaction.
- 04
OpenTelemetry integration
Add an OkHttp TracingInterceptor that propagates traceparent headers. Verify the trace appears in Honeycomb or Tempo.
- 05
SLO definition
Write down three SLOs for your app (crash-free, cold start, business flow). Compute the monthly error budget for each.
Next module
Continue to Module 19 — Enterprise UX — design systems, accessibility, internationalization, and large-screen / foldable support at scale.