Benchmark & Property-Based Testing

Unit and instrumentation tests verify correctness. Benchmarks verify performance. Property-based tests explore input space to find bugs you wouldn't think to write. Both are specialist tools — but for the right problems, they save weeks of production firefighting.

Microbenchmark — per-function CPU

Use Microbenchmark when you want to measure a specific function across scenarios (e.g., is parser A faster than parser B?).

// benchmark/build.gradle.kts
plugins {
    alias(libs.plugins.android.library)
    alias(libs.plugins.kotlin.android)
    alias(libs.plugins.benchmark)
}

android {
    defaultConfig {
        testInstrumentationRunner = "androidx.benchmark.junit4.AndroidBenchmarkRunner"
    }
}

dependencies {
    androidTestImplementation("androidx.benchmark:benchmark-junit4:1.3.3")
}

@RunWith(AndroidJUnit4::class)
class JsonParserBenchmark {

    @get:Rule val benchmarkRule = BenchmarkRule()

    private val json = ClassLoader.getSystemResourceAsStream("product.json")!!
        .bufferedReader().use { it.readText() }

    @Test fun moshi() {
        val adapter = Moshi.Builder().build().adapter(Product::class.java)
        benchmarkRule.measureRepeated {
            adapter.fromJson(json)
        }
    }

    @Test fun kotlinx() {
        benchmarkRule.measureRepeated {
            Json.decodeFromString<Product>(json)
        }
    }
}

Output:

JsonParserBenchmark.moshi    mean=85.2 ns  min=82 ns  stdev=1.8 ns
JsonParserBenchmark.kotlinx  mean=61.8 ns  min=58 ns  stdev=2.1 ns

Microbenchmark disables measurement during debugging, throws if you run without CPU pinning (release build, isolated core) to keep results deterministic.

Hot vs cold phases

benchmarkRule.measureRepeated {
    // WARMUP — implicit; Benchmark library auto-warms JIT
    // MEASURED — this block is timed

    runWithTimingDisabled {
        // Setup that you don't want to measure
        val state = prepare()
    }

    // Measured
    heavyWork(state)
}

runWithTimingDisabled pauses the clock for setup (fresh input, cache invalidation). Keeps the benchmark honest.

Macrobenchmark — real user-facing performance

Use Macrobenchmark for end-to-end user flows — cold start, scroll frame timing, input latency. See Module 10: Baseline Profiles for the full startup chapter; this is the testing lens.

// benchmark/build.gradle.kts
plugins {
    alias(libs.plugins.android.test)
    alias(libs.plugins.benchmark)
}

dependencies {
    implementation("androidx.benchmark:benchmark-macro-junit4:1.3.3")
    implementation("androidx.test.uiautomator:uiautomator:2.3.0")
}

Cold start

@RunWith(AndroidJUnit4::class)
class StartupBenchmark {

    @get:Rule val rule = MacrobenchmarkRule()

    @Test fun startupCompilationNone() = startup(CompilationMode.None())

    @Test fun startupCompilationBaselineProfile() =
        startup(CompilationMode.Partial(BaselineProfileMode.Require))

    private fun startup(mode: CompilationMode) = rule.measureRepeated(
        packageName = "com.myapp",
        metrics = listOf(StartupTimingMetric(), FrameTimingMetric()),
        compilationMode = mode,
        startupMode = StartupMode.COLD,
        iterations = 10,
        setupBlock = { pressHome() },
        measureBlock = {
            startActivityAndWait()
            device.wait(Until.hasObject(By.res("home_root")), 5_000)
        }
    )
}

Results:

StartupBenchmark.startupCompilationNone
  timeToInitialDisplayMs   P50=1,480  P90=1,620  P99=1,790

StartupBenchmark.startupCompilationBaselineProfile
  timeToInitialDisplayMs   P50=1,010  P90=1,120  P99=1,240

Scroll jank

@Test fun feedScroll() = rule.measureRepeated(
    packageName = "com.myapp",
    metrics = listOf(FrameTimingMetric()),
    compilationMode = CompilationMode.Partial(),
    iterations = 10,
    startupMode = StartupMode.WARM,
    setupBlock = {
        startActivityAndWait()
        device.wait(Until.hasObject(By.res("feed")), 5_000)
    }
) {
    val list = device.findObject(By.res("feed"))
    list.setGestureMargin(device.displayWidth / 5)
    repeat(5) {
        list.fling(Direction.DOWN)
        device.waitForIdle()
    }
}

FrameTimingMetric reports:

frameDurationCpuMs — P50/P90/P99 of CPU frame time
frameOverrunMs — how much each dropped frame overran its deadline

Targets (for 60 Hz): P95 < 16.67 ms, P99 < 33.33 ms. For 90/120 Hz, tighter.

Input latency

@Test fun inputLatency() = rule.measureRepeated(
    metrics = listOf(TraceSectionMetric("onClick-submit")),
    iterations = 10
) {
    /* navigate to button, click, measure trace section */
}

With a Trace.beginSection("onClick-submit") in production code, you can benchmark the latency of specific UI handlers.

Running on Firebase Test Lab

./gradlew :benchmark:connectedCheck

For true device coverage, upload the benchmark APKs to Firebase Test Lab across a device matrix.

JankStats — measuring real users

Benchmarks measure your dev phone. JankStats (AndroidX) measures real users in production so you know actual field performance:

// build.gradle
implementation("androidx.metrics:metrics-performance:1.0.0-beta02")

class JankReporterInitializer @Inject constructor(
    private val crashReporter: CrashReporter
) : Initializer<Unit> {

    override fun create(context: Context) {
        ProcessLifecycleOwner.get().lifecycleScope.launch {
            ProcessLifecycleOwner.get().lifecycle.repeatOnLifecycle(Lifecycle.State.STARTED) {
                val window = /* obtain a Window reference */
                val jankStats = JankStats.createAndTrack(window, JankStats.OnFrameListener { frame ->
                    if (frame.isJank) {
                        crashReporter.log("jank: ${frame.frameDurationUiNanos / 1_000_000}ms on ${frame.states}")
                    }
                })
            }
        }
    }

    override fun dependencies() = emptyList<Class<out Initializer<*>>>()
}

Combined with per-screen PerformanceMetricsState, you get real-world jank rates per screen:

val metricsHolder = remember(view) { PerformanceMetricsState.getHolderForHierarchy(view) }

LaunchedEffect(metricsHolder) {
    metricsHolder.state?.apply {
        putState("screen", "home_feed")
        putSingleFrameState("refresh", "user_initiated")
    }
}

Ship jank percentiles to your observability backend (Module 18).

Property-based testing

Example-based tests assert specific cases:

@Test fun `add commutes`() {
    assertEquals(5, add(2, 3))
    assertEquals(5, add(3, 2))
}

Property-based tests assert laws — properties that should hold for any input:

// build.gradle
testImplementation("io.kotest:kotest-property:5.9.1")

test("addition commutes") {
    checkAll(Arb.int(), Arb.int()) { a, b ->
        add(a, b) shouldBe add(b, a)
    }
}

Kotest generates 1000 random pairs by default, testing:

Boundary values (0, 1, -1, Int.MAX_VALUE, Int.MIN_VALUE)
Shrinking — if a test fails, Kotest narrows down to the minimal failing input

When property-based pays off

Great for PBT

Ideal problems

Parsers and serializers (round-trip property)
Data transformations (associativity, commutativity, idempotency)
State machines (invariant checks)
Collections (list reversed twice = original)
Math and statistics
Validators (parse + format + parse = original)

Skip PBT

Use examples

UI behavior (can't generate meaningful clicks)
Thin CRUD wrappers
One-off business rules with < 5 cases
External side effects (network, DB)
Tests where shrinking makes no sense (single enum value)
Simple value objects

Round-trip property — always worth writing

For any encode/decode pair, assert the round-trip:

test("JSON round-trip") {
    checkAll(Arb.user()) { user ->
        val json = Json.encodeToString(user)
        val decoded = Json.decodeFromString<User>(json)
        decoded shouldBe user
    }
}

fun Arb.Companion.user(): Arb<User> = Arb.bind(
    Arb.string(minSize = 1),                      // id
    Arb.string(minSize = 1),                      // name
    Arb.string(minSize = 5).filter { "@" in it }, // email
    Arb.long(min = 0)                              // createdAt
) { id, name, email, created -> User(id, name, email, created) }

One round-trip test often catches bugs that 20 example tests miss — Unicode edge cases, null handling, character encoding.

Stateful property-based

For testing stateful systems (e.g., a cache), generate random sequences of commands:

test("cache respects LRU eviction") {
    checkAll(Arb.list(Arb.cacheCommand(), 1..100)) { commands ->
        val cache = LruCache<String, Int>(maxSize = 10)
        val model = LinkedHashMap<String, Int>()

        commands.forEach { cmd ->
            when (cmd) {
                is Put -> {
                    cache[cmd.key] = cmd.value
                    model[cmd.key] = cmd.value
                    while (model.size > 10) model.remove(model.keys.first())
                }
                is Get -> cache[cmd.key] shouldBe model[cmd.key]
            }
        }
    }
}

The model tracks expected state; the cache is compared to it. Any divergence = a bug.

Fuzz testing

Kotest's property-based is a flavor of fuzzing. For deeper fuzzing (bytes of random input into a parser to find crashes), Jazzer or AFL++ are industry tools — rare in Android apps outside of security-sensitive parsers.

Combining benchmarks with CI

# .github/workflows/performance.yml
on:
  schedule: [{cron: '0 3 * * *'}]          # nightly
  workflow_dispatch:                        # manual

jobs:
  macrobenchmark:
    runs-on: macos-14-large
    steps:
      - uses: actions/checkout@v4
      - uses: gradle/actions/setup-gradle@v4
      - uses: reactivecircus/android-emulator-runner@v2
        with:
          api-level: 34
          target: google_apis
          arch: x86_64
          script: ./gradlew :benchmark:connectedBenchmarkAndroidTest

      - name: Compare with baseline
        run: |
          python compare_benchmarks.py \
            --baseline benchmarks/baseline.json \
            --current benchmarks/result.json \
            --threshold 10    # fail if 10% regression

Catching a regression in CI is cheaper than discovering it from a user report. Set realistic thresholds and update the baseline on approved changes.

Chaos testing (advanced)

For apps with real-time data (chat, games), introduce random network chaos into integration tests:

class ChaoticInterceptor(private val random: Random = Random.Default) : Interceptor {
    override fun intercept(chain: Interceptor.Chain): Response {
        val roll = random.nextFloat()
        return when {
            roll < 0.05f -> throw IOException("simulated network failure")
            roll < 0.10f -> {
                Thread.sleep(5_000)
                chain.proceed(chain.request())
            }
            else -> chain.proceed(chain.request())
        }
    }
}

Run UI tests with the chaotic interceptor in the graph. Your app should degrade gracefully under simulated failure.

Common anti-patterns

Anti-patterns

Benchmark / PBT mistakes

Benchmarking in debug builds (throws / garbage numbers)
Measuring with stack traces enabled
Property-based tests on UI behavior
Round-trip without a fixed-seed for reproducibility
Ignoring benchmark regressions ("probably noise")
Macrobenchmark on emulator only

Best practices

Production-grade

Release builds, isolated CPU core
CompilationMode.Partial for realistic prod conditions
PBT for parsers, transformations, state machines
Checkpoint seeds for shrinking reproducibility
Baseline + threshold in CI; alert on regression
Firebase Test Lab device matrix in nightly

Key takeaways

Practice exercises

01
Microbenchmark two parsers
Compare Moshi vs kotlinx.serialization for your product model. Measure with BenchmarkRule.
02
Cold-start benchmark
Write a Macrobenchmark that measures cold start in CompilationMode.None vs CompilationMode.Partial(BaselineProfileMode.Require). Report the delta.
03
Round-trip PBT
For one domain type (User, Cart, Message), write a round-trip property asserting decode(encode(x)) == x across 1000 random inputs.
04
JankStats in prod
Add JankStats tracking to one screen. Ship a custom key `screen=home_feed`. Verify jank rates appear in your observability backend.
05
Chaotic interceptor
Add a ChaoticInterceptor in debug builds. Run your UI tests with it; fix any flows that crash or hang under simulated failure.

Return to Module 09 Overview or continue to Module 10 — Performance Optimization.

Microbenchmark — per-function CPU​

Hot vs cold phases​

Macrobenchmark — real user-facing performance​

Cold start​

Scroll jank​

Input latency​

Running on Firebase Test Lab​

JankStats — measuring real users​

Property-based testing​

When property-based pays off​

Ideal problems

Use examples

Round-trip property — always worth writing​

Stateful property-based​

Fuzz testing​

Combining benchmarks with CI​

Chaos testing (advanced)​

Common anti-patterns​

Benchmark / PBT mistakes

Production-grade

Key takeaways​

Practice exercises​

Microbenchmark two parsers

Cold-start benchmark

Round-trip PBT

JankStats in prod

Chaotic interceptor

Next​