Android System Design Walkthroughs

System design is the highest-variance round in senior Android interviews. Most engineers know the components (ViewModel, Room, Retrofit) but haven't practiced designing a full system end-to-end. This chapter walks through four classic Android system-design prompts with the five-step framework from Module 20 overview.

The five-step framework

Scope (5 min)       — clarify requirements, assumptions, success metrics
High-level          — sketch the architecture (layers, modules, services)
Data model + sync   — what's stored where; how writes / reads flow
Failure modes       — offline, race conditions, concurrency, scale
Observability       — what metrics define success; rollout plan

Senior signal: spend half the time on steps 4-5. Juniors design the happy path; staff engineers stress-test it.

Walkthrough 1 — Offline-first chat app

1. Scope

Clarifying questions:

1-on-1 or group? → 1-on-1 first; group chat is stretch.
Scale? → 100M DAU, 1B messages/day, P99 1s send-to-deliver.
Offline? → Full offline: send, read, react when disconnected.
Media? → Photos + short video + voice notes.
E2E encryption? → Signal Protocol; assume 3rd-party library (libsignal).
Success metrics → P99 send latency < 1s, delivery success ≥ 99.9%, cold-start to inbox < 1s, crash-free ≥ 99.5%.
Out of scope → Calling, ephemeral messages, stickers.

2. High-level architecture

┌────────────────────────────────────────────────────────────────┐
│  UI Layer (Compose)                                            │
│    InboxScreen · ConversationScreen · ComposerScreen           │
├────────────────────────────────────────────────────────────────┤
│  Presentation (MVVM / MVI)                                     │
│    InboxViewModel · ConversationViewModel (with state machine) │
├────────────────────────────────────────────────────────────────┤
│  Domain                                                        │
│    SendMessageUseCase · ObserveConversationUseCase             │
│    Entities: Message, Conversation, Reaction, Presence         │
├────────────────────────────────────────────────────────────────┤
│  Data                                                          │
│    MessageRepository (single source of truth: Room)            │
│      ↓                 ↓                   ↓                   │
│    Room             OutboxQueue         ChatSocket             │
│    (encrypted       (WorkManager)       (WebSocket + resume)   │
│     via SQLCipher)                                             │
│      ↑                                                         │
│    SyncWorker ← cursor-based sync ← backend                    │
└────────────────────────────────────────────────────────────────┘

3. Data model

@Entity(tableName = "messages", indices = [Index("conversationId", "sentAt")])
data class MessageEntity(
    @PrimaryKey val id: String,             // client-generated UUID
    val conversationId: String,
    val authorId: String,
    val body: String,                        // encrypted before storage
    val sentAt: Long,                        // client clock
    val serverAt: Long?,                     // nullable until acked
    val status: MessageStatus,               // DRAFT → SENDING → SENT → DELIVERED → READ | FAILED
    val attempts: Int = 0,
    val replyToId: String? = null,
    val attachmentUri: String? = null
)

@Entity(tableName = "conversations")
data class ConversationEntity(
    @PrimaryKey val id: String,
    val lastMessageId: String?,
    val unreadCount: Int,
    val participantIds: String,              // CSV
    val updatedAt: Long
)

@Entity(primaryKeys = ["messageId", "userId", "emoji"])
data class ReactionEntity(
    val messageId: String,
    val userId: String,
    val emoji: String
)

Offline send flow:

User types → Room INSERT (status=SENDING, id=UUID)
     ↓ instant UI update (optimistic)
OutboxWorker picks up SENDING messages
     ↓
ChatSocket.send(msg) → server replies { id, serverAt }
     ↓
Room UPDATE status=SENT, serverAt=?
     ↓
Server fans out → other device receives via WebSocket
     ↓
Recipient ack → status=DELIVERED / READ

4. Failure modes (the senior signal)

Failure	Response
Process dies mid-send	OutboxWorker retries on next start with same UUID (idempotency)
WebSocket disconnects	Exponential backoff + resume cursor (last event ID)
Duplicate message IDs	Server dedupes via idempotency key
Deleted message while offline	Tombstone event; apply on next sync
Schema migration	Room migration + server-side version compatibility
Battery-constrained devices	Debounce typing indicator, coalesce read receipts
Low-memory kill during video upload	WorkManager foreground with partial-upload resume
Clock skew (client sentAt vs server)	Server normalizes sentAt to server receipt time
Participant removed from group	Server rejects subsequent writes; client gets 403 → mark
Network partition (split-brain)	Client always wins for own sends; conflicts resolved LWW

5. Observability + rollout

SLIs:

Send success rate ≥ 99.5%
Send P95 latency < 1s
Deliver P95 latency < 3s
Conversation cold-load < 500ms
Crash-free sessions ≥ 99.5%

Custom Crashlytics keys: conversation_id, message_id, network_state.

Rollout: 1% → 5% → 25% → 100% over a week; automatic halt on crash-free regression below 99.5%.

Feature flag: chat_v2_enabled for gradual opt-in.

Why this is a good answer

Five steps, each ~5 min
Data model is concrete (field names, indexes)
Failure modes list is ~half the answer — staff signal
Observability feeds into rollout discipline

Walkthrough 2 — Instagram-style photo feed

1. Scope

Scale: 500M DAU, each scrolling ~100 posts per session.
Content: Photos + short video; following + Explore / discover feeds.
Offline: Cache recent posts; post offline not required.
Interactions: Like, comment count, share button.
Success: Cold start < 1s, scroll jank < 1% frames, engagement (like rate) stable.
Out of scope: Posting, DMs.

2. High-level

┌──────────────────────────────────────────────────────────────┐
│  Feed (LazyColumn + Paging 3)                                │
├──────────────────────────────────────────────────────────────┤
│  FeedViewModel                                                │
│    - Observes PagingData<Post>                                │
│    - Emits prefetch requests for visible items                │
├──────────────────────────────────────────────────────────────┤
│  FeedRepository                                               │
│    - Pager<String, PostEntity>(RemoteMediator)                │
│    - Cache-aside via Room                                     │
├──────────────────────────────────────────────────────────────┤
│  Remote: GraphQL for structured data (Apollo)                 │
│  Media: CDN with adaptive images (srcset by density)          │
│  Prefetch: Coil ImageLoader warmed with next 3 posts         │
└──────────────────────────────────────────────────────────────┘

3. Data + paging

@Entity(tableName = "posts", indices = [Index("feedPosition")])
data class PostEntity(
    @PrimaryKey val id: String,
    val authorId: String,
    val authorName: String,        // denormalized
    val authorAvatar: String,      // denormalized
    val mediaUrl: String,
    val mediaAspect: Float,        // for layout stability
    val caption: String,
    val likeCount: Int,
    val commentCount: Int,
    val liked: Boolean,
    val createdAt: Long,
    val feedPosition: Long         // server-assigned cursor
)

@OptIn(ExperimentalPagingApi::class)
class FeedRemoteMediator(
    private val api: FeedApi,
    private val db: AppDatabase
) : RemoteMediator<Int, PostEntity>() {

    override suspend fun load(loadType: LoadType, state: PagingState<Int, PostEntity>): MediatorResult {
        val cursor = when (loadType) {
            LoadType.REFRESH -> null
            LoadType.APPEND -> db.remoteKeyDao().last()?.nextCursor ?: return MediatorResult.Success(true)
            LoadType.PREPEND -> return MediatorResult.Success(endOfPaginationReached = true)
        }

        return try {
            val page = api.feed(cursor, limit = state.config.pageSize)
            db.withTransaction {
                if (loadType == LoadType.REFRESH) db.postDao().clear()
                db.postDao().insertAll(page.items.map(PostDto::toEntity))
                db.remoteKeyDao().insert(RemoteKey(nextCursor = page.nextCursor))
            }
            MediatorResult.Success(endOfPaginationReached = page.nextCursor == null)
        } catch (e: Exception) {
            MediatorResult.Error(e)
        }
    }
}

4. Failure modes

Slow network → Show cached feed, skeleton placeholders, retry interval on pull-to-refresh
Image failure → Per-image fallback (blurhash → dominant color → icon)
Stale feed → nextCursor becomes invalid after 24h; server 410 → trigger refresh
Like race condition → Optimistic with rollback (Module 04 + Data Patterns)
Scroll jank → Stable item keys, @Immutable PostEntity, Paparazzi snapshot tests
OOM from image cache → Coil's MemoryCache configured to 15% of heap
Comment count drift → Real-time listener over WebSocket for visible posts; batched elsewhere

5. Observability

Feed load P95 < 500ms (post cache hit)
Scroll jank rate < 0.5%
Per-post impression logs sampled at 1%
Experiment: A/B ranking algorithm via server-provided feed cursor

Walkthrough 3 — Ride-hailing customer app

1. Scope

Geography: Global, multi-city
Workflow: Request → match → pickup → ride → payment
Real-time: Driver location updates every 2s during ride
Payment: Credit card, wallet
Offline: Read-only after ride starts; request requires network
Success: 99.5% ride start, map frame rate 60fps, battery < 10%/ride

2. High-level

┌──────────────────────────────────────────────────────────────┐
│  Home (Map + search bar) · RequestRide · DriverMatched ·     │
│  InTrip · RideComplete · Payment · History                   │
├──────────────────────────────────────────────────────────────┤
│  RideViewModel — FSM (MVI): Idle → Requesting → Matched →   │
│                             PickingUp → InTrip → Complete    │
├──────────────────────────────────────────────────────────────┤
│  RideRepository                                               │
│    - Real-time ride state via gRPC bidirectional stream      │
│    - Cached trip history (Room)                              │
│    - Outbox for ratings / feedback (offline-submittable)     │
├──────────────────────────────────────────────────────────────┤
│  LocationClient (fused provider) · MapView · PaymentManager  │
│    (Google Pay + card via Play Billing)                      │
└──────────────────────────────────────────────────────────────┘

3. Ride state machine

sealed interface RideState {
    data object Idle : RideState
    data class Requesting(val pickup: LatLng, val dest: LatLng) : RideState
    data class Matched(val driver: Driver, val etaMinutes: Int) : RideState
    data class InTrip(val driver: Driver, val currentLocation: LatLng, val etaMinutes: Int) : RideState
    data class Completed(val trip: Trip, val amountCents: Long) : RideState
    data class Cancelled(val reason: String) : RideState
}

sealed interface RideIntent {
    data class Request(val pickup: LatLng, val dest: LatLng, val rideType: RideType) : RideIntent
    data object Cancel : RideIntent
    data class Rate(val stars: Int, val tip: Long = 0) : RideIntent
}

Reducer: pure function over (State, Intent, Event) → (State, SideEffect). Easy to test: for each state, what does each intent do?

4. Failure modes

Failure	Response
Driver cancels after match	RideState → Requesting; rematch automatically
Network lost during InTrip	Keep showing last known state; use LocationClient's dead reckoning
Driver location updates delay	Interpolate toward last known position; show "driver may be delayed"
Low battery during trip	Reduce location updates to 10s; keep ride visible
Payment decline	Prompt alternate payment; offer wallet balance
Map tile load fails	Show cached tiles + offline indicator
Phone killed mid-trip	Ride persists in backend; notification re-opens app
Rating while offline	Outbox + WorkManager flush when connected
Driver GPS jumping	Smooth with exponential moving average

5. Observability + cost

Real-time driver location stream: P95 server-to-client < 800ms
Ride complete success rate ≥ 99.5%
Battery drain per 30-min ride < 8%
Map tile cache hit rate > 85%

Cost considerations:

Location updates cost backend bandwidth → sample at 2s, interpolate client-side
Map tiles via CDN with 24h cache
gRPC with protobuf saves 5× bytes vs REST+JSON

Walkthrough 4 — Offline-first notes app

1. Scope

Scale: 10M DAU, 50 notes/user avg
Offline: Complete offline capability; edit, search, sync
Collaboration: Eventually multi-user; start solo
Rich text: Bold, italic, lists; not WYSIWYG editor
Sync: Across 2-3 devices per user (phone + tablet + web)
Success: Edit → sync P95 < 5s; no data loss ever

2. High-level

┌──────────────────────────────────────────────────────────────┐
│  UI: NoteListScreen · EditorScreen (Compose)                 │
├──────────────────────────────────────────────────────────────┤
│  EditorViewModel — TextFieldState for edit, outbox for sync │
├──────────────────────────────────────────────────────────────┤
│  NoteRepository                                               │
│    Single source of truth: Room                               │
│    Outbox table for pending ops                               │
│    SyncWorker (periodic + expedited on connectivity)          │
├──────────────────────────────────────────────────────────────┤
│  Conflict resolution: CRDT (Yjs) OR operational transform    │
│    (simpler: last-write-wins with server-side merge prompts) │
├──────────────────────────────────────────────────────────────┤
│  Search: Room FTS5 over note body                             │
│  Encryption: SQLCipher; keys in Keystore                     │
└──────────────────────────────────────────────────────────────┘

3. Data model + sync

@Entity @Fts4(contentEntity = NoteEntity::class)
data class NoteFts(@PrimaryKey val rowId: Long, val title: String, val body: String)

@Entity(tableName = "notes")
data class NoteEntity(
    @PrimaryKey val id: String,
    val title: String,
    val body: String,                      // plain text or markdown
    val createdAt: Long,
    val updatedAt: Long,
    val version: Long,                     // monotonic per note, for sync
    val syncState: SyncState               // LOCAL, SYNCED, CONFLICT
)

@Entity(tableName = "outbox")
data class OutboxEntity(
    @PrimaryKey(autoGenerate = true) val id: Long = 0,
    val noteId: String,
    val operation: String,                 // CREATE, UPDATE, DELETE
    val payload: String,                   // serialized op
    val attempt: Int = 0,
    val createdAt: Long = System.currentTimeMillis()
)

Edit flow

User types in EditorScreen
    ↓
TextFieldState → debounced (300ms) → Room UPDATE (version + 1, syncState = LOCAL)
    ↓ + Outbox INSERT
SyncWorker (triggered on update, or periodically)
    ↓ POST to backend with expected version
Backend 200 OK → Room UPDATE syncState = SYNCED; Outbox DELETE
Backend 409 Conflict → Room UPDATE syncState = CONFLICT; show merge UI

4. Failure modes

Process killed during edit → Debounced writes may be lost; minimize by flushing every 2s or on backgrounding
Two devices edit same note → Conflict detected via version mismatch; prompt user with 3-way merge
Network slow → Queue grows; user sees "Sync pending" indicator
Account locked → Writes fail 403; user sees "Sign in again"
Device reset → Re-install → backend restores notes; no data in cloud = data loss (backup important)
Large note (10MB) → Chunk-based upload; reconstruct server-side
Search index stale → Rebuild FTS table after bulk import
Clock drift → Client uses updatedAt from server on sync to avoid future timestamps

5. Observability + privacy

Sync success rate ≥ 99.9%
Edit-to-sync P95 < 5s
Search P95 < 200ms on 10k note corpus
Privacy: Notes encrypted on device; backend has body (not E2E-encrypted by default); user can opt into E2E via passphrase-derived key

Patterns across all four

Pattern	Appears in
Offline-first single source of truth	Chat, Ride-hailing history, Notes
Outbox + WorkManager	Chat sends, Notes sync, ratings
Idempotency via client UUIDs	All four
Optimistic UI with rollback	Feed like, Chat send, Notes
State machine for complex flows	Ride (FSM), Checkout, Chat
Real-time updates via WebSocket / gRPC	Chat, Ride, Feed comment counts
Paging 3 with RemoteMediator	Feed, Chat history, Notes list
Room FTS for search	Chat, Notes
Encrypted storage (Keystore + SQLCipher)	Chat, Notes

Notice these are the same primitives. System design isn't memorizing four different designs — it's recognizing the right combination of patterns for the prompt.

Time management (60-min interview)

  0-5 min   Scope (questions, assumptions)
  5-15 min  High-level architecture (diagram)
 15-30 min  Data model + core flows
 30-50 min  Failure modes + trade-offs (the longest section)
 50-60 min  Observability + rollout
           Q&A (30-60 sec per question)

If the interviewer prompts mid-architecture, let them drive. They're often probing for staff-level signal: "what if the server returns 500?", "how do you handle version skew?"

Key takeaways

Practice exercises

01
Design Google Maps
Apply the five steps. Scope, high-level, data model (tile cache + offline regions), failure modes (GPS jumps, tile load fails), observability.
02
Design TikTok
Focus on infinite video scroll: prefetching, memory budget, network adaptation, view completion tracking.
03
Design Zoom
Real-time video call: codec selection, bandwidth adaptation, audio-only fallback, reconnection flows.
04
Mock interview
Ask a peer to give you a prompt + 60 min. Time yourself through the five steps. Record; review your pacing and depth.

Return to Module 20 Overview or continue to building your portfolio projects.

The five-step framework​

Walkthrough 1 — Offline-first chat app​

1. Scope​

2. High-level architecture​

3. Data model​

4. Failure modes (the senior signal)​

5. Observability + rollout​

Why this is a good answer​

Walkthrough 2 — Instagram-style photo feed​

1. Scope​

2. High-level​

3. Data + paging​

4. Failure modes​

5. Observability​

Walkthrough 3 — Ride-hailing customer app​

1. Scope​

2. High-level​

3. Ride state machine​

4. Failure modes​

5. Observability + cost​

Walkthrough 4 — Offline-first notes app​

1. Scope​

2. High-level​

3. Data model + sync​

Edit flow​

4. Failure modes​

5. Observability + privacy​

Patterns across all four​

Time management (60-min interview)​

Key takeaways​

Practice exercises​

Design Google Maps

Design TikTok

Design Zoom

Mock interview

Next​

The five-step framework

Walkthrough 1 — Offline-first chat app

1. Scope

2. High-level architecture

3. Data model

4. Failure modes (the senior signal)

5. Observability + rollout

Why this is a good answer

Walkthrough 2 — Instagram-style photo feed

1. Scope

2. High-level

3. Data + paging

4. Failure modes

5. Observability

Walkthrough 3 — Ride-hailing customer app

1. Scope

2. High-level

3. Ride state machine

4. Failure modes

5. Observability + cost

Walkthrough 4 — Offline-first notes app

1. Scope

2. High-level

3. Data model + sync

Edit flow

4. Failure modes

5. Observability + privacy

Patterns across all four

Time management (60-min interview)

Key takeaways

Practice exercises

Next