Skip to content

Engineering Culture

Engineering culture is the set of shared values, practices, and norms that define how an engineering organization operates. Great culture does not happen by accident — it is intentionally built and continuously maintained by technical leaders at every level. This page covers the essential cultural practices that distinguish high-performing engineering organizations.


Code Review Culture

Code reviews are one of the most impactful practices in software engineering. They catch bugs, spread knowledge, maintain standards, and build team cohesion. But their effectiveness depends entirely on culture.

The Purpose of Code Review

PurposePriority
Knowledge sharingHighest — the primary long-term value
Catching bugs and design issuesHigh — finds problems before production
Maintaining consistencyMedium — ensures standards are followed
MentoringMedium — teaches patterns and practices
DocumentationLower — the PR description serves as a record

What Good Code Reviews Look Like

Reviewer behaviors:

  • Review within 24 hours (ideally same business day)
  • Focus on the “what” and “why”, not just the “how”
  • Ask questions instead of making demands: “What happens if this fails?” vs “Add error handling”
  • Distinguish between blocking issues and suggestions: “Nit:” or “Optional:” prefixes
  • Provide positive feedback on good patterns, not just criticism
  • Review the design first, then the implementation details

Author behaviors:

  • Keep PRs small (under 400 lines of changed code)
  • Write a clear description explaining what and why
  • Self-review before requesting review
  • Respond to all comments, even if just “Done”
  • Do not take feedback personally

Code Review Anti-Patterns

Anti-PatternProblemFix
Rubber stampingLGTM without readingRequire substantive comments; rotate reviewers
Nitpick warsDebating style preferencesAutomate style with linters/formatters
GatekeepingOne person must approve everythingDistribute review responsibility; trust the team
PR too large2000+ line PRs that nobody reads carefullyEnforce size limits; break work into smaller PRs
Slow reviewsPRs waiting days for reviewSet SLA (e.g., review within 4 business hours)
Review by seniority onlyJuniors never reviewEveryone reviews; juniors learn by reviewing

Automating the Tedious Parts

What HUMANS should review: What MACHINES should check:
────────────────────────── ───────────────────────────
Architecture and design Code formatting (Prettier, Black)
Business logic correctness Linting (ESLint, Pylint)
Error handling strategy Type checking (TypeScript, mypy)
Security implications Test coverage thresholds
API contract changes Dependency vulnerabilities
Naming and readability Build success
Performance implications Performance regression tests

Blameless Postmortems

When things go wrong in production, the response should be learning, not blame. Blameless postmortems are a cornerstone of high-reliability engineering organizations.

Why Blameless?

  • People who fear punishment will hide mistakes and avoid reporting incidents
  • The root cause is almost always systemic, not individual
  • Blame shuts down the psychological safety needed for honest analysis
  • The goal is to fix the system, not to find a scapegoat

Postmortem Template

# Incident Postmortem: Payment Processing Outage
**Date:** 2025-01-15
**Duration:** 2 hours 15 minutes (14:30 - 16:45 UTC)
**Severity:** SEV-1 (customer-facing, revenue impact)
**Author:** On-call engineer + incident commander
**Status:** Action items in progress
## Summary
Payment processing was unavailable for 2 hours 15 minutes due to
a database connection pool exhaustion caused by a query regression
in the latest deployment.
## Impact
- 12,500 failed payment attempts
- Estimated revenue impact: $450,000
- 340 customer support tickets
- No data loss or corruption
## Timeline (all times UTC)
14:15 - Deploy v2.45.0 to production (includes order query optimization)
14:28 - Monitoring shows database connection count increasing
14:30 - First customer reports of payment failures
14:35 - PagerDuty alert fires for payment error rate > 5%
14:40 - On-call engineer acknowledges, begins investigation
14:55 - Incident escalated to SEV-1, incident commander assigned
15:10 - Root cause identified: new query missing index, holding
connections for 30s+ instead of <100ms
15:20 - Decision to rollback deployment
15:35 - Rollback deployed, connection pool draining
16:00 - Connection pool recovered to normal levels
16:30 - Payment success rate back to 99.9%
16:45 - Incident resolved, monitoring confirmed stable
## Root Cause
The "order query optimization" in v2.45.0 changed a query that
previously used an index on (customer_id, created_at) to use
a full table scan. The query execution time increased from 50ms
to 30+ seconds, causing database connections to be held longer
than the pool could sustain.
## Contributing Factors
1. No query performance testing in CI/CD pipeline
2. Database migration and query changes reviewed by different
people (context gap)
3. Connection pool alerts set at 80% (too close to exhaustion)
4. No query execution time monitoring for individual queries
## What Went Well
- PagerDuty alert fired within 5 minutes of impact
- Team assembled quickly and communicated clearly
- Rollback process worked smoothly
- Customer communication was timely and transparent
## What Went Wrong
- 20-minute gap between deploy and first alert
- Root cause identification took 30 minutes
- No automated rollback trigger for error rate spikes
## Action Items
| Action | Owner | Priority | Due Date |
|--------|-------|----------|----------|
| Add query performance tests to CI | @alice | P0 | 2025-01-22 |
| Lower connection pool alert to 60% | @bob | P0 | 2025-01-17 |
| Add per-query latency monitoring | @charlie | P1 | 2025-01-31 |
| Implement automated rollback on error spike | @dave | P1 | 2025-02-15 |
| Require DB team review for query changes | @alice | P2 | 2025-02-01 |
## Lessons Learned
The deployment pipeline needs to catch performance regressions
before they reach production. We are investing in query
performance testing and tighter monitoring to prevent similar
incidents.

Running an Effective Postmortem Meeting

  1. Set the tone — “We are here to learn, not to blame. The system failed, not a person.”
  2. Build the timeline collaboratively — Walk through events chronologically
  3. Ask “why” five times — Dig into root causes, not surface symptoms
  4. Focus on systemic fixes — “How do we prevent ANY engineer from making this mistake?”
  5. Assign action items with owners and deadlines — Postmortems without action items are theater
  6. Follow up — Track completion of action items; review in the next team meeting

Documentation Culture

Documentation is a force multiplier. One engineer writing a document saves hundreds of hours of questions, onboarding time, and rediscovered knowledge.

Types of Documentation

TypeAudienceLifespanExample
How-to guidesEngineers performing a taskLong”How to set up the dev environment”
TutorialsNew learnersLong”Building your first feature”
ReferenceEngineers looking up detailsLongAPI documentation, configuration reference
ExplanationsEngineers needing contextMedium”Why we chose event sourcing”
ADRsCurrent and future engineersPermanentArchitecture decisions (see previous page)
RunbooksOn-call engineersMedium”How to handle a database failover”
Onboarding docsNew hiresMedium”Week 1-4 onboarding plan”

Documentation Best Practices

  • Store docs near the code — In the repo, not a separate wiki that rots
  • Treat docs like code — Review in PRs, test links, automate generation
  • Write for your audience — A runbook for on-call is different from an API guide
  • Keep docs current — Outdated docs are worse than no docs (they mislead)
  • Make docs discoverable — Clear naming, good search, table of contents
  • Lower the barrier — Templates, examples, and style guides make writing easier

The Documentation Quadrant

Studying Working
(learning-oriented) (task-oriented)
Practical ┌─────────────────┬─────────────────┐
│ Tutorials │ How-to Guides │
│ │ │
│ "Follow these │ "Steps to achieve │
│ steps to learn │ a specific goal" │
│ this concept" │ │
├─────────────────┼─────────────────┤
Theoretical │ Explanations │ Reference │
│ │ │
│ "Understanding │ "Technical specs, │
│ the background │ API docs, config │
│ and context" │ options" │
└─────────────────┴─────────────────┘
(Based on Divio documentation framework)

Knowledge Sharing

Tech Talks and Presentations

  • Lightning talks (5-10 min) — Low barrier, encourage broad participation
  • Lunch and learns (30-45 min) — Deeper dives, often with food as incentive
  • Architecture reviews (60 min) — Team presents system design for cross-team feedback
  • Demo days — Teams showcase what they shipped that sprint/quarter

Guilds and Communities of Practice

Guilds (or chapters, communities of practice) are cross-team groups organized around a shared interest or skill:

Frontend Guild
┌──────────┐
Team A ────┤ ├──── Team B
(FE dev) │ Shared │ (FE dev)
│ standards│
Team C ────┤ tooling ├──── Team D
(FE dev) │ knowledge│ (FE dev)
└──────────┘
Examples:
- Frontend Guild: shared component library, CSS standards
- Data Engineering Guild: data pipeline standards, tooling
- Security Guild: security review process, threat modeling
- Testing Guild: testing strategies, framework selection

Internal Knowledge Base Practices

PracticeDescription
”Write it once” ruleIf you explain something twice, write a document
Show-and-tell sessionsMonthly presentations of interesting work
Pair programming rotationEngineers pair across teams to spread knowledge
Reading groupsTeam reads and discusses a paper or book chapter weekly
Internal blogEngineers share learnings, post-project reflections
Decision logsSearchable archive of all ADRs and RFCs

Hiring and Onboarding

Hiring for Culture Fit vs Culture Add

Culture FitCulture Add
”Will this person get along with us?""What unique perspective does this person bring?”
Risk: homogeneous teams, groupthinkBenefit: diverse thinking, innovation
Check: shared values and work ethicCheck: complementary skills and viewpoints

Effective Technical Interviews

Interview TypeWhat It AssessesDuration
Coding (live)Problem-solving, communication, code quality45-60 min
System designArchitecture skills, trade-off analysis45-60 min
Take-home projectReal-world coding in a realistic setting2-4 hours
Code reviewHow they evaluate others’ code, communication30-45 min
BehavioralCollaboration, conflict resolution, values30-45 min

Onboarding Checklist

Week 1: Getting Started
[ ] Development environment setup (automated, documented)
[ ] Access to all tools (source control, CI/CD, monitoring, chat)
[ ] Meet the team (1:1s with each team member)
[ ] Read team charter, ADRs, and architecture docs
[ ] First "good first issue" assigned
[ ] Buddy/mentor assigned
Week 2-3: Contributing
[ ] First PR merged
[ ] Participate in code review (both giving and receiving)
[ ] Attend sprint ceremonies
[ ] Understand deployment process
[ ] Shadow on-call rotation
Week 4-6: Owning
[ ] Complete a small feature independently
[ ] Present at team standup or demo
[ ] Understand at least 3 team services
[ ] Give feedback on onboarding process (improve it!)
Month 2-3: Growing
[ ] Own a medium-sized project
[ ] Participate in design review
[ ] Identify one area for improvement and propose a fix
[ ] 30/60/90 day check-in with manager

Developer Experience (DevEx)

Developer experience is the quality of the tools, processes, and environment that developers interact with daily. Great DevEx reduces friction and lets engineers focus on solving problems.

The Three Dimensions of DevEx

┌───────────────────────────────────────────────────┐
│ Developer Experience │
├──────────────┬──────────────┬─────────────────────┤
│ │ │ │
│ Feedback │ Cognitive │ Flow State │
│ Loops │ Load │ │
│ │ │ │
│ How fast │ How much │ How often can │
│ do I get │ mental │ developers get │
│ results? │ effort is │ into and stay │
│ │ required? │ in flow? │
│ │ │ │
│ • CI time │ • Docs │ • Interruptions │
│ • Build time│ • Code │ • Meeting load │
│ • Deploy │ complexity│ • Context switching │
│ time │ • Tool │ • Autonomy │
│ • Test time │ usability │ • Clear goals │
│ │ │ │
└──────────────┴──────────────┴─────────────────────┘

Measuring Developer Experience

MetricTargetHow to Measure
Build timeUnder 2 minutes (local)CI metrics, developer surveys
CI pipeline durationUnder 10 minutesCI/CD dashboard
Deploy frequencyMultiple times per dayDeployment tracking
Time to first PR (new hire)Under 1 weekOnboarding metrics
Developer satisfactionQuarterly survey scoreAnonymous surveys
Toil percentageUnder 20 percentTime tracking, surveys

DevEx Improvements with High ROI

ImprovementImpact
Automate dev environment setupSaves days per new hire
Fast, reliable CI/CDReduces context switching from waiting
Good error messagesReduces debugging time
Self-service infrastructureRemoves bottleneck on platform team
Pre-configured linters and formattersEliminates style debates in code review
Comprehensive local development docsReduces “it works on my machine” problems
Fast, incremental buildsTighter feedback loop during development
Feature flagsEnables decoupling deploy from release

Measuring Engineering Productivity

DORA Metrics

The four key metrics from the DORA (DevOps Research and Assessment) research:

MetricEliteHighMediumLow
Deployment FrequencyOn-demand (multiple/day)Weekly-monthlyMonthly-biannuallyLess than once per 6 months
Lead Time for ChangesLess than 1 hour1 day - 1 week1 week - 1 month1 - 6 months
Mean Time to RecoverLess than 1 hourLess than 1 day1 day - 1 weekMore than 6 months
Change Failure Rate0 - 15 percent16 - 30 percent16 - 30 percent16 - 30 percent

SPACE Framework

A holistic framework for developer productivity (from Microsoft Research and GitHub):

S - Satisfaction and well-being
"How fulfilled are developers with their work?"
P - Performance
"What are the outcomes of the development process?"
A - Activity
"How much output are developers producing?"
C - Communication and collaboration
"How effectively do developers communicate?"
E - Efficiency and flow
"How smoothly can developers get work done?"