Engineering Culture
Engineering culture is the set of shared values, practices, and norms that define how an engineering organization operates. Great culture does not happen by accident — it is intentionally built and continuously maintained by technical leaders at every level. This page covers the essential cultural practices that distinguish high-performing engineering organizations.
Code Review Culture
Code reviews are one of the most impactful practices in software engineering. They catch bugs, spread knowledge, maintain standards, and build team cohesion. But their effectiveness depends entirely on culture.
The Purpose of Code Review
| Purpose | Priority |
|---|---|
| Knowledge sharing | Highest — the primary long-term value |
| Catching bugs and design issues | High — finds problems before production |
| Maintaining consistency | Medium — ensures standards are followed |
| Mentoring | Medium — teaches patterns and practices |
| Documentation | Lower — the PR description serves as a record |
What Good Code Reviews Look Like
Reviewer behaviors:
- Review within 24 hours (ideally same business day)
- Focus on the “what” and “why”, not just the “how”
- Ask questions instead of making demands: “What happens if this fails?” vs “Add error handling”
- Distinguish between blocking issues and suggestions: “Nit:” or “Optional:” prefixes
- Provide positive feedback on good patterns, not just criticism
- Review the design first, then the implementation details
Author behaviors:
- Keep PRs small (under 400 lines of changed code)
- Write a clear description explaining what and why
- Self-review before requesting review
- Respond to all comments, even if just “Done”
- Do not take feedback personally
Code Review Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Rubber stamping | LGTM without reading | Require substantive comments; rotate reviewers |
| Nitpick wars | Debating style preferences | Automate style with linters/formatters |
| Gatekeeping | One person must approve everything | Distribute review responsibility; trust the team |
| PR too large | 2000+ line PRs that nobody reads carefully | Enforce size limits; break work into smaller PRs |
| Slow reviews | PRs waiting days for review | Set SLA (e.g., review within 4 business hours) |
| Review by seniority only | Juniors never review | Everyone reviews; juniors learn by reviewing |
Automating the Tedious Parts
What HUMANS should review: What MACHINES should check:────────────────────────── ───────────────────────────Architecture and design Code formatting (Prettier, Black)Business logic correctness Linting (ESLint, Pylint)Error handling strategy Type checking (TypeScript, mypy)Security implications Test coverage thresholdsAPI contract changes Dependency vulnerabilitiesNaming and readability Build successPerformance implications Performance regression testsBlameless Postmortems
When things go wrong in production, the response should be learning, not blame. Blameless postmortems are a cornerstone of high-reliability engineering organizations.
Why Blameless?
- People who fear punishment will hide mistakes and avoid reporting incidents
- The root cause is almost always systemic, not individual
- Blame shuts down the psychological safety needed for honest analysis
- The goal is to fix the system, not to find a scapegoat
Postmortem Template
# Incident Postmortem: Payment Processing Outage
**Date:** 2025-01-15**Duration:** 2 hours 15 minutes (14:30 - 16:45 UTC)**Severity:** SEV-1 (customer-facing, revenue impact)**Author:** On-call engineer + incident commander**Status:** Action items in progress
## SummaryPayment processing was unavailable for 2 hours 15 minutes due toa database connection pool exhaustion caused by a query regressionin the latest deployment.
## Impact- 12,500 failed payment attempts- Estimated revenue impact: $450,000- 340 customer support tickets- No data loss or corruption
## Timeline (all times UTC)14:15 - Deploy v2.45.0 to production (includes order query optimization)14:28 - Monitoring shows database connection count increasing14:30 - First customer reports of payment failures14:35 - PagerDuty alert fires for payment error rate > 5%14:40 - On-call engineer acknowledges, begins investigation14:55 - Incident escalated to SEV-1, incident commander assigned15:10 - Root cause identified: new query missing index, holding connections for 30s+ instead of <100ms15:20 - Decision to rollback deployment15:35 - Rollback deployed, connection pool draining16:00 - Connection pool recovered to normal levels16:30 - Payment success rate back to 99.9%16:45 - Incident resolved, monitoring confirmed stable
## Root CauseThe "order query optimization" in v2.45.0 changed a query thatpreviously used an index on (customer_id, created_at) to usea full table scan. The query execution time increased from 50msto 30+ seconds, causing database connections to be held longerthan the pool could sustain.
## Contributing Factors1. No query performance testing in CI/CD pipeline2. Database migration and query changes reviewed by different people (context gap)3. Connection pool alerts set at 80% (too close to exhaustion)4. No query execution time monitoring for individual queries
## What Went Well- PagerDuty alert fired within 5 minutes of impact- Team assembled quickly and communicated clearly- Rollback process worked smoothly- Customer communication was timely and transparent
## What Went Wrong- 20-minute gap between deploy and first alert- Root cause identification took 30 minutes- No automated rollback trigger for error rate spikes
## Action Items| Action | Owner | Priority | Due Date ||--------|-------|----------|----------|| Add query performance tests to CI | @alice | P0 | 2025-01-22 || Lower connection pool alert to 60% | @bob | P0 | 2025-01-17 || Add per-query latency monitoring | @charlie | P1 | 2025-01-31 || Implement automated rollback on error spike | @dave | P1 | 2025-02-15 || Require DB team review for query changes | @alice | P2 | 2025-02-01 |
## Lessons LearnedThe deployment pipeline needs to catch performance regressionsbefore they reach production. We are investing in queryperformance testing and tighter monitoring to prevent similarincidents.Running an Effective Postmortem Meeting
- Set the tone — “We are here to learn, not to blame. The system failed, not a person.”
- Build the timeline collaboratively — Walk through events chronologically
- Ask “why” five times — Dig into root causes, not surface symptoms
- Focus on systemic fixes — “How do we prevent ANY engineer from making this mistake?”
- Assign action items with owners and deadlines — Postmortems without action items are theater
- Follow up — Track completion of action items; review in the next team meeting
Documentation Culture
Documentation is a force multiplier. One engineer writing a document saves hundreds of hours of questions, onboarding time, and rediscovered knowledge.
Types of Documentation
| Type | Audience | Lifespan | Example |
|---|---|---|---|
| How-to guides | Engineers performing a task | Long | ”How to set up the dev environment” |
| Tutorials | New learners | Long | ”Building your first feature” |
| Reference | Engineers looking up details | Long | API documentation, configuration reference |
| Explanations | Engineers needing context | Medium | ”Why we chose event sourcing” |
| ADRs | Current and future engineers | Permanent | Architecture decisions (see previous page) |
| Runbooks | On-call engineers | Medium | ”How to handle a database failover” |
| Onboarding docs | New hires | Medium | ”Week 1-4 onboarding plan” |
Documentation Best Practices
- Store docs near the code — In the repo, not a separate wiki that rots
- Treat docs like code — Review in PRs, test links, automate generation
- Write for your audience — A runbook for on-call is different from an API guide
- Keep docs current — Outdated docs are worse than no docs (they mislead)
- Make docs discoverable — Clear naming, good search, table of contents
- Lower the barrier — Templates, examples, and style guides make writing easier
The Documentation Quadrant
Studying Working (learning-oriented) (task-oriented)
Practical ┌─────────────────┬─────────────────┐ │ Tutorials │ How-to Guides │ │ │ │ │ "Follow these │ "Steps to achieve │ │ steps to learn │ a specific goal" │ │ this concept" │ │ ├─────────────────┼─────────────────┤Theoretical │ Explanations │ Reference │ │ │ │ │ "Understanding │ "Technical specs, │ │ the background │ API docs, config │ │ and context" │ options" │ └─────────────────┴─────────────────┘
(Based on Divio documentation framework)Knowledge Sharing
Tech Talks and Presentations
- Lightning talks (5-10 min) — Low barrier, encourage broad participation
- Lunch and learns (30-45 min) — Deeper dives, often with food as incentive
- Architecture reviews (60 min) — Team presents system design for cross-team feedback
- Demo days — Teams showcase what they shipped that sprint/quarter
Guilds and Communities of Practice
Guilds (or chapters, communities of practice) are cross-team groups organized around a shared interest or skill:
Frontend Guild ┌──────────┐ Team A ────┤ ├──── Team B (FE dev) │ Shared │ (FE dev) │ standards│ Team C ────┤ tooling ├──── Team D (FE dev) │ knowledge│ (FE dev) └──────────┘
Examples: - Frontend Guild: shared component library, CSS standards - Data Engineering Guild: data pipeline standards, tooling - Security Guild: security review process, threat modeling - Testing Guild: testing strategies, framework selectionInternal Knowledge Base Practices
| Practice | Description |
|---|---|
| ”Write it once” rule | If you explain something twice, write a document |
| Show-and-tell sessions | Monthly presentations of interesting work |
| Pair programming rotation | Engineers pair across teams to spread knowledge |
| Reading groups | Team reads and discusses a paper or book chapter weekly |
| Internal blog | Engineers share learnings, post-project reflections |
| Decision logs | Searchable archive of all ADRs and RFCs |
Hiring and Onboarding
Hiring for Culture Fit vs Culture Add
| Culture Fit | Culture Add |
|---|---|
| ”Will this person get along with us?" | "What unique perspective does this person bring?” |
| Risk: homogeneous teams, groupthink | Benefit: diverse thinking, innovation |
| Check: shared values and work ethic | Check: complementary skills and viewpoints |
Effective Technical Interviews
| Interview Type | What It Assesses | Duration |
|---|---|---|
| Coding (live) | Problem-solving, communication, code quality | 45-60 min |
| System design | Architecture skills, trade-off analysis | 45-60 min |
| Take-home project | Real-world coding in a realistic setting | 2-4 hours |
| Code review | How they evaluate others’ code, communication | 30-45 min |
| Behavioral | Collaboration, conflict resolution, values | 30-45 min |
Onboarding Checklist
Week 1: Getting Started [ ] Development environment setup (automated, documented) [ ] Access to all tools (source control, CI/CD, monitoring, chat) [ ] Meet the team (1:1s with each team member) [ ] Read team charter, ADRs, and architecture docs [ ] First "good first issue" assigned [ ] Buddy/mentor assigned
Week 2-3: Contributing [ ] First PR merged [ ] Participate in code review (both giving and receiving) [ ] Attend sprint ceremonies [ ] Understand deployment process [ ] Shadow on-call rotation
Week 4-6: Owning [ ] Complete a small feature independently [ ] Present at team standup or demo [ ] Understand at least 3 team services [ ] Give feedback on onboarding process (improve it!)
Month 2-3: Growing [ ] Own a medium-sized project [ ] Participate in design review [ ] Identify one area for improvement and propose a fix [ ] 30/60/90 day check-in with managerDeveloper Experience (DevEx)
Developer experience is the quality of the tools, processes, and environment that developers interact with daily. Great DevEx reduces friction and lets engineers focus on solving problems.
The Three Dimensions of DevEx
┌───────────────────────────────────────────────────┐│ Developer Experience │├──────────────┬──────────────┬─────────────────────┤│ │ │ ││ Feedback │ Cognitive │ Flow State ││ Loops │ Load │ ││ │ │ ││ How fast │ How much │ How often can ││ do I get │ mental │ developers get ││ results? │ effort is │ into and stay ││ │ required? │ in flow? ││ │ │ ││ • CI time │ • Docs │ • Interruptions ││ • Build time│ • Code │ • Meeting load ││ • Deploy │ complexity│ • Context switching ││ time │ • Tool │ • Autonomy ││ • Test time │ usability │ • Clear goals ││ │ │ │└──────────────┴──────────────┴─────────────────────┘Measuring Developer Experience
| Metric | Target | How to Measure |
|---|---|---|
| Build time | Under 2 minutes (local) | CI metrics, developer surveys |
| CI pipeline duration | Under 10 minutes | CI/CD dashboard |
| Deploy frequency | Multiple times per day | Deployment tracking |
| Time to first PR (new hire) | Under 1 week | Onboarding metrics |
| Developer satisfaction | Quarterly survey score | Anonymous surveys |
| Toil percentage | Under 20 percent | Time tracking, surveys |
DevEx Improvements with High ROI
| Improvement | Impact |
|---|---|
| Automate dev environment setup | Saves days per new hire |
| Fast, reliable CI/CD | Reduces context switching from waiting |
| Good error messages | Reduces debugging time |
| Self-service infrastructure | Removes bottleneck on platform team |
| Pre-configured linters and formatters | Eliminates style debates in code review |
| Comprehensive local development docs | Reduces “it works on my machine” problems |
| Fast, incremental builds | Tighter feedback loop during development |
| Feature flags | Enables decoupling deploy from release |
Measuring Engineering Productivity
DORA Metrics
The four key metrics from the DORA (DevOps Research and Assessment) research:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment Frequency | On-demand (multiple/day) | Weekly-monthly | Monthly-biannually | Less than once per 6 months |
| Lead Time for Changes | Less than 1 hour | 1 day - 1 week | 1 week - 1 month | 1 - 6 months |
| Mean Time to Recover | Less than 1 hour | Less than 1 day | 1 day - 1 week | More than 6 months |
| Change Failure Rate | 0 - 15 percent | 16 - 30 percent | 16 - 30 percent | 16 - 30 percent |
SPACE Framework
A holistic framework for developer productivity (from Microsoft Research and GitHub):
S - Satisfaction and well-being "How fulfilled are developers with their work?"
P - Performance "What are the outcomes of the development process?"
A - Activity "How much output are developers producing?"
C - Communication and collaboration "How effectively do developers communicate?"
E - Efficiency and flow "How smoothly can developers get work done?"