systems design
Two practical system design cases with constraints, measurable targets, trade-offs, and implementation decisions.
Pull Request Review Queue
Problem: Engineers needed fast feedback on pull requests, but synchronous checks were slowing merges.
Impact Snapshot
Feedback SLA
< 2 minutes average
Durability goal
0 dropped jobs during spikes
Auditability
100% traceable review comments
Architecture Flow
Git webhooks→
Job queue→
Worker pool→
Rules + model service→
Review results store
Requirements / Constraints
- keep average feedback time under 2 minutes
- handle release-day spikes without dropping jobs
- keep a clear audit trail for every review comment
Role & Decisions
- Designed queue topology and worker concurrency boundaries.
- Introduced repository-priority lanes to reduce merge-day tail latency.
- Defined retry strategy for oversized diffs to protect feedback SLA.
Key Trade-offs
- Queueing improved reliability, but users lost instant responses
- Shared worker pools improved utilization, but noisy repos affected tail latency
Notes
- Most delays came from retries on very large diffs, not model latency.
- Separate queues per repository priority reduced merge-day complaints.
Future Improvements
- Add diff-size based routing to dedicated workers
- Cache repeated review hints for common file patterns
Proof Links
Tenant Analytics Pipeline
Problem: Product and support teams needed daily reports plus near real-time usage visibility per tenant.
Impact Snapshot
Dashboard freshness target
< 1 hour
Storage strategy
low-cost raw event retention
Isolation requirement
tenant-level boundaries by design
Architecture Flow
Client events→
Ingestion API→
Stream bus→
ETL jobs→
Warehouse marts→
Dashboards
Requirements / Constraints
- strict tenant-level data isolation
- cheap long-term storage for raw events
- dashboard freshness within one hour
Role & Decisions
- Designed ingestion-to-mart flow for daily and near real-time reporting.
- Set partition and processing profiles to avoid tenant cost imbalance.
- Added schema-governance guardrails to reduce dashboard breakage risk.
Key Trade-offs
- Shared stream topics cut cost, but partition mistakes became risky
- Hourly batch jobs were simple, but incident analysis felt delayed
Notes
- Schema drift broke charts more often than pipeline failures.
- Small tenants overpaid when using the same processing profile as large tenants.
Future Improvements
- Move top KPIs to incremental processing
- Add producer contract tests to catch schema drift early