Case Study

Reliability Project: Improving HOSP

This project tackled a hard reliability constraint: improve availability on a fragile system we could not rewrite. Working in team STABL, I focused on traffic evidence analysis and ALB rule strategy to reduce backend crashes while keeping failure reporting honest.

STABL teamALB listener rulesALB logs and AthenaHonest reliability

Reliability dashboard showing success rates by host and image screening failures

We used a control panel dashboard to track request success rates and investigate failures. Our STABL environment was pushed to 99.99% success by filtering garbage traffic at the ALB and keeping failures honest.

Context and goal

STABL sits in front of an existing application called HOSP. We did not control the application code, and we could not change upstream behaviour.

Reliability dropped sharply during experiments, so the goal was to maximise honest reliability with strict constraints. 500 errors were not acceptable. 401 and 404 were acceptable when correct. We did not return fake 200 responses and we did not mask failures.

Core question: “How do we reduce crashes and improve availability without lying about system health?”

Challenge and solution

One challenge was an early CloudFront/WAF route that caused a spike in 401/403 responses. We solved it with a full rollback and a simpler ALB allow-list + fixed-404 strategy, which stabilised traffic and drove honest reliability to 99.99%.

Baseline architecture

At the start, all traffic went straight through the Application Load Balancer to a single HOSP instance. There was no traffic filtering, so probes and random paths could hit a fragile backend and trigger crashes.

Clients ↓ Application Load Balancer (HTTP :80) ↓ Single application instance (HOSP)

Why it was failing

Garbage traffic reached the backend.
Unhandled paths triggered 500 crashes.
Limited capacity amplified failures under load.
Failures cascaded when the app crashed repeatedly.

Evidence driven investigation

We used ALB access logs and Athena to understand what was failing before changing anything. The goal was to separate real backend failures from noise and infrastructure issues.

What we proved with Athena

500 errors were real backend failures (elb_status_code = 500 and target_status_code = 500).
The worst routes included /notes, /patients, and /patients/*/screen across GET and POST.
Both fast failures and slow failures existed, which pointed to fragile logic plus expensive work.
Unknown paths and probes were hitting the backend and inflating failures.

Key conclusion

The system was not failing because of the ALB or routing. It was failing because uncontrolled traffic was reaching a fragile single instance backend.

Early experiment and rollback

We tried putting CloudFront in front of the ALB and locking the ALB down using WAF style enforcement. It caused a large spike in 401 and 403 responses and tanked reliability because health checks and internal calls were blocked.

The lesson was simple: observability comes before enforcement. We rolled back fully and returned to a known good baseline.

Final solution: ALB allow list and fixed 404

The design principle was that only valid application paths should reach the backend. Everything else is garbage and should be rejected early.

Allow list rule

Forward only known routes to the backend target group.

/hospitals*
/patients*
/staffs*
/notes*

Default rule

Reject everything else with a fixed response. This stops probes and unknown paths from wasting backend capacity or triggering crashes.

Status: 404 Body: {"message":"Not Found"}

Why this improves reliability honestly

We did not mask failures. We made it harder for invalid traffic to crash a fragile backend, and we kept HTTP semantics correct.

What we avoided

No fake 200 responses.
No hiding crashes with redirects.
No blindly retrying until it looks green.
No masking backend bugs.

What we improved

Reduced load by blocking garbage early.
Kept 404 and 401 meaningful when correct.
Ensured remaining 500s were real failures.
Made behaviour deterministic and easier to explain.

Handling traffic increases with one instance

We could not autoscale and we could not add a second application instance, so the safest lever was controlling what traffic reaches the backend.

Remove garbage traffic at the ALB.
Keep health checks clean and reliable.
Shepherd change carefully with clear metrics.
Emergency option: return 503 on one problematic endpoint to protect the whole system.

Key takeaways

Reliability is about controlling failure, not hiding it.
Edge filtering is a valid reliability tool.
500s should be treated as bugs.
Simpler routing beats clever routing.
Observability comes before enforcement.
A clean rollback is a success.

One sentence summary: Rather than masking failures, we used ALB allow listing and traffic control to reduce backend crashes, so reliability improvements reflected real system health.