Personal Project

Reliability Layer API

I built this backend/platform project to solve a common production problem: reliability logic duplicated across services and handled inconsistently. Instead of scattering timeout, retry, breaker, cache, and metrics behaviour across callers, I centralised these policies into one shared gateway with observable outcomes.

FastAPIRedisDocker ComposePrometheusGrafanaPytest

View code↗View README↗

Why I built it

Real systems rarely fail in a clean binary way. They time out, slow down, return intermittent 5xx responses, and occasionally throttle with 429s. Without strong client behaviour, those upstream issues leak directly into your own system and can even get amplified by bad retry patterns.

I wanted a portfolio project that looked like something a platform team could actually own. That meant focusing less on a polished frontend and more on production-grade HTTP client behaviour, operational visibility, and failure-mode handling.

This project is intentionally backend-first: the proof is in the request pipeline, tests, metrics, dashboards, and Dockerised runtime, not in a marketing UI.

Challenge and solution

One challenge was preventing retry storms when upstream services degraded. I solved it by capping retries with jitter, enforcing idempotency rules, and opening circuit breakers on repeated failures, which kept degraded behaviour controlled and observable.

Architecture diagram for the Reliability Layer API

The service sits between internal callers and upstream APIs, then applies authentication, rate limiting, allowlisted routing, cache policy, circuit breaking, retry rules, and metrics in one place.

Strict timeout policy

Every outbound call uses explicit connect, read, write, and pool timeouts through a shared HTTPX client. That stops hanging requests from silently consuming capacity.

Bounded retries only

Retries are capped, use exponential backoff plus jitter, and are restricted for non-idempotent methods unless an idempotency key is present.

Fail fast when needed

Circuit breakers open after repeated failures, allowing the service to fail fast or serve stale cache instead of repeatedly hammering a bad upstream.

Request flow and degraded behaviour

Request flow diagram showing rate limiting, cache, retries, stale fallback, and error handling

The flow is deliberate: authenticate, rate limit, route only to allowlisted upstreams, check fresh cache, enforce breaker state, make the timed upstream call, then either store a success response, serve stale on error, or return a controlled failure.

What the implementation includes

Allowlisted proxy routes for `GET` and `POST`.
Safe header forwarding instead of arbitrary pass-through.
Redis-backed cache and rate limiting with local fallbacks.
Structured logs to stdout and a Prometheus `/metrics` endpoint.
Docker Compose stack for API, Redis, Prometheus, Grafana, and an upstream simulator.
Health endpoints, docs, dashboards, and CI-ready test coverage.

Why it is a strong portfolio project

It shows backend engineering beyond CRUD by focusing on resilience under partial failure.
It demonstrates operational thinking through metrics, dashboards, health signalling, and failure-mode tests.
It combines application code with platform concerns like Docker, Redis, Prometheus, and Grafana.
It gives me a concrete way to talk about timeouts, retries, load protection, observability, and production readiness.

How it can be explored

The project exposes FastAPI docs at `/docs`, health endpoints at `/health/live` and `/health/ready`, metrics at `/metrics`, and a local Prometheus plus Grafana stack through Docker Compose.

docker compose -f deployments/docker/docker-compose.yml up --build

How it is verified

The important proof is behavioural: unit tests cover retry decisions, cache key determinism, breaker state transitions, and rate-limit atomicity, while integration tests simulate timeouts, upstream 500s, stale fallback, and 429 behaviour.

./.venv/bin/python -m pytest -q

What I would build next

The next step would be turning this from a strong local platform project into a stronger deployment story: ECS Fargate or another container platform, shared breaker state across replicas, alerting rules, and a small amount of production-style infrastructure as code.

In one sentence: this project shows how I think about backend systems when dependencies are unreliable, not just how I build happy-path application features.