Skip to main content
QuantLab Logo
Glossary · APIs

What is Rate Limiting?

Rate limiting is the practice of capping how many requests a single client can make to an API within a time window — say 100 requests per minute — so that one runaway script, abusive actor, or buggy integration cannot overwhelm your servers or starve everyone else of capacity. When a client exceeds the cap, the API returns HTTP 429 and tells it when to try again.

Why every real API needs it

An API with no limits is one bad client away from an outage. A retry loop with no backoff, a scraper, a credential-stuffing attack, or a single customer who scripts against you too aggressively can consume all your capacity and take the service down for everyone. Rate limiting is the seatbelt: it protects the backend, enforces fair use across tenants, contains the blast radius of abuse, and — on metered plans — is how usage tiers get enforced in the first place.

The common algorithms

A few patterns dominate. Fixed window counts requests per calendar window (per minute, per hour) — simple, but it allows a burst right at the window boundary. Sliding window smooths that edge by weighting the previous window. Token bucket adds tokens at a steady rate up to a cap and spends one per request, allowing controlled bursts while holding a long-run average. Leaky bucket drains requests at a constant rate, queuing or dropping the overflow. Token bucket is the most common default because real traffic is bursty.

429 and the headers that matter

When a client trips the limit, the correct response is HTTP 429 Too Many Requests, accompanied by a Retry-After header so the caller knows how long to wait. Good APIs also surface the limit, remaining quota, and reset time (for example via X-RateLimit-* or the standardized RateLimit headers) so well-behaved clients can self-pace. Pair this with idempotency keys so that retries after a 429 do not accidentally double-charge or duplicate work.

Where to enforce it

The earlier you reject abusive traffic, the less it costs you. That is why rate limiting commonly lives at the edge — in a API gateway, a CDN, or a reverse proxy — so floods are stopped before they ever touch application servers. The hard part is doing it across a fleet: if you run many instances behind a load balancer, each one only sees part of the traffic, so distributed limits usually keep their counters in a shared fast store such as Redis.

Identity, fairness, and security

What you key the limit on matters. IP-based limits are easy but unfair behind shared NAT and easy to evade with rotating addresses. API-key or account-based limits are fairer and harder to dodge. Sensitive endpoints — login, password reset, signup — deserve stricter, separate limits, because rate limiting is a frontline defense against brute-force and credential attacks, not just a capacity tool. The right scheme balances protection against accidentally blocking legitimate users.

At QUANT LAB

We treat rate limiting as a default, not an afterthought. Our API development work ships token-bucket limits keyed to API keys, honest 429 responses with Retry-After, standard rate-limit headers, and a Redis-backed counter so the limits hold across every instance behind the gateway. For auth-sensitive routes we add tighter, dedicated limits. The aim is a system that shrugs off abuse without punishing the customers who are using it correctly.

Hardening an API against abuse?

We design rate-limiting that holds across a fleet and protects auth endpoints without blocking real users. Book a 30-minute call.

API development