API Engineering · 2026
API Rate Limiting Strategies: A 2026 Engineering Guide
Rate limiting is the cheapest insurance your API can buy: it caps your cloud bill, blunts abuse, and keeps one noisy client from degrading everyone else. This guide compares the algorithms, the distributed designs, and the HTTP details that make it work in production.

Quick answer
Use a token bucket as your default algorithm — it tolerates real bursts while enforcing a steady average. Keep limiter state in a shared store like Redis with atomic increments so it holds across multiple servers. Limit per authenticated identity and per IP, pair short-term rate limits with longer-term per-tenant quotas, and return 429 with Retry-After and RateLimit headers. Fail closed on sensitive endpoints, fail open on low-risk reads, and always alert when the limiter degrades.
Rate limiting sits at the intersection of reliability, cost control, and security. OWASP calls the failure mode API4: Unrestricted Resource Consumption — without limits, one client can exhaust capacity, run up your bill, or brute-force credentials. We build APIs that take real traffic, and our custom software practice treats limiting as a day-one control. For the full API hardening picture, pair this with our API security best practices guide.
1. The four algorithms, and when each fits
Four algorithms cover almost every real need. The right choice is a tradeoff between burst tolerance, accuracy, and memory.
- Fixed window. Count requests per calendar window (e.g. per minute). Simple, but a client can send a full window's worth right before the reset and again right after — double the intended rate across the boundary.
- Sliding window log. Store a timestamp per request and count those inside the trailing window. Perfectly accurate, but memory grows with traffic.
- Sliding window counter. Approximate the sliding window by weighting the current and previous fixed windows. Near-accurate with constant memory — a great default for high throughput.
- Token bucket. Refill tokens at a steady rate up to a cap; each request spends one. Permits bursts up to the bucket size, then throttles to the refill rate. The best fit for user-facing APIs.
2. Implementing a token bucket
A token bucket needs only two stored values per key — the current token count and the timestamp of the last refill. On each request you compute how many tokens have accrued since then, cap at the bucket size, and spend one if available.
// Token bucket: refill lazily on each request
function allow(state, now, ratePerSec, capacity) {
const elapsed = (now - state.lastRefill) / 1000;
const tokens = Math.min(capacity, state.tokens + elapsed * ratePerSec);
if (tokens < 1) {
return { ok: false, retryAfter: (1 - tokens) / ratePerSec };
}
return {
ok: true,
next: { tokens: tokens - 1, lastRefill: now },
};
}The key design decision is scope: rate-limit per authenticated identity for fairness, and per IP as a backstop against unauthenticated abuse. Use distinct, stricter buckets for expensive or sensitive endpoints — login, password reset, search, export, and anything that triggers downstream cost.
3. Making it work across many servers
In-memory counters break the moment you run more than one instance — each server enforces its own limit, so a client's real allowance is multiplied by your fleet size. The fix is shared, atomic state.
-- Atomic fixed-window counter in Redis (Lua, runs server-side)
local current = redis.call("INCR", KEYS[1])
if current == 1 then
redis.call("PEXPIRE", KEYS[1], ARGV[1]) -- window in ms
end
if current > tonumber(ARGV[2]) then -- limit
return 0 -- reject
end
return 1 -- allow- Use a Lua script or a pipelined transaction so the read-modify-write is atomic — otherwise concurrent requests race past the limit.
- For extreme throughput, keep a node-local token bucket and reconcile with Redis periodically; you trade a little precision for far fewer round trips.
- Push coarse limiting to the edge — a CDN, API gateway, or WAF — and reserve application-level limiting for per-tenant fairness and business logic.
4. The HTTP contract: 429 and the right headers
A rate limiter is only as good as the signal it sends back. Return 429 Too Many Requests with a Retry-After header, and emit the standardized RateLimit-* headers so disciplined clients throttle themselves before hitting the wall.
RateLimit-Limit— the ceiling for the window.RateLimit-Remaining— requests left in the current window.RateLimit-Reset— seconds until the window refreshes.- Never return
200with an error body — HTTP clients and proxies rely on the 429 status to trigger automatic backoff.
Document your limits publicly. A client that knows the rules can design polite retry-with-jitter behavior; a client guessing in the dark will hammer you.
Mid-post: rate limits and quotas are not the same control
Short-term rate limits protect capacity now; long-term quotas enforce the plan a customer pays for. Most APIs need both. Want help mapping limits to your pricing tiers? Book a free scoping call.
Algorithm comparison at a glance
| Algorithm | Strength / weakness |
|---|---|
| Fixed window | Simplest; boundary burst doubles the effective rate |
| Sliding log | Exact; memory grows with request volume |
| Sliding counter | Near-exact with constant memory; great default at scale |
| Token bucket | Burst-tolerant, steady average; best for user-facing APIs |
| Leaky bucket | Smooths output to a constant rate; good for protecting downstreams |
Operational practices that hold over time
Limits drift out of step with reality unless you watch them:
- Instrument every rejection. A spike in 429s is either an attack or a limit set too low for a legitimate workload — you cannot tell them apart without metrics. See our observability for startups guide.
- Decide fail-open vs fail-closed per endpoint. Sensitive endpoints reject when the limiter is down; low-risk reads stay available. Make it explicit and alert on degradation.
- Offload heavy work to a queue. When a tenant hits their limit on an expensive operation, enqueue it rather than rejecting outright — see background jobs and queues in production.
Frequently asked questions
What is the best rate limiting algorithm for an API?
For most production APIs, the token bucket is the best default. It allows short bursts up to the bucket size while enforcing a steady average rate, which matches how real clients behave. The sliding-window log is the most accurate but the most memory-hungry; sliding-window counter is a strong middle ground. Fixed window is the simplest but suffers a boundary problem where a client can fire two full windows of traffic across the reset edge. Choose based on whether you value burst tolerance, accuracy, or simplicity.
What is the difference between token bucket and leaky bucket?
Both smooth traffic, but in opposite directions. The token bucket lets requests through as long as tokens are available, so it permits bursts up to the bucket capacity and then throttles to the refill rate. The leaky bucket processes requests at a fixed drain rate regardless of arrival pattern, smoothing output into a constant stream. Token bucket is better for user-facing APIs that should tolerate bursts; leaky bucket is better when a downstream system must receive a strictly even flow.
How do you rate limit across multiple servers?
Keep the limiter state in a shared store — typically Redis — rather than in each instance's memory, which would let a client multiply their allowance by the number of servers. Use an atomic operation (a Lua script or an INCR with expiry) so concurrent requests cannot race past the limit. For very high throughput, a sliding-window counter in Redis gives accuracy without storing every timestamp, and node-local token buckets that periodically sync to Redis trade a little precision for far less network chatter.
What HTTP status and headers should a rate limiter return?
Return 429 Too Many Requests with a Retry-After header telling the client how long to wait. Emit the standardized RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers so well-behaved clients can self-throttle before they hit the wall. Never return a 200 with an error body — clients and proxies treat 429 specially, and a wrong status code defeats automatic backoff in most HTTP libraries.
What is the difference between rate limiting and quotas?
Rate limiting protects short-term capacity — requests per second or per minute — to keep one client from overwhelming the system right now. Quotas govern longer-term consumption, such as requests per day or per billing month, and usually map to a pricing tier. You generally want both: a rate limit so no tenant can spike and degrade others, and a quota so usage stays within the plan a customer pays for. They are enforced at different time scales and for different reasons.
Should a rate limiter fail open or fail closed?
It depends on what the endpoint protects. For a login, password reset, or any abuse-prone endpoint, fail closed — if the limiter store is unreachable, reject rather than expose yourself to credential stuffing. For a low-risk read endpoint where availability matters more than strict enforcement, failing open avoids turning a Redis blip into a full outage. Decide per endpoint, make the choice explicit, and alert when the limiter degrades so you are not silently unprotected.
Sources & references
Related reading and next steps
Cap the abuse. Protect the bill.
We design rate limiting and quota systems that hold under real traffic and map cleanly to your pricing. Book a free scoping call to talk through your API.
More engineering reading
All postsAPI Security Best Practices (2026)
Auth, rate limiting, input validation, secrets, and the OWASP API Top 10.
Read postScaling a SaaS Database (2026)
Indexing, pooling, read replicas, partitioning, caching, and when to shard.
Read postAdding AI Features to Your SaaS (2026)
Where AI helps, build-vs-API trade-offs, evals, guardrails, and shipping without torching margins.
Read post