Building a Token Bucket Rate Limiter in Rust with Axum

Rate limiting is one of those problems that looks simple on the surface, until you think carefully about what happens when hundreds of requests arrive for the same client at exactly the same moment. In this post I will walk through building an in-memory token bucket rate limiter in Rust, using Axum for the HTTP layer and standard library RwLock/Mutex for concurrency. No external crates for state management, just Rust's type system and std::sync.
The companion Go implementation lives at rate-limiter-go and makes the same design decisions using sync.RWMutex and goroutines. The locking structure in both projects is intentionally identical, which makes the comparison meaningful.
The token bucket algorithm
The idea is simple. Each client gets a bucket that holds up to capacity tokens. Every request consumes one token. A background task replenishes tokens at a fixed rate over time. When the bucket is empty the request is rejected with a 429 and a Retry-After header telling the client when to try again.
| Time | Event | Tokens remaining | Response |
|---|---|---|---|
| t=0s | Bucket created | 10 | |
| t=0s | 3 requests arrive | 7 | 200 |
| t=1s | Refill fires (+1) | 8 | |
| t=1s | 9 requests arrive | 0 (last gets rejected) | 200 × 8, 429 × 1 |
| t=2s | Refill fires (+1) | 1 | |
| t=2s | Next request | 0 | 200 |
The key property is that the check and the decrement must be atomic. If two requests arrive simultaneously and the bucket has one token, exactly one of them should succeed. That is the concurrency problem this post focuses on.
Domain types
#[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct ClientId(pub String);
impl ClientId {
pub fn new(s: impl Into<String>) -> Self {
Self(s.into())
}
}
#[derive(Debug)]
pub struct Bucket {
pub tokens: f64,
pub capacity: f64,
pub last_refill: Instant,
}
#[derive(Debug, Clone)]
pub struct RateLimiterConfig {
pub capacity: f64,
pub refill_rate: f64,
pub refill_interval_ms: u64,
pub cleanup_ttl_secs: u64,
pub cleanup_interval_secs: u64,
}
ClientId is a newtype over String. This is a minor but deliberate choice --- it means you cannot accidentally pass a raw string where a ClientId is expected. Since it is used as a HashMap key it derives Hash and Eq.
Bucket stores tokens as f64 because refills are time-proportional and produce fractional values. We only consume whole tokens, so a client with 0.8 tokens still cannot make a request.
last_refill on Bucket is actually doing double duty: it records when the bucket was last refilled so the refill task can calculate how many tokens to add, and it also serves as the activity timestamp for the cleanup task.
The store: two levels of locking
pub struct Store {
buckets: RwLock<HashMap<ClientId, Mutex<Bucket>>>,
}
There are two distinct locks:
RwLockon theHashMap, many threads can hold the read lock simultaneously to look up existing clients. Only inserting a new client requires the write lock.Mutexper bucket, the check-and-decrement inallowand the token addition inrefill_allboth lock the individual bucket, not the whole map.
This is structurally identical to the Go implementation: sync.RWMutex on the map, sync.Mutex per bucket. The difference is that in Go you write the lock/unlock calls by hand, while in Rust you call .lock().unwrap() and get a MutexGuard that releases the lock when it goes out of scope.
Double-checked insert
pub fn get_or_insert(&self, client_id: ClientId, capacity: f64) {
{
let read = self.buckets.read().unwrap();
if read.contains_key(&client_id) {
return;
}
}
let mut write = self.buckets.write().unwrap();
write.entry(client_id).or_insert_with(|| {
Mutex::new(Bucket {
tokens: capacity,
capacity,
last_refill: Instant::now(),
})
});
}
The first read lock is the optimistic fast path. Most requests are from clients that already have a bucket, so we return immediately without ever taking a write lock. When a new client is seen, we drop the read lock, take the write lock, then use entry().or_insert_with() to insert only if the key is still absent. The or_insert_with handles the race where two threads both see the key missing under the read lock and both try to insert only one insertion will happen.
Accessing a bucket
pub fn with_bucket<F, R>(&self, client_id: &ClientId, f: F) -> Option<R>
where
F: FnOnce(&mut Bucket) -> R,
{
let read = self.buckets.read().unwrap();
let mutex = read.get(client_id)?;
let mut bucket = mutex.lock().unwrap();
Some(f(&mut bucket))
}
with_bucket holds the map's read lock while locking the individual bucket's Mutex and calling f. The read lock on the map ensures the Mutex pointer stays valid. The bucket Mutex ensures the check-and-decrement is atomic. Both locks are released when the guards go out of scope at the end of the function.
The rate limiter
pub fn allow(&self, client_id: ClientId) -> Result<(), RateLimitError> {
self.store.get_or_insert(client_id.clone(), self.config.capacity);
let result = self.store.with_bucket(&client_id, |bucket| {
if bucket.tokens >= 1.0 {
bucket.tokens -= 1.0;
true
} else {
false
}
});
match result {
Some(true) => Ok(()),
_ => {
let retry_after_secs = (1.0 / self.config.refill_rate).ceil() as u64;
Err(RateLimitError::RateLimitExceeded { retry_after_secs })
}
}
}
pub fn refill_all(&self) {
let refill_rate = self.config.refill_rate;
self.store.for_each_bucket(|bucket| {
let elapsed = bucket.last_refill.elapsed().as_secs_f64();
let tokens_to_add = elapsed * refill_rate;
bucket.tokens = (bucket.tokens + tokens_to_add).min(bucket.capacity);
bucket.last_refill = Instant::now();
});
}
allow first ensures a bucket exists for the client, then calls with_bucket to lock that bucket and perform the check-and-decrement atomically. If two concurrent calls arrive for the same client with one token remaining, only one will find tokens >= 1.0 while holding the bucket's Mutex. The other will see 0.0 and return an error.
RateLimiter is Clone because Axum extracts state by cloning it into each handler. Internally the store is behind an Arc, so cloning the RateLimiter only bumps the reference count.
refill_all uses for_each_bucket, which holds the map's read lock while calling the closure for each bucket under that bucket's own Mutex. Tokens added = elapsed * refill_rate, capped at capacity. This is time-proportional: it does not matter how often the refill task fires.
Background tasks
pub fn spawn_refill_task(limiter: RateLimiter) {
let interval_ms = limiter.config().refill_interval_ms;
tokio::spawn(async move {
let mut ticker = time::interval(Duration::from_millis(interval_ms));
loop {
ticker.tick().await;
limiter.refill_all();
}
});
}
pub fn spawn_cleanup_task(limiter: RateLimiter) {
let interval_secs = limiter.config().cleanup_interval_secs;
tokio::spawn(async move {
let mut ticker = time::interval(Duration::from_secs(interval_secs));
loop {
ticker.tick().await;
let before = limiter.store().len();
limiter.cleanup();
let after = limiter.store().len();
if before != after {
info!(removed = before - after, remaining = after, "cleaned up inactive buckets");
}
}
});
}
Both tasks take ownership of a cloned RateLimiter. Because RateLimiter holds an Arc<Store>, both tasks and all request handlers are pointing at the same underlying map. tokio::spawn moves the limiter into the task. The ticker.tick().await suspends the task between intervals, yielding the thread back to the runtime.
The cleanup task is important for long-running servers. Without it, every unique client IP that ever makes a request stays in the map forever. The cleanup task removes any bucket whose last_refill is older than the configured TTL, keeping memory bounded.
Axum middleware
pub async fn rate_limit_middleware(
State(limiter): State<RateLimiter>,
req: Request,
next: Next,
) -> Result<Response, RateLimitError> {
let client_id = extract_client_id(&req);
limiter.allow(client_id)?;
Ok(next.run(req).await)
}
fn extract_client_id(req: &Request) -> ClientId {
if let Some(key) = req.headers().get("X-API-Key").and_then(|v| v.to_str().ok()) {
return ClientId::new(key);
}
if let Some(forwarded) = req.headers().get("X-Forwarded-For").and_then(|v| v.to_str().ok()) {
if let Some(ip) = forwarded.split(',').next() {
return ClientId::new(ip.trim());
}
}
ClientId::new("unknown")
}
The ? operator on limiter.allow(client_id)? is doing real work here. RateLimitError implements IntoResponse, so when allow returns an error the middleware returns a 429 response automatically --- no match statement required.
The middleware is attached with route_layer, which applies it to every route on the router:
let app = Router::new()
.route("/", get(health_handler))
.route("/ping", get(ping_handler))
.route_layer(axum_middleware::from_fn_with_state(
limiter.clone(),
rate_limit_middleware,
))
.with_state(limiter);
Error handling
#[derive(Debug, Error)]
pub enum RateLimitError {
#[error("rate limit exceeded")]
RateLimitExceeded { retry_after_secs: u64 },
#[error("internal error: {0}")]
Internal(String),
}
impl IntoResponse for RateLimitError {
fn into_response(self) -> Response {
match self {
RateLimitError::RateLimitExceeded { retry_after_secs } => {
let mut response = (
StatusCode::TOO_MANY_REQUESTS,
Json(json!({
"error": "rate limit exceeded",
"retry_after_secs": retry_after_secs,
})),
).into_response();
response.headers_mut().insert(
"Retry-After",
retry_after_secs.to_string().parse().unwrap(),
);
response
}
RateLimitError::Internal(msg) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({ "error": msg })),
).into_response(),
}
}
}
thiserror generates the Error impl from the #[error(...)] attributes. The IntoResponse impl sets both the JSON body and the Retry-After header so clients know exactly when they can retry without guessing.
Testing
The five integration tests live in tests/ratelimit_tests.rs and import from the library crate (lib.rs re-exports all modules). This means tests exercise the real types without going through HTTP.
#[tokio::test]
async fn test_client_exhausts_bucket_and_is_rejected() {
let limiter = make_limiter();
let client = ClientId::new("192.168.1.1");
for i in 0..5 {
assert!(limiter.allow(client.clone()).is_ok(), "request {} should be allowed", i + 1);
}
let result = limiter.allow(client);
assert!(matches!(result, Err(RateLimitError::RateLimitExceeded { .. })));
}
#[tokio::test]
async fn test_concurrent_requests_same_client_only_capacity_succeed() {
let limiter = Arc::new(make_limiter());
let client = ClientId::new("192.168.1.3");
let capacity = test_config().capacity as usize;
let total = 20;
let mut set = JoinSet::new();
for _ in 0..total {
let limiter = Arc::clone(&limiter);
let client = client.clone();
set.spawn(async move { limiter.allow(client) });
}
let mut ok = 0usize;
let mut rejected = 0usize;
while let Some(r) = set.join_next().await {
match r.unwrap() {
Ok(()) => ok += 1,
Err(_) => rejected += 1,
}
}
assert_eq!(ok, capacity);
assert_eq!(rejected, total - capacity);
}
The concurrency test spawns 20 Tokio tasks simultaneously for the same client with a capacity of 5. Exactly 5 must succeed and 15 must be rejected. This test would be flaky or wrong if the check-and-decrement were not atomic.
The full test suite:
| Test | Scenario |
|---|---|
test_client_exhausts_bucket_and_is_rejected |
5 requests succeed, 6th is 429 |
test_tokens_refill_and_requests_succeed_again |
Sleep 1.1s, call refill_all, next request succeeds |
test_concurrent_requests_same_client_only_capacity_succeed |
20 concurrent tasks, exactly capacity succeed |
test_different_clients_have_independent_buckets |
Client A exhausted does not affect client B |
test_inactive_buckets_are_cleaned_up |
TTL=0 cleanup removes all buckets |
Rust vs Go: the same locking structure, different enforcement
Both implementations use the same two-level locking pattern:
| Rust | Go | |
|---|---|---|
| Map guard | std::sync::RwLock<HashMap> |
sync.RWMutex + map |
| Per-bucket guard | std::sync::Mutex<Bucket> |
sync.Mutex per bucket |
The algorithm is identical. The difference is how the language enforces the contract.
In Rust, mutex.lock().unwrap() returns a MutexGuard. You cannot access the data inside the Mutex without holding the guard. The guard releases the lock when it goes out of scope, the compiler enforces this through the borrow checker. You cannot forget to release it and you cannot access the data without it.
In Go, bucket.Mu.Lock() and bucket.Mu.Unlock() are explicit calls. defer bucket.Mu.Unlock() makes the release automatic, but nothing in the type system prevents you from accessing bucket.Tokens without holding the lock.
This is not a criticism of Go, explicit primitives are easier to reason about in many contexts. It is just a genuine difference in how the two languages model ownership of shared state.
Project structure
src/
lib.rs -- crate root
main.rs -- entry point
types.rs -- ClientId, Bucket, RateLimiterConfig
store.rs -- RwLock<HashMap> + per-bucket Mutex store
limiter.rs -- allow, refill_all, cleanup
tasks.rs -- background Tokio tasks
middleware.rs -- Axum middleware
error.rs -- typed errors with IntoResponse
tests/
ratelimit_tests.rs
Running it
cargo run
# GET http://localhost:3000/ping
# Hit it 11 times -- 11th returns 429
for i in $(seq 1 11); do
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:3000/ping
done
cargo test
Source code: github.com/lethuzulu/rate-limiter-rs
The Go version using goroutines and sync.RWMutex: github.com/lethuzulu/rate-limiter-go




