Skip to the content.

Rate limiting

Terminologies

For rate limiting system, it should be Fail Open system.

Why rate limiting is needed

Requirements clarification

Functional requirement

Non-functional requirement

Strategies

No rate limiting

No rate limiting is the floor that the design needs to consider in worst case scenario. Using timeouts, deadlines, circuit-breaking pattern helps your service to be more robust in the absence of rate limiting.

Pass through

The service calls other service to fulfill requests. 429 Too Many Requests http response might be used to return to the caller.

Enforce rate limits

Put the rate limits in place to protect the current or downstream service.

To enforce rate limiting, first understand why it is being applied in this case, and then determine which attributes of the request are best suited to be used as the limiting key (for example, source IP address, user, API key). After you choose a limiting key, a limiting implementation can use it to track usage. When limits are reached, the service returns a limiting signal (usually a 429 HTTP response).

Defer response

When under the high traffic responding to the caller’s request is also a challenge.

Client side strategies

If the backend service does not provide the rate-limiting, the client could apply self-imposed throttling.

Architecture

Local rate limiting

local-rate-limiting

Why not putting the rate limiting service as the gateway

We could have the rate limiter client injected into gateway and calls the rate limiting service.

What if the in-memory requests cache crashes

All requests counts are lost, which could cause the peak traffic to backend service. So we need to make therequests counts persistent in a distributed way. If using Redis as the caching layer it handles the single point failure out of the box.

Distributed rate limiting

distributed-rate-limiting-with-scale

How to implement rate limiting | Techniques for enforcing rate limit

Token bucket

Java sample implementation

Leaky bucket

leaky-bucket

This is similar to token bucket, if no tokens are available, we could put the request to sleep until the tokens are refilled. Or we could discard the request and return 429 Too Many Requests back to client.

Golang implementation

Pros of Token Bucket and Leaky bucket

Cons of Token Bucket and Leaky bucket

Fixed window

fixed-window

Fixed window Pros

Fixed window Cons

Sliding log

Sliding log Pros

Sliding log Cons

Sliding window

sliding-window

Sliding window Pros

Sliding window Cons

Rate limiting in distributed system

Global rate limiting vs Local rate limiting

distributed-rate-limiting

If we have global rate limiting set to be 4 QPS, each of the service needs to have the same QPS configurations. It is NOT 4 / 2 QPS for each service. It worst case, all requests have been redirected to one of the service, but 0 requests are directed to others. We should not allow new requests even there is one service has available slots.

In above case, tracking the count locally as local rate limiting is not feasible. We need to know what is the total number of requests have been received, so that to decide if we reach the limit.

Using the following formular to calculate if we have reached the limit

How could we get the rest of service's used capacity ?

Solution 1

Broadcasting to all other services about current service’s status. E.g. Service A-1 broadcast its status to service A-2. This is also called full mesh.

Challenges:

This solution does not scale when there are large number of services in one mesh.

Solution 2

Gossip protocol, one service picks another service randomly and tells its status. Yahoo is using this solution.

Solution 3

global-rate-limiting-solution-2

Challenges:

Rate limiting in K8S

K8S apiserver

The implementation of apiserver has a flag called maxInflightLimit which is used to declare a buffered channel. Each time when a new request comes, it tries to add one element(bool) into the channel, the server invokes handler.ServeHTTP(w, r) and then poll one element out from the channel. If the buffer is full, 429 will be returned.

client-go rate limiting

K8S controller runtime uses the client-go default rate limiter if no custom rate limiters are provided by controller options. The client-go default controller rate limiter applies ItemExponentialFailureRateLimiter and BucketRateLimiter, returns the worst case response back. BucketRateLimiter uses the golang built-in implementation.

envoyproxy rate limiting

This is invented by lyft which backed by redis:

See implementations for more details.

References