Skip to the content.

Realtime presence platform

Requirements and User stories

Functional requirements

Non functional requirements

Assumptions & Calculation

10 * 100 million = 1 Billion status change / day = 12 K writes / second --> 120 K reads / second

Data model

type Presence struct {
  UserID    int64 // 8 bytes
	Presence  String  // true | false 1 Byte
	Timestamp Timestamp // 8 bytes in Redis
}

APIs

# Change user's staus to online
curl -X PUT -H "Accept:text/event-stream" -H "authorization: Bearer <JWT>" \
-d {user_id: "xxx", presence: "online"} https://xxx/users/presence

# Change user's status to offline
curl -X PUT -H "authorization: Bearer <JWT>" \
-d {user_id: "xxx", presence: "offline"} https://xxx/users/presence

# Get users' status
curl -X GET -H "authorization: Bearer <JWT>" \
-d {user_id: "xxx", users_status: {"user_b", "user_c"}} https://xxx/users/presence

We might not need to send user_id, because JWT contains user information

Architecture

Data Store

We need to persist the user’s last active timestamp, so a data store is required. For edge trigger (user sign on/off), we have to update the timestamp. For level trigger (heartbeat), we will also have to update the timestamp. Since we have 100 million concurrent online users, the data store must be able to handle high volume writes.

Relational database is not a good choice

NoSQL Database

100,000,000 concurrent online user * 17 bytes ~= 1.7GB capacity needed to store the online presence status. We do not have to persist all users presence status, even so the total data size would be ~11GB.

We broadcast the status change in realtime using the realtime platform, so we do not have strong consistency requirement at database level. (Eventual consistency should be fine).

This is a read intensive application, so we should replicate data and allows reading from each replica.

From above, we know:

We have multiple potential databases:

Realtime Platform

Realtime platform is a sub-pub based realtime message/events delivery platform. This note covers all the details around it.

Presence Service

Heartbeat module (with network fluctuation handling)

Client sends heartbeat

The biggest problem of this approach is the network overhead. 1)Each heartbeat will establish an new HTTP connection.2) Hard to tune the system to avoid network spike.(All users send the heartbeat at the same time).

For some chat application, if we already have persistent WebSocket connection, we can also ask client to send heartbeat by using that WebSocket connection.

So the overall idea is to reduce network overhead.

Implement Heartbeat service to send heartbeat

Our realtime platform has already maintained the SSE connection for realtime message/events delivery, we can leverage that to obtain the online status of a user.

Avoid spikes in heartbeat

Failure handling

Backend HTTP Server/Handler failure

Backend HTTP Server/Handler is stateless, and can have multiple replicas. (Deployment in K8S)

Database node failure

Heartbeat failure

Heartbeat service failure

Heartbeat service is stateless. Like K8S Deployment with replica=1, a new service will be provisioned to replace the terminated one. However, provisioning a new service takes time, heartbeat handler receives no heartbeat, so it will treat users offline. (We do not want to send the false status change to all other users).

Heartbeat handler failure

Unfortunately the timer is in-memory, if we lose the handler instance we also lose the timer. While waiting for the new handler to be provisioned, we will encounter some delays on offline status broadcasting. (Because timer is reset)

// TODO: Can we not using timer in Heartbeat handler ? Setting the key expiration in Redis or leverage distributed scheduler ?

Scaling

References