Slack recently unveiled how it sends millions of real-time messages around the world every day. The company provides comprehensive insight into Pub/Sub architectures designed to manage large-scale real-time messages. It highlights the unique challenges posed by delivering real-time messages across different time zones and geographies, and how Slack engineers designed the infrastructure to handle them.
Slack Senior Software Engineer Sameera Thangudu explains the importance of this architecture:
Our servers serve tens of millions of channels per host, tens of millions of connected clients, and our system delivers messages worldwide in 500 ms. Projections show that the linear scalability of the current architecture will allow us to serve more customers.
She said the company plans to enhance its architecture to serve a more important customer base.
The backend of the system consists of several services. A Channel Server (CS) is a stateful in-memory server that maintains channel history. A consistent hashing mechanism maps each CS to a subset of channels. At peak times, each host serves approximately 16 million channels. The Consistent Hash Ring Manager (CHARM) manages a consistent hash ring of CSs and allows erratic CSs to be replaced within 20 seconds. Consul saves the latest configuration of consistent hashes.
Source: https://slack.engineering/real-time-messaging/
Gateway Server (GS), like CS, is a stateful in-memory server. It maintains user information and WebSocket channel subscriptions and acts as an interface between Slack clients and CS. GS is deployed across multiple geographic regions to optimize connection speeds. The Administration Server (AS) is a stateless in-memory server that interfaces between the Webapp backend and CS. Finally, the Presence Server (PS) keeps track of online users and powers the green presence dots of your Slack client.
Every Slack client has a persistent WebSocket connection to Slack’s servers to receive and maintain state in real-time events. A client goes through several steps to set up a WebSocket connection, including getting a user token and her WebSocket connection configuration information from the Webapp backend. Then the client initiates her WebSocket connection to the nearest edge region, GS fetches the user information and sends the first message to the client. Envoy load balances incoming traffic and handles TLS termination.
Source: https://slack.engineering/real-time-messaging/
Once the clients are set up, each message sent on the channel is broadcast to all clients online on the channel. The message goes through the Webapp API, AS, and CS before being sent to all subscribing GSs around the world. Each GS that receives the message sends the message to all connected clients that have subscribed to that channel ID.
Source: https://slack.engineering/real-time-messaging/
Apart from chat messages, events are another message type that change the state of the client in real time. Transient events, such as a user typing into a channel, follow a slightly different flow as the database does not persist these events. The diagram below illustrates this flow.
Source: https://slack.engineering/real-time-messaging/