Design a Chat System
5 min readStep 1 - Understand the problem and establish design scope #
Requirements:
- Both 1-on-1 and group chat
- Mobile + web
- 50 million DAU
- Group max: 100 members
- Features: 1-on-1 chat, group chat, online presence indicator
- Text only, max 100,000 chars per message
- No end-to-end encryption (discuss if time allows)
- Chat history stored forever
- Multiple device support (same account logged into multiple devices)
Step 2 - Propose high-level design and get buy-in #
Communication protocols #
Sender side: HTTP with keep-alive (persistent connections, fewer TCP handshakes)
Receiver side — three approaches:
| Technique | Description | Drawbacks |
|---|---|---|
| Polling | Client periodically asks server for messages | Wastes resources; mostly empty responses |
| Long polling | Client holds connection open until messages arrive or timeout | Server affinity issues; can't detect client disconnection; still inefficient |
| WebSocket | Bidirectional, persistent connection; initiated as HTTP then upgraded | Best option |

Final choice: WebSocket for both sending and receiving — simplifies design. HTTP used for everything else (signup, login, profile).

High-level architecture #

Three categories:
-
Stateless services (behind load balancer): Login, signup, user profile. Can be monolithic or microservices. Service discovery recommends the best chat server for a client (based on geo location, server capacity).
-
Stateful service — Chat service: Each client maintains a persistent WebSocket connection to a chat server. Client stays connected to same server while available.
-
Third-party integration: Push notifications for offline messages.
Scalability note #
1M concurrent users × 10KB per connection ≈ 10GB memory — could fit on one server in theory, but single point of failure is unacceptable.
Adjusted high-level design #

- Chat servers: Message sending/receiving via WebSocket
- Presence servers: Online/offline status
- API servers: Login, signup, profile
- Notification servers: Push notifications
- Key-value store: Chat history persistence
Storage #
Generic data (user profile, settings, friends list) → relational database with replication/sharding.
Chat history data — characteristics:
- Enormous volume: Facebook Messenger + WhatsApp process 60 billion messages/day
- Mostly recent chats accessed; old chat random access (search, mentions) must still work
- Read:write ratio ≈ 1:1 for 1-on-1 chat
Why key-value store for messages:
- Easy horizontal scaling
- Very low latency
- Relational DBs struggle with long-tail data; indexes grow large, random access expensive
- Proven: Facebook Messenger uses HBase, Discord uses Cassandra
Data models #
1-on-1 message table: Primary key = message_id (determines sequence; created_at insufficient — two messages can have same timestamp).

Group chat message table: Composite primary key = (channel_id, message_id). channel_id is partition key.

Message ID generation:
- Must be unique and sortable by time (newer = higher ID)
- Options: Snowflake (global 64-bit), or local sequence number generator (unique within a channel, easier to implement, sufficient for message ordering)
Step 3 - Design deep dive #
Service discovery #
Apache ZooKeeper registers available chat servers, picks the best one for a client.

- User A logs in → 2. Load balancer → API servers → 3. Authenticate → service discovery picks best chat server → 4. User A connects via WebSocket
Message flows #
1-on-1 chat flow:

- User A sends message to Chat server 1
- Chat server 1 obtains message ID from ID generator
- Message sent to message sync queue
- Stored in KV store 5a. User B online → forwarded to Chat server 2 → delivered 5b. User B offline → push notification sent
Multi-device sync:

Each device maintains cur_max_message_id. New messages: recipient_id == current_user_id AND message_id > cur_max_message_id. Each device syncs independently.
Small group chat flow:

Message from User A is copied to each group member's message sync queue (each recipient has an inbox). Good for small groups (WeChat caps at 500). For large groups, storing a copy per member is too expensive.

Online presence #
Login: WebSocket connected → status saved as "online" + last_active_at in KV store.

Logout: Status changed to "offline" in KV store.

Disconnection — heartbeat mechanism: Client sends heartbeat every 5s. If no heartbeat for x seconds (e.g., 30s), mark offline. Avoids flickering from brief disconnections.

Online status fanout — pub/sub model: Each friend pair has a channel. Status change published to all friend channels. Friends subscribed. Works well for small groups (WeChat caps at 500). For large groups: fetch status on-demand (enter group, manual refresh).

Step 4 - Wrap up #
Additional talking points:
- Media files: Compression, cloud storage, thumbnails
- End-to-end encryption: Only sender and recipient can read messages (WhatsApp model)
- Client-side caching: Reduce data transfer
- Geo-distributed network: Cache user data near users for fast load times (Slack's Flannel)
- Error handling: Chat server failure → Zookeeper provides new server; message retry via queues
- Message resent mechanism: Retry + queueing
Reference materials [1] Erlang at Facebook [2] Messenger and WhatsApp process 60 billion messages/day [3] Long tail: https://en.wikipedia.org/wiki/Long_tail [4] Underlying Technology of Messages (Facebook) [5] How Discord Stores Billions of Messages [6] Announcing Snowflake (Twitter) [7] Apache ZooKeeper [8] WeChat backend evolution (Chinese) [9] WhatsApp end-to-end encryption [10] Flannel: Slack edge cache