When we set out to build Starforge MMO, the vision was clear: a living galactic map where thousands of commanders watch fleet movements, battle outcomes, and sector captures unfold in real time, directly in the browser, without a page refresh. No "check back in 10 minutes." No polling-based fake-real-time. Actual real-time.
That vision turned out to be considerably harder to build than it was to describe. This is the engineering story of how we got there โ the good decisions, the catastrophic mistakes, the race conditions that made us question our career choices, and the architecture we landed on after several painful iterations.
The First Implementation: Naive but Necessary
The initial prototype used a single Node.js server with the ws library โ plain WebSockets, no abstraction, no clustering. Every connected client maintained an open socket to one server process. Game state updates triggered broadcasts: when a fleet moved, we serialised the delta to JSON and pushed it to every connected client.
This worked beautifully up to about 200 concurrent connections in testing. Performance was good, latency was under 20ms for local connections, and the galactic map felt alive in a way that justified the entire direction.
Then we opened closed beta to 500 players and immediately learned the first lesson of real-time multiplayer: naรฏve broadcast does not scale.
The problem was simple in retrospect: with 500 players, a single fleet movement event was being serialised and sent 500 times โ once to every connected client, regardless of whether they were viewing the affected sector. A major battle with 30 fleet-move events per second was generating 15,000 individual message serialisations per second on a single process. CPU usage hit 95%. Latency spiked to 400โ800ms. Players started complaining that their map was "lagging" even though the server wasn't actually slow โ it was just drowning in unnecessary work.
The fix was spatial filtering: only send events to clients whose current map view intersects with the sector where the event occurred. This required tracking each client's viewport position, but it reduced broadcast volume by approximately 85% for typical play sessions and brought CPU usage down to 30% under the same load.
Lesson learned: Broadcast without filtering is not a real-time system โ it's a distributed denial-of-service attack you're running on yourself.
The Race Condition Problem
Spatial filtering solved the broadcast problem, but it exposed a deeper issue we'd been lucky not to hit at smaller scale: race conditions in game state mutations.
Here's the specific bug that cost us three days of debugging: when two fleets from different players entered the same sector simultaneously, the server was processing both "fleet enter sector" events in parallel. Each event read the current sector state, modified it, and wrote it back. Because these operations weren't atomic, one write could overwrite the other โ resulting in a sector that had only one fleet in it when it should have had two, or in one case, a fleet existing in two sectors simultaneously.
That second case โ a fleet in two sectors at once โ was genuinely surreal to observe in testing. The fleet appeared on both sectors of the map, two players received different combat notifications about the same fleet, and when we tried to query the fleet's position from the database, the result changed depending on which database replica we hit first.
The root cause was that we were using MongoDB without transactions for state that required atomicity. Each event handler was a separate async function that read-modified-wrote without any locking mechanism.
We explored several solutions:
Option 1: MongoDB transactions. We implemented multi-document transactions and tested them under load. They worked correctly but introduced 40โ80ms of additional latency per operation due to transaction overhead. For a real-time game where we were targeting sub-50ms event propagation, adding 40โ80ms to every state mutation was unacceptable.
Option 2: Event queue with single-writer pattern. We moved all state mutations through a single event queue. Only one worker processes mutations at a time, eliminating the race condition. This worked but introduced a different problem: under high load, the queue backed up, and events that were supposed to be real-time were being processed 2โ4 seconds after they occurred.
Option 3: Sector-partitioned locks. The solution we ultimately implemented. Each sector in the galaxy has its own lock. Events affecting a sector acquire that sector's lock before reading or mutating state, then release it immediately after. Since most events affect at most two sectors, lock contention is localised rather than global. Under typical play, a sector lock is held for less than 5ms โ negligible contention.
Combined with Redis-based distributed locks (more on that below), this gave us atomicity without transaction overhead and without queue backup under load.
Redis Pub/Sub: Scaling Beyond One Server
The sector-partitioned lock system worked on a single server. The next problem was that a single server wasn't going to be enough.
With closed beta running at 800 concurrent connections, we were at about 60% CPU on our primary server. Comfortable, but with open beta targeting 5,000โ10,000 concurrent players, we needed horizontal scaling. The problem: WebSocket connections are stateful. Player A's socket is connected to Server 1. Player B's socket is connected to Server 2. When a battle occurs in Sector 47 that both players should see, how does the event notification reach both sockets?
This is the fundamental architectural challenge of real-time multiplayer at scale: the event occurs on one server, but the recipients may be connected to different servers.
We implemented a Redis pub/sub layer as the event bus between server instances. Here's how it works:
1. A game event occurs (fleet moves, battle starts, sector captured).
2. The game logic server publishes the event to a Redis channel โ specifically, a channel named for the affected sector (e.g., sector:events:47).
3. All WebSocket servers subscribed to that Redis channel receive the event.
4. Each WebSocket server checks which of its connected clients have that sector in their viewport.
5. Matching clients receive the event over their WebSocket connection.
This decouples game logic from WebSocket delivery entirely. The game logic servers don't know or care which WebSocket server a player is connected to. The WebSocket servers don't run any game logic. Redis sits between them as a low-latency message bus.
The numbers from our current staging environment running 4,000 simulated concurrent connections across three WebSocket servers:
- Average event-to-client latency: 18ms (event occurs to client WebSocket receipt)
- Redis pub/sub overhead: 2โ4ms per event
- Peak throughput: 12,000 events/second across all channels without degradation
- p99 latency under peak load: 47ms
- Connection capacity per WebSocket server: ~2,500 concurrent connections at target CPU (under 40%)
These numbers support a target of 15,000โ20,000 concurrent connections across 6โ8 server instances, which covers our projected open beta load with comfortable headroom.
The Reconnection Problem
One issue we underestimated until it became acute in closed beta: connection drops. Mobile players especially drop WebSocket connections constantly โ network switches, brief signal losses, app backgrounding. When a player reconnects, they need to receive any events they missed during the disconnection.
Our first implementation had no reconnection state recovery. Players who disconnected for 30 seconds reconnected to a stale view of the galaxy. Fleets had moved without their client knowing. Battles had resolved. Sectors had changed hands. The client-server state was desynchronised, and the only fix was a full page reload โ not an acceptable user experience.
The solution was a per-player event buffer in Redis with a 5-minute TTL. When a player disconnects, their user ID and last-received event sequence number are stored. On reconnection, the WebSocket server queries the buffer for all events since that sequence number and delivers them in order before resuming live updates. For most connection drops (under 30 seconds), this is seamless โ the player's map updates with the missed events and they're current within 200ms of reconnecting.
For longer disconnections (over 5 minutes), we fall back to a full state sync from the database rather than trying to replay potentially thousands of buffered events.
What We'd Do Differently
Two decisions we'd revisit if starting over:
Starting with WebSockets instead of Server-Sent Events for read-only clients. Most players spend the majority of their session in a read-heavy mode โ watching the galactic map, monitoring fleet positions, reading event feeds. They're not sending data to the server constantly. Server-Sent Events (SSE) are simpler to implement for this use case, scale better, and reconnect automatically. We'd use SSE for the galactic map view and reserve WebSockets for interactive sessions (active combat, fleet command). Estimated reduction in WebSocket connection load: 40%.
Not designing the event schema for versioning from the start. We've had to evolve the shape of game events as features changed, and every schema change required coordinated deploys of all consumers (game logic servers, WebSocket servers, the client JavaScript). We now have event versioning, but retrofitting it was painful. If we'd defined a versioned envelope format on day one, schema evolution would have been much smoother.
Where We Are Now
The current architecture handles real-time event delivery across a horizontally-scaled WebSocket layer with sector-partitioned locking for state consistency and Redis as the event bus. It works. The galactic map moves in real time. Fleet battles resolve and the results propagate to all viewing players within 50ms under normal load.
We're not done โ the next major work is moving the game logic layer to an event-sourced architecture that makes state replay and debugging substantially easier. But that's a story for Dev Log #4.
For now: the ships move. The stars burn. The galaxy is live.