design – How can users be notified of changes during collaborative editing in a system like Google Docs?

“Designing Google Docs” is a popular system design interview question, and there are plenty of articles on it on the web. However, those mostly touch on merging changes, but don’t explain how to swiftly notify a user of a change made by another user in the same document. Supposedly, for collaborative editing getting notified of changes as fast as possible is one of the main requirements. So my question is, what design would allow to quickly notify all connected users of a document that another user just made a change? This design should meet the following requirements:

  • Deliver change notifications within seconds
  • Be scalable (up to 100 users can edit the document at the same time; no limit on the number of documents being concurrently edited in the system)
  • Be fault-tolerant

Change merging strategies are out of the scope of this question.

That was the question, now here are some of my thoughts and what I came up with:

The user client needs to have an open Websocket/longpolling connection to quickly get new changes. If this connection is handled by any random node, we have several ways of checking for new changes in the document:

  • Polling the database/some cache for new changes — puts unnecessary load on the system, is not going to be fast, doesn’t scale
  • Running the changes through a single message queue and processing it on each node to notify users connected to node — doesn’t scale, as the time between enqueuing a message and processing it would grow with the number of documents being edited at the same time

Neither of these two ideas seem to work, so what if user connections for a particular document are handled by a node designated to that document? Say, users working with documents D1, D2, D3 are routed to node N1, and users working with documents D4, D5 are routed to node N2. In this case a change made in a document by one of the collaborators can be sent to all the other collaborators as soon as this change is validated, merged and persisted to the database, because all collaborators are connected to the same node. Now, this design requires a separate registry service, which would:

  • Be aware which documents are being handled by which node
  • Designate a node to handle a document if this document isn’t currently being handled by any node
  • Do healthchecks against nodes to quickly reassign documents from a failing node to a healthy one

So, when a user wants to interact with a document, the user client makes a request to establish a connection, the node for the requested document is looked up in the registry, and the connection is established to this node.

Fault tolerance:

  • The client that makes a change gets notified once the change is persisted and acknowledged. If the client connection is broken, all unacknowledged changes get resent on reconnect.
  • If the client connection is broken, the client gets all the latest changes on reconnect.

Thus if the node unexpectedly shuts down, the routing service assigns the document to a new node, all users switch connections to that new node, get all the latest persisted changes, and unacknowledged changes are resent to the new node.

I think this design can work and is scalable, since it’s possible to introduce as many nodes as needed, spreading documents as thinly as needed between these nodes. However, I’m skeptical of the fact that instead of nodes that can handle incoming connections for any document I now have special-purpose nodes that can’t be killed without some hassle of switching all connected users to new nodes.

How do you find this design? What design would you advise?