TIL about Snowflake IDs

Or the weird numbers we see on Twitter, Discord, and other platforms.

I recently read a blog post on making fault resilient, consistent and efficient distributed system for broadcasting messages, written by a senior of mine from college. Inspired by that, I started attempting the Fly.io Distributed Systems Challenge. I got stuck on the same problem he had written about, so that was a lifesaver, and gave me a couple of ideas to try out.

But this post is not about that. It's about the weird numbers we see on Twitter, Discord, and other platforms. You know, those long numbers that look like this: https://x.com/CATIAManikin/status/ 1860383419210817897, I never really thought about them, and assumed it was just a index of some sort. But I would truly understand them when I had to implement the second part of the challenge, which required us to generate unique IDs, while making sure they can be generated in parallel, and are unique across all nodes.

Now that we know our goal, let's think about how we can achieve this. The first thing that comes to mind is to use a centralized database to generate IDs, but that would be quite inefficient, and would not scale well. So we need a way to generate IDs that can be done in parallel, and is unique across all nodes.

Twitter introduced a format called Snowflake IDs, which is a 64-bit integer that is unique across all nodes. The format is as follows:

plaintext
bit 0: Sign bit (always 0)
bits 1 - 41: Timestamp in milliseconds since epoch
bits 42 - 51: Node ID (unique for each node)
bits 52 - 63: Sequence number (increments for each ID generated in the same millisecond)

Thus, for each millisecond, we can generate a maximum of 4,194,3044,194,304 IDs. This is calculated as 40964096 (2122^{12}) sequence numbers ×\times 10241024 (2102^{10}) nodes. While this is sufficient for most applications, the number of bits allocated to the sequence number or node ID can be adjusted, or the reference epoch can be modified, to cater to specific constraints and requirements.

So this way given a Snowflake ID, we can extract the timestamp, node ID, and sequence number. Trying this with the ID from the example above, we can see how it works:

  1. Convert to 64-bit Binary Representation
    The 1860383419210817897 in binary looks like this:
    0001100111010001011010001101110011110000000101100110000101101001

  2. Split Binary into Components
    The binary representation can be divided into four parts:

    • Sign (1 bit): 0
    • Timestamp (41 bits): 00110011101000101101000110111001111000000
    • Node ID (10 bits): 0101100110
    • Sequence (12 bits): 000101101001
  3. Decode the Timestamp
    The timestamp portion (00110011101000101101000110111001111000000) represents the number of milliseconds since the Twitter epoch (Nov 04, 2010 01:42:54.657 UTC). Converting it to decimal gives 443549971392. Adding this to the Twitter epoch gives us the final timestamp:
    1732384946049 ms, which translates to 2024-11-23 18:02:26.049 UTC.

  4. Decode the Node ID
    The node ID portion (0101100110) identifies the node that generated the ID. Converting it to decimal gives 358.

  5. Decode the Sequence ID
    The sequence portion (000101101001) represents the sequence number of the ID generated within the same millisecond. Converting it to decimal gives 361.