TIL about Snowflake IDs
Or the weird numbers we see on Twitter, Discord, and other platforms.
I recently read a blog post on making fault resilient, consistent and efficient distributed system for broadcasting messages, written by a senior of mine from college. Inspired by that, I started attempting the Fly.io Distributed Systems Challenge. I got stuck on the same problem he had written about, so that was a lifesaver, and gave me a couple of ideas to try out.
But this post is not about that. It's about the weird numbers we see on Twitter, Discord, and other platforms. You know, those long numbers that look like this: https://x.com/CATIAManikin/status/ 1860383419210817897, I never really thought about them, and assumed it was just a index of some sort. But I would truly understand them when I had to implement the second part of the challenge, which required us to generate unique IDs, while making sure they can be generated in parallel, and are unique across all nodes.
Now that we know our goal, let's think about how we can achieve this. The first thing that comes to mind is to use a centralized database to generate IDs, but that would be quite inefficient, and would not scale well. So we need a way to generate IDs that can be done in parallel, and is unique across all nodes.
Twitter introduced a format called Snowflake IDs, which is a 64-bit integer that is unique across all nodes. The format is as follows:
bit 0: Sign bit (always 0)
bits 1 - 41: Timestamp in milliseconds since epoch
bits 42 - 51: Node ID (unique for each node)
bits 52 - 63: Sequence number (increments for each ID generated in the same millisecond)
Thus, for each millisecond, we can generate a maximum of IDs. This is calculated as () sequence numbers () nodes. While this is sufficient for most applications, the number of bits allocated to the sequence number or node ID can be adjusted, or the reference epoch can be modified, to cater to specific constraints and requirements.
So this way given a Snowflake ID, we can extract the timestamp, node ID, and sequence number. Trying this with the ID from the example above, we can see how it works:
-
Convert to 64-bit Binary Representation
The1860383419210817897
in binary looks like this:
0001100111010001011010001101110011110000000101100110000101101001
-
Split Binary into Components
The binary representation can be divided into four parts:- Sign (1 bit):
0
- Timestamp (41 bits):
00110011101000101101000110111001111000000
- Node ID (10 bits):
0101100110
- Sequence (12 bits):
000101101001
- Sign (1 bit):
-
Decode the Timestamp
The timestamp portion (00110011101000101101000110111001111000000
) represents the number of milliseconds since the Twitter epoch (Nov 04, 2010 01:42:54.657 UTC). Converting it to decimal gives443549971392
. Adding this to the Twitter epoch gives us the final timestamp:
1732384946049 ms
, which translates to2024-11-23 18:02:26.049 UTC
. -
Decode the Node ID
The node ID portion (0101100110
) identifies the node that generated the ID. Converting it to decimal gives358
. -
Decode the Sequence ID
The sequence portion (000101101001
) represents the sequence number of the ID generated within the same millisecond. Converting it to decimal gives361
.