UUIDs are obsolete in the age of Docker
May 18, 2023 in Engineering philosophyThis post originally appeared in Ukrainian in my daily channel for May 17.
So, imagine you need an identifier that can be safely generated in different places in your infrastructure, or just look less predictable than a sequential integer ID. And, you think, UUID is a nice well-known globally unique ID, I’ll take that. Unfortunately, you are signing up for a one-way road to a dead end.
What’s a UUID made of, anyway? Well, it’s straightforward to generate an identifier that’s unique in the bounds of one machine. The timestamp is already nearly unique; add some decollisioning sequence number on top, and you’re done. To make it universally unique, a simple trick is performed: you take the locally unique identifier, and tack on a unique identifier of the machine that generates it – a “node name”. That’s basically it: 10 bytes of locally unique ID plus 6 bytes of “node name”.
(Note: this is only correct about UUID version 1. However, it is what most applications use. The only other practical option is version 4 – the random UUID – but random is intuitively worse, right? Read on to find out.)
Now, where could you find a unique identifier of the machine? As it turns out, every network card has a unique hardware identifier – a MAC address – that is “burned in” by the manufacturer. Each manufacturer gets assigned a particular range of MACs, and they take care to distribute the identifiers without collisions. As the MAC address is, by definition, a unique identifier, it was picked as the standard node name for UUIDs. Technically, you can use an ID from some other source, but the implementations I’ve checked do use the MAC address for generating UUIDs. Setting aside the privacy concerns of exposing a unique fingerprint of your machine, it was at least a good method to establish uniqueness.
However, the age of virtualization has broken the uniqueness of MAC addresses conclusively. In Docker, the MAC address of a network adapter is determined based on the IP address. This dramatically reduces its variance, of course – especially because most Docker deployments will use the same default IP range. You can easily confirm this by googling the UUIDs generated on Docker with IP 172.17.0.3. Presently, Google finds 2 million of them – and of course, these are just the IDs that somehow became part of the searchable content and were indexed by Google.
This is not even the worst case I know. How’s this one: every container on the AWS Fargate receives the same MAC address – 0a:58:a9:fe:ac:02
. This means that every UUID generated on AWS ECS has the same node name! In global terms, this has less of an impact – only 90K googleable IDs. But if you are running your app on Fargate, you lose all the benefits of UUIDs – your machines are just as likely to generate a collision as if you were using a locally unique identifier.
How bad is this? I don’t think it has global repercussions. We practically never need a truly universal identifier: it is usually qualified by a domain of some sorts. And there is no demand for a huge number of UUIDs within one domain (known to me.)
However, it has very practical local implications for your own distributed system. A UUID v1 timestamp has a resolution of 100 nanoseconds. This seems like a collision is highly unlikely. However, even with two machines generating 1 UUID per second each, the probability of generating a collision reaches 50% in just 41 days! UUIDs without a unique node name are guaranteed to have collisions.
Instead of fixing this, I kindly suggest that UUIDs are never the right answer. That’s right, you should never use them.
- They are awful as keys – being strings, comparisons are dramatically slower than with integers. And even if your database has a UUID type, it’s still worse because the identifier doesn’t fit into a machine word.
- They are excessively long – each character of a UUID only encodes 3.5 bits of information if you count the dashes. That’s twice as less compared to 6 bits of Base64.
- They are not time-ordered – despite containing a timestamp, its bits are mixed up within the UUID: the top bytes of the UUID contain the bottom bytes of the timestamp. Databases do not like an unordered primary key – it means that freshly inserted rows can go anywhere in the index. And you can’t use UUIDs for ad-hoc time sorting by time, either.
- They are bad for human comprehension – UUIDs tend to look alike, and it’s hard to visually seek and compare them. This comes from experience.
Instead of UUIDs, a 64-bit integer is enough for almost every application. You can still follow the same approach of combining a timestamp, sequence number and a machine identifier, but within 64 bits. For an example, consider Snowflake ID – but you can totally design your own.
If you need a string ID, ksuid is a great alternative to UUID that fixes all of its aforementioned issues – it’s compact, time-ordered, and better at uniqueness. Or, you can simply convert an integer ID into a string using a decimal or hex representation, Base64 or other methods.
If you require a globally unique string ID, consider URIs! They can be truly unique and human-readable.
Sometimes, though, you are forced to generate UUIDs by external constraints. Or you are stuck with a system that already uses UUIDs, and migration is a pain. If you must have UUIDs, I suggest going with UUID v4 – the random one. It has an astronomically low chance of collisions. Or, if having a timestamp encoded into the ID is valuable to you, at least override the node name with some value for which you can establish at least system-wide uniqueness.
Liked the post? Treat me to a coffee