New UUID Formats – IETF Draft - JOYK Joy of Geek, Geek News, Link all geek

New UUID Formats – IETF Draft

A somewhat oversimplified summary of the new UUID formats:

UUID6: a timestamp with a weird epoch and 100 ns precision like in UUID1, but in a big-endian order that sorts naturally by time, plus some random bits instead of a predictable MAC address.

UUID7: like UUID6, but uses normal Unix timestamps and allows more timestamp precision.

UUID8: like UUID7, but relaxes requirements on where the timestamp is coming from. Want to use a custom epoch or NTP timestamps or something? UUID8 allows it for the sake of flexibility and future-proofing, but the downside is that there's no standard way to parse the time from one of these -- the time source could be anything monotonic.

>> but the downside is that there's no standard way to parse the time from one of these

IMHO that should not be a concern. The goal of UUID is to create a Unique Identifier, not to record the time something was created. Maybe we should all have quantum random number generators, but with that I don't think 128 bits is enough for a UUID. Might be enough for specific applications though.

First rule of database design: Your unique IDs should not have a real-world meaning.

> First rule of database design: Your unique IDs should not have a real-world meaning.

Well, this is an opinion. I would definitely push back on making it a rule. Especially the first one.

Natural keys exist, make perfectly good unique identifiers, and have inherent meaning.

When using synthetic keys it is difficult (read: requires making trade-offs) to guarantee they were created in order and/or that they increase monotonically. But if you make those trade-offs (or just align expectations) you can still associate real world meaning to them.

These new UUIDs standardize more of those trade-offs.

> Natural keys exist, make perfectly good unique identifiers, and have inherent meaning.

Most of the "natural" things you think are going to be good Primary Keys aren't. You quickly get into one of those "Falsehoods Programmers Believe About X" lists.

I remember some years ago running into the Experian team who were designing a system to uniquely identify people and they'd got it all figured out for the US market. See, Americans have Social Security Numbers, so you can just use those to key off. Alas, as hopefully anybody here who actually deals with Social Security Numbers knows, they aren't unique. Perhaps you could argue they should be unique, but that's not useful in practice, because in practice they aren't. They seemed really disappointed about that. I doubt they fixed it though.

It's far more likely that the "definitely unique" Natural key for your table isn't actually unique than that there will be an insurmountable problem caused by using artificial row IDs, UUIDs or something like that.

SSNs can also change.

Never, ever, ever try to use SSN as a primary key or anything even remotely like a primary key.

SSN is a terrible key (and given privacy and security concerns should probably be encrypted or otherwise peotected in any case, not showing up in logs, developers notes while troubleshooting, etc).

Product serial number might be a better example. Or file hash.

> Product serial number might be a better example.

Nope. The "serial number" is just whatever you printed on the box or maybe on the product itself. You may intend that to be unique but you've got no way to ensure that. If you tell the people in manufacturing that if any duplicates occur it's a show stopper they will tell you to stop being stupid and design your IT systems to cope with the real world.

I buy serial-numbered LED lamps, the manufacturer replaces them if they drop dead within a certain number of years and I have some applications (18 hours per day of lighting) where that's plausible, so I need to track those serial numbers. Except they once shipped me a lamp (one in a box purchased together with others) with no serial number. Completely blank, no manufacturer, no type identifier, no voltage, the entire label was just missing from the product as shipped. Emailed them. No problem, apologies for the defect, your warranty registration for a lamp with no serial number will be fine.

> If you tell the people in manufacturing that if any duplicates occur it's a show stopper they will tell you to stop being stupid and design your IT systems to cope with the real world.

You've got a pretty low opinion of people in manufacturing there, buddy.

No manufacturer I've ever worked with would even blink at a requirement for unique serial numbers.

> No manufacturer I've ever worked with would even blink at a requirement for unique serial numbers.

It's easy and even expected to have a requirement. There's a requirement that US SSNs are unique too. A requirement doesn't cut it. The requirement is going to fail sooner or later and what then? How much did the manufacturers you worked with indemnify you against the consequences of non-unique serial numbers?

That's what we're getting at here. When it breaks somebody has to fix it. Or, you could design it not to break.

It's a man's laws versus nature's laws problem. Did you know you can pick up the soccer ball, and just carry it over the goal line? Yeah, that's "against the rules" but you can do it anyway, being against the rules doesn't make it impossible.

Now, contrariwise you can't teleport the ball - doesn't matter whether the rules allow it, can't do it anyway. Nature's law are facts. Design your database to rely on facts, not man's laws as much as possible.

> The requirement is going to fail sooner or later and what then?

First of all it'll get picked up by the people who set up the serial number printer. They don't just set it to a random number - the factory will have a procedure to make sure the number is set right.

Then it'll get picked up by the end-of-line testing, as test reports are filed by serial number.

Then it'll get picked up when the first part comes off the production line and gets checked by the factory's most conscientious testers and engineers.

Then it'll get picked up at goods inbound, when they count the items and check them in, and find the system says some of them are used, or there's a discrepancy in the number of items.

Then it'll get picked up at acceptance testing, when they find the shipment's test records aren't in order.

Then it'll get picked up by accounts payable, when the manufacturer bills for parts they've already been paid for.

Then depending on the numbers and costs involved, you either send the bad units back for relabelling or trash them, in either case at the manufacturer's expense.

> First of all it'll get picked up by the people who set up the serial number printer. They don't just set it to a random number - the factory will have a procedure to make sure the number is set right.

Printer ran out of toner and nobody noticed until the next shift.

> Then it'll get picked up by the end-of-line testing, as test reports are filed by serial number.

Nah, test report operator was out sick that day and product needed to be pushed anyway. Manager assumed they're all good and overrode tests.

> Then it'll get picked up when the first part comes off the production line and gets checked by the factory's most conscientious testers and engineers.

Those testers and engineers don't have all day to test every single product. They only tested the first 100.

> Then it'll get picked up at goods inbound, when they count the items and check them in, and find the system says some of them are used, or there's a discrepancy in the number of items.

Counted by total weight divided by individual weight.

> Then it'll get picked up at acceptance testing, when they find the shipment's test records aren't in order.

Counted total number of UPCs.

> Then it'll get picked up by accounts payable, when the manufacturer bills for parts they've already been paid for.

Accounts payable sees a box, of 1000 parts and they're labeled and the top 100 appear genuine... pay it.

> Then depending on the numbers and costs involved, you either send the bad units back for relabelling or trash them, in either case at the manufacturer's expense.

Manufacturer thinks you're trying to defraud them and won't authorize a return without a serial number.

I'm guessing you've never worked in manufacturing for a product with a serial number?

The whole point of a serial number is to be able to track an item through your supply chain

If it can get through your supply chain without anyone ever reading the serial number, why are you putting a serial number on it at all? Decoration?

I have, in fact, worked for multiple manufacturers of serialized products.

My point is that even though there are multiple points where the serial number "should" be checked, there are also multiple points where that could fail for one reason or another.

So Ford builds two F-150s with serial number 1234. One has internal_db_id 1 and the other has internal_db_id 2. I buy an F-150 with serial number 1234 and immediately smash it into a tree. Total loss.

You buy the other F-150 sn 1234. You take your truck in for service. They refuse to service salvage title vehicles under warranty. You explain you don’t have a salvage title. They blame “the system”. Your insurance company fares no better.

Or, Ford could put a unique constraint on serial number and catch the problem in the factory.

It’s not clear to me how the internal_db_id did anything except propagate an error until it was way too late to fix.

The entire point of properly modeling data is to codify and enforce the “rules”.

> Nature's law are facts. Design your database to rely on facts, not man's laws as much as possible.

Exactly. Natural keys can, along with normalization and constraints, stop you from picking up the ball and walking into the goal. In other words, natural keys define the nature of your system.

I mean, they said "might be", and they're entirely correct about SSN.

> Emailed them. No problem, apologies for the defect, your warranty registration for a lamp with no serial number will be fine.

Seems fine?

> Natural keys exist

They probably do, but everytime I thought something was good as a natural key, I've been wrong

Agreed. The simplest example is names. People don't have unique names so you combine it with some other stuff (birthdate, SSN) and call it a primary key. Then someone gets married and changes their name. To anyone doing this for a short amount of time this example is obvious. The point is that almost everything you might think of as a "natural key" is not good enough and you'll get burned when you find out what the reasons are.

In theory, theory and practice are the same. In practice they’re not.

Sortable unique IDs have many performance benefits in real world systems.

Just put an index on whatever nice sortable unique field you happen to have, then?

> First rule of database design: Your unique IDs should not have a real-world meaning.

I see a lot of comments conflating a unique ID with a unique column. It's OK for you to have a unique contraint on something like an email column. But you wouldn't use the email column as the primary key.

To take this one step further: Adding surprise semantics like capturing time (and the dev not being aware of it) will lead to security vulnerabilities (side channels because the code is essentially timestamping when various parts of the code are hit) as well as privacy issues (PII leaking, clock skew leaking, whatever).

My thesis is that computers are so full of unwanted unneeded things like this that there is no engineer who knows them all as well as having a good understanding software engineering as well as infosec. Most vulnerabilities are due to lack of understanding how the primitives work, as opposed to flaws in reasoning (the former is clearly distinct: the summary provided to save him from spending a month looking at the implementation is inadequate).

I have designed UUIDs myself and they are simply a long random bitstring. The random part is already necessary because we need sufficient randomness for crypto to work in the first place. IETF likes to "engineer" things.

If that's how you feel about it, why use UUID at all? You could use any 16 bytes and not format them according to any standard.

> to record the time something was created.

I don't think that was the downside mentioned, I think it was because of the lack of standardization on how to parse the timestamp.

If you don't think that's a concern then UUID8 is for you!

> with a weird epoch and 100 ns precision

Exactly as used by windows to store file times. And the epoch is the start of the modern calendar https://en.wikipedia.org/wiki/Gregorian_calendar

Talking about dates before this epoch makes little sense (any date before the Gregorian epoch will not resemble the same stellar / planet constellation as date after the Gregorian epoch).

A precision of 100 ns is not bad, just inflexible; I didn't mean to imply otherwise. The epoch is weird, though. Here's a thought experiment: imagine that a hundred programmers are each asked to pick an epoch for a timestamp, and will be paid $1000 for each other programmer who chooses the same epoch, but they can't talk with each other to coordinate. Which would you pick? I would pick the Unix epoch, because I think others would do the same. Anything else is, in a certain important sense, weird.

This is a great example of a Schelling Point!

I'd also argue that you're right, and that as a principle, good UI/UX and software architecture should default to following natural Schelling Points.

That doesn't make the epoch weird, just different. The Unix Epoch is itself weird. I personally like it mostly for being a neat little historical artifact, but it's not interesting in any sense. What epoch would your hypothetical programmers settle on if they all had their minds wiped of any existing epochs.

I bet you'd see a lot of Jan 1 1900, 2000, or year 1. Very few would pick 1970.

Other fun non-Jan 1 ideas might be December 26, 1791 (Charles Babbage's birthday) or February 5, 1944, (the date Colossus came online)

It makes a great deal of sense when thinking about the use case: recording a monotinic time in a computing environment with limited resources and extremely unlikely to ever see files from before midnight of new years eve 1970.

With a great deal of existing routines that work with those values and a good number of existing files and storage applications based on the simple version, it makes even mores sense to extend that in an unsigned way with a larger bitfield for any timestamp. Though as recently came up, leap seconds and synchronization could use a few bits to describe which timestamp is in use.

1970 makes sense if you’re picking an epoch and you are in the 70s - everything will happen “after” your start date but picking the actual current date doesn’t seem natural enough.

It’s like picking 2020 as the beginning of an epoch now (of course we’d probably fall back to 2000 as that’s a big round number).

Other fun non-Jan 1 ideas

1997-08-29T06:14:00Z

> Here's a thought experiment [...] I would pick the Unix epoch, because I think others would do the same.

Ah yes, nobody will ever need to represent a date/time before 1970-01-01.

As for your thought experiment: recently I had to pick a "NULL/no/unknown date" value to put into a non-nullable database column and I picked 1900-01-01. Unix epoch crossed my mind only briefly and I discarded it immediately because it's too arbitrary and, more importantly, too near valid dates that the user might use. (Use-case: materials from historical archives being ingested.)

So my opinion is that the RFC authors should be commended for having the foresight of choosing an epoch that's widely more applicable and far less arbitrary (the 1st day of the calendar system that most of the world is using today) than the Unix epoch.

> Ah yes, nobody will ever need to represent a date/time before 1970-01-01.

If only we had a way to represent numbers less than zero :)

Will UUIDs with your favourite encoding for negative numbers put 1969-01-01 before or after 1970-01-01 with a lexicographical sort of the UUID bytes?

I think it's certainly rather unlikely that anyone will ever need to generate a v6-8 UUID before 1970.

> Ah yes, nobody will ever need to represent a date/time before 1970-01-01.

More like: the few people interested in doing this can use UUIDv8 (unspecified time format). There are many, many reasons to store times before 1970 in computer systems, but I'm not sure any apply to these UUID formats.

tl;dr of the "Background" section of the RFC: they're designing it so that two UUIDs with fresh timestamps will have values near each other in the database keyspace, which in many cases improves efficiency. If your timestamp isn't freshly generated, I don't know why you'd embed it in a UUID with these formats.

It's arguably not a good practice to extract the timestamp from the UUID at all. Instead, I might just treat them once generated as opaque approximately-sorted bytes. More clear to have a separate fields for timestamps with particular importance. Though I might not feel too religious about that if tight on storage space.

> If your timestamp isn't freshly generated, I don't know why you'd embed it in a UUID with these formats.

Ingesting past/archival data. As to why use these formats: to still benefit from their sorting properties. As to why use UUID at all: because it's a format understood by a wide variety of software.

> It's arguably not a good practice to extract the timestamp from the UUID at all.

Ya, I'd do it only if I trust the source (i.e., a closed system).

> Ingesting past/archival data. As to why use these formats: to still benefit from their sorting properties. As to why use UUID at all: because it's a format understood by a wide variety of software.

Their argument about the value of sorting is quite particular. Put stuff roughly in ascending order by the record creation time so that creating a bunch of records will touch (eg) fewer btree nodes. The simplest way to achieve that is to just use the current time. If you have some other timestamp at hand and happen to be scanning them in ascending order by that, you still get no advantage by this argument over just using the current timestamp:

> However some properties of [RFC4122] UUIDs are not well suited to this task. First, most of the existing UUID versions such as UUIDv4 have poor database index locality. Meaning new values created in succession are not close to each other in the index and thus require inserts to be performed at random locations. The negative performance effects of which on common structures used for this (B-tree and its variants) can be dramatic. As such newly inserted values SHOULD be time-ordered to address this.

My gut tells me that you should rarely if ever stick past timestamps in UUID primary keys. One reason why: can it ever change? Maybe your artifact dating technique was wrong, and you want to update the dates of a bunch of artifacts. But now you have to change your primary key. You probably don't want that.

Yet that thought process wouldn't allow for the Unix epoch in the first place.

I wouldn't say it makes "little sense". Just like we can talk about the year 776 BC even though no one at the time called it that, we can extend the Gregorian calendar backwards to dates when it wasn't used anywhere. The Wikipedia article on the proleptic Gregorian calendar lists some use cases. [1]

And in any case, 15 October 1582 isn't some hard cutoff point where we can stop worrying about calendar conversions. Only four countries adopted the Gregorian calendar on that day, and even in Europe there are several countries that only switched in the 20th century. If a piece of software needs to support historical dates that go anywhere near 1582, it needs to be aware of the existence of different calendars.

[1] https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar

> Talking about dates before this epoch makes little sense (any date before the Gregorian epoch will not resemble the same stellar / planet constellation as date after the Gregorian epoch).

That is not the only sense in which we care about dates. For example, we might want to talk about which of two events came first, and by how much. Historians have lots of uses for dates before the introduction of the Gregorian calendar.

Yes, but we won't need 100ns precision for those historical dates.

Historically, calendars came to be from observing the movement of planets and stars. Now:

> we might want to talk about which of two events came first, and by how much

Neither of which makes sense without having a common reference point (start of Gregorian calendar being one such possible reference point). Trivial example: there is exactly one day between 1582-10-04 (Julian c.) and 1582-10-15 (Gregorian c.), i.e., different reference points.

> Historians have lots of uses for dates before the introduction of the Gregorian calendar.

Yes, and I bet they're hyper-aware about the calendar system(s) they're using. The rest of our mortals can just use Gregorian calendar :)

Surely they'd be using some kind of timestamp standard not a UUID?

Windows' FILETIME uses a different epoch, though. It counts since 1601 (the then-most-recent 400-year leap-year cycle when Windows NT was designed).

To clarify, are those "plus some random bits" parts intended to be randomly generated for each individual UUID that is created, or randomly generated once per machine or startup?

I ask because a common newbie bug is to use a UUID as a secure token, but currently only v4 UUIDs with a cryptographically secure PRNG can be used this way. Separately, even if not used as a token, using UUIDs with randomness for object IDs can help mitigate IDOR vulnerabilities if there are other bugs in code that aren't adequately checking object permissions.

They're to be generated for each individual UUID. (However, this RFC recommends using the entirely random UUID4 for anything that's meant to be used in a security-related context, presumably with a CSPRNG.)

Thanks. To be honest then, that last bit, "this RFC recommends using the entirely random UUID4 for anything that's meant to be used in a security-related context, presumably with a CSPRNG." makes me think this RFC could cause some problems.

v6 UUIDs only have 48 bits of randomness, which means that some newbies will think that's "good enough" for security, when in reality it's not.

I still like these new UUID types because I would use them as DB primary keys now (benefit of being time sorted but also globally unique and offers some protection against IDOR bugs), but important to know where not to use them.

Yeah, I'm a bit torn on this security-wise. On the one hand, their only real security-relevant ambition here is to avoid leaking MAC address info, a classic UUID foot-gun, and they do achieve this. Probably a much better set of defaults! On the other hand, yeah, the mere presence of some random bits could lull people into a false sense of safety, especially as the RFC doesn't say anything about where those bits should come from except that it should have enough entropy to make collisions minimal. On the third hand, anybody who thinks that 48 random bits of unspecified provenance are enough for a secure token was probably always going to mess up somehow, so I'm not sure how much difference this makes on the margin.

I feel like understanding the guarantees is really simple. A value can be unique while still being guessable, and that's the idea with this latest revision. If you want something to be unique and unguessable, that's another uuid (or just use random bits, there's no benefit to a uuid here, generally).

UUID are better than random for distributed ID generation, I though

Nope, v4 are just random with a version tag.

They’re perfect for client generation of db ids. I was looking to do exactly this format, but decided not to do anything that was weirdly non-standard.

Uuid6 is the one I’ll be switching to now.

UUIDv4 (or even worse, the other versions) are regularly used for "security", as you said, when they shouldn't. That won't change with these proposed versions.

Any randomness in the UUIDs is meant to minimize the possibility of collisions. If e.g. you only used a timestamp then you might get collisions even on the same machine, and are almost guaranteed collisions sooner than later on a global scale.

The draft mentions randomness in the background section, but not really.

> Lastly, privacy and network security issues arise from using a MAC address in the node field of Version 1 UUIDs. Exposed MAC addresses can be used as an attack surface to locate machines and reveal various other information about such machines (minimally manufacturer, potentially other details). Instead "cryptographically secure" pseudo-random number generators (CSPRNGs) or pseudo-random number generators (PRNG) SHOULD be used within an application context to provide uniqueness and unguessability.

And in the Security Considerations section

> Instead [of MACs] pseudo-random bits SHOULD selected from a source with sufficient entropy to ensure guaranteed uniqueness among UUID generation.

However, the rest of the document does not mention how to generate the random data, at all. If the draft advanced to RFC or even standard like this, it would be fine to use the "Debian PRNG")[1].

Then again, only the proposed v6 mandates any random data at all (and 48-bits "only"). v7 and v8 have essentially a "may use" for any randomness.

Using only UUIDs for "security" is generally a bad idea in my humble opinion, regardless, even if v4 with a crypto-secure (CS)PRNG. But I suppose, mandating the use crypto-secure (CS)PRNG would still alleviate some of the injuries to people who point the UUID-gun at their feet and pull the trigger.

[1] https://i.imgur.com/WkLh32P.png [2]

[2] https://www.debian.org/security/2008/dsa-1571

> UUIDv4 (or even worse, the other versions) are regularly used for "security", as you said, when they shouldn't. That won't change with these proposed versions.

UUIDv4 can be entirely suitable for security purposes, so long as the implementation is defined to use a secure RNG, which most sane implementations would be (since that's the dead simplest way to ensure the guarantees of UUIDv4).

Most people use them for security purposes because libraries for UUID tend to make it easy - you get 122bits of randomness (more than enough for many use cases), a nice way to convert to strings in multiple formats, parsers, etc.

Heh, v4 had to be from before VMs were common and MACs easily duplicated.

v4 doesn't use a MAC address, you're thinking of V1 and V2.

> only v4 UUIDs with a cryptographically secure PRNG can be used this way

Only they should be used that way. Unfortunately other options (other UUID types, v4 with bad RNG) are sometimes used that way. At least other UUID variants are more faf to hack around than a simple increasing integer, particularly if decent request and access validation is in place, though if the wrong type of key is being used perhaps hoping for good validation elsewhere is a bit much.

8 kinds of UUIDs, because we already had 4 too many.

Which of the 5 existing kinds of UUID do you consider the One True UUID? No matter which you pick, I guarantee that there's something badly wrong with it for at least one common use-case that can be improved by switching to either v4 (IMHO the only one of the existing types worth using) or to one of the new proposed UUID types.

UUID v4

Are you concerned about the 0.0000001% chance of collision after generating a 100 trillion UUIDs?

Or are you trying to include metadata in your identifier? (Not the worst thing, but it's also not super useful info.)

I'm not worried about collisions and I agree that being able to put metadata in the UUID is a big meh of a feature. The problem with v4 is what happens when people try using them as database keys: the random sort order can really hurt performance. You might argue that the fix for this is to simply not use UUIDs as database keys... but so many people are already doing this, and will continue to do it, that they should probably be given better standard options.

> random sort order can really hurt performance

I've long thought it was a problem that PostgreSQL doesn't have a real clustering key like MySQL or SQL Server.

But PostgreSQL users have told me it doesn't really matter.

It's much less important in postgres as in some other databases, but it still hurts.

Mainly because indices have to be updated and searched all over the place with UUIDs. The locality of the data on disk itself is fine because its inserted sequential.

> But PostgreSQL users have told me it doesn't really matter.

Those users were very wrong, unless they explicitly only talked about the case where the uuids are not indexed.

They'd likely argue to use natural keys instead. Use the data that matters to you, index it across more than one column.

For cases where one sensibly used uuids, I don't see natural keys being an alternative. What's e.g. the natural key of an order reference that should be accessible without auth, when the key shouldn't allow enumeration and that should be allocatable without a single point of coordination?

This is a fairly significant issue for mysql, but does the same issue exist when using postgres with a UUID pk type?

At my last job, this was managed by using two ID's on each row - a serial one that was basically only used for database optimization purposes, and the "real" UUIDv4 ID. It felt gross then, and it feels gross now, but it seemed to do the trick for our needs.

It doesn’t, since Postgres doesn’t actively use the primary key for clustering.

It still increases write rate a lot over more sequential values / reduced cache hit ratio.

With sequential-ish values indexed by something like b-trees the same index pages get modified over and over again because new table rows will be in a narrow part of the index. As databases / operating systems typically buffer writes for a while, this reduced the number of writes hitting storage substantially.

Conversely, with random values, every leaf page will be modified with the same probability. That's not problem with a small index, because you'll soon dirty every page anyway. But as soon as the index gets larger you'll see many more writes hitting storage.

The other versions all have use cases other than avoiding collisions.

If your access patterns are such that you often access ids generated around the same time together v6-8 will probably be faster due to locality. If you access data generated on the same node together, than same for v1-2.

If you want a reproducible id based on some other identifier, then only v3 and v5 will work.

> UUID6: a timestamp with a weird epoch and 100 ns precision like in UUID1, but in a big-endian order that sorts naturally by time

So even the IETF hasn’t read Things Every Developer Should Know About Time.

As a quick reference for readers:

The UUID specs use the terms "variant" and "version" a little funny; "variant" is essentially the revision (so all modern UUIDs have variant=0b10 to specify RFC 4122 UUIDs), and "version" is a 4-bit number identifying the sub-type within that variant:

    UUIDv1 time-based
    UUIDv2 legacy DCE security thing, not wideley used
    UUIDv3 name-based with md5 hashing
    UUIDv4 randomness-based
    UUIDv5 name-based with sha1 hashing

This draft registers a few new sub-types:

    UUIDv6 sortable time-based, Gregorian calendar
    UUIDv7 sortable time-based, Unix time
    UUIDv8 sortable time-based, custom time

tl;dr

Use UUIDv4 and get on with your day.

Well, not if you need database primary keys with sensible index locality. Which is what this draft is about.

Aren't indexes sorted? I suppose using time as the MSBs means new batches are always appended (after sorting just the new ones).

Yes they are. But, consider the common query, for records based on an time index on the last day. This index will point to a bunch of primary keys and those, being random, will be spread across a much wider range of blocks. If the primary key was just the time, those items would be on the same or close blocks.

"UUIDv8 SHOULD only be utilized if an implementation cannot utilize UUIDv1, UUIDv6, or UUIDv8." I assume that is a typo.

Indeed, I believe that it should read "… or UUIDv7."

    UUIDv1 non-sortable time-based
    UUIDv6 sortable time-based, Gregorian calendar
    UUIDv7 sortable time-based, Unix time
    UUIDv8 sortable time-based, custom time

That is: The custom-time version should only be used if you cannot use one of the standard time versions.

There’s another typo in the v7 section where it’s erroneously referred to as v8.

The background section gives reasons derived from looking at existing implementations.

Due to the shortcomings of UUIDv1 and UUIDv4 details so far, many widely distributed database applications and large application vendors have sought to solve the problem of creating a better time- based, sortable unique identifier for use as a database key. This has lead to numerous implementations over the past 10+ years solving the same problem in slightly different ways.

- Timestamps MUST be k-sortable. That is, values within or close to the same timestamp are ordered properly by sorting algorithms.

- Timestamps SHOULD be big-endian with the most-significant bits of the time embedded as-is without reordering.

- Timestamps SHOULD utilize millisecond precision and Unix Epoch as timestamp source. Although, there is some variation to this among implementations depending on the application requirements.

- The ID format SHOULD be Lexicographically sortable while in the textual representation.

- IDs MUST ensure proper embedded sequencing to facilitate sorting when multiple UUIDs are created during a given timestamp.

- IDs MUST NOT require unique network identifiers as part of achieving uniqueness.

- Distributed nodes MUST be able to create collision resistant Unique IDs without a consulting a centralized resource.

What are the advantages of sortable UUIDs with embedded timestamps over random 128 bits and a created_at column?

The linked RFC talks about this (it refers to UUIDv4, which is not much different from 128 random bits):

" First, most of the existing UUID versions such as UUIDv4 have poor database index locality. Meaning new values created in succession are not close to each other in the index and thus require inserts to be performed at random locations. The negative performance effects of which on common structures used for this (B-tree and its variants) can be dramatic. As such newly inserted values SHOULD be time-ordered to address this."

> UUIDv4, which is not much different from 128 random bits

To quantify how different: UUIDv4 has 6 fixed bits and 122 random bits.

Note that this totally-random key behavior is actually highly desirable when using dynamic programming techniques such as with the Splay Tree. It makes it statistically impossible for the tree to fall into a degenerate state.

Splay trees aren't a dynamic programming thing (and aren't used in any database storage backends I know of). They are, however, conjectured to be no more than a constant factor slower than a dynamically optimal binary search tree.

https://en.wikipedia.org/wiki/Optimal_binary_search_tree

> Splay trees aren't a dynamic programming thing

My mistake - I meant to write "dynamic optimality".

As for the "not used in any database storage backends", I have personally started experimenting with them due to higher-order effects they can provide relative to their optimization technique.

The fact that the most recently accessed node is always at the root also means that incremental tree updates are well-bounded in terms of # of nodes that would need to be rewritten to disk. This allows for much more efficient usage of storage-friendly things like append-only log structures.

Today I learned! That sounds fascinating; is there anything written down about it somewhere?

I actually don't think there is any specific reference I have ever encountered. It was something that just popped into my head one day when I was trying to figure out a way to write small batches of key-value data into a log.

Once you put 2+2 together on it, its like a lot of concepts seem to snap into place all at once. You also get incredible locality of access - each batch can sometimes fit within a single I/O block depending on system loading and data patterns, so traversing the tree to any node modified in the previous flush to disk would involve a single I/O.

Edit: If you search "splay tree append only log" on google, you will find one of my previous HN comments on the first page of results: https://news.ycombinator.com/item?id=28010167 This tells me there isnt much precedent for this idea.

There are two issues not mentioned in the other comments with a random 128-bit UUID in real systems, both related to data scale:

- They defeat data compression, which is an integral part of many database engines these days for both performance and scalability reasons.

- For some very large systems, there is a UUID collision tail risk that while small is large enough to be plausible.

Its compression a legitimate concern for primary keys?

Definitely, I've run into this problem many times. UUIDs are pretty bulky data types in database terms, usually larger than most other types in a typical table on average and possibly larger than rest of the row after including the index overhead, which creates cache pressure. Logical columns in a table are commonly stored physically as a vector, of UUIDs in this case, expressly for the purpose of compressing the representation. This saves both disk cache and CPU cache. For queries, the benefit of scanning compressed vectors is that it reduces the average number of page faults per query, which is one of the major bottlenecks in databases.

Also, some data models (think graphs) tend to be not much more than giant collections of primary keys. Using the smallest primary key data type that will satisfy the requirements of the data model is a standard performance optimization in databases. It is not uncommon for the UUID that the user sees to be derived from a stored primary key that is a 32-bit or 64-bit integer.

Uuids are bulky if you store them in text form, but in binary form they are only 128 bits.

The main feature of a uuid is it allows distributed generation. 32-bit or 64-bit integers are almost always sequential numbers. The sequential nature allows efficient page filling and index creation, but the contention involved in creating a sequence grows rapidly with scale.

So while a 128-bit uuid is larger than a 64 bit integer, this version allows for the bulk of the benefits of sequential integers while reducing the biggest drawback of contention at the point of creation.

I was assuming binary format. 128-bits is a pretty heavy data type in many data models with measurable impact on performance versus using something smaller.

You also do not need 128-bits to decentralize unique key generation, even though it is quite reasonable if you design your keys well. Many do it with 64-bit unique keys in massive scale systems.

A subtle point that you may be overlooking is that while large-scale distributed databases, including all the ones I work on, export globally unique 128-bit keys, in most systems I’ve worked with they are internally represented and stored as 64-bits or less even if the key space is much larger than 64 bits. There are many different techniques for doing key space elision and inference that are used inside distributed databases to save space. The 128-bit value is only materialized when sending it over the wire to some external client system. But you don’t need to store it.

Literally storing a primary key in a distributed system as a 128-bit value is all downside with few benefits. For small systems the performance and scaling cost may not matter that much but in very large systems it matters a lot. It can — literally! — cost you millions of dollars per year.

You can generate any size number in a distributed fashion. The only difference is that 128-bits gives you enough scale that it's practically impossible to have collisions when randomly generating.

Unless you need to be completely disconnected, a little coordination can drastically improve things. In past companies, I used a simple counter with 64-bit integers and each distributed process would increment a billion-number range to use for IDs. Fast, efficient, compatible with everything, naturally ordered, and guaranteed to never have a collision.

And, to be clear, those benefits are the placement of new records in a DB.

If a UUID is completely random it means it can be inserted anywhere which can require the DB to reshuffle records and pages in order to make room for the new record.

Having a sequential element to the UUID makes it a lot easier to have an index where each record is inserted at the end. Which, like you said, makes page usage more efficient and decreases the amount of work a DB has to do on insertion.

All this is a compounded problem if you have a DB with frequent writes, a lot of indexes, or both.

Twitter snowflakes are unsigned 64-bit ids designed to be created in a distributed fashion.

But they require a separate global distributed HA clustered service just for ID generation. This makes no sense from a cost or complexity perspective unless you’re Twitter-scale.

If you are getting good compression on your primary keys, this may be cause for alarm.

These should be high-entropy items.

The sole constraint on a primary key is that it is unique. Even in distributed systems, deterministically unique keys are almost always much more compressible than probabilistically unique keys, which is a double win.

Should you desire a high-entropy key, e.g. for hash tables or security, this can be trivially derived from the low-entropy key while still guaranteeing deterministic uniqueness.

Compressible primary keys are superior in every way.

My understanding is that if the UUID is not sortable, inserting it into an index can be a performance hit. So if you have a write-heavy table with non-sorted primary keys, you will have a lower ceiling (higher overhead).

The lack of sorting is what this rfc is fixing.

Know about primary vs secondary indexes?

Generally 1 index contains the data, all the other indexes have pointers to that index in their leaf nodes. Usually the PK index contains the data

So loading 1000 consecutive rows in a created_at secondary index takes two steps: 1) created_at index B-Tree traversal (maybe 10 pages from disk, max?) 2) Potentially loading 1000 randomly sorted pages from disk, dereferencing pointers (EXPENSIVE!)

Whereas, if your primary key is sorted by time, loading 1000 rows is 1) slightly larger primary B-Tree traversal 2) no 2, you're done :)

I don't expect to get an explanation for why this received downvotes - but maybe someone can offer one?

My best theory is that it's the Reddit "I don't need to understand how stuff works to be a programmer" crowd.

> What are the advantages of sortable UUIDs

I guess for use-cases like Instagram's: "Generated IDs should be sortable by time (so a list of photo IDs, for example, could be sorted without fetching more information about the photos)... If you use a timestamp as the first component of the ID, the IDs remain time-sortable... [but require] more storage space (96 bits or higher) to make reasonable uniqueness guarantees" [0]

[0] https://archive.is/xiMGZ

cf also Twitter's Snowflake[1] which is [timestamp, worker number, sequence number] to give the same "roughly sortable by time" property.

[1] https://blog.twitter.com/engineering/en_us/a/2010/announcing...

I use KSUIDs a ton in DynamoDB; it saves me from having to create and fill another index to sort items by creation time.

The only case where I don't use them is if I don't want to create time of the item to be known to the user.

FWIW the linked RFC's "Background" section is really easy to understand even for people who aren't well versed in dist-sys concepts. I suggest giving it a read - it's quite well written.

Random primary keys play hell on database indexes.

On top of reasons given by others, if the data likely to be queried together tends to get stored together because of this, things will be much more cache efficient and queries potentially much faster since for db loads, cache efficiency (not just ram but CPU cache), alleviates some of the most important performance bottlenecks.

It works better for databases that can only sort on the primary key (iirc Cassandra is one of these)

it depends what "created_at" means, if it means "persisted_at" then they're also a bit different because these sorts of UUIDs are generated before being persisted (often long before for offline-and sync systems).

Imagine you’re not using a storage platform that supports that access pattern. Think S3, DynamoDB or even a directory of files. Being able to encode your sort key in your primary key is a pretty nice property.

Hmmm...Now I'm wondering why I never bothered to prepend a timestamp to a GUID.

Because then it wouldn’t be a UUID in a format that other systems might expect or work more efficiently with.

Not having a created_at column and the index bloat that goes with it?

Follow-up question:

What are the advantages of sortable UUIDs with embedded timestamps over having an auto-incrementing id column, a created_at column, and using the composite primary key (created_at, id)?

Because an auto-incrementing id column has to be coordinated between processes that are writing to the database. That means allocating ids in blocks that are large enough that processes aren't blocked in getting an allocation, but they still need to request a block to maintain writing.

That sequence allocation needs to be ACID as well, so that there is no chance of any allocation being repeated.

Isn't that a possible security issue. MACs used to be predictable but then became random.

I presume you are referring to UUIDv4/1, which has 122 random bits. It is a common misconception that this is insecure. In reality, the UUIDv4/1 is just a way to encode 122 random bits, which is more than enough for most needs. The property of being practically unguessable comes from the random generator, not from the encoding.

Secondly, whether or not being predictable is a problem depends on the use case. The keys used on this forum are predictable and it is not an issue.

I’m pretty sure you mean ‘stable’ not predictable here re:MACs?

The issue with using Mac addresses is it essentially leaked tracking information on the machine being used to generate the uuid - you could associate every ID it ever generated back to that same machine, correlate it to things like open WiFi AP logs, whatever. Which is pretty scary sometimes.

Many OS’s now randomly generate a new MAC on every disconnect/connect for this reason, to at least give some privacy.

One data point instead of two. Which may or may not have advantages depending on your situation.

At any rate it's definitely simpler, and you don't have to forego the created_at column either.

This allows you to keep recent records in DB cache, for example. Contrary, random keys would constantly invalidate the cache unless all your data fits in memory.

That is addressed in the introduction -- random access inserts (i.e. specularity) can be quite inefficient depending on your backing store.

Searching records and returning the results time-sorted in O(n) time instead of O(n log n) time.

Using the uuid as a PK do to cursor based pagination for one.

After some staring, I can't see why (beyond the obvious, I understand HN rules) this is here.

This is clearly an individual draft. OK. And it has previously expired without action (last year) but a newer draft was submitted this year.

But it doesn't seem to have been adopted by any working group, and although the word "dispatch" appears in the title it was not discussed at IETF 111's DISPATCH or GENDISPATCH or SECDISPATCH last week. If this was in fact dispatched somewhere - perhaps before it expired last time - there's no indication where it went or what its status is now.

It is the nature of these things that if everybody chooses to do exactly this, even if it was only documented in a by-then expired draft, or on the back of an envelope, then that's how it is -- the IETF has no enforcement arm. However if you support this proposal, or even if you think it'd be a good idea with minor tweaks you should find out where (and if) it's being developed and get on board with that.

Maybe the community response here would provide guidance on whether this is worth pursuing?

I for one have been waiting for an IETF-standard sortable UUIDs for a while, so I'm happy to upvote this even if it is "blog post wishlist"-stage of standardization.

I'm really excited to see k-sortable unique identifiers (flakes) be submitted as an IETF draft. This will help keep the UUID data type relevant.

However, I'd like to see the draft include a mention about the practice of embedding machine and data type identifiers into the format which helps in distributed applications.

the practice of embedding machine and data type identifiers into the format is a security risk

How do you figure? There's no guarantee guids aren't supposed to be predictable, right?

For one? It allows you to figure out which machine is generating UUIDs, how fast, and all UUIDs it has ever generated. Which 1) is info leakage that is useful to many people in many ways, 2) can be a security risk as it tells you who to attack. 3) at a minimum is a privacy risk.

It might be, but whether it is depends on your threat model.

If you need to obscure both when in time and where in space, then UUIDv4 is a better option.

Section 7, "Distributed UUID Generation", mentions that you MAY hold some of the most significant random bits constant per-machine to act as a machine identifier.

The IETF draft mentions ksuid in the "prior art consulted" section.

Do you mean you prefer ksuid to this new draft? Your comment would be more helpful if you said what about it you prefer.

I prefer ksuids because

they have 128 bits of pseudorandom data

Both text and binary representations are lexicographically sortable

They don't have the awkward dash ('-') to mess with url encoding

> They don't have the awkward dash ('-') to mess with url encoding

Can you explain what you mean by that? I have experienced issues with " " as "+" vs. "%20", but dashes... ?

Not in URLs per se, but in HTML URL Encoded Multipart Forms, i.e. enctype="multipart/form-data"

ETA: also, when you put a UUID as text into a lot of search engines, it treats the dashes as token delimiters.

Which of those things are not true of these new UUID formats proposed?

Let's see if I can figure it out (correct me if I'm wrong).

I think maybe these new UUIDs proposed ARE lexigraphicaly sortable (in timestamp order, i think is the implication) in text and binary, these newly proposed UUIDs above? I got the impression that was one of the main things they provide, compared to older UUID formats, one of the main reasons for their proposal? Am I wrong?

I know the new UUIDs look the same as the ones we're used to, where they often have dashes (although there's certainly no requirement you display them with dashes, you can strip the dashes out before putting them in a URL if you want, although yeah it's an extra step).

[If we're talking about URL-encoding we're talking about putting them in URLs? ksuid's 160 bits insead of UUID's 128 , making a longer URL, seems like a downside?]

Not sure how many bits of psuedorandom/entropy they have... Looks like... UUIDv6 has 48 bits of psuedorandom in addition to timestamp, to reduce timestamp collisions. UUIDv7 is variable and can let the application decide (am I understanding right?), but supports up to 62 bits of pseudorandom data. [I guess you prefer a longer ID with more bytes of psuedorandom... just to further minimize possibility of timestamp collission, or other reasons? I am not sure how often existing UUID timestamp collisions happen in practice, anyone know? Anyone ever had one?]

> Which of those things are not true of these new UUID formats proposed?

Well, the entire UUID is 128, so obviously not 128 bits of pseudorandom data.

> What about 128 bits of pseudorandom data is preferable to you?

Same as with any hash: collision resistance. That's why we don't use SHA-1 for anything secure any more. The dash problem arises not in the URL but the multipart/form-data encoding, as well as the following:

"when UUIDs are indexed by a search engine, where the dashes will likely be interpreted as token delimiters. The base62 encoding avoids this pitfall and retains the lexicographic ordering properties of the binary encoding."

https://segment.com/blog/a-brief-history-of-the-uuid/

OK, thanks for explaining!

I have never had or thought of an application where a search engine interpreting the dashes as token delimiters was a problem. But if you have, and it was a problem, ok, that explains your preference I guess!

Curious if anyone reading has run into collisions with existing v1 UUIDs, or knows anyone who has, how often.

Pretty sure these new UUIDs are lexigraphically sortable (in timestamp order) in their hex representation (with or without dashes) as well as binary. Lots of people agree this is important, that in fact seems to be one of the main motivations of these new UUID formats in the draft we're discussing, I think?

Yeah, they've solved one issue with the UUID format by giving them k-sortability, but the other issues are still there. Ultimately, if you need some kind of UUID-compatible representation (eg for Vertica UUID columns), then these are good approaches. If you need to solve the same problems but don't have a reason to be locked into UUIDs, pick something else.

Other than the additional pseudorandom bits (which also have drawbacks), what's the advantage of ksuid over UUIDv6 as defined in the draft?

Both text and binary representations are lexicographically sortable

They don't have the awkward dash ('-') to mess with url encoding

Could you explain your concern with dash in url? To my knowledge, that shouldn't be in conflict with anything else like a slash, space, or plus, what sort of issues do you see?

No in the URL specifically, but in forms, with the enctype="multipart/form-data"

Oh, are you referring to some sort of conflicts with the "boundary" separators? I would be interested to hear more about this, I haven't personally had encoding issues with dashes, or thought to try to encode them (or w/e) to avoid any issues

Those look suspiciously similar to the Python one!

Probably is quite similar, the converting numbers between bases is a textbook algo, and the UUID format is a standard. Which python one?

Oh, nice. Won’t do one with Python, then :D

But yeah, I’ve found the ability to move IDs between alphabets quite useful.

> implementations MAY dedicate a portion of the node's most significant random bits to a pseudo-random machineID which helps identify UUIDs created by a given node. This works to add an extra layer of collision avoidance.

> This machine ID MUST be placed in the UUID proceeding [sic] the timestamp and sequence counter bits. This position is selected to ensure that the sorting by timestamp and clock sequence is still possible.

This guarantees uniqueness at a global level, as long as each machine doesn't run out of sequence counters within a given timestamp.

But why must that machine ID must preceding the timestamp & sequence counter? Why not have it after? (or does "proceeding" has a meaning I'm not aware of? I read it as a typo for "preceding", but I'd assume it should be succeeding, especially given what the next sentence says)

My intuition is range requests based on timestamp would work better if the machine ID is after, not before... If before, it seems it would violate the key requirement in abstract of "sortable using the monotonic creation time".

(It's already violated since each machine in distributed system doesn't have the same clock so "creation time" is all relative. But for purposes of analytics, such as querying the "last 24h", having the timestamp be at the beginning seems preferable, since range queries can be done easily)

> This machine ID MUST be placed in the UUID proceeding the timestamp and sequence counter bits.

I read this as a strange word choice, but interpret as follows: "This machine ID MUST proceed from the timestamp and sequence counter." X proceeding from Y implies that X comes after Y.

If we examine the context, it is absolutely unambiguous (emphasis mine):

> This machine ID MUST be placed in the UUID proceeding [sic] the timestamp and sequence counter bits. This position is selected to ensure that the sorting by timestamp and clock sequence is still possible.

The proceeding sentence makes the intent clear, that a naive sort should preserve time ordering. The only way for this to work is for the timestamp and sequence counter bits to precede the optional machine ID.

I think there was some proceeding/preceding confusion in GP. I misread it at first, too.

I found that confusing as well. If the machine id is before the unixts, the primary advantage of using this is lost for me. Scanning an index for a 24 hour period is only fast if you can easily find the start and stop, and they have locality. Since that is the primary problem addressed with this rfc, I hope it is just a poorly worded section, rather than a design flaw.

Since there's no valid way for part of the node to be put in front of the timestamp and sequence counter bits -- that would contradict the format specifications in section 4 -- they probably meant to write "succeeding" or "right after". (There are, alas, still some typos in this draft.)

Machine ID? Who uses identifiable machines instead of virtual ones any more? They poof in and out of existence almost as often as vacuum polarization. Heh.

See ulid for similar prior art. We generate Ulids in code and then store them in uuid columns in Postgres to realize the compact (binary) size benefits.

https://github.com/ulid/spec

I put in the work to store ULIDs in SQL Server's interesting UUIDv1-based sort order for uniqueidentifier columns. It seems to be providing the clustered index benefits I was hoping for so far (far less clustered index churn behavior than v4 Random UUIDs).

An interesting digression to bring up with respect to this RFC because it is interesting to note that none of the proposed v6, v7, v8 UUIDs match the SQL Server sort order behavior and would still have a lot of clustered index churn unless new sort behaviors were allowed to be opted in. That might be something that this RFC would help to address by also making sort orders a bit more standard, though it does not directly make recommendations on that front.

For what it is worth, it is the SQL Server sort order that is weird: it sorts the last group of the UUID string form first, which in v1 UUIDs corresponded to the MAC Address. Clustering by machine probably made sense by whoever decided on that sort order, at the time, but even with v1 UUIDs they probably would have got better results sorting by timestamp than by machine.

(The ULID timestamp fits exactly into that "MAC address" group and it's just a matter of swapping the front 6 bytes to the back of the UUID when storing to uniqueidentifier.)

Fun reference on UUID sort orders: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...

It’s not just a question of whether you want to cluster by MAC address, it’s more of a fundamental disagreement as to the endianness of the UUID groups themselves, which is probably the biggest difference between a GUID and a UUID in the wild. Microsoft has used GUIDs to brand OS intervals for a long, long time - long enough that there’s a chance they came before UUID was a thing, but I haven’t dug into the history of it.

Unfortunately converting between them without any db extensions is a PITA in the absence of native support. I’ve written about it at length here (SQLite in the example, but the fundamentals apply regardless): https://neosmart.net/blog/2018/converting-a-binary-blob-guid...

That was the question specific to SQL Server storage sort order for clustered indexing and most relevant to ULID to uniqueidentifier "conversion"/storage, but yes there are endianness issues in the sort orders as well, which is why I made sure to include the full Raymond Chen blog post on many of the most common sort orders for GUIDs because it is quite the rabbit hole. (The "update" footnote link at the bottom of the article I posted to the Java order is especially wild if you fall into the rabbit hole.)

(ETA: Also to answer your indirect question, Microsoft was definitely an early adopter of GUID/UUID. Though the inventor was Apollo Computer as a part of their complex network design: https://en.wikipedia.org/wiki/Apollo_Computer)

Thanks for clarifying - perhaps another reader will find that link for in-db conversion useful. And thanks for the interesting information and link to the Apollo Computer; I have some fodder for bedtime reading now!

No problem. Yeah, for what it is worth (getting further aside), I gave up on supporting the ULID conversion directly in-db for that same reason: Base-32 math (which ULID uses for canonical string format) is tough enough in SQL before you add in the endian issues. The endian issues aren't as big of a deal in my storage sort order situation because the MAC address section that I use for the timestamp data most critical to the sort order doesn't suffer from the endian issue and the remainder of a ULID is entirely random and I'm not too concerned if the random section doesn't sort identically to the ULID sort order. (It would be great if it did, but it's not as critical as the more dominant timestamp order.) That said, .NET's GUID ToByteArray() handles the endian order for me and I'm somewhat confident it's correct so long as it is my .NET code doing ULID-to-uniqueidentifier conversions and not the scripts I tried to use directly in-db.

I’m using Postgres so I actually have it a little easier with its native sort order. I switched from NUlid that I was using from years back to this one [0], and the default .ToGuid() gets the desired sort order when used with Npgsql. Likewise though, no db-side generation of uuid - then again, isn’t that the benefit of using globally unique identifiers in the first place, to avoid needing to rely on the db to generate conflict-free identifiers, skipping HiLo or sequences, etc.

Anyway, nice to meet another .NET Ulid user in the wild!

0: https://github.com/Cysharp/Ulid

Yeah, it's a small world.

I actually took the opposite path and explored several of the micro-optimized ULID libraries in C# before settling down with NUlid which I decided had the cleanest API and documentation (and reasonably good response times to GitHub issues). The implementation of Cysharp's lib has a lot of Unsafe code, which I understand why they took that path (given the focus on Unity code), but I'm hesitant about it and I liked a lot better (conceptually) how much more Span<T> code mcb2001's CSharp.Ulid library had and if I were to micro-optimize a library that's much more the path that I would take. (But so far micro-optimizations of that sort haven't been a priority in my real world usage and I'll take the well tested library with the nice API.) I created a couple simple extension methods to do the specific "ToSqlGuid()" and "ToUlidFromSql()" that I desired for the SQL Server sort order. (Which is just shuffling the first six bytes of a ULID byte array to the end of the GUID byte array and vice versa.)

I actually trust that I can at least generate ULIDs in SQL Server, again because the part I care to be sorted/needs to be correct (the timestamp) fits into the non-endian-issue portion of a GUID in SQL Server. I've been using just the generator function from ulid-mssql [1] (I used it mostly just for DB migrations and it was nice to have for that). It was debugging the endian issues and Base-32 encoding issues of the "ULID GUID to string" functions where I gave up and decided I didn't trust doing that in database.

[1] https://github.com/rmalayter/ulid-mssql

What's a bit interesting to me is having read all the negativity/criticism around ULID, and now seeing these proposals by IETF...

Most ulid criticism surrounds the ill-advised (and optional!) part of the spec that tries to achieve sub-ms ordering at the cost of requiring a singleton or cross-thread atomics to try and generate random values for the random portion of the blob from a range that would end up providing additional sort order bits. The reference implementation in JS doesn’t even offer this (although that’s likely because JS doesn’t even support multiple execution threads within the same context).

Apart from that little misadventure (last I checked, there was a request for comments on a proposal to drop that from the Ulid spec altogether) there’s really nothing else that’s not same and straightforward in the spec, and little else to complain about.

They list ULID and 15 other implementations in the document.

To put it simply, the new standard aims to address UUID usage as primary keys in distributed systems.

But UUIDv7 is described as: Unix timestamp + fractions of second + increasing sequence + random number. Now imagine two processes start simultaneously and start generating UUIDs right away. IMHO chances of the two processes generating exactly the same UUID sequence are pretty high unless the implementation is smart enough to feed something like process id as a seed to random function.

Well implemented processes usually seed their random generators from a random device, such as /dev/urandom in Linux. Such devices hand out different values for each request.

> 48-bit pseudo-random number used as a spatially unique identifier Occupies bits 80 through 127 (octets 10-15)

Is 48 bits really enough to be a spatially unique identifier. Roughly 16 million entities would have a 50% chance of collision.

If you have a spatially unique identifier collision, it seems it might be possible for two independent entities to generate the same time stamp and counter codes resulting in an overall UUID collision.

While possible, with a high precision clock it should be quite rare. In any case, UUID's aren't meant to be universally unique to the point you will never see the same one for any reason across every use case in the galaxy, although many versions do allow you to neglect to code to handle collisions for almost every practical use case.

If you need 16 million entries to get a 50% chance of collision, 100ns time resolution means you have 10^7 timestamps per second, so to actually experience a 50% chance of collision requires an instantaneous hash rate of over 10^14 per second, which I don't think you are likely to ever see in practice before these UUID versions are long obsolete.

OT: Nicely written and formated ascii document. I wonder what tools they used to create the headars / page numbers / refs.

There are a number of different tools for writing internet-drafts. See here: https://tools.ietf.org/tools/

There are a lot of tools that let you write a source document and compile it into something that fits the RFC style guide.

XML and markdown are the most common source document formats.

https://www.rfc-editor.org/pubprocess/tools/

I had the same thought: Interesting topic, but I am very distracted by how much I love the format of the document.

> The machineID MUST NOT be an IEEE 802 MAC address.

> MAC addresses pose inherent security risks and MUST not be used for node generation.

Interesting concern in the distributed generation pathway.

I've used MAC addresses in the past for absolutely unique identifiers, but this is calling out that as a security risk, because the time + arp data might be known to predict a future UUID from a machine?

UUIDv1 IDs with MAC addresses have been used for tracking people down. There are especially a lot of classic examples from older Word documents where v1 UUIDs were often embedded in many places inside the documents. Groups used (and sometimes still use) the MAC addresses in such Word documents to track down specific authors (for FBI investigations on the more [extra-]legal side and for "doxxing" and such on the far less legal side).

Whether you consider this specific example threat a privacy risk versus a security risk here depends on your personal threat model, of course. But general consensus at this point is to label it a security risk.

That makes sense, but that seems like it's a problem with any machine-unique id, not just the MAC address. Is there some specific reason MAC addresses are a worse choice than (e.g.) CPU serial number?

This RFC has reason to call out MAC addresses directly as a security risk because they were standardized way back in UUIDv1. (To generate a standards compliant v1 UUID you are still required to use a MAC address, which is why everyone suggests you use UUIDv4 today, and exactly why UUIDv6 exists in the RFC, it's basically exactly UUIDv1 without the MAC address.)

The RFC leaves finding a machine-unique ID that isn't a security risk as an exercise to the user. (Mentioned directly in the v7 description where some bytes are optionally allocated for "flake"-like machine IDs.)

Seems like a missed opportunity to sneak ULID right in (https://github.com/ulid/spec), given that it's already pretty widely used.

The author seems to be developing the draft on GitHub and there's a few edits since the v01 version linked here

https://github.com/uuid6/uuid6-ietf-draft

TLDR they present 3 new versions that each have their own trade-offs, but are also taking into account being able to use these as DB primary keys, which is really great.

Another thing I was quite curious about (looks like you can use them in existing DB columns etc):

> The UUID length of 16 octets (128 bits) remains unchanged. The textual representation of a UUID consisting of 36 hexadecimal and dash characters in the format 8-4-4-4-12 remains unchanged for human readability. In addition the position of both the Version and Variant bits remain unchanged in the layout.

Doesn't the textual representation not matter for using them in DB columns? The DB just uses the actual bytes, no?

SQL is a textual format. Your DB driver may expect a UUID to be provided them in the standard string format. If your language lacks a standard UUID type, the drivers may expect you to provide UUIDs as strings even when using parameter binding.

e.g. MySQL accepts only "no dashes" and the standard textual format: https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functi...

Postgres is a bit more flexible, but still expects hyphens to be after multiples of four characters if present: https://www.postgresql.org/docs/9.1/datatype-uuid.html

This is a good point. Thank you!

Application level code validates UUIDs based on a specific format not to mention what the specific database might expect depending on the data type

This sounds like a mistake? When/why should the application validate the format of a UUID?

When our services get Strings in requests that are UUIDs we convert them to typed java.util.Uuid objects, if they're not valid they get rejected.

The important part is the number of bits used for storage; the textual version is only for humans

depends on the DB. And it depends on your framework. Last I checked Django from about 5 years ago? still shipped uuids as text fields to the database.

DB supports from "None" (SQLite) to "We'll give you some functions to work on text/ints that are supposed to be UUIDs" (MySQL) to "It's an actual native data type" (Postgres) among the 3 most common databases used by Django users, so it's hard to blame Django for going with a lowest common denominator approach.

I have since switched to Elixir/Ecto, and it just "does the right thing" in all of the above databases. Also, I think SQLite is capabale of supporting uuid with one of the plugins that you can add into the amalgamation with a compile switch, in either case ideally the framework would support it transparently.

I’m not sure this isn’t true, at least for PostgreSQL… I’m maintaining a ULID field library for Django and I’m reasonably sure that it’s getting sent to PostgreSQL as a UUID specific data type.

For human readability, I've always prefered encoding UUIDs in base 64 and swapping a couple characters. If I'm going to have some ugly identifier in a URL, I might as well make it more concise.

I think youtube does something similar with their video IDs.

A bit off-topic, but does anyone know how to build webpages that look like this? Like is there any way to do it other than doing everything manually on a text editor?

Most new RFCs are authored in an XML format that gets rendered to the officially-submitted plaintext (well, since 2016 (RFC 7990) if you use v3 of the XML schema, that may be submitted directly as the canonical version, otherwise the rendered plaintext is the canonical version.)

So the authors wrote this XML https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...

That XML got turned in to this plaintext https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u... by the xml2rfc tool https://xml2rfc.tools.ietf.org/

And then that plaintext got turned in to the linked HTML by the datatracker.ietf.org server software, which can be found in SVN https://svn.ietf.org/svn/tools/ietfdb/

At the moment anyway, that XML link won't render per an XML parsing error. For anyone who wants a quick look at what these XML docs look like here's one for UUID6 [0].

[0] https://github.com/uuid6/uuid6-ietf-draft/blob/master/draft-...

The reason it's showing that is because the document says

    <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>

But rfc2629.xslt isn't uploaded in the same directory (and Firefox gets real unhappy when it tries to parse the HTML 404 page as an XSLT file). Which is fine, the XML file isn't there to show it rendered, it's to download the source. So you can still just save it.

Not sure exactly what you mean by “look like this”, but several tools for generating RFCs in the standard text format are here: https://tools.ietf.org/

Maybe just wrap your text into <pre> tags? :)

For UUIDv7 it isn't very clear on how the subsecond part is encoded. From the description of decoding it sounds like it would be multiplying the fraction of a second by 2^n. but that isn't very explicit. And if you want to avoid floating point arithmetic you'll need to include a 10^p in there where p is how many digits of precision you have in base 10 (such as 3 for milliseconds)

I like using standards, but the advantages of KSUIDs over existing UUIDs has led me to adopt them by default in new (and where possible, existing) projects.

I was sort of hoping this would yield something obviously superior to KSUIDs, since nominally this is an attempt to create an IETF standard to solve the same problem KSUIDs solve, but...

...I'm not seeing it. I just want enough random bits to ensure I don't need to worry about a collission, combined with enough of a timestamp I can roughly k-sort them and get acceptable performance using them in a DB index. KSUIDs do this. These proposed formats...not so much.

Specifically:

1. All three of these standards devote significant numbers of bits to providing timestamps with very high precision or very distant epochs, or both. If you're generating a v6 UUID and your device's wall clock says it's currently the year 1582 (or really, any year prior to 2021), something has gone very wrong. As for accuracy, there's nothing wrong with it, but you could also shave off a few bits and spend that space on more randomness.

2. All three of these standards devote shockingly few bits to containing raw random bits. v6 has 48, and v7 and v8 don't require any (!) and maxes out at 62 bits. If you could trust that all devices generating v6 UUIDs had super accurate, super in-sync wall clocks, I suppose 48 bits of randomness and a 100 nanosecond timestamp might make sense. But in the real world, I think you're better off focusing on randomness, with only enough of a timestamp to make your database happy.

3. Both v6 and v7 devote significant attention to a clock mechanism. Clocks make a ton of sense in a system like Boundary's Flake (and Twitter's Snowflake), where you're embedding a node ID in the generated value, but these proposed formats don't do this. (Aside from an offhand commend that you could repurpose some of the random bits for this, but that this is "out of scope" for the spec.) Which is fine, but if your UUIDs aren't namespaced per-node, then your clock now needs to be synchronized globally, which is somewhere between "unfeasible" and "outright impossible". I appreciate the review they did of prior solutions, but you can't just grab arbitrary features from successful designs without considering the context that made the feature work in that design. As far as I'm concerned, if the format doesn't have a node ID, then all the bits devoted to clocks should be spent on more randomness.

I assume/hope that these UUIDs are useful for some applications, but I'm really struggling to see it. They seem good at things that aren't going to benefit users and bad at the things they're trying to solve, particularly at scale, although there's no real point for hobbyists either.

Does anyone have any opinion on how Firebase push keys compare to UUIDs? Firebase push keys are said to be unique and are sortable. Here's a link to the push key generator: https://gist.github.com/mikelehen/3596a30bd69384624c11

v7 sounds compelling for keying entities in our product. We use v4 right now.

What I am trying to determine is if 62 bits of entropy combined with a timestamp gives me better or worse collision resistance as with 122 bits in a purely-random format. Most of our keys only live within the scope of a single computer, but ~5% of them might be generated/shared between other systems.

Being able to order our keys by time would be really nice, and I like that they would compress/index better.

Maybe I could do a hybrid between V4 and V7 keys depending on if the type would be shared with external parties. There are many types that only ever scope to a single box.

I probably couldn't play with the idea of adding a "machine id" because the space of all possible machines is difficult to anticipate/control right now.

I think I just talked myself into sticking with v4.

Probability wise, nearly doubling your entropy will significantly reduce the risk of a collision in your generated IDs, but there's lots of other considerations for you to make. But for the raw odds perspective, from https://en.wikipedia.org/wiki/Birthday_attack we can get a rough idea of collision chance. You can plug this into Wolfram Alpha [1][2] to play around with it.

For a one-in-a-million chance of having at least one collision, you'd need to generate 3.04e6 random 62-bit values or 3.26e15 random 122-bit values. For a one-in-a-thousand chance, those numbers would be 9.6e7 or 1.03e17 respectively.

In other words: If you're sustained allocating v7 IDs (with no sub-second portion of the timestamp) at a rate of 3.04 million per second, there's a 1-in-a-million chance each second of having at least one collision. Taking 1 - 0.999999^86400, that gives an 8.3% chance of at least one collision after a day, 45.4% after a week, and 92.5% after a month.

If you instead were allocating v4 IDs at the same rate for a month, you'd have allocated (3.04e6)×86400×30 IDs, and you'd have a probability of 5.84e-12 of at least one collision among all of those IDs.

[1] Collision probability given number of draws: https://www.wolframalpha.com/input/?i=n%3D3.04e6%2C+H%3D2%5E...

[2] Min number of draws to get a given collision probability: https://www.wolframalpha.com/input/?i=p%3D0.000001%2C+H%3D2%...

edit: If you are storing milliseconds in subsec_a and generating IDs at the same rate, you'd have 3040 ids per millisecond, which have a probability 1.00164e-12 of a collision in any given millisecond. Taking 1-(1-1.00164e-12)^(86400*1000) gives you a probability of 0.0087% of a collision in a day, 0.061% in a week, or 0.26% in a month. So as long as your event generation is relatively constant across that second, you will indeed have a much lower collision chance if you are using millisecond subdivisions. If you have even higher precision time or are using a sequence counter, that can be pushed down even further.

v7 allows to you trade time precision for randomness. Adding time precision allows the surface area for conflicts to decrease, so the loss of random bits to higher time precision is worth it if you are generating a lot of ids. Unless it is a security application, I would use v7 for most things.

Trying to wrap my head around how this works when distributed. If v7 is used with a high time precision, it's important that each node in the distributed system have a very tightly synchronized time, right?

If so, with v7, I'd feel the need to find the right balance of time precision, time synchronization, node identification, and randomness. Just some thoughts out loud.

Synchronization isn't required, actually. The timestamp is there to make the UUIDs roughly k-sortable, and to provide a "namespace" for the random values to reduce collisions.

Imagine you had three devices one of which has an accurate wall clock, one of which is 5 minutes slow, and one of which is 10 minutes slow, and you're using millisecond precision. So what happens is:

For a given arbitrary "accurate" time such as UTC 00:00:00.5 Dec 1st 2021 (500ms after midnight), device one gets 1 millisecond to generate UUIDs into that timestamp "bucket". If it generates enough, it might get a collision, but likely it'll be fine. After 1 millisecond, its wall clock moves on to 501ms after midnight, and it stops generating UUIDs that (might) collide with other UUIDs generated at 500ms after midnight. Five minutes later at 00:05:00.500 the second device (incorrectly) thinks it's 500ms after midnight, and starts generating UUIDs for that millisecond "bucket". And then after another 5 minutes, the third device does. And we can extend it; an hour later a glitchy NTP server (or daylight savings, or whatever) might cause the devices to all roll their time back, and then they all go through the "500ms after midnight" window a second time, getting another shot to generate conflicts. And that's fine!

What matters is the total number of UUIDs you generate (over the entire lifetime of your system) by devices which believe (correctly or not) that it's that specific timestamp.

Now, one potential gotcha is that a user might be overly clever and think they can back the timestamp out of the UUID to figure out exactly when it was created. That does require precise synchronisation if you want accurate answers, but if you're trying to figure out the exact creation time to the millisecond of UUIDs created in a distributed fashion...well, eh, hopefully you know what you're doing. :)

Ah, this helps a lot with understanding the design goals. Thank you. The synchronization consequences are a bit disappointing, but hard to solve around I suppose.

If the epoch in this proposal were changed to something closer to the present day (2000-01-01 or the UNIX epoch) the format could easily recover some bits from the time fields to put in the PRNG-generated "node" fields. I wonder why they chose the Gregorian epoch?

The Gregorian epoch was chosen for UUID v6 in order to make it as similar to v1 as possible, for ease of updating existing software. There are two more types of UUID proposed in this draft RFC: v7 which uses the Unix epoch and allows more time precision, and v8 which can use any monotonic timestamp source.

I'm a bit confused as to what the use case for UUIDs are compare to incrementing integer IDs. I assume there are some big upsides (aside from the primary key issue which seems to have quite a few solutions at this point), but I'm just not aware of them.

One big one is distributed computing: for example, an app that can create To Do items offline and periodically syncs with a server.

If you have your To Do items identified by an incrementing integer, you have to come up with some other unique identifier to use for To Do items that have yet to be synced with the server's database (since you can't know which ID to use next).

On the other hand, if you use UUIDs, your app can just create a new UUID for the To Do item, which will continue to be its ID for subsequent updates.

From the "Introduction" section of the linked document:

"The motivation for using UUIDs as database keys stems primarily from the fact that applications are increasingly distributed in nature. Simplistic "auto increment" schemes with integers in sequence do not work well in a distributed system since the effort required to synchronize such numbers across a network can easily become a burden. The fact that UUIDs can be used to create unique and reasonably short values in distributed systems without requiring synchronization makes them a good candidate for use as a database key in such environments. "

One non-performance reason is that incrementing IDs inherently leak information about your application and/or your company. If you create an account somewhere and you see that your user ID is 452 you now know how many users the company has. In the company's eyes they'd probably prefer to keep that obscured.

Not exactly, as seeing a fresh ID lets you pretty accurately tell how many items there are.

Because they are unique on generation, you dont need to talk to the database before generating one. This is huge for distributed systems, or systems that have offline components (think mobile apps that sync). The ids can be generated client side and passed up to be stored in the database without worry.

Incrementing an ID requires complete knowledge in a single place, which at scale becomes a challenge. This would typically be done in the database, but doesn't bode well with distributed systems as multiple servers might thing "5" is following "4" leading to multiple rows identified by the same ID => this is a collision.

So UUIDs are great because multiple servers can generate plethora of them without having to worry about collisions.

This is all explained in the first few pages of the document, although not in simple terms.

Also I haven't spotted anything about UUIDv6 trying to avoid collisions, but I haven't read very far.

The last 48 bits should do it.

" node: 48-bit pseudo-random number used as a spatially unique identifier Occupies bits 80 through 127 (octets 10-15)"

One of the most important is that you can't iterate over all the UUIDs. How many data leaks have come from people incrementing an ID in a URL?

This is so common, I've experienced this myself as a student with a summer job: the company I was working for used an online service for their payroll, I accessed my pay slip one day and changed ?employee=123 to 124 in the URL out of curiosity and it loaded the data for a different person in a different company. They fixed it incredibly quickly once I told them about it :)

incrementing ids requires a centralized server of some sort to hand out IDs when you get to a certain scale. you don't have that issue with UUIDs. for small apps just incrementing ids works fine though

this approach also allows you to generate ids client side. which if you're working with a local db on ios/android etc this is very convenient

The huge one for me is that if I make a mistake in code or admin and accidentally run "update users set permission = 'root' where id=$1" using an account id instead of a user id, it does nothing. If I'd used integers I'd have given root permission to a random user.

So, it's like creating a special data type for each "counter-style" identifier. For example suppose a StudentsId which is not an integer, it's a StudentsInteger, so it cannot be mixed up with UsersId. Admittedly that seems more clunky than just using uuids.

Yeah you could do that if you really want to use integers. But I'm not sure I'd trust the database layer to not just silently convert the type since they're both based on integers.

As well as the other use cases mentioned, they're harder to guess than incrementing integer IDs so are a useful defense in depth to help stop people guessing URLs for resources.

This benefit can’t be understated: missing or buggy authorization checks are present in almost all real-world web apps and APIs I’ve encountered. A long random component (such as the 80 bits provided by ULIDs) covers up a lot of sins.

A ULID is 48 bits of timestamp plus 80 bits of entropy. These new formats just don’t have enough random-component entropy. 64 bits isn’t enough. Who cares in your millisecond timestamp wraps every 8900 years as they do in a ULID?

https://github.com/ulid/spec

OT but related: a reverse-sorted UUID format seems especially useful for IoT and event data where typically we want to read newest to oldest lexicographically (eg. rowscans starting at now). Is there such a standard or OSS that does this?

I haven’t seen any uuid format in any project I’ve worked on get explicitly created except uuid V4.

What’s the actual advantage of the other uuids? I know what they are on paper but has anyone used other ones in practice?

if you were about to use a uuid as a primary key in a database, wouldn't it always be better to instead use a composite key with explicit columns for sequence, timestamp, node id, etc.? if you really need to accept client side generated values and there's no opportunity to issue them a unique node id before hand, then explicitly taking their mac address and region seems better than stealthily relying on those things being embedded in a uuid - also aren't mac addresses considered PII?

I find it peculiar that the Introduction section on these drafts is always just boilerplate.

Interesting. I'd be curious how it affects sharding.

This looks awesome. Can't wait till I can add this to uuid.rocks!

Ok so we’re going back to timestamps.

The wheel turns.

New UUID Formats – IETF Draft

Recommend

Docker Nexus 使用dnspod证书

华尔街的低波动性使得投资者转投加密货币

Ticket Servers: Distributed Unique Primary Keys on the Cheap

AWS adds an extra 5.5M IPv4 addresses

955.WLB 的 VS Code 插件来了！在 VS Code 中查看 955 公司名单，以及贴心的下班提醒...

抖音Android无障碍开发知识总结

3C Design Team - 618项目复盘总结家电篇

AI Weekly: The road to ethical adoption of AI

英格兰银行称，央行或将在支付竞赛中输给科技公司

提交新西兰技术移民申请审理步骤解析

About Joyk