New UUID Formats – IETF Draft
source link: https://news.ycombinator.com/item?id=28088213
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
UUID6: a timestamp with a weird epoch and 100 ns precision like in UUID1, but in a big-endian order that sorts naturally by time, plus some random bits instead of a predictable MAC address.
UUID7: like UUID6, but uses normal Unix timestamps and allows more timestamp precision.
UUID8: like UUID7, but relaxes requirements on where the timestamp is coming from. Want to use a custom epoch or NTP timestamps or something? UUID8 allows it for the sake of flexibility and future-proofing, but the downside is that there's no standard way to parse the time from one of these -- the time source could be anything monotonic.
IMHO that should not be a concern. The goal of UUID is to create a Unique Identifier, not to record the time something was created. Maybe we should all have quantum random number generators, but with that I don't think 128 bits is enough for a UUID. Might be enough for specific applications though.
First rule of database design: Your unique IDs should not have a real-world meaning.
Well, this is an opinion. I would definitely push back on making it a rule. Especially the first one.
Natural keys exist, make perfectly good unique identifiers, and have inherent meaning.
When using synthetic keys it is difficult (read: requires making trade-offs) to guarantee they were created in order and/or that they increase monotonically. But if you make those trade-offs (or just align expectations) you can still associate real world meaning to them.
These new UUIDs standardize more of those trade-offs.
Most of the "natural" things you think are going to be good Primary Keys aren't. You quickly get into one of those "Falsehoods Programmers Believe About X" lists.
I remember some years ago running into the Experian team who were designing a system to uniquely identify people and they'd got it all figured out for the US market. See, Americans have Social Security Numbers, so you can just use those to key off. Alas, as hopefully anybody here who actually deals with Social Security Numbers knows, they aren't unique. Perhaps you could argue they should be unique, but that's not useful in practice, because in practice they aren't. They seemed really disappointed about that. I doubt they fixed it though.
It's far more likely that the "definitely unique" Natural key for your table isn't actually unique than that there will be an insurmountable problem caused by using artificial row IDs, UUIDs or something like that.
Never, ever, ever try to use SSN as a primary key or anything even remotely like a primary key.
Product serial number might be a better example. Or file hash.
Nope. The "serial number" is just whatever you printed on the box or maybe on the product itself. You may intend that to be unique but you've got no way to ensure that. If you tell the people in manufacturing that if any duplicates occur it's a show stopper they will tell you to stop being stupid and design your IT systems to cope with the real world.
I buy serial-numbered LED lamps, the manufacturer replaces them if they drop dead within a certain number of years and I have some applications (18 hours per day of lighting) where that's plausible, so I need to track those serial numbers. Except they once shipped me a lamp (one in a box purchased together with others) with no serial number. Completely blank, no manufacturer, no type identifier, no voltage, the entire label was just missing from the product as shipped. Emailed them. No problem, apologies for the defect, your warranty registration for a lamp with no serial number will be fine.
You've got a pretty low opinion of people in manufacturing there, buddy.
No manufacturer I've ever worked with would even blink at a requirement for unique serial numbers.
It's easy and even expected to have a requirement. There's a requirement that US SSNs are unique too. A requirement doesn't cut it. The requirement is going to fail sooner or later and what then? How much did the manufacturers you worked with indemnify you against the consequences of non-unique serial numbers?
That's what we're getting at here. When it breaks somebody has to fix it. Or, you could design it not to break.
It's a man's laws versus nature's laws problem. Did you know you can pick up the soccer ball, and just carry it over the goal line? Yeah, that's "against the rules" but you can do it anyway, being against the rules doesn't make it impossible.
Now, contrariwise you can't teleport the ball - doesn't matter whether the rules allow it, can't do it anyway. Nature's law are facts. Design your database to rely on facts, not man's laws as much as possible.
First of all it'll get picked up by the people who set up the serial number printer. They don't just set it to a random number - the factory will have a procedure to make sure the number is set right.
Then it'll get picked up by the end-of-line testing, as test reports are filed by serial number.
Then it'll get picked up when the first part comes off the production line and gets checked by the factory's most conscientious testers and engineers.
Then it'll get picked up at goods inbound, when they count the items and check them in, and find the system says some of them are used, or there's a discrepancy in the number of items.
Then it'll get picked up at acceptance testing, when they find the shipment's test records aren't in order.
Then it'll get picked up by accounts payable, when the manufacturer bills for parts they've already been paid for.
Then depending on the numbers and costs involved, you either send the bad units back for relabelling or trash them, in either case at the manufacturer's expense.
Printer ran out of toner and nobody noticed until the next shift.
> Then it'll get picked up by the end-of-line testing, as test reports are filed by serial number.
Nah, test report operator was out sick that day and product needed to be pushed anyway. Manager assumed they're all good and overrode tests.
> Then it'll get picked up when the first part comes off the production line and gets checked by the factory's most conscientious testers and engineers.
Those testers and engineers don't have all day to test every single product. They only tested the first 100.
> Then it'll get picked up at goods inbound, when they count the items and check them in, and find the system says some of them are used, or there's a discrepancy in the number of items.
Counted by total weight divided by individual weight.
> Then it'll get picked up at acceptance testing, when they find the shipment's test records aren't in order.
Counted total number of UPCs.
> Then it'll get picked up by accounts payable, when the manufacturer bills for parts they've already been paid for.
Accounts payable sees a box, of 1000 parts and they're labeled and the top 100 appear genuine... pay it.
> Then depending on the numbers and costs involved, you either send the bad units back for relabelling or trash them, in either case at the manufacturer's expense.
Manufacturer thinks you're trying to defraud them and won't authorize a return without a serial number.
The whole point of a serial number is to be able to track an item through your supply chain
If it can get through your supply chain without anyone ever reading the serial number, why are you putting a serial number on it at all? Decoration?
My point is that even though there are multiple points where the serial number "should" be checked, there are also multiple points where that could fail for one reason or another.
You buy the other F-150 sn 1234. You take your truck in for service. They refuse to service salvage title vehicles under warranty. You explain you don’t have a salvage title. They blame “the system”. Your insurance company fares no better.
Or, Ford could put a unique constraint on serial number and catch the problem in the factory.
It’s not clear to me how the internal_db_id did anything except propagate an error until it was way too late to fix.
The entire point of properly modeling data is to codify and enforce the “rules”.
> Nature's law are facts. Design your database to rely on facts, not man's laws as much as possible.
Exactly. Natural keys can, along with normalization and constraints, stop you from picking up the ball and walking into the goal. In other words, natural keys define the nature of your system.
> Emailed them. No problem, apologies for the defect, your warranty registration for a lamp with no serial number will be fine.
Seems fine?
They probably do, but everytime I thought something was good as a natural key, I've been wrong
Sortable unique IDs have many performance benefits in real world systems.
I see a lot of comments conflating a unique ID with a unique column. It's OK for you to have a unique contraint on something like an email column. But you wouldn't use the email column as the primary key.
My thesis is that computers are so full of unwanted unneeded things like this that there is no engineer who knows them all as well as having a good understanding software engineering as well as infosec. Most vulnerabilities are due to lack of understanding how the primitives work, as opposed to flaws in reasoning (the former is clearly distinct: the summary provided to save him from spending a month looking at the implementation is inadequate).
I have designed UUIDs myself and they are simply a long random bitstring. The random part is already necessary because we need sufficient randomness for crypto to work in the first place. IETF likes to "engineer" things.
I don't think that was the downside mentioned, I think it was because of the lack of standardization on how to parse the timestamp.
Exactly as used by windows to store file times. And the epoch is the start of the modern calendar https://en.wikipedia.org/wiki/Gregorian_calendar
Talking about dates before this epoch makes little sense (any date before the Gregorian epoch will not resemble the same stellar / planet constellation as date after the Gregorian epoch).
I'd also argue that you're right, and that as a principle, good UI/UX and software architecture should default to following natural Schelling Points.
I bet you'd see a lot of Jan 1 1900, 2000, or year 1. Very few would pick 1970.
Other fun non-Jan 1 ideas might be December 26, 1791 (Charles Babbage's birthday) or February 5, 1944, (the date Colossus came online)
With a great deal of existing routines that work with those values and a good number of existing files and storage applications based on the simple version, it makes even mores sense to extend that in an unsigned way with a larger bitfield for any timestamp. Though as recently came up, leap seconds and synchronization could use a few bits to describe which timestamp is in use.
It’s like picking 2020 as the beginning of an epoch now (of course we’d probably fall back to 2000 as that’s a big round number).
1997-08-29T06:14:00Z
Ah yes, nobody will ever need to represent a date/time before 1970-01-01.
As for your thought experiment: recently I had to pick a "NULL/no/unknown date" value to put into a non-nullable database column and I picked 1900-01-01. Unix epoch crossed my mind only briefly and I discarded it immediately because it's too arbitrary and, more importantly, too near valid dates that the user might use. (Use-case: materials from historical archives being ingested.)
So my opinion is that the RFC authors should be commended for having the foresight of choosing an epoch that's widely more applicable and far less arbitrary (the 1st day of the calendar system that most of the world is using today) than the Unix epoch.
If only we had a way to represent numbers less than zero :)
More like: the few people interested in doing this can use UUIDv8 (unspecified time format). There are many, many reasons to store times before 1970 in computer systems, but I'm not sure any apply to these UUID formats.
tl;dr of the "Background" section of the RFC: they're designing it so that two UUIDs with fresh timestamps will have values near each other in the database keyspace, which in many cases improves efficiency. If your timestamp isn't freshly generated, I don't know why you'd embed it in a UUID with these formats.
It's arguably not a good practice to extract the timestamp from the UUID at all. Instead, I might just treat them once generated as opaque approximately-sorted bytes. More clear to have a separate fields for timestamps with particular importance. Though I might not feel too religious about that if tight on storage space.
Ingesting past/archival data. As to why use these formats: to still benefit from their sorting properties. As to why use UUID at all: because it's a format understood by a wide variety of software.
> It's arguably not a good practice to extract the timestamp from the UUID at all.
Ya, I'd do it only if I trust the source (i.e., a closed system).
Their argument about the value of sorting is quite particular. Put stuff roughly in ascending order by the record creation time so that creating a bunch of records will touch (eg) fewer btree nodes. The simplest way to achieve that is to just use the current time. If you have some other timestamp at hand and happen to be scanning them in ascending order by that, you still get no advantage by this argument over just using the current timestamp:
> However some properties of [RFC4122] UUIDs are not well suited to this task. First, most of the existing UUID versions such as UUIDv4 have poor database index locality. Meaning new values created in succession are not close to each other in the index and thus require inserts to be performed at random locations. The negative performance effects of which on common structures used for this (B-tree and its variants) can be dramatic. As such newly inserted values SHOULD be time-ordered to address this.
My gut tells me that you should rarely if ever stick past timestamps in UUID primary keys. One reason why: can it ever change? Maybe your artifact dating technique was wrong, and you want to update the dates of a bunch of artifacts. But now you have to change your primary key. You probably don't want that.
And in any case, 15 October 1582 isn't some hard cutoff point where we can stop worrying about calendar conversions. Only four countries adopted the Gregorian calendar on that day, and even in Europe there are several countries that only switched in the 20th century. If a piece of software needs to support historical dates that go anywhere near 1582, it needs to be aware of the existence of different calendars.
[1] https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar
That is not the only sense in which we care about dates. For example, we might want to talk about which of two events came first, and by how much. Historians have lots of uses for dates before the introduction of the Gregorian calendar.
> we might want to talk about which of two events came first, and by how much
Neither of which makes sense without having a common reference point (start of Gregorian calendar being one such possible reference point). Trivial example: there is exactly one day between 1582-10-04 (Julian c.) and 1582-10-15 (Gregorian c.), i.e., different reference points.
> Historians have lots of uses for dates before the introduction of the Gregorian calendar.
Yes, and I bet they're hyper-aware about the calendar system(s) they're using. The rest of our mortals can just use Gregorian calendar :)
I ask because a common newbie bug is to use a UUID as a secure token, but currently only v4 UUIDs with a cryptographically secure PRNG can be used this way. Separately, even if not used as a token, using UUIDs with randomness for object IDs can help mitigate IDOR vulnerabilities if there are other bugs in code that aren't adequately checking object permissions.
v6 UUIDs only have 48 bits of randomness, which means that some newbies will think that's "good enough" for security, when in reality it's not.
I still like these new UUID types because I would use them as DB primary keys now (benefit of being time sorted but also globally unique and offers some protection against IDOR bugs), but important to know where not to use them.
Uuid6 is the one I’ll be switching to now.
Any randomness in the UUIDs is meant to minimize the possibility of collisions. If e.g. you only used a timestamp then you might get collisions even on the same machine, and are almost guaranteed collisions sooner than later on a global scale.
The draft mentions randomness in the background section, but not really.
> Lastly, privacy and network security issues arise from using a MAC address in the node field of Version 1 UUIDs. Exposed MAC addresses can be used as an attack surface to locate machines and reveal various other information about such machines (minimally manufacturer, potentially other details). Instead "cryptographically secure" pseudo-random number generators (CSPRNGs) or pseudo-random number generators (PRNG) SHOULD be used within an application context to provide uniqueness and unguessability.
And in the Security Considerations section
> Instead [of MACs] pseudo-random bits SHOULD selected from a source with sufficient entropy to ensure guaranteed uniqueness among UUID generation.
However, the rest of the document does not mention how to generate the random data, at all. If the draft advanced to RFC or even standard like this, it would be fine to use the "Debian PRNG")[1].
Then again, only the proposed v6 mandates any random data at all (and 48-bits "only"). v7 and v8 have essentially a "may use" for any randomness.
Using only UUIDs for "security" is generally a bad idea in my humble opinion, regardless, even if v4 with a crypto-secure (CS)PRNG. But I suppose, mandating the use crypto-secure (CS)PRNG would still alleviate some of the injuries to people who point the UUID-gun at their feet and pull the trigger.
[1] https://i.imgur.com/WkLh32P.png [2]
UUIDv4 can be entirely suitable for security purposes, so long as the implementation is defined to use a secure RNG, which most sane implementations would be (since that's the dead simplest way to ensure the guarantees of UUIDv4).
Most people use them for security purposes because libraries for UUID tend to make it easy - you get 122bits of randomness (more than enough for many use cases), a nice way to convert to strings in multiple formats, parsers, etc.
Only they should be used that way. Unfortunately other options (other UUID types, v4 with bad RNG) are sometimes used that way. At least other UUID variants are more faf to hack around than a simple increasing integer, particularly if decent request and access validation is in place, though if the wrong type of key is being used perhaps hoping for good validation elsewhere is a bit much.
Are you concerned about the 0.0000001% chance of collision after generating a 100 trillion UUIDs?
Or are you trying to include metadata in your identifier? (Not the worst thing, but it's also not super useful info.)
I've long thought it was a problem that PostgreSQL doesn't have a real clustering key like MySQL or SQL Server.
But PostgreSQL users have told me it doesn't really matter.
Mainly because indices have to be updated and searched all over the place with UUIDs. The locality of the data on disk itself is fine because its inserted sequential.
Those users were very wrong, unless they explicitly only talked about the case where the uuids are not indexed.
At my last job, this was managed by using two ID's on each row - a serial one that was basically only used for database optimization purposes, and the "real" UUIDv4 ID. It felt gross then, and it feels gross now, but it seemed to do the trick for our needs.
With sequential-ish values indexed by something like b-trees the same index pages get modified over and over again because new table rows will be in a narrow part of the index. As databases / operating systems typically buffer writes for a while, this reduced the number of writes hitting storage substantially.
Conversely, with random values, every leaf page will be modified with the same probability. That's not problem with a small index, because you'll soon dirty every page anyway. But as soon as the index gets larger you'll see many more writes hitting storage.
If your access patterns are such that you often access ids generated around the same time together v6-8 will probably be faster due to locality. If you access data generated on the same node together, than same for v1-2.
If you want a reproducible id based on some other identifier, then only v3 and v5 will work.
So even the IETF hasn’t read Things Every Developer Should Know About Time.
The UUID specs use the terms "variant" and "version" a little funny; "variant" is essentially the revision (so all modern UUIDs have variant=0b10 to specify RFC 4122 UUIDs), and "version" is a 4-bit number identifying the sub-type within that variant:
UUIDv1 time-based
UUIDv2 legacy DCE security thing, not wideley used
UUIDv3 name-based with md5 hashing
UUIDv4 randomness-based
UUIDv5 name-based with sha1 hashing
This draft registers a few new sub-types: UUIDv6 sortable time-based, Gregorian calendar
UUIDv7 sortable time-based, Unix time
UUIDv8 sortable time-based, custom time
Use UUIDv4 and get on with your day.
UUIDv1 non-sortable time-based
UUIDv6 sortable time-based, Gregorian calendar
UUIDv7 sortable time-based, Unix time
UUIDv8 sortable time-based, custom time
That is: The custom-time version should only be used if you cannot use one of the standard time versions.
Due to the shortcomings of UUIDv1 and UUIDv4 details so far, many widely distributed database applications and large application vendors have sought to solve the problem of creating a better time- based, sortable unique identifier for use as a database key. This has lead to numerous implementations over the past 10+ years solving the same problem in slightly different ways.
- Timestamps MUST be k-sortable. That is, values within or close to the same timestamp are ordered properly by sorting algorithms.
- Timestamps SHOULD be big-endian with the most-significant bits of the time embedded as-is without reordering.
- Timestamps SHOULD utilize millisecond precision and Unix Epoch as timestamp source. Although, there is some variation to this among implementations depending on the application requirements.
- The ID format SHOULD be Lexicographically sortable while in the textual representation.
- IDs MUST ensure proper embedded sequencing to facilitate sorting when multiple UUIDs are created during a given timestamp.
- IDs MUST NOT require unique network identifiers as part of achieving uniqueness.
- Distributed nodes MUST be able to create collision resistant Unique IDs without a consulting a centralized resource.
" First, most of the existing UUID versions such as UUIDv4 have poor database index locality. Meaning new values created in succession are not close to each other in the index and thus require inserts to be performed at random locations. The negative performance effects of which on common structures used for this (B-tree and its variants) can be dramatic. As such newly inserted values SHOULD be time-ordered to address this."
To quantify how different: UUIDv4 has 6 fixed bits and 122 random bits.
My mistake - I meant to write "dynamic optimality".
As for the "not used in any database storage backends", I have personally started experimenting with them due to higher-order effects they can provide relative to their optimization technique.
The fact that the most recently accessed node is always at the root also means that incremental tree updates are well-bounded in terms of # of nodes that would need to be rewritten to disk. This allows for much more efficient usage of storage-friendly things like append-only log structures.
Once you put 2+2 together on it, its like a lot of concepts seem to snap into place all at once. You also get incredible locality of access - each batch can sometimes fit within a single I/O block depending on system loading and data patterns, so traversing the tree to any node modified in the previous flush to disk would involve a single I/O.
Edit: If you search "splay tree append only log" on google, you will find one of my previous HN comments on the first page of results: https://news.ycombinator.com/item?id=28010167 This tells me there isnt much precedent for this idea.
- They defeat data compression, which is an integral part of many database engines these days for both performance and scalability reasons.
- For some very large systems, there is a UUID collision tail risk that while small is large enough to be plausible.
Also, some data models (think graphs) tend to be not much more than giant collections of primary keys. Using the smallest primary key data type that will satisfy the requirements of the data model is a standard performance optimization in databases. It is not uncommon for the UUID that the user sees to be derived from a stored primary key that is a 32-bit or 64-bit integer.
The main feature of a uuid is it allows distributed generation. 32-bit or 64-bit integers are almost always sequential numbers. The sequential nature allows efficient page filling and index creation, but the contention involved in creating a sequence grows rapidly with scale.
So while a 128-bit uuid is larger than a 64 bit integer, this version allows for the bulk of the benefits of sequential integers while reducing the biggest drawback of contention at the point of creation.
You also do not need 128-bits to decentralize unique key generation, even though it is quite reasonable if you design your keys well. Many do it with 64-bit unique keys in massive scale systems.
A subtle point that you may be overlooking is that while large-scale distributed databases, including all the ones I work on, export globally unique 128-bit keys, in most systems I’ve worked with they are internally represented and stored as 64-bits or less even if the key space is much larger than 64 bits. There are many different techniques for doing key space elision and inference that are used inside distributed databases to save space. The 128-bit value is only materialized when sending it over the wire to some external client system. But you don’t need to store it.
Literally storing a primary key in a distributed system as a 128-bit value is all downside with few benefits. For small systems the performance and scaling cost may not matter that much but in very large systems it matters a lot. It can — literally! — cost you millions of dollars per year.
Unless you need to be completely disconnected, a little coordination can drastically improve things. In past companies, I used a simple counter with 64-bit integers and each distributed process would increment a billion-number range to use for IDs. Fast, efficient, compatible with everything, naturally ordered, and guaranteed to never have a collision.
If a UUID is completely random it means it can be inserted anywhere which can require the DB to reshuffle records and pages in order to make room for the new record.
Having a sequential element to the UUID makes it a lot easier to have an index where each record is inserted at the end. Which, like you said, makes page usage more efficient and decreases the amount of work a DB has to do on insertion.
All this is a compounded problem if you have a DB with frequent writes, a lot of indexes, or both.
These should be high-entropy items.
Should you desire a high-entropy key, e.g. for hash tables or security, this can be trivially derived from the low-entropy key while still guaranteeing deterministic uniqueness.
Compressible primary keys are superior in every way.
Generally 1 index contains the data, all the other indexes have pointers to that index in their leaf nodes. Usually the PK index contains the data
So loading 1000 consecutive rows in a created_at secondary index takes two steps: 1) created_at index B-Tree traversal (maybe 10 pages from disk, max?) 2) Potentially loading 1000 randomly sorted pages from disk, dereferencing pointers (EXPENSIVE!)
Whereas, if your primary key is sorted by time, loading 1000 rows is 1) slightly larger primary B-Tree traversal 2) no 2, you're done :)
My best theory is that it's the Reddit "I don't need to understand how stuff works to be a programmer" crowd.
I guess for use-cases like Instagram's: "Generated IDs should be sortable by time (so a list of photo IDs, for example, could be sorted without fetching more information about the photos)... If you use a timestamp as the first component of the ID, the IDs remain time-sortable... [but require] more storage space (96 bits or higher) to make reasonable uniqueness guarantees" [0]
[1] https://blog.twitter.com/engineering/en_us/a/2010/announcing...
The only case where I don't use them is if I don't want to create time of the item to be known to the user.
What are the advantages of sortable UUIDs with embedded timestamps over having an auto-incrementing id column, a created_at column, and using the composite primary key (created_at, id)?
That sequence allocation needs to be ACID as well, so that there is no chance of any allocation being repeated.
Secondly, whether or not being predictable is a problem depends on the use case. The keys used on this forum are predictable and it is not an issue.
The issue with using Mac addresses is it essentially leaked tracking information on the machine being used to generate the uuid - you could associate every ID it ever generated back to that same machine, correlate it to things like open WiFi AP logs, whatever. Which is pretty scary sometimes.
Many OS’s now randomly generate a new MAC on every disconnect/connect for this reason, to at least give some privacy.
At any rate it's definitely simpler, and you don't have to forego the created_at column either.
This is clearly an individual draft. OK. And it has previously expired without action (last year) but a newer draft was submitted this year.
But it doesn't seem to have been adopted by any working group, and although the word "dispatch" appears in the title it was not discussed at IETF 111's DISPATCH or GENDISPATCH or SECDISPATCH last week. If this was in fact dispatched somewhere - perhaps before it expired last time - there's no indication where it went or what its status is now.
It is the nature of these things that if everybody chooses to do exactly this, even if it was only documented in a by-then expired draft, or on the back of an envelope, then that's how it is -- the IETF has no enforcement arm. However if you support this proposal, or even if you think it'd be a good idea with minor tweaks you should find out where (and if) it's being developed and get on board with that.
I for one have been waiting for an IETF-standard sortable UUIDs for a while, so I'm happy to upvote this even if it is "blog post wishlist"-stage of standardization.
However, I'd like to see the draft include a mention about the practice of embedding machine and data type identifiers into the format which helps in distributed applications.
If you need to obscure both when in time and where in space, then UUIDv4 is a better option.
Do you mean you prefer ksuid to this new draft? Your comment would be more helpful if you said what about it you prefer.
they have 128 bits of pseudorandom data
Both text and binary representations are lexicographically sortable
They don't have the awkward dash ('-') to mess with url encoding
Can you explain what you mean by that? I have experienced issues with " " as "+" vs. "%20", but dashes... ?
ETA: also, when you put a UUID as text into a lot of search engines, it treats the dashes as token delimiters.
Let's see if I can figure it out (correct me if I'm wrong).
I think maybe these new UUIDs proposed ARE lexigraphicaly sortable (in timestamp order, i think is the implication) in text and binary, these newly proposed UUIDs above? I got the impression that was one of the main things they provide, compared to older UUID formats, one of the main reasons for their proposal? Am I wrong?
I know the new UUIDs look the same as the ones we're used to, where they often have dashes (although there's certainly no requirement you display them with dashes, you can strip the dashes out before putting them in a URL if you want, although yeah it's an extra step).
[If we're talking about URL-encoding we're talking about putting them in URLs? ksuid's 160 bits insead of UUID's 128 , making a longer URL, seems like a downside?]
Not sure how many bits of psuedorandom/entropy they have... Looks like... UUIDv6 has 48 bits of psuedorandom in addition to timestamp, to reduce timestamp collisions. UUIDv7 is variable and can let the application decide (am I understanding right?), but supports up to 62 bits of pseudorandom data. [I guess you prefer a longer ID with more bytes of psuedorandom... just to further minimize possibility of timestamp collission, or other reasons? I am not sure how often existing UUID timestamp collisions happen in practice, anyone know? Anyone ever had one?]
Well, the entire UUID is 128, so obviously not 128 bits of pseudorandom data.
> What about 128 bits of pseudorandom data is preferable to you?
Same as with any hash: collision resistance. That's why we don't use SHA-1 for anything secure any more. The dash problem arises not in the URL but the multipart/form-data encoding, as well as the following:
"when UUIDs are indexed by a search engine, where the dashes will likely be interpreted as token delimiters. The base62 encoding avoids this pitfall and retains the lexicographic ordering properties of the binary encoding."
I have never had or thought of an application where a search engine interpreting the dashes as token delimiters was a problem. But if you have, and it was a problem, ok, that explains your preference I guess!
Curious if anyone reading has run into collisions with existing v1 UUIDs, or knows anyone who has, how often.
Pretty sure these new UUIDs are lexigraphically sortable (in timestamp order) in their hex representation (with or without dashes) as well as binary. Lots of people agree this is important, that in fact seems to be one of the main motivations of these new UUID formats in the draft we're discussing, I think?
They don't have the awkward dash ('-') to mess with url encoding
But yeah, I’ve found the ability to move IDs between alphabets quite useful.
> This machine ID MUST be placed in the UUID proceeding [sic] the timestamp and sequence counter bits. This position is selected to ensure that the sorting by timestamp and clock sequence is still possible.
This guarantees uniqueness at a global level, as long as each machine doesn't run out of sequence counters within a given timestamp.
But why must that machine ID must preceding the timestamp & sequence counter? Why not have it after? (or does "proceeding" has a meaning I'm not aware of? I read it as a typo for "preceding", but I'd assume it should be succeeding, especially given what the next sentence says)
My intuition is range requests based on timestamp would work better if the machine ID is after, not before... If before, it seems it would violate the key requirement in abstract of "sortable using the monotonic creation time".
(It's already violated since each machine in distributed system doesn't have the same clock so "creation time" is all relative. But for purposes of analytics, such as querying the "last 24h", having the timestamp be at the beginning seems preferable, since range queries can be done easily)
I read this as a strange word choice, but interpret as follows: "This machine ID MUST proceed from the timestamp and sequence counter." X proceeding from Y implies that X comes after Y.
If we examine the context, it is absolutely unambiguous (emphasis mine):
> This machine ID MUST be placed in the UUID proceeding [sic] the timestamp and sequence counter bits. This position is selected to ensure that the sorting by timestamp and clock sequence is still possible.
The proceeding sentence makes the intent clear, that a naive sort should preserve time ordering. The only way for this to work is for the timestamp and sequence counter bits to precede the optional machine ID.
An interesting digression to bring up with respect to this RFC because it is interesting to note that none of the proposed v6, v7, v8 UUIDs match the SQL Server sort order behavior and would still have a lot of clustered index churn unless new sort behaviors were allowed to be opted in. That might be something that this RFC would help to address by also making sort orders a bit more standard, though it does not directly make recommendations on that front.
For what it is worth, it is the SQL Server sort order that is weird: it sorts the last group of the UUID string form first, which in v1 UUIDs corresponded to the MAC Address. Clustering by machine probably made sense by whoever decided on that sort order, at the time, but even with v1 UUIDs they probably would have got better results sorting by timestamp than by machine.
(The ULID timestamp fits exactly into that "MAC address" group and it's just a matter of swapping the front 6 bytes to the back of the UUID when storing to uniqueidentifier.)
Fun reference on UUID sort orders: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...
Unfortunately converting between them without any db extensions is a PITA in the absence of native support. I’ve written about it at length here (SQLite in the example, but the fundamentals apply regardless): https://neosmart.net/blog/2018/converting-a-binary-blob-guid...
(ETA: Also to answer your indirect question, Microsoft was definitely an early adopter of GUID/UUID. Though the inventor was Apollo Computer as a part of their complex network design: https://en.wikipedia.org/wiki/Apollo_Computer)
Anyway, nice to meet another .NET Ulid user in the wild!
I actually took the opposite path and explored several of the micro-optimized ULID libraries in C# before settling down with NUlid which I decided had the cleanest API and documentation (and reasonably good response times to GitHub issues). The implementation of Cysharp's lib has a lot of Unsafe code, which I understand why they took that path (given the focus on Unity code), but I'm hesitant about it and I liked a lot better (conceptually) how much more Span<T> code mcb2001's CSharp.Ulid library had and if I were to micro-optimize a library that's much more the path that I would take. (But so far micro-optimizations of that sort haven't been a priority in my real world usage and I'll take the well tested library with the nice API.) I created a couple simple extension methods to do the specific "ToSqlGuid()" and "ToUlidFromSql()" that I desired for the SQL Server sort order. (Which is just shuffling the first six bytes of a ULID byte array to the end of the GUID byte array and vice versa.)
I actually trust that I can at least generate ULIDs in SQL Server, again because the part I care to be sorted/needs to be correct (the timestamp) fits into the non-endian-issue portion of a GUID in SQL Server. I've been using just the generator function from ulid-mssql [1] (I used it mostly just for DB migrations and it was nice to have for that). It was debugging the endian issues and Base-32 encoding issues of the "ULID GUID to string" functions where I gave up and decided I didn't trust doing that in database.
Apart from that little misadventure (last I checked, there was a request for comments on a proposal to drop that from the Ulid spec altogether) there’s really nothing else that’s not same and straightforward in the spec, and little else to complain about.
But UUIDv7 is described as: Unix timestamp + fractions of second + increasing sequence + random number. Now imagine two processes start simultaneously and start generating UUIDs right away. IMHO chances of the two processes generating exactly the same UUID sequence are pretty high unless the implementation is smart enough to feed something like process id as a seed to random function.
Is 48 bits really enough to be a spatially unique identifier. Roughly 16 million entities would have a 50% chance of collision.
If you have a spatially unique identifier collision, it seems it might be possible for two independent entities to generate the same time stamp and counter codes resulting in an overall UUID collision.
If you need 16 million entries to get a 50% chance of collision, 100ns time resolution means you have 10^7 timestamps per second, so to actually experience a 50% chance of collision requires an instantaneous hash rate of over 10^14 per second, which I don't think you are likely to ever see in practice before these UUID versions are long obsolete.
XML and markdown are the most common source document formats.
> MAC addresses pose inherent security risks and MUST not be used for node generation.
Interesting concern in the distributed generation pathway.
I've used MAC addresses in the past for absolutely unique identifiers, but this is calling out that as a security risk, because the time + arp data might be known to predict a future UUID from a machine?
Whether you consider this specific example threat a privacy risk versus a security risk here depends on your personal threat model, of course. But general consensus at this point is to label it a security risk.
The RFC leaves finding a machine-unique ID that isn't a security risk as an exercise to the user. (Mentioned directly in the v7 description where some bytes are optionally allocated for "flake"-like machine IDs.)
Another thing I was quite curious about (looks like you can use them in existing DB columns etc):
> The UUID length of 16 octets (128 bits) remains unchanged. The textual representation of a UUID consisting of 36 hexadecimal and dash characters in the format 8-4-4-4-12 remains unchanged for human readability. In addition the position of both the Version and Variant bits remain unchanged in the layout.
e.g. MySQL accepts only "no dashes" and the standard textual format: https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functi...
Postgres is a bit more flexible, but still expects hyphens to be after multiples of four characters if present: https://www.postgresql.org/docs/9.1/datatype-uuid.html
I think youtube does something similar with their video IDs.
So the authors wrote this XML https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...
That XML got turned in to this plaintext https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u... by the xml2rfc tool https://xml2rfc.tools.ietf.org/
And then that plaintext got turned in to the linked HTML by the datatracker.ietf.org server software, which can be found in SVN https://svn.ietf.org/svn/tools/ietfdb/
[0] https://github.com/uuid6/uuid6-ietf-draft/blob/master/draft-...
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
But rfc2629.xslt isn't uploaded in the same directory (and Firefox gets real unhappy when it tries to parse the HTML 404 page as an XSLT file). Which is fine, the XML file isn't there to show it rendered, it's to download the source. So you can still just save it.
I was sort of hoping this would yield something obviously superior to KSUIDs, since nominally this is an attempt to create an IETF standard to solve the same problem KSUIDs solve, but...
...I'm not seeing it. I just want enough random bits to ensure I don't need to worry about a collission, combined with enough of a timestamp I can roughly k-sort them and get acceptable performance using them in a DB index. KSUIDs do this. These proposed formats...not so much.
Specifically:
1. All three of these standards devote significant numbers of bits to providing timestamps with very high precision or very distant epochs, or both. If you're generating a v6 UUID and your device's wall clock says it's currently the year 1582 (or really, any year prior to 2021), something has gone very wrong. As for accuracy, there's nothing wrong with it, but you could also shave off a few bits and spend that space on more randomness.
2. All three of these standards devote shockingly few bits to containing raw random bits. v6 has 48, and v7 and v8 don't require any (!) and maxes out at 62 bits. If you could trust that all devices generating v6 UUIDs had super accurate, super in-sync wall clocks, I suppose 48 bits of randomness and a 100 nanosecond timestamp might make sense. But in the real world, I think you're better off focusing on randomness, with only enough of a timestamp to make your database happy.
3. Both v6 and v7 devote significant attention to a clock mechanism. Clocks make a ton of sense in a system like Boundary's Flake (and Twitter's Snowflake), where you're embedding a node ID in the generated value, but these proposed formats don't do this. (Aside from an offhand commend that you could repurpose some of the random bits for this, but that this is "out of scope" for the spec.) Which is fine, but if your UUIDs aren't namespaced per-node, then your clock now needs to be synchronized globally, which is somewhere between "unfeasible" and "outright impossible". I appreciate the review they did of prior solutions, but you can't just grab arbitrary features from successful designs without considering the context that made the feature work in that design. As far as I'm concerned, if the format doesn't have a node ID, then all the bits devoted to clocks should be spent on more randomness.
I assume/hope that these UUIDs are useful for some applications, but I'm really struggling to see it. They seem good at things that aren't going to benefit users and bad at the things they're trying to solve, particularly at scale, although there's no real point for hobbyists either.
What I am trying to determine is if 62 bits of entropy combined with a timestamp gives me better or worse collision resistance as with 122 bits in a purely-random format. Most of our keys only live within the scope of a single computer, but ~5% of them might be generated/shared between other systems.
Being able to order our keys by time would be really nice, and I like that they would compress/index better.
Maybe I could do a hybrid between V4 and V7 keys depending on if the type would be shared with external parties. There are many types that only ever scope to a single box.
I probably couldn't play with the idea of adding a "machine id" because the space of all possible machines is difficult to anticipate/control right now.
I think I just talked myself into sticking with v4.
For a one-in-a-million chance of having at least one collision, you'd need to generate 3.04e6 random 62-bit values or 3.26e15 random 122-bit values. For a one-in-a-thousand chance, those numbers would be 9.6e7 or 1.03e17 respectively.
In other words: If you're sustained allocating v7 IDs (with no sub-second portion of the timestamp) at a rate of 3.04 million per second, there's a 1-in-a-million chance each second of having at least one collision. Taking 1 - 0.999999^86400, that gives an 8.3% chance of at least one collision after a day, 45.4% after a week, and 92.5% after a month.
If you instead were allocating v4 IDs at the same rate for a month, you'd have allocated (3.04e6)×86400×30 IDs, and you'd have a probability of 5.84e-12 of at least one collision among all of those IDs.
[1] Collision probability given number of draws: https://www.wolframalpha.com/input/?i=n%3D3.04e6%2C+H%3D2%5E...
[2] Min number of draws to get a given collision probability: https://www.wolframalpha.com/input/?i=p%3D0.000001%2C+H%3D2%...
edit: If you are storing milliseconds in subsec_a and generating IDs at the same rate, you'd have 3040 ids per millisecond, which have a probability 1.00164e-12 of a collision in any given millisecond. Taking 1-(1-1.00164e-12)^(86400*1000) gives you a probability of 0.0087% of a collision in a day, 0.061% in a week, or 0.26% in a month. So as long as your event generation is relatively constant across that second, you will indeed have a much lower collision chance if you are using millisecond subdivisions. If you have even higher precision time or are using a sequence counter, that can be pushed down even further.
If so, with v7, I'd feel the need to find the right balance of time precision, time synchronization, node identification, and randomness. Just some thoughts out loud.
Imagine you had three devices one of which has an accurate wall clock, one of which is 5 minutes slow, and one of which is 10 minutes slow, and you're using millisecond precision. So what happens is:
For a given arbitrary "accurate" time such as UTC 00:00:00.5 Dec 1st 2021 (500ms after midnight), device one gets 1 millisecond to generate UUIDs into that timestamp "bucket". If it generates enough, it might get a collision, but likely it'll be fine. After 1 millisecond, its wall clock moves on to 501ms after midnight, and it stops generating UUIDs that (might) collide with other UUIDs generated at 500ms after midnight. Five minutes later at 00:05:00.500 the second device (incorrectly) thinks it's 500ms after midnight, and starts generating UUIDs for that millisecond "bucket". And then after another 5 minutes, the third device does. And we can extend it; an hour later a glitchy NTP server (or daylight savings, or whatever) might cause the devices to all roll their time back, and then they all go through the "500ms after midnight" window a second time, getting another shot to generate conflicts. And that's fine!
What matters is the total number of UUIDs you generate (over the entire lifetime of your system) by devices which believe (correctly or not) that it's that specific timestamp.
Now, one potential gotcha is that a user might be overly clever and think they can back the timestamp out of the UUID to figure out exactly when it was created. That does require precise synchronisation if you want accurate answers, but if you're trying to figure out the exact creation time to the millisecond of UUIDs created in a distributed fashion...well, eh, hopefully you know what you're doing. :)
If you have your To Do items identified by an incrementing integer, you have to come up with some other unique identifier to use for To Do items that have yet to be synced with the server's database (since you can't know which ID to use next).
On the other hand, if you use UUIDs, your app can just create a new UUID for the To Do item, which will continue to be its ID for subsequent updates.
"The motivation for using UUIDs as database keys stems primarily from the fact that applications are increasingly distributed in nature. Simplistic "auto increment" schemes with integers in sequence do not work well in a distributed system since the effort required to synchronize such numbers across a network can easily become a burden. The fact that UUIDs can be used to create unique and reasonably short values in distributed systems without requiring synchronization makes them a good candidate for use as a database key in such environments. "
So UUIDs are great because multiple servers can generate plethora of them without having to worry about collisions.
This is all explained in the first few pages of the document, although not in simple terms.
Also I haven't spotted anything about UUIDv6 trying to avoid collisions, but I haven't read very far.
" node: 48-bit pseudo-random number used as a spatially unique identifier Occupies bits 80 through 127 (octets 10-15)"
This is so common, I've experienced this myself as a student with a summer job: the company I was working for used an online service for their payroll, I accessed my pay slip one day and changed ?employee=123 to 124 in the URL out of curiosity and it loaded the data for a different person in a different company. They fixed it incredibly quickly once I told them about it :)
this approach also allows you to generate ids client side. which if you're working with a local db on ios/android etc this is very convenient
A ULID is 48 bits of timestamp plus 80 bits of entropy. These new formats just don’t have enough random-component entropy. 64 bits isn’t enough. Who cares in your millisecond timestamp wraps every 8900 years as they do in a ULID?
What’s the actual advantage of the other uuids? I know what they are on paper but has anyone used other ones in practice?
The wheel turns.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK