Choosing the Right Integers
source link: https://www.thecodedmessage.com/posts/programming-integers/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Choosing the Right Integers
#rust #c++ #programming #computers
Paying attention to beginner questions is important. While many are just re-hashes of things that books have explained 1000 times, some are quite interesting, like this one that I saw recently in a Reddit post:
How do I determine when to use i8-64 (same for u8-64)?
Now, here is a surprisingly hard question. It’s really easy to find out
what i16
and u8
and friends mean, and what they do: the appropriate
section comes
really early in The Book. It’s
also fairly easy to learn that you use u8
when dealing with
the concept of raw bytes, or that you might need to use a particular
one for a data format or an existing API call.
But it’s harder to get actual advice on what to do with that information for your own code, and what type to pick when you are the one making the data structure or API. I realized, reading this question, that I’m not entirely sure that what I do is the best thing to do.
And, oddly enough, the answer seems to be social just as much as it is technical. I’d like to answer the question, but first I’d like to explain why Rust works the way it works. For that, as with so many things in programming, we shall need to delve into history.
The Old C int
s#
My “native” C and C++ had an easy answer for this question built into
the programming language: When in doubt, use int
. It’s the one named
after the general concept of “integers,” and so it’s the commonly-used
default type. The theory was then that a concrete meaning for int
would then be chosen by each platform at their own discretion
It was originally anticipated that this would correspond to the platform’s
native word width: int
would be a 16-bit value on 16-bit platforms,
and a 32-bit value on 32-bit platforms. For a long time, those were
the two kinds of platforms people cared about, and the following overall
convention emerged:
short
would always mean 16-bitlong
would always mean 32-bitint
would mean whatever the native word type was, and would either be entirely equivalent toshort
orlong
.long long
could be used as a non-standard extension for 64-bit, which was usually a non-native type simulated using multiple instructions per operation.
And so, if code cared about whether a value was a certain width, like
in a structure that would be written to disk or used in inter-process
communication – or even when compactness mattered or when having a
full range of values mattered – it would use short
or long
. When
these things didn’t matter, it would use int
.
And as a result, many millions of lines of code were written that used
short
and long
in important memory layouts to mean 16-bit or 32-bit,
and it would break compatibility to have them instead mean something else.
Some less careful code would start assuming 32-bit (since after a certain
point, 16-bit started seeming obsolete), and so soon you had code that
also assumed that int
was 32-bit, and it would break compatibility
if it would mean something else.
The 64-bit Revolution#
By the time I started programming in C and C++, int
always meant 32-bit,
on all the computers I could program on. I knew that int
used to mean
16-bit, on the DOS and Windows 3.1 computers I used to use, but that
was a historical curiosity for me. I assumed, however, that when 64-bit
came – if it ever would – int
would come to mean 64-bit.
I was far from alone in that assumption. As I said, the convention was
at that time that int
was the native word type. This convention
was documented in books I read, which used “native word type signed integer”
as the definition of int
. I’m pretty sure at least one book I read
even asserted that, when 64-bit chips become a thing, int
will mean
64-bit.
I occasionally even saw people use typedef
or various header files
to define other int
types, int32
or similar, and asked, why
would they do that? After all, long
, I thought, would always be
32 bits. Shouldn’t they just say long
if they actually cared? And
if they didn’t, which they didn’t seem to that much, shouldn’t they
just say int
?
Yes, this is how naive I was. But then, in 2003, when I heard the Athlon 64 was coming out, implementing the new x86-64 architecture. I read everything I could about this new architecture, and I couldn’t be more excited: Not only would each register double in width, but the number of registers would also double.
I turned to read the new System V ABI, which specified the way this new architecture would be used by C compilers on Linux. And here, again, there was much to be excited about. Up to 6 parameters would now be passed in registers rather than pushed onto the stack!
But all this excitement was tempered by the data model that was chosen,
LP64.
Under LP64, the following meanings were assigned to short
, long
,
and int
:
short
would be 16 bits.int
would be 32 bits.long
would be 64 bits.
I was shocked. All my C code used int
s as the default type for all
my integers. If int
was 32 bits, wouldn’t my code, for all practical
purposes, still have basically the same capabilities? I wasn’t even sure
my code would be 64-bit clean: I had so thoroughly assumed that int
meant word size that I used int
as my type for indices and freely cast
between int
s and pointer types!
Furthermore, I asked the document, didn’t this mean that long
was
changing meanings? I had, incorrectly, thought that long
simply always
meant 32 bits, probably from a book that assumed 16-bit and 32-bit were
everything. I had anticipated that short
, long
, and long long
would keep their meanings, and that int
would move to long long
.
Rationales#
The reason for LP64 instead of ILP64 is simple: too much code had dug
in the assumption that int
s were 32 bits. It doesn’t take much to
assume that. You can write an int
to a file or a socket as 4 bytes.
Write 4 instead of sizeof(int)
in a write
or read
system call.
Or it’s enough to even get it right and say sizeof(int)
, but expect
the file or socket to be compatible between 32-bit and 64-bit builds of
the same program. It’s easy to make assumptions in C.
Apparently more code made that assumption than made the assumption that
pointers could be cast to int
s and back (which was undefined behavior
anyway, and which I was now forced to repent of). And using int
s
as an index wasn’t that bad, if no longer fully correct; how many
data structures really do have more than 2 billion items, even if
it is theoretically possible now.
As for long
, well, it was deemed reasonable that there should be
some primitive type that was equivalent to the native word length.
There was also some precedent: long
already meant this in Java.
And besides, the word pretty much called for it. And this way,
code that casted to int
and back could just use long
instead.
Well, I wanted my values to be 64-bit. So when I finally got my hands
on a 64-bit computer, for a while, I just replaced all my uses of int
with long
(including writing long main(long argc,...)
– I was a
stubborn teenager). But ultimately, that felt silly, and I tried to find
a better solution.
New Conventions#
Eventually, I figured out the real solution. Eventually I came to
understand the header files full of typedef
s so that code could say
things like int32
. The C standards committee recently had adopted
something similar themselves: C99 came with a stdint.h
, that would
define types like int64_t
or uint32_t
.
For various specific use cases, there were also specialized types,
which would automatically resolve to the right type on your platform
for that use case. For storing indices, there was size_t
. ssize_t
was its signed equivalent: more on that later. For being the same width
as a pointer, there was intptr_t
/uintptr_t
.
But for general-purpose integers, which fit none of these use
cases, after my brief flirtation with long
(because 64-bit!)
I continued to use int
. It was still the default; the one
I used if there was no reason to use anything else, the one C
APIs would treat as the most normal.
General Purpose#
What do I mean by general-purpose? What is there to do with int
s
besides indexing? This struck me when I saw the original question:
Even though I use int
s all the time in C++, and use various integer
types all the time in Rust, for reasons other than indexing an array
or vector, I had never enumerated or thought deeply about what
situations call for such a thing.
Now I have thought about it, and here are some general-purpose use
cases for int
:
- Assigning unique IDs
- Locating widgets on a screen in pixels
- Bit fields of options (e.g.
O_RDONLY
,O_APPEND
inopen
syscall) - Indices that will be low enough and can’t take up 8 bytes
- Counting how many times something has happened
- Counting how many times you want something to happen
- Counting how many times you want to try something before giving up
- Counting how many milliseconds to wait before giving up
- Counting how many milliseconds to sleep for
- Counting, in general. Computers, it turns out, do a lot of that.
- (Please let me know if you think of more)
Here’s some less generic ones that you definitely want to use special
typedef
s for, and for which you should make a conscious decision
about what width to use:
- Encoding unicode code points
- Pointers/words in simulated architectures
- Fields in structs that are serialized in wire formats or file formats
- Go ahead and make them
int
but maybe writeint32_t
for these
- Go ahead and make them
- Port numbers (
uint16_t
/u16
) or other OS-level constructs - Values for hardware registers
- Enumerations (in fact, use
enum
, in Rust or C) - Values for color or sound samples in image or sound formats
- Where it’s in a large collection
- Where the value is only measureable to a certain level
- Where it’s effectively a fixed-point value
- Individual bytes as bytes: this is always
u8
- (Please let me know if you think of more)
Application to Rust#
In Rust, there is no type with the generic name int
. The
programming language is, compared to C, neutral in naming the various
widths.
usize
is available for indices, and is effectively
considered to be the same as uintptr_t
from C as well
(even though some people are interested in changing
that).
But there’s no named type for when you “just” want an integer. If you’re counting things, or assigning IDs, and you want to write out the type, you have to not only pick a width, you have to actively type it. So what “should” you do? Do you go with the default word width of the processor? Well, your code may be multi-platform, so that might lead to inconsistencies.
Well, what I do, it turns out, is I think int
, C’s default,
and I just write i32
. And, as far as choosing 32
goes, a quick
scan of checked out code indicates I’m far from the only one. And
this convention is endorsed by the programming language, but
more subtly than C: i32
is the default type for integer literals
when the type inferencer isn’t constrained to choose another one.
So i32
is the Rust convention. But it’s a convention that comes from C.
It comes from the C int
type on both 32-bit and 64-bit architectures
being i32
.
Choosing an integer width a complicated choice. It’s a trade-off between compactness and space enough for a wide range of numbers, but also, it’s a matter of convention: More ints being the same width reduces the cognitive load. Having a default reduces the cognitive load, even if the programming language isn’t “in on” the default like C is. And so, we simply inherited a default from C.
So that’s my advice: use usize
for indices, and i32
for your
default. Use either types when explicitly called for by a situation.
(This section has been corrected thanks to a
Reddit
comment.
Previously, it implied that the Rust programming language itself did nothing
to privilege i32
over other types. This is false; it is used as
a default by the inferencer.)
Signedness#
I wish that was it, but it isn’t. We have to talk about signedness.
Which is a whole ‘nother kettle of fish, isn’t it? What is isize
even for? Should you use u32
if your value can never be negative?
Again, the literal distinction between signed and unsigned types
is straight-forward and well-known: Signed types have a special
interpretation of the most-significant bit (the MSB). Instead of
indicating 2**31
, the MSB in a i32
indicates -2**31
when the bit is
1
, making it the “sign bit”
(all Rust platforms and all commonly-used C/C++ platforms use two’s complement
arithmetic for
integers). But the practical implications are less clear.
Signedness in C/C++#
C/C++ has a weird relationship with signed and unsigned.
For example, if a value overflows or underflows in a signed value,
this is considered undefined behavior. In an unsigned value, it is
defined, and it will wrap around, so that uint32_t
arithmetic was
defined modulo 2**32
.
In general, unsigned integers are more raw. They are supposed to represent, abstractly, the underlying types used by the processor. Unsigned integers are for bit fields, for “raw” memory accesses, for simulating processors, and for other low-level technical purposes.
For all other purposes, signed integers are really a better choice. If you are doing application programming in C/C++, unsigned integers might never be called for.
Here are some specific reasons:
- As “signed integers” exhibited undefined behavior, compilers are
allowed to make extra optimizations. Some loops just ran faster if
you used signed types. This happened surprisingly often, though
more often with
int
vsuint32_t
than withsize_t
. - Use of unsigned integers could lead to surprises in loops:
for (uint32_t i = 12; i >= 0; --i) {
// loop infinitely
}
- -1 is often used as a sentinel number, for errors or invalid
values, even if values are expected to be signed. This necessitated
the need for
ssize_t
as well, as a return type for theread
andwrite
system calls.
For all these reasons, signed is generally preferred to unsigned in C or C++ when correspondence with the machine type is not required, even if the number is not expected to be negative.
Signedness in Rust#
In Rust, both signed and unsigned have defined (but configurable) behavior on overflow. They both either panic on overflow (default in debug builds), or they wrap around with two’s complement (default in release builds). This means the optimization difference from C/C++ is moot, and that argument can be entirely set aside.
I haven’t played around with different loop types to know what’s fully
best, but I do know that Rust’s support for ranges somewhat mitigates
the dangers manual fiddling with for
-loop conditions. I suspect
unsigned still comes out looking worse though, as underflow around 0,
very close to the numbers that come up all the time, is way easier
to trigger by accident than overflow or underflow around MAX
or
MIN
on a signed value.
Similarly, using -1 as a sentinel or error value is not good Rust code
hygiene: Option<u32>
or Result<u32, E>
could just as easily be
used instead. However, in compact data structures, or working with other
programming languages that use it, -1 as a sentinel value still makes perfect
sense, and is easy to check for.
Old habits die hard. I still use i32
, the signed version, as my
default in Rust. Besides, I work with a lot of other programming
languages that might use -1 as a sentinel; it’s shockingly common.
But I’ve also seen a lot of code that uses u32
whenever it doesn’t
make sense for the number to be negative. This is in line with Rust’s
philosophy of making invalid values unrepresentable in the type.
In the end, I’m not entirely sure. I think I’ll continue to use i32
,
but I feel like I need to think and learn more on the issue. What a
deep question!
Conclusion#
What do other people think? This is something I think about way less
often than I should. Do we agree that i32
is a good default for
non-indices, usize
is a good choice for indices? Has anyone had
need for isize
in Rust?
What about the i32
/u32
debaucle? In the end, this Redditor’s
question made me realize that I’m still figuring out what to do about
Rust’s signedness.
Further Reading#
- A commenter suggested The Lost Art of Structure Packing, a true classic
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK