1

Choosing the Right Integers

 2 years ago
source link: https://www.thecodedmessage.com/posts/programming-integers/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Choosing the Right Integers

2022-04-14 :: Jimmy Hartzell

#rust  #c++  #programming  #computers 

Paying attention to beginner questions is important. While many are just re-hashes of things that books have explained 1000 times, some are quite interesting, like this one that I saw recently in a Reddit post:

How do I determine when to use i8-64 (same for u8-64)?

Now, here is a surprisingly hard question. It’s really easy to find out what i16 and u8 and friends mean, and what they do: the appropriate section comes really early in The Book. It’s also fairly easy to learn that you use u8 when dealing with the concept of raw bytes, or that you might need to use a particular one for a data format or an existing API call.

But it’s harder to get actual advice on what to do with that information for your own code, and what type to pick when you are the one making the data structure or API. I realized, reading this question, that I’m not entirely sure that what I do is the best thing to do.

And, oddly enough, the answer seems to be social just as much as it is technical. I’d like to answer the question, but first I’d like to explain why Rust works the way it works. For that, as with so many things in programming, we shall need to delve into history.

The Old C ints#

My “native” C and C++ had an easy answer for this question built into the programming language: When in doubt, use int. It’s the one named after the general concept of “integers,” and so it’s the commonly-used default type. The theory was then that a concrete meaning for int would then be chosen by each platform at their own discretion

It was originally anticipated that this would correspond to the platform’s native word width: int would be a 16-bit value on 16-bit platforms, and a 32-bit value on 32-bit platforms. For a long time, those were the two kinds of platforms people cared about, and the following overall convention emerged:

  • short would always mean 16-bit
  • long would always mean 32-bit
  • int would mean whatever the native word type was, and would either be entirely equivalent to short or long.
  • long long could be used as a non-standard extension for 64-bit, which was usually a non-native type simulated using multiple instructions per operation.

And so, if code cared about whether a value was a certain width, like in a structure that would be written to disk or used in inter-process communication – or even when compactness mattered or when having a full range of values mattered – it would use short or long. When these things didn’t matter, it would use int.

And as a result, many millions of lines of code were written that used short and long in important memory layouts to mean 16-bit or 32-bit, and it would break compatibility to have them instead mean something else. Some less careful code would start assuming 32-bit (since after a certain point, 16-bit started seeming obsolete), and so soon you had code that also assumed that int was 32-bit, and it would break compatibility if it would mean something else.

The 64-bit Revolution#

By the time I started programming in C and C++, int always meant 32-bit, on all the computers I could program on. I knew that int used to mean 16-bit, on the DOS and Windows 3.1 computers I used to use, but that was a historical curiosity for me. I assumed, however, that when 64-bit came – if it ever would – int would come to mean 64-bit.

I was far from alone in that assumption. As I said, the convention was at that time that int was the native word type. This convention was documented in books I read, which used “native word type signed integer” as the definition of int. I’m pretty sure at least one book I read even asserted that, when 64-bit chips become a thing, int will mean 64-bit.

I occasionally even saw people use typedef or various header files to define other int types, int32 or similar, and asked, why would they do that? After all, long, I thought, would always be 32 bits. Shouldn’t they just say long if they actually cared? And if they didn’t, which they didn’t seem to that much, shouldn’t they just say int?

Yes, this is how naive I was. But then, in 2003, when I heard the Athlon 64 was coming out, implementing the new x86-64 architecture. I read everything I could about this new architecture, and I couldn’t be more excited: Not only would each register double in width, but the number of registers would also double.

I turned to read the new System V ABI, which specified the way this new architecture would be used by C compilers on Linux. And here, again, there was much to be excited about. Up to 6 parameters would now be passed in registers rather than pushed onto the stack!

But all this excitement was tempered by the data model that was chosen, LP64. Under LP64, the following meanings were assigned to short, long, and int:

  • short would be 16 bits.
  • int would be 32 bits.
  • long would be 64 bits.

I was shocked. All my C code used ints as the default type for all my integers. If int was 32 bits, wouldn’t my code, for all practical purposes, still have basically the same capabilities? I wasn’t even sure my code would be 64-bit clean: I had so thoroughly assumed that int meant word size that I used int as my type for indices and freely cast between ints and pointer types!

Furthermore, I asked the document, didn’t this mean that long was changing meanings? I had, incorrectly, thought that long simply always meant 32 bits, probably from a book that assumed 16-bit and 32-bit were everything. I had anticipated that short, long, and long long would keep their meanings, and that int would move to long long.

Rationales#

The reason for LP64 instead of ILP64 is simple: too much code had dug in the assumption that ints were 32 bits. It doesn’t take much to assume that. You can write an int to a file or a socket as 4 bytes. Write 4 instead of sizeof(int) in a write or read system call. Or it’s enough to even get it right and say sizeof(int), but expect the file or socket to be compatible between 32-bit and 64-bit builds of the same program. It’s easy to make assumptions in C.

Apparently more code made that assumption than made the assumption that pointers could be cast to ints and back (which was undefined behavior anyway, and which I was now forced to repent of). And using ints as an index wasn’t that bad, if no longer fully correct; how many data structures really do have more than 2 billion items, even if it is theoretically possible now.

As for long, well, it was deemed reasonable that there should be some primitive type that was equivalent to the native word length. There was also some precedent: long already meant this in Java. And besides, the word pretty much called for it. And this way, code that casted to int and back could just use long instead.

Well, I wanted my values to be 64-bit. So when I finally got my hands on a 64-bit computer, for a while, I just replaced all my uses of int with long (including writing long main(long argc,...) – I was a stubborn teenager). But ultimately, that felt silly, and I tried to find a better solution.

New Conventions#

Eventually, I figured out the real solution. Eventually I came to understand the header files full of typedefs so that code could say things like int32. The C standards committee recently had adopted something similar themselves: C99 came with a stdint.h, that would define types like int64_t or uint32_t.

For various specific use cases, there were also specialized types, which would automatically resolve to the right type on your platform for that use case. For storing indices, there was size_t. ssize_t was its signed equivalent: more on that later. For being the same width as a pointer, there was intptr_t/uintptr_t.

But for general-purpose integers, which fit none of these use cases, after my brief flirtation with long (because 64-bit!) I continued to use int. It was still the default; the one I used if there was no reason to use anything else, the one C APIs would treat as the most normal.

General Purpose#

What do I mean by general-purpose? What is there to do with ints besides indexing? This struck me when I saw the original question: Even though I use ints all the time in C++, and use various integer types all the time in Rust, for reasons other than indexing an array or vector, I had never enumerated or thought deeply about what situations call for such a thing.

Now I have thought about it, and here are some general-purpose use cases for int:

  • Assigning unique IDs
  • Locating widgets on a screen in pixels
  • Bit fields of options (e.g. O_RDONLY, O_APPEND in open syscall)
  • Indices that will be low enough and can’t take up 8 bytes
  • Counting how many times something has happened
  • Counting how many times you want something to happen
  • Counting how many times you want to try something before giving up
  • Counting how many milliseconds to wait before giving up
  • Counting how many milliseconds to sleep for
  • Counting, in general. Computers, it turns out, do a lot of that.
  • (Please let me know if you think of more)

Here’s some less generic ones that you definitely want to use special typedefs for, and for which you should make a conscious decision about what width to use:

  • Encoding unicode code points
  • Pointers/words in simulated architectures
  • Fields in structs that are serialized in wire formats or file formats
    • Go ahead and make them int but maybe write int32_t for these
  • Port numbers (uint16_t/u16) or other OS-level constructs
  • Values for hardware registers
  • Enumerations (in fact, use enum, in Rust or C)
  • Values for color or sound samples in image or sound formats
    • Where it’s in a large collection
    • Where the value is only measureable to a certain level
    • Where it’s effectively a fixed-point value
  • Individual bytes as bytes: this is always u8
  • (Please let me know if you think of more)

Application to Rust#

In Rust, there is no type with the generic name int. The programming language is, compared to C, neutral in naming the various widths.

usize is available for indices, and is effectively considered to be the same as uintptr_t from C as well (even though some people are interested in changing that).

But there’s no named type for when you “just” want an integer. If you’re counting things, or assigning IDs, and you want to write out the type, you have to not only pick a width, you have to actively type it. So what “should” you do? Do you go with the default word width of the processor? Well, your code may be multi-platform, so that might lead to inconsistencies.

Well, what I do, it turns out, is I think int, C’s default, and I just write i32. And, as far as choosing 32 goes, a quick scan of checked out code indicates I’m far from the only one. And this convention is endorsed by the programming language, but more subtly than C: i32 is the default type for integer literals when the type inferencer isn’t constrained to choose another one.

So i32 is the Rust convention. But it’s a convention that comes from C. It comes from the C int type on both 32-bit and 64-bit architectures being i32.

Choosing an integer width a complicated choice. It’s a trade-off between compactness and space enough for a wide range of numbers, but also, it’s a matter of convention: More ints being the same width reduces the cognitive load. Having a default reduces the cognitive load, even if the programming language isn’t “in on” the default like C is. And so, we simply inherited a default from C.

So that’s my advice: use usize for indices, and i32 for your default. Use either types when explicitly called for by a situation.

(This section has been corrected thanks to a Reddit comment. Previously, it implied that the Rust programming language itself did nothing to privilege i32 over other types. This is false; it is used as a default by the inferencer.)

Signedness#

I wish that was it, but it isn’t. We have to talk about signedness. Which is a whole ‘nother kettle of fish, isn’t it? What is isize even for? Should you use u32 if your value can never be negative?

Again, the literal distinction between signed and unsigned types is straight-forward and well-known: Signed types have a special interpretation of the most-significant bit (the MSB). Instead of indicating 2**31, the MSB in a i32 indicates -2**31 when the bit is 1, making it the “sign bit” (all Rust platforms and all commonly-used C/C++ platforms use two’s complement arithmetic for integers). But the practical implications are less clear.

Signedness in C/C++#

C/C++ has a weird relationship with signed and unsigned.

For example, if a value overflows or underflows in a signed value, this is considered undefined behavior. In an unsigned value, it is defined, and it will wrap around, so that uint32_t arithmetic was defined modulo 2**32.

In general, unsigned integers are more raw. They are supposed to represent, abstractly, the underlying types used by the processor. Unsigned integers are for bit fields, for “raw” memory accesses, for simulating processors, and for other low-level technical purposes.

For all other purposes, signed integers are really a better choice. If you are doing application programming in C/C++, unsigned integers might never be called for.

Here are some specific reasons:

  • As “signed integers” exhibited undefined behavior, compilers are allowed to make extra optimizations. Some loops just ran faster if you used signed types. This happened surprisingly often, though more often with int vs uint32_t than with size_t.
  • Use of unsigned integers could lead to surprises in loops:
for (uint32_t i = 12; i >= 0; --i) {
    // loop infinitely
}
  • -1 is often used as a sentinel number, for errors or invalid values, even if values are expected to be signed. This necessitated the need for ssize_t as well, as a return type for the read and write system calls.

For all these reasons, signed is generally preferred to unsigned in C or C++ when correspondence with the machine type is not required, even if the number is not expected to be negative.

Signedness in Rust#

In Rust, both signed and unsigned have defined (but configurable) behavior on overflow. They both either panic on overflow (default in debug builds), or they wrap around with two’s complement (default in release builds). This means the optimization difference from C/C++ is moot, and that argument can be entirely set aside.

I haven’t played around with different loop types to know what’s fully best, but I do know that Rust’s support for ranges somewhat mitigates the dangers manual fiddling with for-loop conditions. I suspect unsigned still comes out looking worse though, as underflow around 0, very close to the numbers that come up all the time, is way easier to trigger by accident than overflow or underflow around MAX or MIN on a signed value.

Similarly, using -1 as a sentinel or error value is not good Rust code hygiene: Option<u32> or Result<u32, E> could just as easily be used instead. However, in compact data structures, or working with other programming languages that use it, -1 as a sentinel value still makes perfect sense, and is easy to check for.

Old habits die hard. I still use i32, the signed version, as my default in Rust. Besides, I work with a lot of other programming languages that might use -1 as a sentinel; it’s shockingly common.

But I’ve also seen a lot of code that uses u32 whenever it doesn’t make sense for the number to be negative. This is in line with Rust’s philosophy of making invalid values unrepresentable in the type.

In the end, I’m not entirely sure. I think I’ll continue to use i32, but I feel like I need to think and learn more on the issue. What a deep question!

Conclusion#

What do other people think? This is something I think about way less often than I should. Do we agree that i32 is a good default for non-indices, usize is a good choice for indices? Has anyone had need for isize in Rust?

What about the i32/u32 debaucle? In the end, this Redditor’s question made me realize that I’m still figuring out what to do about Rust’s signedness.

Further Reading#


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK