1

Validity of Values In Programming Languages

 1 year ago
source link: https://jerf.org/iri/post/2023/validity_in_values/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Validity of Values In Programming Languages

In Interfaces and Nil in Go, or, Don’t Lie to Computers, I take on the popular misconception that Go has multiple kinds of nil values.

I have noticed as I have fielded questions in /r/golang a related… perhaps “misconception” is too strong, but a fuzziness around the concept of “validity”.

For programming in general, what does it mean for a value to be “valid”?

Like so many things in life, this may seem obvious at first but when you dig into it, the question only gets harder.

Validity of nil Pointers

One of the major sticking points I’ve seen in the Interfaces post is my claim that a nil pointer to some struct is not invalid in Go (if you’ll pardon the double-negative), e.g.:

package main

import "fmt"

type SomeStruct struct{}

func (ss *SomeStruct) Print() {
	fmt.Println("hello!")
}

func main() {
	ss := (*SomeStruct)(nil)
	fmt.Println("ss is nil:", ss == nil)
	ss.Print()
}

That code is perfectly legal and will print

ss is nil: true
hello!

I can see from repeated experience that this still bothers people. They want to insist the nil pointer is “invalid” even with the above code.

What Exactly Is invalid?

The problem that causes all this angst in the first place is that you may call a method on an interface value that has a nil in it, and it may panic when you didn’t expect it to. Does that mean a value is invalid if and only if it causes panics?

Well, that is obviously not the concept people have in mind, because those who insist nil is always invalid can verify for themselves, with all the objectivity of mathematical proof, that the above snippet of code produces a nil pointer that does not panic when used.

On the flip side, producing a similar value that is not a pointer at all but panics when used is trivial:

type AlwaysInvalid struct {}

func (ai AlwaysInvalid) Print() {
    panic("I do not print")
}

func main() {
    ai := AlwaysInvalid{}
    ai.Print()
}

This panics when Print() is called. It can’t be because it’s a nil pointer and therefore invalid because it isn’t a pointer of any kind, nil or otherwise.

If I can produce a nil pointer that doesn’t panic when used, and produce a value-type object that always panics when used, but someone considers the former “invalid” and the later “valid”, then it must not be related to whether or not it panics.

Which brings us back around to, what does “valid” mean in this context?

Invalid in C?

As I said in my Interfaces post, I think what this is is an inappropriate carryover from the world of C, where NULL (not nil!) pointers are conventionally considered “invalid”, and there is nothing you can do with them at all that won’t crash…

… except that’s not true either. You can compare a pointer to NULL legally. Some functions deliberately return NULL to indicate various things; something as classic as strchr returns a NULL char* if you search a string for a character that does not exist in the string.

Is that NULL “invalid”?

My answer would be, “no”. It’s a perfectly sensible, documented return value, and if you unconditionally dereference it in C, the error is yours1.

But clearly there is something about NULL that is “invalid”…

Again, what does it mean to be “invalid”?

Validity Is Relative

The answer is one I think a lot of people don’t like very well, but there is no universal definition of validity. Data can only be valid or invalid within some certain context. This means that the same data may be valid in one context and invalid in another.

In this case, a C NULL is invalid in the specific context of dereferencing it. If you do that, the program will fault somehow. NULL is not unconditionally invalid, it is invalid for specific uses.

MORE COMMENTARY ABOUT VALIDITY

This further implies that certain common practices are actually code smells, including some I use myself. For instance, it is easy to stick a .IsValid() bool (or some other variant on the return value) method on an object in most languages, or do some local equivalent. (Even in a functional programming language, you can still do something equivalent for my purposes by writing a is_valid function that examines private/internal/(whatever) values as part of its decision.) Doing that implies a certain universality to the declaration of “validity”, but that is often not the best thing to do. Even if your library/package/module has some basic definition of validity that it can enforce, e.g., “this binary tree is correctly balanced”, there may be another context for which the definition of “validity” may entail further constraints on what the tree can contain.

For instance, the “validity” of a string in C amounts to “having a NULL terminator in an allocated memory segment”, but code that wants to check for “validity” of a given string will almost certainly impose further constraints, like, “this complies with the definition of an HTTP header name” or “this contains only 7-bit ASCII” or any number of further validity constraints. A data type can impose a certain baseline of validity for itself, but generalized validity is necessarily an external concern for a given data type.

Therefore, the only correct answer to the questions “is a nil pointer to a struct in Go valid?” or “is a NULL char pointer in C valid?” is that the question is ill-defined. Without more context the question has no answer.

I have many .IsValid() methods defined in my various code bases. Technically those ought to be understood not as some sort of universal definition of validity, but as being a system small enough that there is only one relevant context and therefore I can collapse the separation between context and data type on the grounds that I am confident there is only one such context. I find there are many such things in code that I write, where conceptually there is a separation which I then elide because the separation is useless2.

Validity of User Input Data

There’s another long-standing debate that this ties into, between people who believe you should “sanitize” incoming user input to “remove bad data” and those who believe that you should be “encoding” the user’s data on the way out, so that it can’t be interpreted in some security-violating manner. The latter is correct, and that’s in no small part because at the time that the data enters the system, there is no way of turning the data into “valid” data at that point because that part of the system doesn’t know what “valid” even is, because it can’t know the context.

If the input is headed out into potentially unescaped HTML, there’s one set of “bad” characters in this context.

If the input is headed into potentially string-concatenated SQL, there’s another set of “bad” characters in this other context.

If the input is headed to CSV files, there’s another set of set of “bad” characters, depending on the details of which variant of CSV is being used, whether or not the potential bad code is quoting everything, etc. And this is itself a whole set of contexts, not just one.

And in all cases, the “bad” characters may also be important characters; a double-quote can be a problem in all three contexts plus any number of others, but it is also an important part of the original data; you can not in general just strip it out and throw it away.

Worst of all, additional outputs may be added in the future for data that didn’t even exist at the time the data was ingested; how can any conceivable “sanitization” code account for that?

The only correct option for safely handling data is that at output time, the output code must correctly encode the data, because only the output code can correctly define “valid”, because only the output code has the relevant context.

A Valid Definition Of Validity Should Be Precise

I would suggest that if you want to argue with this, especially in the specific context of nil pointers in Go, that you should take the time to precisely specify what you mean by “valid” before you post your vigorous disagreement with me. In particular, what makes a given value invalid, when? If you want to make unconditional claims about “when”, by all means, say “always”, but be clear about it.

Since I just said the definition is intrinsically relative, it makes it difficult for me to tell you that you are “wrong”. However, I can reserve the right to tell you that your definition is not very useful.

While this may not apply to $YOU, dear reader, it seems clear to me that some people have a probably-implicit definition of “validity” for Go pointers that simply defines nil as invalid and not-nil as valid. Like I just said, I can’t necessarily call this “wrong”, but I will say this is not a very useful definition of validity… but perhaps not for the reason you think. This definition lacks utility not primarily because of what I demonstrate above (where nil pointers can still have methods on them and not-nil things can still crash for all uses), but more because we already have a term for pointers that are nil and pointers that are not-nil, which is…

nil and not nil.

We don’t need a second term for that. It would be helpful to reserve the term “valid” for something other than that.

However, it is probably impossible to provide a useful definition of “valid” for which I can’t show you a valid nil pointer and something not-nil that is invalid by your definition.

In the context of the original Go discussion, this is relevant because the frequently-requested solution to the nil pointer in an interface psuedo-problem is some ability to dig into the interface’s value and find out whether the internal pointer is nil.

However, that raises the same problem as when I was discussing user input above; the code doing this does not know whether a nil pointer of an arbitrary type, potentially one that it has never heard of and didn’t even exist when the original code was being written3, is “valid” or not. It can’t. It does not have the context necessary to make that decision, potentially because that context is in the code’s future and thus, in the most rigid possible way, can’t possibly have been taken into account by the person writing the code.

But this comes up in all languages in one form or another. There may well be more code in the world that assumes that there is some universal definition of validity than code that is correctly written with the understanding that it is an intrinsically relative value, and even if today there is only one viable definition at least in principle there may be others tomorrow. So even if you aren’t a Go programmer or even loathe the language, there is still something to be learned from this discussion here.


  1. Yes, yes, Option types etc etc etc. We’re talking C here. It doesn’t have those. If you are programming in C, you accepted this contract. ↩︎

  2. Another example is serialization code. Many static languages have support for declaring a struct and then converting JSON straight into that struct. In theory, you should always have a separation between that JSON struct and your “real” struct, because you should not be constraining your internal representation to “things that happen to be easy to serialize to and from JSON”. There are numerous important ways that you may not be able to directly serialize things straight into and out of JSON.

    In practice, I very often have those collapsed, because in the simple cases where JSON serialization does work, the “conversion” would just be some sort of direct copy, or unnecessary casting, etc. There’s no reason to actually carry around multiple different representations if they aren’t bringing any value.

    In my head, though, I still conceive of the struct as pulling double duty, and if as the program evolves, the default serialization behavior becomes an impediment, I split the type at the drop of a hat. Having just one is a bonus. A bonus I very often managed to claim, but still just a bonus. ↩︎

  3. This is one of the distinguishing characteristics of an “interface”, that it can contain types that did not even exist at the time that the code using some interface is being written. Therefore, interfaces in general impose a constraint that the thing operating through an interface must not use any additional knowledge about the type within the interface other than what is provided by the interface itself.

    Go does provide a way to penetrate through this abstraction, but it must always be used with care, and in particular, it must generally be used with the understanding that while you may be able to name some concrete types if you want to do optimizations or special handling, that you can not in general name all the possible concrete types that may come in.

    In Go, the exception is if an interface contains an unexported method, in which case this does close the interface to external implementations and it does become possible to name all the concrete types that implement this, because they can only be located in that module. ↩︎


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK