Reject octal zeros in IPv4 addresses by Smittyvb · Pull Request #86984 · rust-la...
source link: https://github.com/rust-lang/rust/pull/86984
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
This fixes #86964 by rejecting octal zeros in IP addresses, such that 192.168.00.00000000
is rejected with a parse error, since having leading zeros in front of another zero indicates it is a zero written in octal notation, which is not allowed in the strict mode specified by RFC 6943 3.1.1. Octal rejection was implemented in #83652, but due to the way it was implemented octal zeros were still allowed.
(rust-highfive has picked a reviewer for you, use r? to override)
The documentation for std::net::Ipv4Addr
pretty clearly states that the octets are in "decimal notation"; thus I personally don't even understand why the octet 001
is rejected since that is a perfectly reasonable base-10 representation of the integer 1. While that octet can be interpreted in any numeral system (including base-8), the octet 127
can also be interpreted in any numeral system starting from base-8. To me, this is inconsistent and very much confusing. It's like the parser says "127
can be in base-8, base-9, base-10, etc.; but we will default to interpreting it as base-10"; but for some reason, the parser doesn't apply that same logic to the octet 001
.
The documentation also clearly cites IETF RFC 6943, but Section 3.1.1. in that RFC defines the "strict" form—the form that forbids other numeral systems (e.g., base-8)—as 4 octets in decimal notation separated by a "." such that each octet is between one and three digits. The current behavior doesn't conform to any obvious standard.
Apparently other languages default to interpreting any integer with padded 0
s as base-8, but that is not true in most areas of Rust except here (e.g., assert_eq!("010".parse::<u32>().unwrap(), 10u32);
).
To me if for some reason we want to forbid 0
-padded octets, then it should be applied consistently for all integers not all integers except 0.
AIUI, it was defined in POSIX that a leading 0
in an IPv4 address should indicate octal. Rust was always parsing as decimal, but this could lead to bugs, even security issues, if you have system that combines Rust and POSIX ideas of a given input address. You might validate "010" in one mode and then use it in the other with a different interpretation. The Rust parser was changed to reject octal to avoid this confusion altogether.
You could debate whether "0"
alone really has a leading zero to make it octal rather than decimal, but it usually doesn't matter. Longer strings of zeros are more plausibly octal, I guess, but it still seems pedantic to reject it.
Hm, maybe this is yet another area where my math background is biting me in the butt because it's rather foreign to me to treat leading 0
s as "more likely" to be base-8 than any other base since all numeral systems are based on the same fundamental algorithm. Seems overly restrictive to assume leading 0
s means base-8 (with the exception of 0 apparently) too since clearly something like 008
is not base-8 due to the 8
not being a symbol in the base-8 numeral system; and since the documentation states each octet is in base-10, the problem of "how to interpret each octet" is already answered: base-10. I'm clearly in the minority with this point-of-view though, so I'll drop it.
Octal 0...
goes back to C, at least: https://en.cppreference.com/w/c/language/integer_constant
- decimal-constant is a non-zero decimal digit (1, 2, 3, 4, 5, 6, 7, 8, 9), followed by zero or more decimal digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
- octal-constant is the digit zero (0) followed by zero or more octal digits (0, 1, 2, 3, 4, 5, 6, 7)
(which does mean a lone 0
is technically octal as well in C, not that it matters)
The documentation for std::net::Ipv4Addr pretty clearly states that the octets are in "decimal notation"; thus I personally don't even understand why the octet 001 is rejected since that is a perfectly reasonable base-10 representation of the integer 1.
The docs should probably be clarified: the octets are in decimal notation, and leading 0s are disallowed (except, as usual, for a plain 0
).
This almost matches current behavior, except for 0000
. So with this PR, those docs would be correct. Without this PR, the docs should say: the octets are in decimal notation, and leading 0s are disallowed, except for the number 0
itself which may have an arbitrary amount of leading 0s.
(I don't have an opinion between these two options, but this PR does not update the docs so IMO it is incomplete either way.)
Seems overly restrictive to assume leading 0s means base-8
I'm not disagreeing, but this is what other parts of the ecosystem do. It is much better for Rust to reject 011
in IP addresses than for both Rust and POSIX to accept it and interpret it differently -- would you agree to that? Rust does not exist in a vacuum, we have to take into account the surrounding reality here, and reality often uses a leading 0 to indicate octal notation (and yes this makes no mathematical sense; whoever decided that decades ago didn't ask mathematicians it seems, but without a time machine no amount of good arguments is going to change this). We could probably stretch the limits here and reject 011
but accept 008
(since POSIX could not accept the latter), but that seems even more surprising and weird than just rejecting leading 0
s outright.
The docs possibly should say that this is why leading 0s are rejected.
The docs should probably be clarified: the octets are in decimal notation, and leading 0s are disallowed (except, as usual, for a plain
0
).
That is perfectly OK to me. While I would prefer something like 001.001.001.001
to be parseable because that is technically valid base-10 (at least mathematically)—I know my Nintendo Switch requires exactly 3 base-10 digits when typing in the IP—I realize that POSIX compatibility is very important.
I'm not disagreeing, but this is what other parts of the ecosystem do. It is much better for Rust to reject
011
in IP addresses than for both Rust and POSIX to accept it and interpret it differently -- would you agree to that? Rust does not exist in a vacuum, we have to take into account the surrounding reality here, and reality often uses a leading 0 to indicate octal notation (and yes this makes no mathematical sense; whoever decided that decades ago didn't ask mathematicians it seems, but without a time machine no amount of good arguments is going to change this). We could probably stretch the limits here and reject011
but accept008
(since POSIX could not accept the latter), but that seems even more surprising and weird than just rejecting leading0
s outright.
I absolutely agree. I was not aware POSIX stated that leading 0s meant base-8 until @cuviper pointed that out (i.e., my "complaint" would have been purely about the documentation and nothing else had I known this). I admit that I am probably in the minority when it comes to my background as a Rustacean. I've never written a line of assembly, C, C++, or kernel code in my life—something I'm not proud of, but c'est la vie—so it very well could be the case that "decimal notation" obviously means "mathematical base-10 with the added restriction that leading 0s are not allowed with the exception of 0 itself which is allowed to have any non-negative integer amount of leading 0s" to the majority of other Rustaceans.
Also want to add that what's considered "pedantry" to one may not be "pedantry" to another. While my 30+ years on this planet has shown me that my brain is wired very differently than most, I don't think this is an example of my brain thinking that differently. Case in point, I have a .CSV file of IPv4 addresses. Each row contains exactly 16 UTF-8 code units—with the last Unicode scalar value being a newline (i.e., U+000A). Based on the documentation, I wrote code that used this parsing algorithm; but of course, I was dismayed when I saw it fail. Ideally when one reads the documentation, they should have a good sense of the invariants and what's supposed to work and not. Obviously it didn't take long for me to change my implementation, but it would have been nice if I knew something like 001.001.001.001
would lead to a parsing failure a priori. Honestly even if two examples were added to the documentation that highlighted something like 001.001.001.001
leading to a parsing failure and 0000000.0.0000.00
leading to a parsing success would be enough for someone like me to deduce what "decimal" notation means in this context.
so it very well could be the case that "decimal notation" obviously means "mathematical base-10 with the added restriction that leading 0s are not allowed with the exception of 0 itself which is allowed to have any non-negative integer amount of leading 0s" to the majority of other Rustaceans.
I don't think it does. :) Leading 0s being code for "base 8" is a weird C thing that Rust did not do for its own integer literals, but that doesn't help us here. And either way the docs should be understandable for people with lots of different backgrounds, so if it is confusing for some, we should fix that! It's near impossible for one person (whoever writes the docs) to foresee all the ways it can be misleading, which is why documentation feedback and bugreports are extremely valuable. :)
So, we are in full agreement that the docs should document this more clearly. Usually, lading 0s are allowed, as you pointed out in your bugreport.
While I would prefer something like 001.001.001.001 to be parseable because that is technically valid base-10 (at least mathematically)—I know my Nintendo Switch requires exactly 3 base-10 digits when typing in the IP—I realize that POSIX compatibility is very important.
What I could imagine being useful (but I am not a libs person) is a parsing function that explicitly is not POSIX compatible, and that does interpret 010.010.010.010
entirely in decimal. But that should not be the default parsing function. I am not sure if the Rust stdlib is the right place for this, but it might be worth suggesting.
I am honestly quite shocked that POSIX would interpret this as 8.8.8.8
, and I did not know that this is the case... but my ping
agrees:
$ ping 010.010.010.010
PING 010.010.010.010 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=118 time=11.5 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=118 time=10.4 ms
Now this probably is a pedantic request—thus something I am not that passionate about and OK with not doing—but an example like the parsing of 0000.0.0.0
leading to a successful result despite having leading 0
s is informative as that result does not obviously follow from the documentation (also has the property that more than 3 digits are allowed despite going against the "strict" form mentioned in RFC 6943 Section 3.1.1.).
Once I know how to do "pull requests", I won't ask others to do such a thing.
Does this require libs team signoff or an fcp or something? This is another breaking change after the one that rejected the octal notation. I've encountered the regression at work and will probably encounter this one too when its available in stable.
That being said, I think the PR does make things more consistent and coherent.
r? @joshtriplett for libs-api(?) examination of this issue. I think it probably merits an FCP, at least...
I think we should go ahead with this change.
@rfcbot merge
Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:
No concerns currently listed.
Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!
See this document for info about what commands tagged team members can give me.
I was personally somewhat on the side of just interpreting octets as decimal, even with a leading zero, POSIX be damned. But one thing that convinced me otherwise was the fact that, well, those leading zeroes are indeed interpreted as indicating octal base in many other places. So if our routine accepts the same IP but gives a different result than others, that could indeed be quite surprising. So I think it's the right decision to fail here. (Or alternatively, just implement octal support.)
(I am not aware of the history of rejecting octal in this routine, so I may just be re-treading old ground, but wanted to mention it since there was some discussion above.)
Copy link
rfcbot commented 15 days ago
This is now entering its final comment period, as per the review above.
Copy link
rfcbot commented 5 days ago
The final comment period, with a disposition to merge, as per the review above, is now complete.
As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.
This will be merged soon.
@bors r+
Commit 403d269 has been approved by m-ou-se
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK