Github Tracking issue for RFC 2457, "Allow non-ASCII identifiers" · Is...
source link: https://github.com/rust-lang/rust/issues/55467
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
New issue
Tracking issue for RFC 2457, "Allow non-ASCII identifiers" #55467
Centril opened this issue on Oct 29, 2018 · 125 comments
Comments
This is a tracking issue for the RFC "Allow non-ASCII identifiers" (rust-lang/rfcs#2457).
Steps:
-
Implement the RFC (cc @rust-lang/compiler @Manishearth)
- Normalize identifiers to NFC whilst parsing (#66670, #67702)
- Ensure that
#![forbid(non_ascii_idents)]
works. (#61883) - Lint:
confusable_idents
(#71542, #72770) - Lint:
less_used_codepoints
uncommon_codepoints
(#67810) - Adjustments to "bad style
non_standard_style
" lints. (See #73839) - Lint:
mixed_script_confusables
(#72770) - Provide reusable crates for above lints and checks on crates.io. (unicode-security)
- Adjust documentation (see instructions on forge)
- Stabilization PR (see instructions on forge)
Unresolved questions:
- Which context is adequate for confusable detection: file, current scope, crate?
- Should ZWNJ and ZWJ be allowed in identifiers?
-
How are non-ASCII idents best supported in debuggers?
Resolved: DWARF and debuggers handle UTF-8 just fine - Which name mangling scheme is used by the compiler? (Punycode, see RFC2603)
-
Is there a better name for the
less_used_codepoints
lint?
Resolved in favour ofuncommon_codepoints
-
Which lint should the global mixed scripts confusables detection trigger?
Resolved in favor ofmixed_script_confusables
-
How badly do non-ASCII idents exacerbate const pattern confusion
(#7526, #49680)?
Can we improve precision of linting here? -
In
mixed_script_confusables
, do we actually need to make an exception forLatin
identifiers? - Terminal width is a tricky with unicode. Some characters are long, some have lengths dependent on the fonts installed (e.g. emoji sequences), and modifiers are a thing. The concept of monospace font doesn't generalize to other scripts as well. How does rustfmt deal with this when determining line width?
- right-to-left scripts can lead to weird rendering in mixed contexts (depending on the software used), especially when mixed with operators. This is not something that should block stabilization, however we feel it is important to explicitly call out. Future RFCs (preferably put forth by RTL-using communities) may attempt to improve this situation (e.g. by allowing bidi control characters in specific contexts).
-
Tweak
XID_Start
/XID_Continue
? #4928http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm
The ISO JTC1/SC22/WG14 (C language) think that possibly UTR#31 didn't quite hit the nail on the head in terms of defining identifier syntax. They have a couple tweaks in mind. Consider following their lead.
-
Similarly to out-of-line modules (
mod фоо;
), extern crates and paths with a first segment naming a crate should not be able to do filesystem search using those non-ASCII identifiers (i.e. no ,extern crate ьаг;
orму_сгате::baz
). (#73305)
zulip channel topic for real-time discussion:
https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/nonascii.20identifiers(rfc.202457)
last unresolved question isn't a real unresolved question, it was included in the RFC for completeness but does not block this issue.
@joshtriplett Please check that the list of checkboxes above are satisfactory. :)
@Manishearth alright; leave a note under it to that effect?
The note saying so is already in the unresolved q
Is there a better name for the
less_used_codepoints
lint?
Substituting "rare" or "unusual" for "less used" seems to me a simple, if not necessarily final, improvement, replacing the somewhat awkward "less used" with a single, shorter, more usual synonym.
(Edit: I note that I personally oppose allowing non-ASCII identifiers, but I recognize that the Rust Team favors it, and I have no problem bowing to their decision and chipping in my cents to help.)
I would prefer “rare” as it sounds more objective to me than “unusual”, and perhaps less judgemental as well.
My first thought was "uncommon", but that's not strong enough of an adjective to get the intended meaning across.
I'm partial towards "rare" as well; rare_codepoints
is pretty short and sweet.
If we need something even stronger we might try "mythic".
I prefer uncommon_codepoints
as being more serious/boring than rare/legendary/mythic/etc.
Copy link
8573 commented on Nov 4, 2018
I prefer
uncommon_codepoints
as being more serious/boring than rare/legendary/mythic/etc.
As a native and competent English-user who generally is seen as serious/boring [1], while I agree that "legendary" and "mythic" sound rather fantastical [2], I don't think "rare" does. The distinction I would draw between uncommon_codepoints
and rare_codepoints
is that "rare" may be seen as stronger than "uncommon", at least in US English (Merriam-Webster and Wiktionary say so, whereas Oxford and Cambridge don't say so explicitly).
[1] (I recognize this may not have been a trait I displayed when I was a member of your channel.)
[2] (similarly to the Unicode term "astral plane", which, possibly relevantly, if I understand correctly, was changed to "supplementary plane")
I'm much against non-ascii identifiers, but I'm not going to repeat what other folks said. I just noticed that no example of malicious code was provided so far, so I'm providing one:
fn list_items_in_category(category: &str) -> io::Result<String> { let cate𝚐ory = sanitize_untrusted_input(category); debug!("Listing category {}", cate𝚐ory); system(format!("grep '^{},' /my/simple/database | awk -F , "{{ print $2 }}", category)) }
Who can spot the problem without looking at character codes?
As a side note, I'd like to provide my experience with attempts to localize everything (feel free to skip the rest of my comment if you want to remain technical). I went to a school where they translate literally every technical term. In an attempt to make everything understandable to everyone, they translated even things that are very difficult to translate reasonably.
You'd expect that it was much easier to learn at that school compared to others schools that don't do that, right? Well, life is weird. It was hard to understand, I felt like Alice in wonderland and it took me a week to realize that "that weird term I didn't hear before" was the actual thing I wanted to study and the very reason I signed up for that specific school!
Of course, this is not directly an argument against non-ascii identifiers. I just wanted to express my concerns to all those wonderful loving people (seriously) who want for everyone to feel great in Rust community, so that they remain vigilant and avoid accidentally going against their own beliefs.
while I agree that "legendary" and "mythic" sound rather fantastical [2], I don't think "rare" does.
(Agreed. My comment was entirely in jest (as I hope should've been evident?), and I had no intention of tarnishing "rare" by association, a word which is itself ordinary and common.)
Who can spot the problem without looking at character codes?
It's trivial on my font. Also, we want to start with lints against this sort of thing.
Who can spot the problem without looking at character codes?
Github's monospace font configuration on my machine ended up using a binocular glyph for U+0067 LATIN SMALL LETTER G but a monocular one for U+1D690 MATHEMATICAL MONOSPACE SMALL G, so I spotted it instantly.
That said, the confusable_idents
lint will once implemented almost certainly flag this code, since MATHEMATICAL MONOSPACE SMALL G → LATIN SMALL LETTER G
is listed in confusables.txt
.
The Oxford English Dictionary has as one definition of "uncommon":
Of an unusual type or character; exceptional in kind or quality
which seems especially appropriate (Emphasis mine.)
I feel it's a more suitable choice of words than "rare", which has significantly more meanings than "uncommon", some associated specifically with being good, e.g. (also OED):
Unusually good, fine, or worthy; of uncommon excellence or merit.
I also think more reserved wording is generally appropriate for compiler naming conventions.
but I'm not going to repeat what other folks said
That's precisely what you're doing -- your "counterexample" is caught by both the less_used_codepoints lint and the confusable_idents lint.
At this point these counterexample discussions have been done to death (and we're leveraging a unicode spec designed to deal with this!) -- please actually check if your "counterexample" isn't something we or the Unicode Consortium have thought of already.
Exactly. People have thought a lot about this, and it's certainly possible to implement a feature that many people will find useful while dealing with complications that might arise. At this point, adding this feature is a given. If you have ideas about how to improve the lints for finding confusable identifiers, by all means share them, but there's no need to simply point out an issue that everyone is already aware of.
Re. less_used_codepoints
: I propose that we bring the bikeshed to a halt in favor of uncommon_codepoints
because "meh".
I was personally in favor of rare_codepoints
but uncommon_codepoints
is basically the same and also works for me so...
Everyone OK with this?
works for me. Slight preference for rare but very slight. Both work for me, and I don't think it's really worth bikeshedding this too much :)
@Manishearth Sorry, I didn't mean to argue that the example is unsolved, I just wanted to provide actual code for those who might have difficulties imagining how to turn this feature into something malicious.
I wonder though, why confusables lint is not mandatory (according to RFC). It sounds to me like making borrow checker non-mandatory. My understanding is that Rust should be safe by default, where you can opt-into unsafety. For me, this means deny
ing confusables in every crate I make, if I want to ensure high quality of my crates.
This isn't a matter of safety in the way that Rust describes it.
Making the warnings on by default has been discussed. Please let's not use this tracking issue to relitigate things which have already been decided through a rather long RFC.
Let's see if we can get this stabilized. We have a bunch of unresolved questions, I feel like I have a concrete answer for many of these, (I have, after all, been thinking about this for years now). I think we actually are pretty close to stabilization on this.
cc @crlf0710 @rust-lang/lang @pyfisch
(lang team, how should we proceed on this? perhaps nominate it for discussion?)
- Which context is adequate for confusable detection: file, current scope, crate?
Crate, as implemented, seems fine for now. We can tweak later, this is not a stabilization blocker.
I actually speak a language that natively uses ZWJ for representing a certain relatively rare letter. The version of it without the zwj is acceptable, but not technically correct.
But most of the use cases are for affecting joining behavior in compounds, and overall it seems minor. I think we should disallow for now and keep an issue open for people who wish to use them to collect data. There's significant risk associated with allowing these since they're invisible, and for most languages do not introduce a semantic distinction.
- How are non-ASCII idents best supported in debuggers?
cc @fitzgen . I think we should just do what C++ does.
- Which lint should the global mixed scripts confusables detection trigger?
mixed_script_confusables
? That's what we do now.
This is not a blocker, and is only a problem for users of non-ascii idents. We can build better lints as needed.
- In
mixed_script_confusables
, do we actually need to make an exception forLatin
identifiers?
I think the exception as is, where Latin is always treated as trusted, is fine and necessary. We always consider ASCII to be "ok" and without this exception using, say, ë (which is confusable with a cyrillic character) will suddenly cause your compiler to error because it seems like the only time you use non-ascii latin is for a mixed script confusable.
- Terminal width is a tricky with unicode. Some characters are long, some have lengths dependent on the fonts installed (e.g. emoji sequences), and modifiers are a thing. The concept of monospace font doesn't generalize to other scripts as well. How does rustfmt deal with this when determining line width?
I consider this to not be a blocker. We should file a followup on rustfmt.
- Tweak
XID_Start
/XID_Continue
? #4928http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm
The ISO JTC1/SC22/WG14 (C language) think that possibly UTR#31 didn't quite hit the nail on the head in terms of defining identifier syntax. They have a couple tweaks in mind. Consider following their lead.
It's hard to understand exactly what their specification is here, and how it differs. The way it's specified makes it seem like punctuation is also allowed, which seems ... wrong? They mention Pattern_*
but not in enough detail. I've spent a fair amount of time on going through the things allowed by UTR 31 and it has so far seemed fine to me.
@fitzgen asked me how I thought GDB would handle non-ASCII identifiers. It's been a while since I worked on GDB, but last I knew, it simply stored everything as null-terminated strings. I think if you're lucky it'll put identifiers that don't match some sort of expected syntax in single quotes when it prints them.
The best thing is to just open up an ELF file in an editor, change some names to something else that is the same number of bytes in length, and see what GDB says.
@tromey might actually know the real answer.
I experimented by editing the DWARF by hand to add an emoji in a local variable name and gdb
seemed to work just fine:
gdb
also seems to handle non-UTF-8 as well as can be expected:
There's significant risk associated with allowing [ZWNJ and ZWJ] since they're invisible, and for most languages do not introduce a semantic distinction.
If this risk is significant, is it acceptable for identifiers to include U+034F COMBINING GRAPHEME JOINER? It is invisible and makes almost no semantic difference, but RFC 2457 allows it, along with many other such characters. If these characters are okay, why not ZWNJ and ZWJ too?
@dscorbett None of them have an Allowed
Identifier_Status
, so they won't go by unlinted: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ADI%3A%5D%26%5B%3AXID_Continue%3A%5D&g=Identifier_status&i=Identifier_Type
ZWJ and ZWNJ do. An easy fix here is to include them in the set of characters we lint on, but my point here is that we can stabilize without allowing these and figure it out later based on user feedback.
I guess "significant" was not the right word to use.
@Manishearth I've nominated this for discussion, but I anticipate that the lang team would be happy to move forward on this. Would you be able and willing to write a stabilization report for this feature, including the questions you answered above, as well as any other key decisions that were made since the RFC was approved?
@cramertj Yes, but I'm not clear on what is involved in writing such a report
(or is it just the points you mentioned?)
Terminal width is a tricky with unicode. Some characters are long, some have lengths dependent on the fonts installed (e.g. emoji sequences), and modifiers are a thing. The concept of monospace font doesn't generalize to other scripts as well. How does rustfmt deal with this when determining line width?
I consider this to not be a blocker. We should file a followup on rustfmt.
This is already a problem that rustc has to deal with because of string literals. I 100% agree it isn't a blocker.
@tromey might actually know the real answer.
For names in DWARF, GDB just hopes for the best. In theory DWARF lets a compiler specify that names are UTF-8-encoded (DW_AT_use_UTF8
). In practice, GDB just ignores this and assumes that whatever is in the DWARF matches however you've happened to set up the GDB locale. Lame, I guess, but maybe expensive to fix, and also low-impact, since my belief is that most people use UTF-8 locales now.
It's worth noting that enabling the confusable idents lint at least is likely to increase compile times, particularly on smaller crates - https://perf.rust-lang.org/compare.html?start=bd0bacc694d7d8175804bb6f690cb846bfa4a9ee&end=faa97a0482a1a2dd0650603410128fe53ec06fbc is the last perf run I could find on the PR implementing the confusable idents lint (#71542). It's not too big a hit, but if we intend to enable the lint by default it may be worth doing some more investigation on whether that hit can be mitigated. (It's not 100% clear to me whether the lint would be warn-by-default after stabilization).
@Manishearth it's probably good to adjust the stabilization report to include the current lint levels for each of the lints and (if applicable) suggested/required new levels.
@Mark-Simulacrum Lints always run regardless of whether they are enabled (we may optimize this in the future). Stabilization cannot change any performance characteristics here.
(It's not 100% clear to me whether the lint would be warn-by-default after stabilization).
It should, good catch
@Mark-Simulacrum Actually, no, all of these lints are Warn
by default already, aside from non_ascii_idents
(which is supposed to be Allow
). I'll clarify this in the report anyway.
(This could be done without stabilization because these lints only apply to so-far-unstable code, and lints have a relaxed stability policy due to cap-lints)
@Mark-Simulacrum Lints always run regardless of whether they are enabled (we may optimize this in the future). Stabilization cannot change any performance characteristics here.
This isn't true for these lints:
Ah good point. Either way, their level has already been Warn
for a while.
We could perhaps do ascii-checks during parsing and just store a "contains non-ascii idents" bool that globally disables these. But I feel these are all followups, we have been running with the perf impact of these lints for a year already.
We discussed this in today's @rust-lang/lang meeting, and agreed to merge based on this stabilization report. Thanks, @Manishearth!
@rfcbot merge
Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:
No concerns currently listed.
Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!
See this document for info about what commands tagged team members can give me.
@rfcbot reviewed
Copy link
rfcbot commented 3 days ago
This is now entering its final comment period, as per the review above.
No one assigned
None yet
No milestone
Successfully merging a pull request may close this issue.
None yet
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK