New issue

Tracking issue for RFC 2457, "Allow non-ASCII identifiers" #55467

Centril opened this issue on Oct 29, 2018 · 125 comments

Comments

Copy link

Contributor

Centril commented on Oct 29, 2018 •

edited by Manishearth

This is a tracking issue for the RFC "Allow non-ASCII identifiers" (rust-lang/rfcs#2457).

Steps:

Implement the RFC (cc @rust-lang/compiler @Manishearth)
- Normalize identifiers to NFC whilst parsing (#66670, #67702)
- Ensure that #![forbid(non_ascii_idents)] works. (#61883)
- Lint: confusable_idents(#71542, #72770)
- Lint: less_used_codepoints uncommon_codepoints (#67810)
- Adjustments to "bad stylenon_standard_style" lints. (See #73839)
- Lint: mixed_script_confusables(#72770)
- Provide reusable crates for above lints and checks on crates.io. (unicode-security)
Adjust documentation (see instructions on forge)
Stabilization PR (see instructions on forge)

Unresolved questions:

Which context is adequate for confusable detection: file, current scope, crate?
Should ZWNJ and ZWJ be allowed in identifiers?
How are non-ASCII idents best supported in debuggers?
Resolved: DWARF and debuggers handle UTF-8 just fine
Which name mangling scheme is used by the compiler? (Punycode, see RFC2603)
Is there a better name for the less_used_codepoints lint?
Resolved in favour of uncommon_codepoints
Which lint should the global mixed scripts confusables detection trigger?
Resolved in favor of mixed_script_confusables
How badly do non-ASCII idents exacerbate const pattern confusion
(#7526, #49680)?
Can we improve precision of linting here?
In mixed_script_confusables, do we actually need to make an exception for Latin identifiers?
Terminal width is a tricky with unicode. Some characters are long, some have lengths dependent on the fonts installed (e.g. emoji sequences), and modifiers are a thing. The concept of monospace font doesn't generalize to other scripts as well. How does rustfmt deal with this when determining line width?
right-to-left scripts can lead to weird rendering in mixed contexts (depending on the software used), especially when mixed with operators. This is not something that should block stabilization, however we feel it is important to explicitly call out. Future RFCs (preferably put forth by RTL-using communities) may attempt to improve this situation (e.g. by allowing bidi control characters in specific contexts).
Tweak XID_Start / XID_Continue? #4928

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm

The ISO JTC1/SC22/WG14 (C language) think that possibly UTR#31 didn't quite hit the nail on the head in terms of defining identifier syntax. They have a couple tweaks in mind. Consider following their lead.
Similarly to out-of-line modules (mod фоо;), extern crates and paths with a first segment naming a crate should not be able to do filesystem search using those non-ASCII identifiers (i.e. no , extern crate ьаг; or му_сгате::baz). (#73305)

zulip channel topic for real-time discussion:
https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/nonascii.20identifiers(rfc.202457)

last unresolved question isn't a real unresolved question, it was included in the RFC for completeness but does not block this issue.

Copy link

Contributor

Author

Centril commented on Oct 29, 2018

@joshtriplett Please check that the list of checkboxes above are satisfactory. :)

@Manishearth alright; leave a note under it to that effect?

The note saying so is already in the unresolved q

Copy link

8573 commented on Oct 29, 2018 •

edited

Is there a better name for the less_used_codepoints lint?

Substituting "rare" or "unusual" for "less used" seems to me a simple, if not necessarily final, improvement, replacing the somewhat awkward "less used" with a single, shorter, more usual synonym.

(Edit: I note that I personally oppose allowing non-ASCII identifiers, but I recognize that the Rust Team favors it, and I have no problem bowing to their decision and chipping in my cents to help.)

I like "unusual" -Manish Goregaokar

…

Copy link

Contributor

Serentty commented on Oct 29, 2018

I would prefer “rare” as it sounds more objective to me than “unusual”, and perhaps less judgemental as well.

My first thought was "uncommon", but that's not strong enough of an adjective to get the intended meaning across.

Copy link

Contributor

Author

Centril commented on Nov 1, 2018

I'm partial towards "rare" as well; rare_codepoints is pretty short and sweet.

Copy link

Contributor

glaebhoerl commented on Nov 1, 2018

If we need something even stronger we might try "mythic".

Copy link

Member

eddyb commented on Nov 3, 2018

I prefer uncommon_codepoints as being more serious/boring than rare/legendary/mythic/etc.

Copy link

8573 commented on Nov 4, 2018

@eddyb—

I prefer uncommon_codepoints as being more serious/boring than rare/legendary/mythic/etc.

As a native and competent English-user who generally is seen as serious/boring [1], while I agree that "legendary" and "mythic" sound rather fantastical [2], I don't think "rare" does. The distinction I would draw between uncommon_codepoints and rare_codepoints is that "rare" may be seen as stronger than "uncommon", at least in US English (Merriam-Webster and Wiktionary say so, whereas Oxford and Cambridge don't say so explicitly).

[1] (I recognize this may not have been a trait I displayed when I was a member of your channel.)
[2] (similarly to the Unicode term "astral plane", which, possibly relevantly, if I understand correctly, was changed to "supplementary plane")

Copy link

Contributor

Kixunil commented on Nov 5, 2018

I'm much against non-ascii identifiers, but I'm not going to repeat what other folks said. I just noticed that no example of malicious code was provided so far, so I'm providing one:

fn list_items_in_category(category: &str) -> io::Result<String> {
    let cate𝚐ory = sanitize_untrusted_input(category);
    debug!("Listing category {}", cate𝚐ory);
    system(format!("grep '^{},' /my/simple/database | awk -F , "{{ print $2 }}", category))
}

Who can spot the problem without looking at character codes?

As a side note, I'd like to provide my experience with attempts to localize everything (feel free to skip the rest of my comment if you want to remain technical). I went to a school where they translate literally every technical term. In an attempt to make everything understandable to everyone, they translated even things that are very difficult to translate reasonably.

You'd expect that it was much easier to learn at that school compared to others schools that don't do that, right? Well, life is weird. It was hard to understand, I felt like Alice in wonderland and it took me a week to realize that "that weird term I didn't hear before" was the actual thing I wanted to study and the very reason I signed up for that specific school!

Of course, this is not directly an argument against non-ascii identifiers. I just wanted to express my concerns to all those wonderful loving people (seriously) who want for everyone to feel great in Rust community, so that they remain vigilant and avoid accidentally going against their own beliefs.

Copy link

Contributor

glaebhoerl commented on Nov 5, 2018

while I agree that "legendary" and "mythic" sound rather fantastical [2], I don't think "rare" does.

(Agreed. My comment was entirely in jest (as I hope should've been evident?), and I had no intention of tarnishing "rare" by association, a word which is itself ordinary and common.)

Copy link

Member

eddyb commented on Nov 5, 2018

Who can spot the problem without looking at character codes?

It's trivial on my font. Also, we want to start with lints against this sort of thing.

Copy link

Contributor

ketsuban commented on Nov 5, 2018 •

edited

Who can spot the problem without looking at character codes?

Github's monospace font configuration on my machine ended up using a binocular glyph for U+0067 LATIN SMALL LETTER G but a monocular one for U+1D690 MATHEMATICAL MONOSPACE SMALL G, so I spotted it instantly.

That said, the confusable_idents lint will once implemented almost certainly flag this code, since MATHEMATICAL MONOSPACE SMALL G → LATIN SMALL LETTER G is listed in confusables.txt.

Copy link

Member

varkor commented on Nov 5, 2018

The Oxford English Dictionary has as one definition of "uncommon":

Of an unusual type or character; exceptional in kind or quality

which seems especially appropriate (Emphasis mine.)

I feel it's a more suitable choice of words than "rare", which has significantly more meanings than "uncommon", some associated specifically with being good, e.g. (also OED):

Unusually good, fine, or worthy; of uncommon excellence or merit.

I also think more reserved wording is generally appropriate for compiler naming conventions.

Copy link

Member

Manishearth commented on Nov 5, 2018

@Kixunil

but I'm not going to repeat what other folks said

That's precisely what you're doing -- your "counterexample" is caught by both the less_used_codepoints lint and the confusable_idents lint.

At this point these counterexample discussions have been done to death (and we're leveraging a unicode spec designed to deal with this!) -- please actually check if your "counterexample" isn't something we or the Unicode Consortium have thought of already.

Copy link

Contributor

Serentty commented on Nov 9, 2018

Exactly. People have thought a lot about this, and it's certainly possible to implement a feature that many people will find useful while dealing with complications that might arise. At this point, adding this feature is a given. If you have ideas about how to improve the lints for finding confusable identifiers, by all means share them, but there's no need to simply point out an issue that everyone is already aware of.

Copy link

Contributor

Author

Centril commented on Nov 9, 2018

Re. less_used_codepoints: I propose that we bring the bikeshed to a halt in favor of uncommon_codepoints because "meh".

I was personally in favor of rare_codepoints but uncommon_codepoints is basically the same and also works for me so...

Everyone OK with this?

Copy link

Member

Manishearth commented on Nov 9, 2018

works for me. Slight preference for rare but very slight. Both work for me, and I don't think it's really worth bikeshedding this too much :)

Copy link

Contributor

Kixunil commented on Nov 15, 2018

@Manishearth Sorry, I didn't mean to argue that the example is unsolved, I just wanted to provide actual code for those who might have difficulties imagining how to turn this feature into something malicious.

I wonder though, why confusables lint is not mandatory (according to RFC). It sounds to me like making borrow checker non-mandatory. My understanding is that Rust should be safe by default, where you can opt-into unsafety. For me, this means denying confusables in every crate I make, if I want to ensure high quality of my crates.

This isn't a matter of safety in the way that Rust describes it.

Making the warnings on by default has been discussed. Please let's not use this tracking issue to relitigate things which have already been decided through a rather long RFC.

Copy link

Member

Manishearth commented 15 days ago

Let's see if we can get this stabilized. We have a bunch of unresolved questions, I feel like I have a concrete answer for many of these, (I have, after all, been thinking about this for years now). I think we actually are pretty close to stabilization on this.

cc @crlf0710 @rust-lang/lang @pyfisch

(lang team, how should we proceed on this? perhaps nominate it for discussion?)

Which context is adequate for confusable detection: file, current scope, crate?

Crate, as implemented, seems fine for now. We can tweak later, this is not a stabilization blocker.

Should ZWNJ and ZWJ be allowed in identifiers?

I actually speak a language that natively uses ZWJ for representing a certain relatively rare letter. The version of it without the zwj is acceptable, but not technically correct.

But most of the use cases are for affecting joining behavior in compounds, and overall it seems minor. I think we should disallow for now and keep an issue open for people who wish to use them to collect data. There's significant risk associated with allowing these since they're invisible, and for most languages do not introduce a semantic distinction.

How are non-ASCII idents best supported in debuggers?

cc @fitzgen . I think we should just do what C++ does.

Which lint should the global mixed scripts confusables detection trigger?

mixed_script_confusables? That's what we do now.

How badly do non-ASCII idents exacerbate const pattern confusion
(#7526, #49680)?

This is not a blocker, and is only a problem for users of non-ascii idents. We can build better lints as needed.

In mixed_script_confusables, do we actually need to make an exception for Latin identifiers?

I think the exception as is, where Latin is always treated as trusted, is fine and necessary. We always consider ASCII to be "ok" and without this exception using, say, ë (which is confusable with a cyrillic character) will suddenly cause your compiler to error because it seems like the only time you use non-ascii latin is for a mixed script confusable.

Terminal width is a tricky with unicode. Some characters are long, some have lengths dependent on the fonts installed (e.g. emoji sequences), and modifiers are a thing. The concept of monospace font doesn't generalize to other scripts as well. How does rustfmt deal with this when determining line width?

I consider this to not be a blocker. We should file a followup on rustfmt.

Tweak XID_Start / XID_Continue? #4928

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm
The ISO JTC1/SC22/WG14 (C language) think that possibly UTR#31 didn't quite hit the nail on the head in terms of defining identifier syntax. They have a couple tweaks in mind. Consider following their lead.

It's hard to understand exactly what their specification is here, and how it differs. The way it's specified makes it seem like punctuation is also allowed, which seems ... wrong? They mention Pattern_* but not in enough detail. I've spent a fair amount of time on going through the things allowed by UTR 31 and it has so far seemed fine to me.

Copy link

Contributor

jimblandy commented 15 days ago

@fitzgen asked me how I thought GDB would handle non-ASCII identifiers. It's been a while since I worked on GDB, but last I knew, it simply stored everything as null-terminated strings. I think if you're lucky it'll put identifiers that don't match some sort of expected syntax in single quotes when it prints them.

The best thing is to just open up an ELF file in an editor, change some names to something else that is the same number of bytes in length, and see what GDB says.

Copy link

Contributor

jimblandy commented 15 days ago

@tromey might actually know the real answer.

Copy link

Member

fitzgen commented 15 days ago

I experimented by editing the DWARF by hand to add an emoji in a local variable name and gdb seemed to work just fine:

Copy link

Member

fitzgen commented 15 days ago

gdb also seems to handle non-UTF-8 as well as can be expected:

Copy link

Contributor

dscorbett commented 15 days ago

There's significant risk associated with allowing [ZWNJ and ZWJ] since they're invisible, and for most languages do not introduce a semantic distinction.

If this risk is significant, is it acceptable for identifiers to include U+034F COMBINING GRAPHEME JOINER? It is invisible and makes almost no semantic difference, but RFC 2457 allows it, along with many other such characters. If these characters are okay, why not ZWNJ and ZWJ too?

Copy link

Member

Manishearth commented 15 days ago

@dscorbett None of them have an Allowed Identifier_Status, so they won't go by unlinted: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ADI%3A%5D%26%5B%3AXID_Continue%3A%5D&g=Identifier_status&i=Identifier_Type

ZWJ and ZWNJ do. An easy fix here is to include them in the set of characters we lint on, but my point here is that we can stabilize without allowing these and figure it out later based on user feedback.

I guess "significant" was not the right word to use.

Copy link

Member

cramertj commented 15 days ago

@Manishearth I've nominated this for discussion, but I anticipate that the lang team would be happy to move forward on this. Would you be able and willing to write a stabilization report for this feature, including the questions you answered above, as well as any other key decisions that were made since the RFC was approved?

Copy link

Member

Manishearth commented 15 days ago

@cramertj Yes, but I'm not clear on what is involved in writing such a report

Copy link

Member

Manishearth commented 15 days ago

(or is it just the points you mentioned?)

Copy link

Contributor

estebank commented 15 days ago

Terminal width is a tricky with unicode. Some characters are long, some have lengths dependent on the fonts installed (e.g. emoji sequences), and modifiers are a thing. The concept of monospace font doesn't generalize to other scripts as well. How does rustfmt deal with this when determining line width?

I consider this to not be a blocker. We should file a followup on rustfmt.

This is already a problem that rustc has to deal with because of string literals. I 100% agree it isn't a blocker.

Copy link

Contributor

tromey commented 14 days ago

@tromey might actually know the real answer.

For names in DWARF, GDB just hopes for the best. In theory DWARF lets a compiler specify that names are UTF-8-encoded (DW_AT_use_UTF8). In practice, GDB just ignores this and assumes that whatever is in the DWARF matches however you've happened to set up the GDB locale. Lame, I guess, but maybe expensive to fix, and also low-impact, since my belief is that most people use UTF-8 locales now.

Copy link

Member

Manishearth commented 14 days ago •

edited

@cramertj I wrote up a stabilization report here

Inlined report

It's worth noting that enabling the confusable idents lint at least is likely to increase compile times, particularly on smaller crates - https://perf.rust-lang.org/compare.html?start=bd0bacc694d7d8175804bb6f690cb846bfa4a9ee&end=faa97a0482a1a2dd0650603410128fe53ec06fbc is the last perf run I could find on the PR implementing the confusable idents lint (#71542). It's not too big a hit, but if we intend to enable the lint by default it may be worth doing some more investigation on whether that hit can be mitigated. (It's not 100% clear to me whether the lint would be warn-by-default after stabilization).

@Manishearth it's probably good to adjust the stabilization report to include the current lint levels for each of the lints and (if applicable) suggested/required new levels.

Copy link

Member

Manishearth commented 10 days ago

@Mark-Simulacrum Lints always run regardless of whether they are enabled (we may optimize this in the future). Stabilization cannot change any performance characteristics here.

(It's not 100% clear to me whether the lint would be warn-by-default after stabilization).

It should, good catch

Copy link

Member

Manishearth commented 10 days ago •

edited

@Mark-Simulacrum Actually, no, all of these lints are Warn by default already, aside from non_ascii_idents (which is supposed to be Allow). I'll clarify this in the report anyway.

(This could be done without stabilization because these lints only apply to so-far-unstable code, and lints have a relaxed stability policy due to cap-lints)

@Mark-Simulacrum Lints always run regardless of whether they are enabled (we may optimize this in the future). Stabilization cannot change any performance characteristics here.

This isn't true for these lints:

let check_non_ascii_idents = cx.builder.lint_level(NON_ASCII_IDENTS).0 != Level::Allow; let check_uncommon_codepoints = cx.builder.lint_level(UNCOMMON_CODEPOINTS).0 != Level::Allow; let check_confusable_idents = cx.builder.lint_level(CONFUSABLE_IDENTS).0 != Level::Allow; let check_mixed_script_confusables = cx.builder.lint_level(MIXED_SCRIPT_CONFUSABLES).0 != Level::Allow; if !check_non_ascii_idents && !check_uncommon_codepoints && !check_confusable_idents && !check_mixed_script_confusables { return; }

Copy link

Member

Manishearth commented 10 days ago

Ah good point. Either way, their level has already been Warn for a while.

We could perhaps do ascii-checks during parsing and just store a "contains non-ascii idents" bool that globally disables these. But I feel these are all followups, we have been running with the perf impact of these lints for a year already.

Copy link

Member

joshtriplett commented 10 days ago

We discussed this in today's @rust-lang/lang meeting, and agreed to merge based on this stabilization report. Thanks, @Manishearth!

@rfcbot merge

Copy link

rfcbot commented 10 days ago •

edited by scottmcm

Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

Copy link

Contributor

nikomatsakis commented 9 days ago

@rfcbot reviewed

Copy link

rfcbot commented 3 days ago

This is now entering its final comment period, as per the review above.

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Linked pull requests

Successfully merging a pull request may close this issue.

None yet

33 participants

and others

Github Tracking issue for RFC 2457, "Allow non-ASCII identifiers" · Is...

Tracking issue for RFC 2457, "Allow non-ASCII identifiers" #55467

Comments

Centril commented on Oct 29, 2018 •

Centril commented on Oct 29, 2018

8573 commented on Oct 29, 2018 •

Serentty commented on Oct 29, 2018

Centril commented on Nov 1, 2018

glaebhoerl commented on Nov 1, 2018

eddyb commented on Nov 3, 2018

8573 commented on Nov 4, 2018

Kixunil commented on Nov 5, 2018

glaebhoerl commented on Nov 5, 2018

eddyb commented on Nov 5, 2018

ketsuban commented on Nov 5, 2018 •

varkor commented on Nov 5, 2018

Manishearth commented on Nov 5, 2018

Serentty commented on Nov 9, 2018

Centril commented on Nov 9, 2018

Manishearth commented on Nov 9, 2018

Kixunil commented on Nov 15, 2018

Manishearth commented 15 days ago

jimblandy commented 15 days ago

jimblandy commented 15 days ago

fitzgen commented 15 days ago

fitzgen commented 15 days ago

dscorbett commented 15 days ago

Manishearth commented 15 days ago

cramertj commented 15 days ago

Manishearth commented 15 days ago

Manishearth commented 15 days ago

estebank commented 15 days ago

tromey commented 14 days ago

Manishearth commented 14 days ago •

Manishearth commented 10 days ago

Manishearth commented 10 days ago •

Manishearth commented 10 days ago

joshtriplett commented 10 days ago

rfcbot commented 10 days ago •

nikomatsakis commented 9 days ago

rfcbot commented 3 days ago

Recommend

About Joyk