5

RFC: `c"…"` string literals by m-ou-se · Pull Request #3348 · rust-lan...

 1 year ago
source link: https://github.com/rust-lang/rfcs/pull/3348
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Conversation

Member

@m-ou-se m-ou-se commented Nov 15, 2022

edited by tmandry
BurntSushi, seritools, Kixiron, thomcc, Diggsey, mejrs, ChayimFriedman2, joshtriplett, petrochenkov, jhpratt, and 25 more reacted with thumbs up emojisteffahn, clarfonthey, BurntSushi, mati865, faptc, Kixiron, thomcc, mejrs, Nashenas88, coolreader18, and 6 more reacted with heart emoji

m-ou-se

added T-lang Relevant to the language subteam, which will review and decide on the RFC. A-syntax Syntax related proposals & ideas

labels

Nov 15, 2022

Member

Author

m-ou-se commented Nov 15, 2022

Three weeks ago, the lang team said they would be interested in potentially doing this in the future. So here's an RFC. :)

scottmcm and Karmylr reacted with heart emoji

Contributor

clarfonthey commented Nov 15, 2022

I'm on board. I'd even consider that a future extension might be to allow os"..." string literals, but that seems probably more iffy since it'd be the first case of a language item not being available in no_std environments. (I think?)

One other potential thing to thing about is whether c"..." string patterns should be allowed. Like, completely outside of the realm of constant patterns, if c"..." would be considered a valid pattern for macros, etc.

Accepted escape codes: [Quote](https://doc.rust-lang.org/reference/tokens.html#quote-escapes) & [Unicode](https://doc.rust-lang.org/reference/tokens.html#unicode-escapes) & [Byte](https://doc.rust-lang.org/reference/tokens.html#byte-escapes).

Unicode characters are accepted and encoded as UTF-8. That is, `c"🦀"`, `c"\u{1F980}"` and `c"\xf0\x9f\xa6\x80"` are all accepted and equivalent.

I wish byte string literals had this support too, so big +1 on this!

Member

Author

@m-ou-se m-ou-se Nov 15, 2022

It might be worth proposing that in a separate RFC. That would also resolve one unresolved question of concat_bytes, if we accept that mixing UTF-8 and non-UTF-8 in byte strings is okay.

clarfonthey reacted with thumbs up emojiBurntSushi reacted with rocket emoji

Member

Author

@m-ou-se m-ou-se Nov 15, 2022

edited

Wrote an RFC for that: #3349

BurntSushi, joshtriplett, and farnz reacted with heart emoji

Member

Author

m-ou-se commented Nov 15, 2022

edited

I'd even consider that a future extension might be to allow os"..." string literals

I was hoping to make things like os!"..." possible without extending the language for each prefix: #3267. But that proposal turned out to be quite controversial and was rejected.

An alternative would be to allow literals like "…" to implicitly convert to more than just &str (just like how 123 can be u32 or i64, etc. etc.). Some kind of const FromLiteral trait or something, once we have const traits. Then "…" could implicitly become a &CStr, and 123 a BigNum, etc. Not sure how exactly that feature would work though, but I'll mention it in the alternatives section.

clarfonthey reacted with thumbs up emoji

One concern I have is that if single-letter prefixes become common, extending the language with new prefixes can become confusing. Although, if br and cr are treated as fixed literals rather than composition, this may be a non-issue.

Member

nagisa commented Nov 15, 2022

I have two rhetorical questions with regards to the RFC text:

  1. What does the dependence of this feature on the standard library types means for #[no_core] crates? Would it be possible to do something/anything that would make #[no_core] crates utilizing the c"" literals to work out of the box still?
  2. What does the defaulting to UTF-8 encoding mean when interacting with C source that targets non-UTF-8 locales (lets say the linked-in C code is encoded in JIS, and the environment is also set up for JIS?) How does that interact with whatever reasonable assumptions a developer might make about c""?

Member

Author

m-ou-se commented Nov 15, 2022

edited
  1. What does the dependence of this feature on the standard library types means for #[no_core] crates? Would it be possible to do something/anything that would make #[no_core] crates utilizing the c"" literals to work out of the box still?

Do we even support no_core? I suppose it just means that they'd have to define the CStr lang item if they want to use c"" syntax. I think we could make not the type but a constructor function the lang item, such that they can decide themselves what to do with the [u8; N]. (In core, that'd basically be CStr::from_bytes_with_nul_unchecked.)

  1. What does the defaulting to UTF-8 encoding mean when interacting with C source that targets non-UTF-8 locales (lets say the linked-in C code is encoded in JIS, and the environment is also set up for JIS?) How does that interact with whatever reasonable assumptions a developer might make about c""?

The exact same as would happen when using regular string literals. For example, libc::puts("我名字叫玛拉。".as_ptr() as _) is already possible. It'll just pass the string as UTF-8 encoded bytes. woman_shrugging

- Also add `c'…'` C character literals? (`u8`, `i8`, `c_char`, or something more flexible?)

- Should we make `&CStr` a thin pointer before stabilizing this? (If so, how?)

I think this should be a blocker on stabilization, yeah.

Member

@Kixiron Kixiron Nov 15, 2022

edited

I don't see how this feature is blocked by that at all really. It produces an &'static CStr regardless of what &CStr itself is made of.

@Kixiron To be clear, I think considering that question should be a blocker for stabilization.

Given that a major use case of this will be FFI, it seems important that we have a simple, not-error-prone way of passing a C string to C functions. If we decide that &CStr wasn't that mechanism, then we should decide what that mechanism should be, and make sure c"..." works well with that.

SUPERCILEX, Veykril, tux3, daboross, and tmandry reacted with thumbs up emoji

rfcbot commented Nov 17, 2022

edited by cramertj

Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

rfcbot

added proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. disposition-merge This RFC is in PFCP or FCP with a disposition to merge it.

labels

Nov 17, 2022

joshtriplett

removed the I-lang-nominated Indicates that an issue has been nominated for prioritizing at the next lang team meeting. label

Nov 17, 2022

Member

Author

m-ou-se commented Nov 17, 2022

I implemented this yesterday evening, but it got a bit ugly/verbose because this literal type accepts yet another different set of escape codes. I'll clean it up more before sending a PR, but the code would get simpler if #3349 were also accepted. :)

Member

Author

m-ou-se commented Nov 17, 2022

I was hoping to make things like os!"..." possible without extending the language for each prefix: #3267. But that proposal turned out to be quite controversial and was rejected.

A pity considering that other languages do have custom string literal processors. That would have made it a library concern instead of a lang issue and users could bring their own specialty string types if needed.

I'm confused by both of these comments. The RFC opens with Implementing cstr!() as a macro_rules or proc_macro requires non-trivial code to get it completely right, but #3267 was just a shorter macro syntax which wouldn't work for this RFC because the whole point is that the macro system makes implementing this kind of thing difficult.

Ah sorry for the confusion. I should clarify: I think c"" is a better solution for C string literals than the c!"" macro syntax idea I proposed before. But if we had c!"", then #3349 would've been enough to make it accept all the escape codes. The implemention would still be non-trivial, but we could provide it as a built-in macro to work around that.

I do not think that we should have many different prefixes (for paths, os strings, wide strings, utf16 strings, allocated strings, Cows, etc. etc. etc.), because that quickly gets out of hand. The macro shorthand syntax could've helped with that, but there are other options too:

The "custom string literal processors" thing that @the8472 mentions sounds like what I described in #3348 (comment), some kind of "const" FromLiteral trait or something, to allow any string- or integer-like type to be created directly from a literal. It's mentioned in the RFC in the alternatives section. I've also added it to the future possibilities section, because I think we might still want to consider that at some point in the future, even if this RFC is accepted. (Maybe once we have const traits.)

I'm still confused as to why macros and prefixes are being presented as incompatible ideas. If we don't want to play the "add a prefix" game for every type and #3349 is accepted, why can't we have a macro like this: c!("")? The parens are a little weird, but honestly so is the bang for a string literal. To be clear, I do think adding a prefix for cstrs is a good idea because cstrs are a fundamental, platform-independent type (whereas the examples like paths already have easy ways to be constructed and are domains specific). I just want to make sure we don't fall into the trap of thinking that because a macro has parens it means it can't be used as a shorthand for other types.

Member

nagisa commented Nov 17, 2022

edited

even in the unlikely event that we do end up horribly regretting the choice later then it's easily fixed via an edition.

To be honest, this is not the first time I’m finding it unfortunate that editions are being used as a motivation for dismissing concerns about future-proofness of proposed changes. Changing some existing behaviour across an edition boundary isn’t as trivial of a question as “is the change technically feasible?” It is also a question of how widespread is the use of that functionality, how confusing it would be to have multiple flavours of the language where specific syntactic constructs behave differently, etc. In practice there are likely very few syntactic constructs we would be willing to change, even if we do horribly regret them. More likely we’d deprecate the old and add some new construct instead, possibly acknowledging that it is less-than-ideal thing to do but still a superior alternative.

Ixrec, kornelski, jminer, m-ou-se, joshtriplett, Nilstrieb, daboross, riking, Gilnaa, and CAD97 reacted with thumbs up emojiafetisov reacted with heart emoji

Lonami commented Nov 17, 2022

I haven't seen it mentioned, but perhaps the following alternative could be considered:

The compiler would refuse to compile c"..." unless the last character is \0. The compiler would not append this character on its own. This way the NUL byte is explicit. But perhaps that would be more annoying than helpful (and at this point using b"..." would probably be just as good).

This way the NUL byte is explicit.

I'd argue that's what the c prefix is for. Also, the compiler would have to reject strings that don't have the nul byte, so it seems like requiring it explicitly would just be an annoying thing you have to do to please the compiler.

As an additional data point, in gtk-rs we have a very similar use case and these C string literals would almost fit there but we need the UTF-8 guarantee that normal strings would give.

Currently there's a GStr type (basically a CStr plus the UTF-8 guarantee) and a gstr! proc-macro that converts a normal string literal into a &GStr.

Not sure if there's anything that could make adapt this proposal to also allow for that use case but with CStr not guaranteeing valid UTF-8 that seems to be hard. Maybe this would need something like externally extensible string literal modifiers so we could implement a g"123" string literal or so.

Contributor

Lokathor commented Nov 18, 2022

edited

The "non-portable" argument against having c'.' to make a c_char seems extremely weak. People can already declare a const to be of type c_char, so properly portable code that's expecting c_char can't just pick u8 or i8 and forget about the difference.

EDIT: to be clear, I'm not saying that c'.' syntax needs to be in this rfc, but if we're not going to have it here I'd much rather that the rfc just say "this is out of scope for now".

# Motivation

[motivation]: #motivation

Looking at the [amount of `cstr!()` invocations just on GitHub](https://cs.github.com/?scopeName=All+repos&scope=&q=cstr%21+lang%3Arust) (about 3.2k files with matches) it seems like C string literals

Contributor

nikomatsakis commented Nov 29, 2022

@rfcbot reviewed

rfcbot

added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label

Nov 29, 2022

rfcbot commented Nov 29, 2022

bellThis is now entering its final comment period, as per the review above. bell

m-ou-se reacted with rocket emoji

rfcbot

removed the proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. label

Nov 29, 2022

rfcbot

added finished-final-comment-period The final comment period is finished for this RFC. to-announce

and removed final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised.

labels

Dec 9, 2022

rfcbot commented Dec 9, 2022

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

CAD97 commented Dec 13, 2022

Just a minor additional note: I want to second that even if c"…" and bc"…" both create &CStr, having the former carry a guarantee of WF UTF-8 is beneficial to readers of the code that the former is known to be UTF-8 encoded (and the latter probably intended to contain non-UTF-8 encoded data). This includes procedural macros which are capable of seeing the prefix used and using the UTF-8 guarantee for 3rd party guaranteed UTF-8 CStr variants like e.g. cstr8::CStr8 (disclaimer: my own crate) and interop with C++ std::u8string/std::u8string_view.

Just having c"…" be &CStr and allow arbitrary nonnul bytes is probably the more practical choice. The proc macro which would've used the guaranteed-UTF-8 can just as easily take a normal string literal and convert it to a c"…" literal internally like it would today (but benefitting from the automatic interior-nul checking).

(Polymorphic string literals is probably the ideal long-term position, but having c", c8", c16", u", u8", u16", u32", char" (etc. or w/e) prefixes to explicitly disambiguate which string type from whatever string types this theoretical future std provides is still reasonable and a good idea. (Super explicit: not proposing any of these at this time.))

However, as a data point, the windows crate provides c!("…") as just concat!("…", "\0").as_ptr(), and despite the lack of interior-nul checking, the guaranteed-UTF-8 is useful. (They also currently provide w! for the same thing but for UTF-16, and h! for HSTRING.) Asking the team working on the windows crate how they'd ideally like to utilize c"…" is probably worth doing sometime before stabilization. (Not to prioritize windows over Linux or macOS; it's just what I'm familiar with. It's probably worth asking the Rust-for-Linux and Android people for their input as well.)

tmandry

merged commit 873890e into

rust-lang:master

Dec 14, 2022

Contributor

tmandry commented Dec 14, 2022

Huzzah! The @rust-lang/lang team has decided to accept this RFC.

To track further discussion, subscribe to the tracking issue here:
rust-lang/rust#105723

SUPERCILEX, ehuss, CAD97, attila-lin, and m-ou-se reacted with hooray emoji

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK