7

International domain names: where does https://meßagefactory.ca lead you?

 1 year ago
source link: https://lemire.me/blog/2023/01/23/international-domain-names-where-does-https-mesagefactory-ca-lead-you/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Skip to content

Daniel Lemire's blog

Daniel Lemire is a computer science professor at the Data Science Laboratory of the Université du Québec (TÉLUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist and a free-speech advocate.

International domain names: where does https://meßagefactory.ca lead you?

Originally, the domain part of a web address was all ASCII (so no accents, no emojis, no Chinese characters). This was extended a long time ago thanks to something called internationalized domain name (IDN).

Today, in theory, you can use any Unicode character you like as part of a domain name, including emojis. Whether that is wise is something else.

What does the standard says? Given a domain name, we should identify its labels. They are normally separated by dots (.) into labels: www.microsoft.com has three labels. But you may also use other Unicode characters as separators ( ., ., 。, 。). Each label is further processed. If it is all ASCII, then it is left as is. Otherwise, we must convert it to an ASCII code called “punycode” after doing the following according to RFC 3454:

  • Map characters (section 3 of RFC 3454),
  • Normalize (section 4 of RFC 3454),
  • Reject forbidden characters,
  • Optionally reject based on unassigned code points (section 7).

And then you get to the punycode algorithm. There are further conditions to be satisfied, such as the domain name in ASCII cannot exceed 255 bytes.

That’s quite a lot of work. The goal is to transcribe each Unicode domain name into an ASCII domain name. You would hope that it would be a well-defined algorithm: given a Unicode domain name, there should be a unique output.

Let us choose a common non-ASCII character, the letter ß, called Eszett. Let me create a link with this character:

What happens if you click on this link? The result depends on your browser. If you are using Microsoft Edge, Google Chrome or the Brave browser, you may end up at https://messagefactory.ca/. If you are using Safari or Firefox  you may end up at https://xn--meagefactory-m9a.ca. Of course, your results may vary depending on your exact system. Under ios (iPhone), I expect that the Safari behaviour will prevail irrespective of your browser.

Not what I expected.

Published by

2ca999bef9535950f5b84281a4dab006?s=56&d=mm&r=g

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire

Leave a Reply Cancel reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com.

Comment *

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.

Receive Email Notifications?

You may subscribe to this blog by email.

Post navigation


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK