Funycode issue

From Wikidna.org

Jump to: navigation, search
  • At 16:36 11/05/2008, Vint Cerf wrote:

I think we should say nothing about display. John's focus is on whether and how to do the lookup.

I agree with what I understand his two positions to be:

1. just put the punycode string into the DNS query opaquely.

OR

2. do the conversion and handle as if the resulting Unicode had been submitted.

technical question:

if someone generates an arbitrary string of the form "xn-- <random sequence of lowercase a-z, 0-9 and hyphen> does the algorithm ALWAYS produce a sequence of UNICODE code points? Note I did not say a PVALID set of code points or even ASSIGNED.

I am asking because I am wondering how a relatively simple-minded implementation might look from the UI perspective.

If we always get a sequence of code points regardless of the sequence of LDH, the simple-minded implementation could easily produce gibberish if attempting to invert to UNICODE a sequence of random LDH characters (confining the letters to lowercase)

Is the following correct:

let s be a random string of <lower case a-z, 0-9, hyphen> prefixed by "xn--"

let To UNICODE be a function that maps s into UNICODE

let To ASCII be a function that maps UNICODE into punycode

s is valid punycode If and Only If s = To ASCII ( To UNICODE (s) )

I hope I haven't mangled the question too badly.

v


  • At 17:13 11/05/2008, Cary Karp wrote:

I'm not sure if or how it weighs into the consideration of this question but on April 28th, the .SU TLD registry began accepting the registration of subdomain labels beginning with xn-- without requiring them to be valid output of the Punycode algorithm:

"A user or service provider can either design software used for decoding and decoding algorithm on his own or use PUNYCODE algorithm recommended by ICANN and published in RFC documents."

http://www.fid.su/english/?newsid=1207819620

One purpose of allowing non-IDNA-compliant alternatives appears to be to permit script mixing:

"In the process of generation of domain names with xn-- prefix using encoding algorithms mentioned in RFC documents registrators are not allowed to mix symbols of different national alphabets."


  • At 18:20 11/05/2008, Vint Cerf wrote:

If this is true, this is a very disappointing outcome -the .SU operators are certainly damaging our general efforts to make the Internet a less confusing place in which to operate. v


  • At 18:58 11/05/2008, Erik van der Poel wrote:

Hi Vint, I think it's fine for the protocol document to focus on lookup (and registration). But let's not forget that the bidi draft has a heavy focus on display issues (and that's OK).

"> if someone generates an arbitrary string of the form "xn-- <random sequence of lowercase a-z, 0-9 and hyphen does the algorithm ALWAYS produce a sequence of UNICODE code points?"

No. For example, xn--en32g would produce U+110000, which is outside the range of valid code points. (The highest code point is U+10FFFF.)

If an app receives such a punycode string, it should not attempt to display the corresponding Unicode (since it is invalid). I'm guessing that we can all agree on that.


  • At 19:28 11/05/2008, JFC Morfin wrote:

Dear Vint,

the only issues for a ccTLD Registry Manager (and any zone registry manager) are that (1) users get the names they want (pay for) which do not violate ccTLD (or zone) Manager's rules, and (2) ICANN does not interfere with national sovereignty (cf. WSIS) about ISO 3166 related language TLD strings. Everything else is in user applications and best practices realm and can/will be handled as the users think it is the best for them.

The only real issue of common interest, because it affects different applications and the Web (hence W3C, Unicode, IETF) is the display issue.

It would be interesting to try to register "xn--vint-cerf.xx" in each top level registry.

1. to see how many would accept it today.
2. as you know this kind of A-Label name, the U-Label can be registered as TM, is dubbed a "Babel Name". A topic I have with the WIPO for years which creates a real IP problem.


  • At 21:10 11/05/2008, John C Klensin wrote:

--On Sunday, 11 May, 2008 17:13 +0200 Cary Karp <ck@nic.museum> wrote: "In the process of generation of domain names with xn-- prefix using encoding algorithms mentioned in RFC documents registrators are not allowed to mix symbols of different national alphabets."

Of course, there is no such requirement or restriction in either IDNA2003 or proposed for IDNA2008, nor are there strong guidelines that prohibit such registrations. Some serious confusion and/or FUD going on here.


  • At 22:17 11/05/2008, JFC Morfin wrote:

Dear John,
I do not know then how to read ICANN Guidelines :

1. Domain registries that implement internationalized domain name capabilities at any level, including their own top-level designations, will do so in strictcompliance with the technical requirements described in RFCs 3454,3490, 3491, and 3492 (collectively, the "IDN standards").
2. In implementing the IDN standards, domain registries will employ an "inclusion-based" approach (meaning that code points which are not explicitly permitted by the registry are prohibited) for identifying permissible sets of code points from among the full Unicode repertoire, as described below. A registry may not even by exception permit code points that are prohibited by the IDN standards.
3. In implementing the IDN standards, domain registries will associate each label in a registered internationalized domain name, as it appears in their registry, with a single script as defined by the block division of the Unicode code chart. A more specific association may be made by combining descriptors for both language and script. Alternatively, a label may be associated with a set of languages, or with more than one designator under the conditions described below.
3.1 A domain registry will publish the aggregate set of code points that it makes available in clearly identified IDN-specific character tables, and will define equivalent character variants if registration policies are established on their basis. Any such table will be designated in a manner that indicates the script(s) and/or language(s) it is intended to support.
3.2 All code points in a single label will be taken from the same script as determined by the Unicode Standard Annex #24: Script Names < http://www.unicode.org/reports/tr24>. Exceptions to this guideline are permissible for languages with established orthographies and conventions that require the commingled use of multiple scripts. Even in the case of this exception, visually confusable characters from different scripts will not be allowed to co-exist in a single set of permissible codepoints unless a corresponding policy and character table is clearly defined.

".su" is just favoring a WSIS people centric vision of the Internet over the network centric vision of the IETF. It seems that currently 40 ccTLDs over more than 250 have signed an agreement with ICANN.

I do not think this creates any problem as long as users can filter in the point-code they do not want to accept in their private environment ?


  • At 02:35 12/05/2008, Vint Cerf wrote:

Jefsey, I think what John meant is that the RFCs did not impose the restriction, ICANN did. vint


  • At 03:50 12/05/2008, JFC Morfin wrote:

Sure. ".su" is a ccTLD. They do not refer to the IETF who refused to enter into an MoU with them over the matter (cf. Brian Carpenter's response to my invitation to attend the ccTLD Meeting in Luxembourg to investigate it - it was well timed with the NTIA principles). They refer to the "serious confusion and/or Fud" introduced by ICANN.

This means that the very disappointing outcome damaging the WG-IDNABIS efforts to make the Internet a less confusing place in which to operate comes from ICANN. This is why I would advocate a crash meeting in Paris end of june to have ccTLDs, ICANN, IETF WG-IDNABIS to settle a common road map for the implementation of IDNA. I know that this is a very short notice but, again, I think that Beijin Olympic Games will show the non-ASCII world how every Chinese athlete has a Chinese name web site and mail. If there is no consensual road-map by then I am afraid it will be very difficult to prevent as a result a non-coordinated ML-DNS deployment. It will take some time after decisions will be taken. So, we might consider that they could mushroom in Falls 2009. Not necessarily a good timing for ICANN.

Up to you to decide.


  • At 05:57 12/05/2008, Patrik Fältström wrote:

In the implementations of IDNA I am writing (more for DNS registration side than lookup side) I am definitely doing multiple A-label => U- label => A-label iterations to ensure the string is "stable" and accepted regardless of whether the string start with an A-label or U- label. That is the easiest way to check that the string is "ok" (programmatically). Instead of having different rules depending on how the label "enter the system".

I would be very very nervous if some document made it "impossible" to do such tests.


  • At 06:17 12/05/2008, James Seng wrote: (co Chair WG-IDNA - 2001-2003)

What Patrik proposed makes a lot of sense, that some form of round-trip stability check should be perform.

I am in the "New compromise camp:" to be exact.

Client is permitted to look up any label that is already in Punycode, even if it has an unassigned code point encoded inside it, but the client SHOULD update itself or warn the user when an unassigned code point is encountered in a label that is not already in Punycode.

but I emphasis the word "SHOULD" because I dont think it is a "MUST" requirement on the app developer. The reason is because the app developer may provide warnings in the form of a warning pop-up, highlighting the dangerous A-URL, send them to a warning page before proceeding, or something we haven't thought of right now.


  • At 15:21 12/05/2008, Andrew Sullivan wrote:

What I like about the overall "internationalize LDH" approach is that it is conceptually simple. It gives us some pretty clear guidance on the cases that are problematic. It allows us to use a set of properties of scripts from some standard that does have the involvement of the athropological and linguistic experts needed to make the kinds of judgement in question (even if everyone doesn't always agree with the results). We know where the "land mines" are in the DNS, so we can address those cases specifically, and just derive everything else.

If we begin to deviate from this mostly-derived path, then we set ourselves up as somehow knowing something about what "should" be allowed in the DNS, on grounds of utility ("these historic scripts are useless, so they should be left out"; "this particular historic script -- or code point -- even though categorized like the others, is different and needs to be allowed in"). Since different people will have different views on this utility, it opens us to endless discussions about what is in and out.


  • At 17:07 12/05/2008, John C Klensin wrote:

Jefsey, I think what John meant is that the RFCs did not impose the restriction, ICANN did. Vint

Yes. And, as has been pointed out many times, ICANN's restrictions apply to only a few handfuls of domains (if that many) out of many millions. For anyone else, their guidelines are a suggestion at most.

Jefsey also wrote... " I do not think this creates any problem as long as users can filter out the code-points they do not want to accept in their private environment ?" I think we need to be very careful here. I agree with what you are saying as I understand it, but I also believe that there are ways of reading the above that would get us into trouble.

(further comments by JFC Morfin at 21:06 12/05/2008 are embeded to shorten the report)

Who is "us". This sounds IETF network centric. If we want to clearly understand one another, I remind that I am "people centric" - as the WSIS consensus is. A step further than ISOC which is "user centric", what still implies some dependance from the network. I think it is important not to wast time.

So...

Q: Would it be reasonable for a user to set up a sort of whitelist of domains to be accepted, with all others being rejected or producing warnings?

I spoke of code-points. This is as simple as why to accept an URL I cannot print or read?

A: Yes. Whether it would be a good idea or not would depend on the user and usage patterns, but, if a user wanted to do it, I don't think we should try to interfere.

Moreover than you cannot impeach it. I am surpised that no one yet discussed OPES in relation to IDNs. You are engaged in a complex project. Caring about domain names, etc. I just load in seconds the DISALOWED list I want.

Q: Would it be reasonable for a user to set up some sort of algorithm or collection of rules to effectively perform whitelist selection?

A: Sure. And if that algorithm includes rejecting IDNs in scripts that the user doesn't read, I don't see any problem with it as long as the user is aware that there is no necessary relationship between the character set / language/ script of the content reached through a domain name and the script of the domain name itself. To avoid getting tangled up in a different misunderstanding, it is important to remember that all standard-conforming domain names are based on Unicode, so there is no question of character set and that domain names do not have language bindings except heuristically and possibly at registration time.

There is no standard yet and for long - but there may be local laws. There is an International Standard (what is a confusing term) ISO 10646. The only requirement is that the DNS receives LDH ASCII values. This is not nitpicking, this is that your only chance to enforce what you propose is that every user, every country and every hacker fully adhere to it as the very best anyone can think of.

Q: Would it be reasonable for a user to set up a sort of whitelist of domains to be accepted, with all others being rejected or producing warnings?

A: Yes. Whether it would be a good idea or not would depend on the user and usage patterns, but, if a user wanted to do it, I don't think we should try to interfere. Note that this really has nothing to do with the script in which those domain names are written or even whether they are LDH or IDNs. I would also suppose that the idea would be much more useful on the basis of domain reputation than on the basis of lexical analysis, but, if the user is creating explicit lists, there is no need for anyone else to be concerned about the basis being used.

Q: Would it be reasonable for a user to set up some sort of algorithm or collection of rules to effectively perform whitelist selection?

A: Sure. And if that algorithm includes rejecting IDNs in scripts that the user doesn't read, I don't see any problem with it as long as the user is aware that there is no necessary relationship between the character set / language/ script of the content reached through a domain name and the script of the domain name itself. To avoid getting tangled up in a different misunderstanding, it is important to remember that all standard-conforming domain names are based on Unicode, so there is no question of character set and that domain names do not have language bindings except heuristically and possibly at registration time.

Q: Is it reasonable for someone else to set algorithmic or heuristic rules that let users see some domain names and not others?

A: First of all, this goes on today. We have reputation systems that filter out or create strong warnings about some domains based on prior bad uses (e.g., phishing, porn, spyware, or viruses) of those domains. We have filters that refuse to display U-labels based on whether or not the user has the relevant scripts enabled as part of language choices (e.g., IE) and other filters that make decisions about U-label display based on policies of the associated TLD (e.g., Firefox). Personally, I am much more comfortable with these sorts of actions if either (i) they generate warnings of one sort or another rather than rejecting (e.g., refusing to look up) the name and/or (ii) the user can override the choices. There is text in Rationale that is consistent with that view (of course, it could be changed if there were consensus to do so).

But it is an area in which, although I don't think standards should have much to say about how a user sees or handles names, we need to be somewhat careful. Ultimately, at the extremes, there are only two types of identifier systems. In one, identifiers are unique, unambiguous, and universal: one should, at least in principle, be able to reach the identified object or, if not at least be assured that some other object will not turn up instead. At the other extreme, we have what, with apologies to Lewis Carroll, a Humpty Dumpty naming system in which words mean whatever one cares to have them mean and identifiers do not exist except with regard to each particular interpretation system.

The DNS is clearly designed to be one of the former. IDNs should not change that. If two different people register the same label with an "xn--" prefix in different zones and do so with different assumptions about what U-label it will be mapped into (if it is mapped at all), then the DNS still works because the respective FQDNs are still unique. But IDNs essentially stop working because, for IDNs to be viable, the mappings, in both directions, between A-labels and U-labels must be consistent and predictable _and_, given the way the DNS is constructed, the mappings to be used must not depend on the zone (or DNS hierarchy) in which the label is embedded.

To obtain that the process must be end to end at network level. What IDNA does not intend to be. Because IDNA fakes a presentation layer at user application level, but does not implement an Internet presentation layer.

I think there is a place for the latter, more local, type of identifier as well and that, in particular, one area in which the Internet is not fully mature yet is that of personal aliases in which a user can decide what to call a particular object and have that decision honored by relevant applications software and environments. That might include aliases that trigger selection lists of the "which of these did you really mean?" variety. But my personal aliases are normally useful to me and not to you. If we agree that they should be useful to you, we need to be running similar software, you need to know that a given personal alias belongs to me and not to you or someone else, you need to know where my alias databases are located and have access to them, and so on.

Confusion between what a user can filter and local decisions about how strings should be interpreted or mapped between A-labels and U-labels, or between universally-interpretable identifiers and personal (or local) aliases gets us into a lot of trouble, IMO.

Interesting discussion but not the point I raised. Because, if we analyse it,
  • - you put yourself at networked user application level to organise the way users could accept or not "domain names" and what it may imply.
  • - I start at a character filter level to keep my screen/printer/machine tidy.
At the end of the day one comes back to the same problem. Without presentation layer you can only be at user application level. That can work as long as everyone agree. But the problem here is that the target is to support a dynamic diversity.
This is why IMHO the target should be to document a punycode based mechanic that everyone can easily adapt/extend, so it becomes a part of ISO 10646/Unicode and it is more convenient for everyone to use the same one. Like ASCII, TCP/IP. Like the DNS. etc. With _no_ built-in constraint. But with all the constraints to be easily loaded by the user if he wants them. i.e. not saying "this is the way it must work", but "if you want this, do it that way". A Multilingual Internet RFC 1958 extension.


  • At 17:15 12/05/2008, John C Klensin wrote:
On Monday, 12 May, 2008 05:57 +0200 Patrik Fältström <patrik@frobbit.se> wrote: > In the implementations of IDNA I am writing (more for DNS registration side than lookup side) I am definitely doing multiple A-label => U-label => A-label iterations to ensure the string is "stable" and accepted regardless of whether the string start with an A-label or U-label. That is the easiest way to check that the string is "ok" (programmatically). Instead of having different rules depending on how the label "enter the system". I would be very very nervous if some document made it "impossible" to do such tests.

I would be very nervous if some document did not _require_ the equivalent of such tests. In other words, regardless of what is done on the lookup side, I believe that it is the responsibility of every registry -- every zone administrator-- to put only those labels into the DNS that are consistent with the standards applicable to those labels. For ASCII labels intended for normal (e.g., not SRV) use, that means the LDH rules must be followed. For IDNs, it means that anything that looks like an A-label must be an A-label and that the U-label form not contain unassigned characters, disallowed characters, or anything that violates contextual or bidi rules.

Of course, some zone administrator might choose to violate that standard. What will happen to them if they do so is not an IETF matter, except insofar as standards about lookup (and software that conforms to those standards) makes the non-compliant names hard to find. From an IETF point of view, non-conformant systems are just outside the standard and it is pointless to try to make them conformant to some other model of conformance.

FWIW, although the usual qualifications about things being subject to change apply, the forthcoming version of Protocol more clearly reflects the above model.


  • At 17:22 12/05/2008, Erik van der Poel wrote:
On Sat, May 10, 2008 at 12:59 PM, John C Klensin <klensin@jck.com> wrote:
I think we get out into dangerous territory if we give more than general advice about display and I think some will argue that we should not do even that. But I don't see that as an issue in this case.
There will be an issue in getting the wording right (for which I will certainly need help). But I think that "MAY treat the putative A-label as opaque" rule can be written to give the implementation a choice between opaque or not. So, e.g.,
* If you decide to treat it as opaque, you look it up without inspecting its contents but don't, ever, convert it to a U-label.
* If you do decide to convert it to a U-label, then it isn't opaque, it must be valid as a U-label (and hence as an A-label). Obviously, if it contains DISALLOWED or UNASSIGNED characters, or even CONTEXT-required characters that don't follow whatever rules need to be followed for looking, then you need to treat it as invalid for lookup and tell the user whatever you tell the user under such circumstances.

Having re-read this proposal and thought about it some more, it appears that it does not really allow for the easier transitions that I have been talking about. If IDNA-aware clients are permitted to look up labels that are already in Punycode and contain unassigned or disallowed code points, then it would be somewhat easier to make transitions in the future, e.g. from unassigned to pvalid or from disallowed to pvalid. Such transitions would be easier because people that want to start using the newly pvalidated characters can use them in Punycode form and be assured that IDNA-aware clients will at least look them up, thereby providing for a minimal functionality.

If I may argue the "other side" of this, one reason that we don't want to allow clients to look up labels with disallowed characters is to deter zone operators from registering such labels. One might also argue that Unicode 5.1 is really quite mature now, and that future assignments will be less and less interesting from the IDN perspective, thereby making easy transitions less important. If the yes/no/maybe discussion for historic scripts makes good progress, we may also end up having fewer reasons to move characters from disallowed to other categories.

In order to convince implementors that our rules are worth following, I believe we will need clear and convincing enough reasons to put such things as symbols in the disallowed set. Otherwise, implementors might decide not to follow our rules for some scripts or character types, and we would end up with less interoperability.

So, I'm going to try to find time to take another good look at the long(!) drafts to see if our reasoning is convincing enough. It might also be good to simplify and shorten the drafts somehow. Implementors are less likely to read and understand our reasons if they are buried in long and hard-to-understand documents. This is not to say that John hasn't been doing a great job -- on the contrary, this is very hard work because of all of the complexities, and these drafts tend to grow "organically", sometimes without removing parts that need to be removed.

One paragraph, in particular, probably needs attention:

5. Domain Name Resolution (Lookup) Protocol
Resolution is conceptually different from registration and different tests are applied on the client. Although some validity checks are necessary to avoid serious problems with the protocol (see Section 5.4 ff.), the resolution-side tests are more permissive and rely heavily on the assumption that names that are present in the DNS are valid. Among other things, this distinction, applied carefully, facilitates expansion of the permitted character lists to include new scripts and accommodate new versions of Unicode without introducing ambiguity into domain name processing.

Some might interpret this to mean that it is OK to look up labels with unassigned characters. Or am I misunderstanding this?


  • At 18:07 12/05/2008, Tina Dam wrote:

Vint, this is true. ICANN staff have recommended to the .SU operators not to open for IDNs in this matter and also offered to assist or recommend other experts to assist in deploying IDN.SU in a secure manner, just FYI. Tina


  • At 18:15 12/05/2008, Tina Dam wrote:

At the ICANN meeting in Paris, end June, we already have an item on the agenda scheduled to provide information about the new protocol proposal, the differences from the current protocol and related topics. We are working on arranging for speakers and on specifying the agenda and details will be posted just like all presentations will be made publicly available.

You can see other IDN related sessions at the agenda for the entire meeting at: http://par.icann.org/en/schedule


  • At 18:30 12/05/2008, Shawn Steele wrote:
For example, xn--en32g would produce U+110000, which is outside the range of valid code points. (The highest code point is U+10FFFF.)
If an app receives such a punycode string, it should not attempt to display the corresponding Unicode (since it is invalid). I'm guessing that we can all agree on that.

Well, it does indicate that *some* validation of the resulting Unicode string is necessary. What happens if there's a U+0020 or U+0007 embedded in it?

Note that on the client side it would be required to convert and display the Unicode string if lookup actually succeeds. xn--asdfasdf isn't acceptable from the "we want our users to know what they're seeing" crowd.

If the client is required to display a successfully resolved string, then there doesn't seem to be much point in disallowing smiley face at this (client) level, since anything with a smiley that resolves would be displayed. That would put the disallowed character tests at the registration level.

I expected some disagreement with my assertion that some protocols/users will require the Unicode form, so therefore the benefit of looking up punyicode is limited to some specific scenarios, probably leading to inconsistent experiences with "new" names.


  • At 18:59 12/05/2008, JFC Morfin wrote:

Tina, the meeting I propose is not for you to give us information. But for all of us to reach a consensus where there is none.


  • At 19:15 12/05/2008, JFC Morfin wrote:
At 17:22 12/05/2008, Erik van der Poel wrote: Some might interpret this to mean that it is OK to look up labels with unassigned characters. Or am I misunderstanding this?

Erik, if you want applications not to look up labels with unassigned characters, I am afraid you have to convince their developpers for the centuries to come that the idea why they may want to do it is a bad idea.

Am I misunderstanding something basic? Is not the whole idea that we do not change anything in the DNS, everything is at the application level, yet we want to impose constaints and entropy at a pre-DNS level. And we are surprised because people want to by-pass the constraints and dismayed because the information has degraded.

IMHO, as long as there is no end to end necessity to respect the IDNA constraints, whereever that necessity is located in, ".su" will be the most sensible thing to do for a ccTLD Manager (unless it is protected by a local law)..


  • At 20:22 12/05/2008, JFC Morfin wrote:
At 18:07 12/05/2008, Tina Dam wrote: Vint, this is true. ICANN staff have recommended to the .SU operators not to open for IDNs in this matter and also offered to assist or recommend other experts to assist in deploying IDN.SU in a secure manner, just FYI.

Tina, The problem is that your recommendation is no more technically secure from a registry point of view as there is no ICANN/WIPO warranted conversion tool, but it is legally unsecure. A Registry sells a registration and a nameserver support of ASCII labels. This is its RFC and ICP-1 defined job. With a WIPO/ICANN UDRP. What the registrant does with the names is not its cup of tea. It never knows it.

Why alone, among the whole Internet, second level Registry Managers should consider the use of the domain names they sell, for one single user application among millions, and take technical and IP responsibility for it which is not even legally worked out by WIPO. Moreover than phishing is carried at third level and above, in using IE7 and Firefox, with any existing DN.

There is a Consultation going on in France by the Government over the French ccTLD Registries Management as Public Services. One of the france@large answer backed by several lawyers is that .fr and the other French TLDs behave like .su, as long as there is not a State control and warranty. We will see what will be the answer. The Registry cannot be legally responsible of what the users may do. More over if it operates by State delegation. ".gr" Registry Manager has already explained what they do. The same, IE7 and Firefox have also described the different way they support IDNs, is the Registry Manager to be legally responsible?

Now let suppose there is an extended Punycode algorithm? Common sense leads to think that modern symbols like famous logos will be of interest in mobiles and their code point for sale. This logocode business can be very fruitful; or regulated by law. This is one of the reasons why I prefer ISO 10646 to Unicode as the same reference table. Are we discussing archaïc symbol or modern logos and music tunes ?


  • At 21:23 12/05/2008, John C Klensin wrote:
--On Monday, 12 May, 2008 09:30 -0700 Shawn Steele wrote: Well, it does indicate that *some* validation of the resulting Unicode string is necessary. What happens if there's a U+0020 or U+0007 embedded in it?

While I would prefer some validation if a putative A-label is presented to the application (but continue to believe that should be an implementation choice), I don't think this changes the answer. If the application gets a string that starts with "xn--", it MAY just ignore IDNA and look the thing up and there is no requirement that it ever convert it to a native character ("Unicode") form (or to try to do so). If, on the other hand, it intended to made the conversation at some point, I'd think it would be lots better to try to make it before the lookup. Doing so effectively invokes IDNA and, if the string contains bad characters (U+0020 or U+0007 as well as DISALLOWED or UNASSIGNED ones), then the user should get an error message and the string not be looked up.

As I've said before, I don't see much choice. If a U-label is presented to a non-IDNA-aware application that does any checking at all, it will be rejected as a syntax error. If it is not so rejected, it presumably won't be found on lookup since a non-IDNA-aware application would not be able to perform the U-label to A-label conversion. However, if an A-label is presented to that application, it is certainly going to be treated as a conventional LDH label and looked up. Since it is a conventional LDH label, no one is going to try to convert it to a native character string.

Note that on the client side it would be required to convert and display the Unicode string if lookup actually succeeds. xn--asdfasdf isn't acceptable from the "we want our users to know what they're seeing" crowd.
If the client is required to display a successfully resolved string, then there doesn't seem to be much point in disallowing smiley face at this (client) level, since anything with a smiley that resolves would be displayed. That would put the disallowed character tests at the registration level.

There is no way to require a client to do that. We know that experimentally, since IDNA2003 contains exactly that requirement (although not stated that way). To say that requirement has been widely ignored would be incorrect-- it has been deliberately and systematically violated in the interest of protecting users. We also know what clients do now. For an example with which you are presumably familiar, IE7 would presumably convert the punycode form to a Unicode one, scan it, and, upon discovering an unassigned character or smiley face, would compare those code points to the code points associated with the languages the user had configured and then decide to display the punycode form (unless it permits smiley faces regardless of the languages configured). As I understand it, the decision was that, even if 'xn--asdfasdf isn't acceptable from the "we want our users to know what they're seeing" crowd', neither a native-character string in a user-unknown script nor a row of little boxes are acceptable to the "we have to offer reasonable protection to our users" crowd.

I expected some disagreement with my assertion that some protocols/users will require the Unicode form, so therefore the benefit of looking up punyicode is limited to some specific scenarios, probably leading to inconsistent experiences with "new" names.

I'm not sure I understand this. Applications that have not been upgraded to understand IDNA are going to look those punycode forms up because they don't know any better _and_ because those forms are perfectly good LDH labels. That was a major part if the IDNA design and some other alternatives were sacrificed to get it. And, of course, some protocols and users will "require the Unicode form". Regardless of what may be found embedded in files or referrals of some sort, I'd certainly encourage a user interface designer to consider whether direct typing of the A-label form should be associated with warnings and/or "expert user" configuration.


  • At 21:34 12/05/2008, Shawn Steele wrote:

That is a technical part of IDN. As a practical matter I expect the Unicode form, not the A form, on a business card. Specifically, imagine that I go to the library and they have the latest updates. I click on a link (which happens to be in a-form) and find a web site useful to a project I'm working on at work. I then write the "URL" from the address bar down on a scrap of paper and take it to work. I try to email this to my coworkers but discover the URL I wrote down doesn't work because my work environment hasn't received the latest updates yet.

As the owner of such a domain name I need users to have reliable access to the name. If I can't use the Unicode form on a business card or a display ad because the adoption rate is too low, then the A form is useless to me in a link since I can't even get the core scenarios to work.

So I'm suggesting that the A-form is interesting for back-compat (all those xn-- domains registered before IDN, which the IDN RFCs effectively made illegal anyway), but not for forward-compat.


  • At 21:40 12/05/2008, Shawn Steele wrote:

I wasn't very clear . IE is "training" users that xn-- forms are suspect in the display and shouldn't be trusted. If IE was trying to display an xn-- name that was in the user's script with "new" characters, then IE would try to display it in Unicode, if it could resolve the name (which it can't for new characters). If the intent is for "new" characters to be used without being discriminated against, then conversion to Unicode is necessary.

FWIW: In practice we're discussing the difference between Unicode 3.2 and Unicode 5 characters. IE "knows" the scripts and character properties of the newer characters, so its phishing logic will still behave as expected. It doesn't know how to handle IDN labels though that are greater than Unicode 3.2. Were IDN more current, then the disparity wouldn't be very visible. If the Unicode set of IDN and IE were similar, then only "new" characters would fall into this bucket (like Unicode 6 or whatever), and then IE would start failing the phishing stuff since it couldn't find Unicode data and script data for the Unicode 6 code points, regardless of how it handled IDN names.


  • At 23:30 12/05/2008, John C Klensin wrote:
On Friday, 09 May, 2008 09:16 -0700 Paul Hoffman phoffman@imc.org> wrote:
The IETF has a long history of wrestling with the question of interoperability, agreement among the parties who care, enforcement, resistance, and updates. Different people in the IETF (we are people, not a monolithic organization) have different views about the best way to write a standard to get the greatest interoperability. Some of those differences of opinion are very relevant in the current discussion.
My view for IDNAbis (and IDNA2003) is that if the standard does not say that software that does a lookup needs to follow the rules for allowed and disallowed characters, meaningful interoperability will not be possible. Without lookup validation, registries will make mistakes (or possibly intentional mistakes) and then leave them in place with impunity. The reason we have had such good interoperability in IDNA2003 is the fact that most/all major software refuses to get to names that go against the standard.
This view makes updating the standard harder due to the already-discussed issues of delay in software updates. However, those issues are just as present if the standard does not require checking for lookup but does require checking for display. Programs that display IDNs need to be updated as often as those checking them before sending out queries. Thus, the standard does not gain any flexibility for updates unless both sides are made optional. A standard that only proposed limits on what could be registered would lead to near-zero interoperability due to the high number of mistakes that would be made.


  • At 00:53 13/05/2008, Mark Davis wrote:

Vint,
In answer to your question, I wrote a quick and dirty test of taking random "xn--something" codes (where something is from [- a-z 0-9]+) and seeing what the percents would be. Here are the results.

For the lengths up to 4, I do an exhaustive test; above that it is a random sampling.

The percentages are not what one would expect from a random sampling of strings (eg < 10% of possible Unicode code points are assigned LMN), I suspect because PunyCode would favor locality of deltas.

Key:

  • illegal_punycode - means that converting to unicode and back has an error
  • unassigned - has at least one unassigned character
  • non_LMN - has at least one non Letter/Mark/Number
  • non_folded - has at least one non NFKC or non CaseFolded
  • all_ascii - is all ASCII
  • otherwise_ok - everything else

Hope this is useful, Mark

length: 1

   97.297%    :    illegal_punycode    (36)
   02.703%    :    non_LMN             (1)

length: 2

   92.909%    :    illegal_punycode    (1,271)
   04.386%    :    non_LMN             (60)
   02.705%    :    all_ascii           (37)

length: 3

   48.121%    :    otherwise_ok        (24,357)
   30.725%    :    illegal_punycode    (15,552)
   11.109%    :    non_LMN             (5,623)
   04.645%    :    unassigned          (2,351)
   02.705%    :    all_ascii           (1,369)
   02.695%    :    non_folded          (1,364)

length: 4

   46.001%    :    illegal_punycode    (861,512)
   23.924%    :    otherwise_ok        (448,047)
   16.716%    :    unassigned          (313,059)
   08.354%    :    non_LMN             (156,447)
   02.705%    :    all_ascii           (50,653)
   02.300%    :    non_folded          (43,074)

length: 5

   46.694%    :    illegal_punycode    (29,064)
   31.959%    :    otherwise_ok        (19,892)
   09.399%    :    non_LMN             (5,850)
   06.754%    :    unassigned          (4,204)
   02.696%    :    all_ascii           (1,678)
   02.498%    :    non_folded          (1,555)

length: 6

   47.881%    :    illegal_punycode    (29,946)
   25.041%    :    otherwise_ok        (15,661)
   13.676%    :    unassigned          (8,553)
   08.294%    :    non_LMN             (5,187)
   02.723%    :    all_ascii           (1,703)
   02.386%    :    non_folded          (1,492)

length: 7

   49.545%    :    illegal_punycode    (30,959)
   26.269%    :    otherwise_ok        (16,415)
   10.865%    :    unassigned          (6,789)
   08.253%    :    non_LMN             (5,157)
   02.713%    :    all_ascii           (1,695)
   02.356%    :    non_folded          (1,472)

length: 8

   50.707%    :    illegal_punycode    (31,414)
   22.831%    :    otherwise_ok        (14,144)
   13.564%    :    unassigned          (8,403)
   07.871%    :    non_LMN             (4,876)
   02.741%    :    all_ascii           (1,698)
   02.287%    :    non_folded          (1,417)

length: 9

   50.912%    :    illegal_punycode    (31,940)
   24.295%    :    otherwise_ok        (15,242)
   12.015%    :    unassigned          (7,538)
   07.693%    :    non_LMN             (4,826)
   02.766%    :    all_ascii           (1,735)
   02.319%    :    non_folded          (1,455)

length: 10

   51.009%    :    illegal_punycode    (31,760)
   22.071%    :    otherwise_ok        (13,742)
   14.408%    :    unassigned          (8,971)
   07.512%    :    non_LMN             (4,677)
   02.676%    :    all_ascii           (1,666)
   02.324%    :    non_folded          (1,447)

length: 11

   52.386%    :    illegal_punycode    (32,852)
   21.739%    :    otherwise_ok        (13,633)
   13.479%    :    unassigned          (8,453)
   07.289%    :    non_LMN             (4,571)
   02.751%    :    all_ascii           (1,725)
   02.355%    :    non_folded          (1,477)

length: 12

   52.535%    :    illegal_punycode    (32,787)
   20.875%    :    otherwise_ok        (13,028)
   14.156%    :    unassigned          (8,835)
   07.188%    :    non_LMN             (4,486)
   02.697%    :    all_ascii           (1,683)
   02.549%    :    non_folded          (1,591)

length: 13

   52.923%    :    illegal_punycode    (33,188)
   20.389%    :    otherwise_ok        (12,786)
   14.432%    :    unassigned          (9,050)
   07.005%    :    non_LMN             (4,393)
   02.682%    :    all_ascii           (1,682)
   02.569%    :    non_folded          (1,611)

length: 14

   53.417%    :    illegal_punycode    (33,263)
   19.515%    :    otherwise_ok        (12,152)
   14.837%    :    unassigned          (9,239)
   06.803%    :    non_LMN             (4,236)
   02.730%    :    non_folded          (1,700)
   02.698%    :    all_ascii           (1,680)

length: 15

   54.071%    :    illegal_punycode    (33,796)
   18.905%    :    otherwise_ok        (11,816)
   14.684%    :    unassigned          (9,178)
   06.712%    :    non_LMN             (4,195)
   02.909%    :    non_folded          (1,818)
   02.720%    :    all_ascii           (1,700)

length: 16

   53.885%    :    illegal_punycode    (33,741)
   18.150%    :    otherwise_ok        (11,365)
   15.368%    :    unassigned          (9,623)
   06.703%    :    non_LMN             (4,197)
   03.145%    :    non_folded          (1,969)
   02.750%    :    all_ascii           (1,722)

length: 17

   54.281%    :    illegal_punycode    (33,862)
   18.016%    :    otherwise_ok        (11,239)
   15.265%    :    unassigned          (9,523)
   06.526%    :    non_LMN             (4,071)
   03.302%    :    non_folded          (2,060)
   02.610%    :    all_ascii           (1,628)

length: 18

   54.470%    :    illegal_punycode    (34,239)
   17.489%    :    otherwise_ok        (10,993)
   15.230%    :    unassigned          (9,573)
   06.615%    :    non_LMN             (4,158)
   03.513%    :    non_folded          (2,208)
   02.684%    :    all_ascii           (1,687)

length: 19

   54.406%    :    illegal_punycode    (34,149)
   17.162%    :    otherwise_ok        (10,772)
   15.237%    :    unassigned          (9,564)
   06.741%    :    non_LMN             (4,231)
   03.763%    :    non_folded          (2,362)
   02.691%    :    all_ascii           (1,689)

length: 20

   54.940%    :    illegal_punycode    (34,364)
   16.606%    :    otherwise_ok        (10,387)
   15.089%    :    unassigned          (9,438)
   06.732%    :    non_LMN             (4,211)
   03.895%    :    non_folded          (2,436)
   02.737%    :    all_ascii           (1,712)


  • At 02:39 13/05/2008, John C Klensin wrote:

Mark, I, at least found this very helpful and even more interesting. A few comments below...

The percentages are not what one would expect from a random sampling of strings (eg < 10% of possible Unicode code points are assigned LMN), I suspect because PunyCode would favor locality of deltas.

Certainly it does favor locality, so that is a reasonable hypothesis.

length: 1

   97.297%    :    illegal_punycode    (36)
   02.703%    :    non_LMN             (1)

length: 2

   92.909%    :    illegal_punycode    (1,271)
   04.386%    :    non_LMN             (60)
   02.705%    :    all_ascii           (37)

The nature of the IDNA and punycode beasts essentially make one-character strings impossible (my guess is that the one non_LMN character you found is in the 0..7F range) and two unlikely. It is good to have that impression confirmed.

length: 3

   48.121%    :    otherwise_ok        (24,357)
   30.725%    :    illegal_punycode    (15,552)
   11.109%    :    non_LMN             (5,623)
   04.645%    :    unassigned          (2,351)
   02.705%    :    all_ascii           (1,369)
   02.695%    :    non_folded          (1,364)

length: 4

   46.001%    :    illegal_punycode    (861,512)
   23.924%    :    otherwise_ok        (448,047)
   16.716%    :    unassigned          (313,059)
   08.354%    :    non_LMN             (156,447)
   02.705%    :    all_ascii           (50,653)
   02.300%    :    non_folded          (43,074)

...

Part of what is interesting here (and in the rest of your results) is that, if we ignore the cases that might be shifted from DISALLOWED to PVALID as a vanishingly small percentage and ignore strings requiring bidi and/or contextual tests (as you experiment did), the number of strings that could ever be valid (valid now ("otherwise_ok" or unassigned now) is around 52% at length three and significantly under half (pretty constantly between 30 and 40%) for all lengths longer than that.

So it is actually reasonable to infer that, if people start generating strings at random and building putative A-labels from them, most of the results will be invalid for one reason or another, independent of future version of Unicode. If I were designing an application that was doing lookup, that, plus the display concerns, would almost certainly induce me to try to convert the string to U-label form and test it before trying to look it up. I still don't think we should try to require that behavior, but...


  • At 06:26 13/05/2008, Martin Duerst wrote:
At 09:39 08/05/13, John C Klensin wrote: So it is actually reasonable to infer that, if people start generating strings at random and building putative A-labels from them, most of the results will be invalid for one reason or another, independent of future version of Unicode. If I were designing an application that was doing lookup, that, plus the display concerns, would almost certainly induce me to try to convert the string to U-label form and test it before trying to look it up. I still don't think we should try to require that behavior, but...

Just to check: Are you saying that because with random strings, the chance that lookup is successful (for reasonably long strings) is rather low, you would convert to an U-label and check to avoid unnecessary lookups.

Surely except for very special tests, the input to your application wouldn't be random strings, or would they?

As an aside, are we expecting that applications will produce different warnings for illegal (punicode) strings and for failed lookups? My expectation would be that for applications with a general audience (e.g. browsers,...), there should be a generic message (something like "site not reachable, please: - check spelling,...).


  • At 06:41 13/05/2008, Martin Duerst wrote:
At 00:07 08/05/13, John C Klensin wrote: I think there is a place for the latter, more local, type of identifier as well and that,

Yes, of course. In everybody's head, in bookmark files, in Web pages, and so on.

in particular, one area in which the Internet is not fully mature yet is that of personal aliases in which a user can decide what to call a particular object and have that decision honored by relevant applications software and environments.

I'd not 'blame' that on the Internet. It is essentially a local function, not a network function. The mechanisms to make it work cross-application are already available: On each OS, there is functionality to register applications as resolvers for particular URIs, this mechanism can easily be used e.g. with a my: URI schema or something similar. There are also facilities such as SGML/XML catalogs (see e.g. http://www.oasis-open.org/committees/entity/spec-2001-08-06.html) for URI mapping, and it's also easily possible to implement the desired functionality in an HTTP proxy or with a simple CGI. A friend of mine had such a thing, he simply called it "do the right thing".

That might include aliases that trigger selection lists of the "which of these did you really mean?" variety.

Of course. Technically easy, HTTP already has "multiple choice", and if you don't want to use that, just put the choices on a Web page.

But my personal aliases are normally useful to me and not to you. If we agree that they should be useful to you, we need to be running similar software, you need to know that a given personal alias belongs to me and not to you or someone else, you need to know where my alias databases are located and have access to them, and so on.

There is quite some research on shared bookmarks and similar stuff.

The reason that all this stuff above is not widely used is that sharing of personal identifiers is very hard, not technically but content-wise. I wouldn't want to share my personal identifiers with just anybody, and I don't think there will be many people who's personal identifiers I could make use off, either in bulk (all or most of the identifiers used by a person) or only individually (single identifiers).

Apart from that, many people identify many things not by an explicit (textual) label, but by other means, such as "halfway down on the right side of that Web page" or so.


  • At 06:59 13/05/2008, JFC Morfin wrote:

Thank you Mark for this important computation.

At 02:39 13/05/2008, John C Klensin wrote: So it is actually reasonable to infer that, if people start generating strings at random and building putative A-labels from them, most of the results will be invalid for one reason or another, independent of future version of Unicode. If I were designing an application that was doing lookup, that, plus the display concerns, would almost certainly induce me to try to convert the string to U-label form and test it before trying to look it up. I still don't think we should try to require that behavior, but...

This also means that there is an impressive pool of funycodes, i.e. non IDN conflicting xn-- DNs that can be used, for example, to fool users or for other purposes. I am sure there are "xn--Famous-TM" among them. I suppose this calls for an urgent discussion with WIPO on this in order to protect that TM owners from legal impact as an "xn--Famous-TM" funycode will display in ASCII everywhere. This may lead WIPO to consider "x--n" instead of "xn--" as "x--nFamous-TM" is something less conflicting.

This certainly makes interesting Louis Bleriot's suggestion of a funycode filter that could be implemented as an anti-virus option. I suggest that WIPO and IPC are invited to the Paris review meeting I proposed.


  • At 14:48 15/05/2008, jefsey wrote:

Dear all, As far as I understand it the punyspace (punycoded namespace) is formed by all the labels with an "xn--" header. There are at least three level of tests that can be carried on it before resolving the domain names including labels from the punyspace.

1. punycode validity test.

   roughly half of the punycode space can resolve as valid Unicode strings, the other being False Unicode or funycode.

2. the non-funycode namespace can be split into :

   - disallowed code-points identified for their capacity to confuse users or the applications.
   - possioble code-points

3. the possible code-points can be split into:

   - permitted code-points
   - non-permitted code-points.

For each split a filtering is to occur.

In (1) the split is against the punycode process. It should therefore be a positive point if there was a very strict document that would be completed by an IETF (or Unicode, or ICANN) guaranteed program everyone could use for legal checking of the nature of an "xn--". The resulting funycodes are public domain and anyone can build private extensions through Sunycodes (specialised punycode like process).

In (2) the test is against an ISO 10646 based list documented by the IETF (that Unicode or other organization may wish to extend).

in (3) the test is against a user defined/chosen list or process. The IETF should specify the possible formats and answers of such processes and the applications standardisers (W3C for the Web, IETF for SMTP, etc.) should determine the resulting application behaviour.

This should result into an international document co-signed by the WIPO, protecting zone managers from any legal responsibity in the use of a domain name, such responsibility being to the registrant. This agreement should also include an international clause protecting TM owners from their current obligation to protect their TM in "xn--" headed domain names. Another clause should consider the case of TM symbols entered as code-points in any semiotic table.

The documentation produced for the user level filtering should make sure it does not lead to conflicts with possibilities permitted by hosts.txt (aliases and aliases table).

Please correct me if I am wrong before I use this summary to explain the issue on http://wikidna.org which is used by some as a focal (some times bi-lingual) reminder.


  • At 16:00 15/05/2008, Vint Cerf wrote:

Jefsey, I would recommend against any attempt to sanction any non-standard uses of punycode. This particular representation should be applied only to validly derived punycode from permitted unicode dtrings. V


  • At 16:12 15/05/2008, Gervase Markham wrote:

What would be the advantage in doing such experiments or building such extensions within the xn-- prefix, as opposed to using another prefix?


  • At 16:43 15/05/2008, JFC Morfin wrote:

Not our cup of tea. Since Anaxagoras we know that nature hates emptiness. My architectural rule nr.1 is "if something can be developped, is fun, permits to make money, or to hurt someone it will be used" - which matches RFC 1958 only principle (except that principle everything may change).

It is better to acknowledge the funyspace and reserve it to private use (it will be most probably used anyway) that to leave it to hackers. Because private sunycodes will react, while punycode only will be previsible.

Why? Because I reasonably expect a tremendous extension of graphemes with semiotic support and multilingualisation at the semantic strata. IDNA is not to be developped for the current ASCII internationalised Internet, but for the Multilingual Semantic Internet and I will not bet on the ISO 10646 size in 2020.

Your suggestion to use another prefix could be considered. I only suggest that using the same prefix for the current and future solution is in line with RFC 1958 guidelines. Also, using this empty space for something hackers (or existing commercial projects) will also need, is a way to protect it as stable for everyone.


  • At 22:17 15/05/2008, Frank Ellermann wrote:
Andrew Sullivan wrote: I am not convinced that this working group, or indeed the IETF in general, really has the broad participation of relevant anthropological and linguistic experts to make this kind of judgement. So I don't think we should make it.

If Ken tells us that "Lo" vs. "So" means less than we might hope for the purposes of IDNAbis wrt archaic scripts I assume that he knows what he is talking about. If all else fails I admire the one PVALID Phaistos Disc symbol, folks would need the rest of the Phaistos Disc to do anything in U-labels with it - unless they are up to no good with corresponding A-labels.

IMO you *must* make judgement calls, it is the duty of this WG, and I hope that folks like Michael won't let you get this wrong.

It allows us to use a set of properties of scripts from some standard that does have the involvement of the athropological and linguistic experts needed to make the kinds of judgement in question (even if everyone doesn't always agree with the results).

Mark and Ken said that they don't populate the list of archaic scripts with a random generator, using it should be okay.

Erik van der Poel wrote: For example, xn--en32g would produce U+110000, which is outside the range of valid code points. (The highest code point is U+10FFFF.) If an app receives such a punycode string, it should not attempt to display the corresponding Unicode (since it is invalid). I'm guessing that we can all agree on that.

Definitely. My UTF-32BE to UTF-8 encoder failed miserably for 0xFFFFFFFF, I forgot to check "negative" (non-) code points So far for "assume valid input".

Shawn Steele wrote: I surf for one of the web-based ones when I need to convert.

+1. Used to an Unicode-ignorant platform it even took me some time to figure out how to produce an unassigned UTF-8 code for a funny "I <unassigned black telephone dingbat> Unicode" title of a blog entry in reply to Mark's 5.1 article.

Some Wordpad Alt-X magic finally did the trick - using an NCR produced a feed validator warning, and warnings are bad news.

Consider also the IMA/EAI UTF-8 effort. Clients that get a pretty UTF-8 name aren't going to be able to process that name if they can't do the punycode conversion to do the DNS query because they're on the wrong version of IDN. Sure, they can (hopefully) use the fallback address, but I suspect that'll have a human readable name, not a punycode string.

AFAIK that affects only the RHS (domain part), for the LHS (local part) knowing what the UTF-8 means is not essential: The MSA can be (again) a "smart host" wrt EAI, and clients aren't forced to support punycode if they can handle UTF-8. The same goes for say whois servers vs. whois clients.


  • At 23:36 15/05/2008, Shawn Steele wrote:

That's kind of what I meant. The EAI specifies UTF-8 for the domain part, not punycode, so a mail server can't deliver the mail unless they know how to do the Punycode conversion (or their DNS API does it for them). Therefore requiring unknown A-names resolve might enable a URL, but mail will still break. If I own the domain that won't help me then.


  • At 01:59 16/05/2008, JFC Morfin wrote:
At 16:00 15/05/2008, Vint Cerf wrote: Jefsey, I would recommend against any attempt to sanction any non-standard uses of punycode. This particular representation should be applied only to validly derived punycode from permitted unicode dtrings. V

Vint,
unfortunately I did not chose IDNA nor punycode. Yet, this is what we have, sanctioned by RFCs and all the work already done. SHOULD are not technically biding. IMHO we MUST make punycode misuse more complex than possibly rewarding. This calls for some thinking: a warning in the security section cannot be enough when some TM owners will discover they (at random) have to spend much much more to protect their name than others, due to an Internet non-end to end core-process.

".su" case (which is not unique as a TLD, and is the general SLD case) shows that the motivation is that users can use extended algorithms, what I name "sunycode" for super punycode. Why would someone do that: greed, fun, hack. We had alt-roots, we cannot afford alt-idns.

Can we entrust the success of IDNA in the ways IE and Firefox may differently disrespect the RFCs in order to best protect the users. Which TLD Manager is going to guarantee and support names he does not know if they will really work in real life, depending on releases, browsers, nameserver (as some nameserver softwares will most probably decide to "enforce" the standard) as will most probably do firewalls too.

We have a problem. IMHO each of its aspects are to be clearly identified and delimited. Then answers must be found together with all the stakeholders. If we want to match the Nov 08 target, or to have an international consensus to postpone it, I really think this open round must be organised at the ICANN Paris meeting.


  • At 12:20 16/05/2008, JFC Morfin wrote:

Dear Colleagues,

Vint Cerf raised a pertinent question concerning the IDNs in the IDNA context (Internationalized Domain Names for Applications) as the Chair of the IETF/WG-IDNABIS. The resulting responses most probably call for a common review.

I spent some time in presenting them in a single formated document at http://wikidna.org/index.php?title=Funycode_issue.

In this I suggest a concertation meeting on the issue at the occasion of the World Internet Week de Paris (EGENI, ICANN, ISOC, VoxInternet, MLTF, france@large, GIGANET, etc. meetings) end of June. As you may know I am circulating for a week a Draft for a question to Vint Cerf concerning the WG-IDNABIS target. This actually is complementary to Vint's own question, I will therefore issue it today after receiving the last comments. All this may be a unique occasion to fully understand, concert and settle the Multilingual Internet issue this year, permitting to consistently progress with IRIs, semantic addressing and other naming/documentation applications needs.


  • At 13:10 16/05/2008, Nigel Roberts wrote:

Can I have a one paragraph summary, translated from gobbledygook in to English, please? PS. And which ccTLD are you representing?


  • At 13:25 16/05/2008, laura.garcia@arenotech.org wrote:

Pouvez-vous nous le dire aussi en français ou dans une autre langue européenne
Podria decirnoslo en espanol o en otra lengua europea
Merci/gracias


  • At 13:13 16/05/2008, MLTF IDN Mailing List wrote:

I am sorry,

1) English is an European language. If you have any doubt about it, my Franglish certainly is European.

2) l'un des problèmes majeurs dont nous avons à nous rendre compte (cf. les réponses dans l'URL donnée) est que l'IETF n'est pas l'ISO, et donc la technologie Internet est purement pensée, documentée et opérée en anglais. De plus elle n'a pas vocation à disposer des linguistes, d'ethnologues, etc. dont ces sujets auraient besoin. Ceci est à la source de la majorité de ses problèmes.

Non pas d'être en anglais mais d'être orientée par une seule langue pour traiter du multilinguisme : cela entraine à voir toutes les autres langues à travers cette langue. C'est là qu'est né le concept de "globalization" (mondialisation). Il est bon qu'au moment où, en fait, se joue de façon pratique toute l'exception culturelle en-ligne, et donc probablement partout, les Français réalisent les enjeux stratégiques, économiques, culturels et de dévloppement. Parler c'est penser. Il est donc normal que les langues influencent les façons de penser et réciproquement. Ne pas utiliser au moins l'anglais, le français et le japonais ou le chinois simultanément dans le processus de standardsation global semble très présompteux en raison des trois modes de pensée fondamentaux et complémentaires auxquelles ces langues entrainent plus naturellement.

Personal tools