High-level changes from IDNA2003 in the "current work"

From Wikidna.org

Jump to: navigation, search

At 00:02 07/03/2008, Paul Hoffman wrote: "It would be useful for those coming in late to have a summary of what changes are embodied in the current set of documents. Here's my first take on such a list. If people like this format, it could be used as the beginning of an outline for the BoF/WG." This summary is based upon the various comments he received.

Contents

[edit] Methodology

Some of these goals are possibly stated a little more narrowly than is intended. The present debate may help, but it is strongly suggested that people read both the drafts and RFC 4690 before the IETF March 12th, BOF meeting (to discuss a WG-IDNABIS).

[edit] Rationale of IDNA 2003 update

John Klensin : My conclusions are: (1) Looking up unregistered code points is untenable because it makes moving to future versions of Unicode impossible. That conclusion is already reflected in IDNA200X, but IDNA2003 requires such lookups.

Mark Davis: I disagree. While I'm willing to live with the John, Harald, and Patrik's decision to disallow the resolution with unassigned characters -- just so we can get this thing out the door -- we should not be basing any other decisions on thinking that it is "untenable".
Consider an character X that was unassigned in Unicode 5.1, but assigned in Unicode 6.0, and see what happens. Let's suppose that a U5.1 client sends out "aXc.com" ("a" and "c" are some particular strings, not the literal U+0061 and U+0063). Before the registry upgrades to U6.0, it will fail, as expected -- it wasn't (and couldn't have been) registered.
So let's look at the case where the registry has upgraded to U6.0. There are a small number of cases, and I don't see that *any* of them cause a problem.
Cases:
X is illegal according to IDNA200X rules under U6.0. The registry can't register it, so it won't work. Not a problem.
X is legal and unaffected by normalization. This is true of the vast majority of characters. Then if the registry adds "aXc.com", then the old client will work, as expected. Not a problem -- in fact, a positive benefit.
X is legal but affected by normalization -- but not in the context of "a...c". This is true of the vast majority of those few characters remaining from case #2. Then if the registry adds "aXc.com", then the old resolver will work. Not a problem -- in fact, a positive benefit.
X is legal, and affected by normalization, in the context of "a...c". For example, suppose that string a ends with a non-spacing mark that reorders with X in NFC. In that case, "aXc.com" would not be legal, and could not be registered. So even in this rare case, not a problem.
John, if you think this situation is untenable, which of the above cases causes a problem, and exactly what would that problem be?

(2) When we do things that are untenable, the odds of implementations simply ignoring us and doing what they consider The Right Thing are very high. This particular example with MSIE7 is, I think, the best one so far, but we have others. The much more restrictive rules of IDNA200X are intended, long-term, to give us more latitude because it would be possible, if painful, to change a prohibited (Disallowed) code point to be Protocol-Valid without thereby creating an ambiguity in coding of labels "before" and "after".

(3) The IDNA200X proposals make the assumption that there are a collection of cases that violate at least one of

(i) the clear intent of IDNA2003
(ii) the clear intent of either the cautionary notes in the IESG Statement about IDNs, the general guidance of the ICANN Guidelines, or both
(iii) Good sense, including thoughtful application of the robustness principle and moving outside IDNA2003 where its provisions cannot be reasonably implemented in practice without causing other problems.

Some of these cases represent legitimate, albeit possibly misguided, uses. Others represent defensive registrations, and still others represent behavior that is either deliberately excessively cute or malicious. They are very hard to tell apart, even by walking the DNS tree, inspecting the records of various registries, or inspecting a large corpus of documents.

We believe that these cases are sufficiently problematic that the right course of action is to minimize the degree to which they can arise in the future and then to work out the transition or compatibility questions for existing registrations and applications. In some cases, that will imply that registries will either need to adopt variant techniques or prohibit some registrations that the standard will permit in order to avoid ambiguities. In others, we may end up having to take the position that the registrations are simply not going to work any more (if it were even the case that they worked reliably now). That "so sorry... you pushed the boundaries of the rules and you now lose" scenario is going to be a difficult one, but the alternatives are to invalidate a few existing domains versus going the "next prefix" route and invalidating everything.

If we want viable IDNs -- usable as part of high-quality identifiers and able to support the whole range of applications that depend on, and infrastructure built on top of, the DNS-- we need get this right and to do so now. And that may require cleaning out a certain amount of cruft rather than saying "cruft now and cruft forever" (which is clearly one of the alternatives).

[edit] Charset

[edit] Update base character set from Unicode 3.2 to Unicode 5.0 or 5.1

John Klensin: "Actually, the goal is to make the revised standard Unicode version-agnostic. Getting to 5.0 or 5.1 is a consequence of that approach. The underlying issue is discussed in Sections 3 and 5.2 of RFC 4690 (although, if those section were written today, I believe it might differ in some details)."

Mark Davis: John's statement is the one that is in the current working drafts, and in my opinion the correct strategy. We should not be in a position where the RFC needs to be rev'ed with each new version of Unicode.

Paul Hoffman: We're not. There were multiple Unicode versions between 3.2 and 5.1 and no change in IDNA. We choose when and if we make changes to IDNA based on many factors. Becoming Unicode version-agnostic is one choice; tying ourselves to a version knowing that we might want to update later is another

Mark Davis: And the fact that there were multiple versions of Unicode with no update of IDNA was a significant problem for anyone who needed the new characters. Reformulating the RFC so that updates to a new Unicode version don't require a new RFC is a significant advantage.

John Klensin: My personal view --and I want to stress that it is just a personal view-- is that a decision to upgrade to a specific version of Unicode only would be equivalent to rejecting the general approach in the IDNA200X documents. In addition, based on experience with IDNA2003 implementations using libraries that do not report Unicode versions, I do not believe that a decision to simply switch from hard-binding from one Unicode version is plausible. Part of that issue was discussed at length in RFC 4690 which, while not formally an IETF consensus document, should not be a surprise to anyone on this list. I believe that the "unassigned code point lookup" issue, the prohibition on display of punycode-encoded strings, and the idea of a strong tie to a particular version of Unicode all constitute known technical deficiencies in RFC 3490 and friends, in the sense that phrase is used in RFC 2026.

Paul Hoffman: [is this WG chartered to untie IDNA from specific versions of Unicode using algorithms that define validity based on Unicode properties,] We need to clear this up in the charter depending on how the group wants to go.

John Klensin: For the reasons outlined above, in 4690, and in the existing documents, I do not personally believe that there is any real "clearing up in the charter" for any reason but clearing up the ambiguity that you have pointed out. The assumption of Unicode version independence is strongly enough embedded in the IDNA200X documents that, if the group doesn't "want to go" that way, I do not believe that the existing documents are usable as a base and one would need to think about an entirely different charter and set of benchmarks

[edit] Disallow most symbol characters

John Klensin: "Change the way that the protocol specifies which characters are allowed in labels"

[edit] Change the way that the protocol specifies which characters are allowed in labels

Change the way that the protocol specifies which characters are allowed in labels from "humans decide what the table of codepoints contains" to "decision about codepoints are based on Unicode properties plus a small exclusion list created by humans".

John Klensin: I'm not sure I know how to express this better, but note that what the issue is about may seem to change a great deal based on how one states it. While I've had some serious disagreements with some active participants in the development of Unicode, I have never had reason to believe that any member of the Unicode Technical Committee (UTC) is non-human. Given that the Unicode properties are created by humans, the distinction that the statement above seems to make is not really a distinction.

It seems to me that there are two underlying questions. One is where the locus of decision-making lies and about the units in which decisions are made. I think we still agree that the IETF is not the right place to try to construct a consensus table on a character-by-character basis. The other is a question of how the conclusions are expressed, a question described in the proposed documents as whether the tables are normative or the rules that generate them are. That question is somewhat intertwined with (a), above: if the standard is going to be Unicode-version-agnostic, then having the IETF adopt a new standard and new tables for each new version of Unicode (or to try to freeze things and not move forward... again see RFC 4690 for a discussion of this) is pretty much a contradiction.

James Seng: wouldn't take give raise to inconsistency of results as implementations varies (James Seng)

John Klensin: Partially because the prohibition in 3490 on display of punycode didn't work (some browser vendors now view display of punycode in varying circumstances as a feature)

Gervaise Markham: "A feature" is a bit strong; "the least worst thing to do" is probably better. When displaying the decoded version would lead to risk to the user, and displaying nothing is not an option, the punycode version at least has the significant advantages that it's a) always readable and b) unique to that domain.

Ideally, Firefox would never display punycode in normal browsing, because all registries would have homograph policies and would have registered for the whitelist. I admit it does continue to surprise me that various large GTLDs have not attempted to register. But I guess that's their business decision. and because various implementations are already doing some local mapping in URIs and IRIs (i.e., mapping that is not specified in IDNA2003), that inconsistency already exists. It is aggravated by a user-level inconsistency for both end users and registrants: other than computers and a few patient experts, no one actually understands what characters are, and are not, accepted by IDNA and there is continuing confusion about what "registration" means and whether one can register a string that cannot come out of ToUnicode(ToASCII(string)). Different registries have different policies on the latter. A few registrars have concluded that IDNs are so confusing and problem-prone that they don't want to touch them. Those comments are certainly anecdotal rather than definitive and certainly do not represent checks all the way down the DNS tree, but they are, I believe, symptomatic of a broader problem.

In addition, glyphs in fonts seem to be available more often for base characters than for compatibility ones. While applications can sometimes tell which scripts are supported locally, particular characters within fonts are much harder. In general, the application can only assume that the character are there and try to display them (typically resulting in question marks or little boxes if the characters are not available, but sometimes in mapping to other, similar, characters). Or the application much assume that display is not possible and display the punycode-encoded string instead (see above).

The lack of full support for compatibility and other mapped-out character should not be surprising, but display as little boxes or question marks is the worst possible case, since information is actually lost and copy-and-paste are even less likely to work than usual.

We either need to do something to clean that up (or, of course, we could decide we like the warts). The approach taken in the proposed documents is that we need to minimize variation in domain names in interchange and --whether you look at it as part of the goal or as an effect-- make the equivalents of the ToASCII and ToUnicode operations reversible. We believe that will significantly reduce confusion and significantly improve interoperability.

Once the mappings are removed from IDNA, there are several possible approaches:

(i) Do mapping externally to the IDNA protocol set, possibly using a standardized model that preserves complete compatibility with IDNA2003. This at least gets us clarity about what can be registered and what goes onto the model. It also gives us the potential for different rules for different protocol contexts, which might be either an advantage or a disadvantage.
(ii) Do as little mapping as possible except in contexts where backward-compatibility is more important than cleaning things up. Given the comments above and earlier discussions on this list and in draft-klensin-idnabis-issues, this might be the best approach going forward, and probably would have been the best approach if we were starting from a clean slate. Whether it is wise or not today depends on what we think about the importance of IDNs that are in active use now using characters that map out versus the much larger number of IDNs that may exist in the future.
(iii) View mapping among Unicode characters as a completely local matter, just as we have always viewed mapping into Unicode from local character sets and codings. This requires strong "it better not leak if you expect it to resolve" constraints (which we have today in different form), but is consistent with the knowledge that some local mappings are inevitable as application implementations attempt to compensate for perceived inadequacies, vis-a-vis their script or writing systems, in either IDNA or Unicode.

There are probably some hybrid possibilities as well, but the proposed documents are, deliberately, completely agnostic on this subject (draft-klensin-idnabis-issues may not be quite clear about that in the current version -- I've learned a bit more about how to explain it in the few weeks since the most recent version was posted).

[edit] Bidi

  • Allowing typical words and names in languages such as Dhivehi and Yiddish to be expressed, which was probably not even considered since it would have been easy to fix the first time around.

Restrict bidi domain names so that their display is not surprising, whether they be isolated or be embedded in a paragraph of text.

  • Make bidirectional domain names (delimited strings of labels, not just labels standing on their own) display in a non-surprising fashion
  • Make bidirectional domain names in a paragraph display in a non-surprising fashion

This was actually part of the original intent of the IDNA2003 bidi restrictions -- it's just that the actual rules didn't encompass that intent correctly -- nostra culpa. Much more work has been done in refining the proposed bidi restrictions in the new bidi document to make them actually satisfy that original intent, both in terms of allowing labels that shouldn't have been disallowed, and disallowing labels that should not have been allowed. We abandoned them fairly quickly when we couldn't find a sensible way of doing them. It is a good thing that the topic is re-opened because they are laudable goals; we now need to see if we can get there.

[edit] Architecture

  • Remove the mapping and normalization steps from the protocol and have them instead done by the applications themselves, possibly in a local fashion, before invoking the protocol.

[edit] New concept of characters

(h) Introduce the new concept of characters that can be used only in specific contexts.

John Klensin: This is driven primarily by the need to permit the use of previously-prohibited (for the obvious specific case, mapped-to-nothing) joining characters where they are necessary to preserve information in scripts because of Unicode character shaping and presentation rules. By means of explanation, rather than trying to go into the details, this is a particular problem for word formation and use with several scripts used in the Indian Subcontinent and nearby areas. For those scripts, it is claimed that mapping the zero-width joiners and non-joiners to nothing loses too much information to be appropriate (as well as creating some serious problems with "meaning" when ToUnicode(ToASCII(string)) is applied). There are some separate questions (still being discussed) about the use of the same or similar characters as virtual word-separators in other scripts.

[edit] Registration vs. Lookup

(i) Explicitly separate the definitions for the "registration" and "lookup" activities and introduce explicit rules for validation of IDN strings before DNS lookup.

John Klensin: IDNA2003, whether explicitly intended or not, has the effect of putting almost all responsibility for conformance on the registration process. If something can be registered, even in violation of the standard, it will be looked up. It appears that some registries have deliberately violated the registration rules to make things consistent with their beliefs about correct local conventions. Some of these violations are harmless except that, for some of them, some applications will find the names on lookup and others will not. Others could create significant risks. The proposed documents attempt to identify the latter cases and prohibit the strings even being looked up in the DNS, creating an "even if you register that, the lookups will almost always fail" situation for rogue registries.

Personal tools