Disallowing code points

From Wikidna.org

Jump to: navigation, search

The issue is the neutrality of the Internet bandwdith vs. an efficient IDNA, while IDNA is based on the Unicode invented and supported globalization process. How to best conciliate the Internet use and architectural capacity with Unicode optimization in a network context. The matter discussed directly concerns the support of the multilinguistic but also every kind of similar multilateral and/or smart network applications.

At 13:42 16/07/2009, Vint Cerf wrote:

The problem we foresee is that not all registry operators (remember this is not just at top or second level but all through the hierarchy) will show the same diligence. So the most troubling cases can be excluded by protocol.

The Charter says:

The WG will stop work and recommend that a new charter be generated if it concludes that any of the following are necessary to meet its goals: (iii) A change to the basic approach taken in the design team documents (Namely: independence from Unicode version and elimination of character mapping in the protocol)


.

Contents

[edit]
Background agreement with Vint Cerf

jefsey,

thanks for taking the trouble to prepare this note. It does lay out the issues clearly. I think I can report agreement with you in several areas.

First, while the IDNA working group members may not yet have reach consensus, there is at least among them, some who would treat the mapping on look up as a "should" (for compatibility reasons) but not a MUST and distinguish between the basic IDN protocol that preserves the A-label/U-label definitions and 1:1 conversion feature as part of protocol, and the pre-processing of strings that may be destined for lookup in the DNS. Some of the potential preprocessing is not specified in any of the documents except to say that by the time a look up occurs, it must be in A-label form (including the "xn--") or must be a "traditional" ASCII form ("LDH"). One form of preprocessing, about which we continue to debate, is to map some of the characters of the eventual "lookup" string so as to reduce variance from expectations that arise from the IDNA2003 behaviors.

The DNS itself continues to have the purely ASCII behavior it has always had to avoid requiring changes on the domain name server side.

Second, I think there is considerable room for innovation and standardization at a conceptual "presentation" layer - such a layer was not defined in the Internet while some effort to define on was in the OSI system. As applications have become increasingly internationalized, it is arguable that codifying presentation concepts may prove useful. Some might go so far as to suggest re-inventing the system of binding strings to internet addresses (what the DNS does). While that path was not chosen in the present DNS-IDNA2003-IDNA2008 sequence, it might be considered in the future and in my opinion, it is in these areas (presentation and re-definition) that much of your work, jefsey, has relevance.

I am interpreting your message here as arguing not to change the existing basic DNS, to confer stability at the server level, and here we agree.

vint

[edit]
Position statement by JFC Morfin

[edit] Possible appeal after the IETF/LC

DISALLOWing being a "mapping to nil", here voluntarily included in the protocol, I hereby formally request that this WG stops all work and a new charter is subsequently generated.


However, as expressed in my mail sent yesterday, I fully realize that we are currently in the process of a coup by Unicode members. They are simply trying to force a new extension of their "globalization" doctrine layer violation (cf. attached note). They actually maneuvered just as we had foresaw, including utilizing us. The alternative they propose is:

(1) either the IETF will accept their control over the Internet linguistic architecture
(2) or they will claim that some people, like myself and IUCG@IETF members, are delaying the "overwhelming consensus" that they have identified, and in turn make topics of the future, such as the sex of the glyphs, a priority. As a result, they have been forced to proceed independently.

This is exactly why I will not appeal now against a negative or non-response from you: I prefer the regular IETF/LC to apply first. I, therefore, shall leave them with the responsibility to modify the Internet architecture as well as to introduce constraints in favor of a business interest group vs. the rest of the world languages, cultures, and nations. However, I will quote this mail in my appeal should the IETF/LC fail to protect the Internet neutrality the WG-IDNABIS Charter adequately protects. (May I remind you that I will not participate to the WG/LC and to the IETF due to other maneuvers of the same influence group, in the same lengthy opposition they have to the people' interests I try to defend). As far as I am concerned, when back from vacations, I will continue to support the IUCG, so it may proceed with its user documentation of the way the existing Internet architecture does address the "multilingualization" they need, in an adequate, open, cross-technology, presentation layer based, off-the-shelf, neutral, pollution free, and semantic addressing ready manner (cf. attached note). Our alternative is outside of their trap.

Actually, I have to thank the Unicode consortium members for having gallantly answered the challenge that I forced them into. They have far superior knowledge in the Unicode area compared to simple users such as us on that issue. We had to force them to document, from the horse's mouth, as to "why we were wrong and why they were right", and who was really concerned politically and why. We also needed their tables to be conveniently documented. We would never have obtained otherwise such precious material for ensuring that the Internet’s existing possibilities would permit the support of multilingualization as we know now it does.

jfc

[edit]
Review of the semiotic context

For the convenience of the readers, the context is as follows:

[edit] There is a linguistic (semiotic) support pile wherein:

  • Universalization consists in the neutral, equal, and technical support of every language, script, orthotypography (the way in which a culture will semantically interpret the way something is written) and quoting capacity of other languages, i.e. every existing language entity is fully supported and understood. Universalization is today commonly obtained through the use of metaconcepts being numerically transcoded. Further work in that area belongs to the semantic facilitation technologies.
  • Multilingualization is the architectural approach that permits universalization through, for example, the globalization (cf. below) of every language. Its simplest support is provided by the presentation layer. As in all network architecture, the Internet includes a presentation layer. However, up to now its exploration and documentation were limited.
  • Globalization is a method that is documented by Unicode that enables the reduction of the linguistic barriers between a given language and other natural languages. It involves (to this day):
  • the internationalization of the media by including the signs of every script (ex. Unicode). The NFC is a reduction of the Unicode list that can be utilized via the Internet. What is being discussed today is to attain a further reduction for the Internet namespace.
  • the localization of the other end, through the use of "locale" files.
  • the langtagging (permitting language filtering) of the content for the application’s benefit.

[edit] The CSoC

It so happens that the Unicode members seem to also benefit from a centralizing set of chances (CSoC) that could defeat the distributed nature of the world and of the Internet, and in turn sterilize the tremendous innovation capacity of the Internet presentation layer. This would be great for the present form of the Internet economy and governance that they dominate or participate in (stakeholders), which have not been properly prepared to cope with the impact of the activation of the presentation layer. This is why they are striving for innovation moratoria that they like to call the "status quo", in spite of the detriment that it represents for all the people.

A technical, business, and political status quo is permitted by the CSoC. By luck, the CSoC seems to result from porting globalization tasks within the multilingualization layer, something ASCII users fond convenient and with e-business favorable consequences. This layer violation limits the multilingualization layer to a single globalization occurrence: in this case, this is the ASCII English language globalization presentation.

This is technically an exceedingly unstable layer violation because it only rests on the subjective user acceptation to limit his/her Internet usage due to a misunderstanding. However, unexpectedly the CSoC happens to technically, commercially, and politically influence those who design, use, and manage the Internet in favor of such an auto-limitation. In particular, its constraints result in violations of the Internet neutrality that might make more difficult a deployment of the presentation layer (some 63 character DNS labels will be mapped to others, or to nil, upon the sole IETF engineering authority). This would equate to the machines and engineers (of some major corporations) to prevail on all the people of the world and their languages and culture. This is purely bad luck.

[edit] Interplus

Interplus comprises a two (interapplication and pseudo-network) layer transparent "post-wire" continuation (interplug) of the network on the user side. It completes the Internet architecture on an edge to edge basis (the network ends on the users side), in which now every user has an intelligent syste. This was not the case in 1983 when the Internet and DNS began.

The Interplus (interplugged layers on the user side system) includes a double interapplication networking capacity (through the Internet or through OPES [open pluggable edge services(*)] overlays) and an unlimited capacity for network interrelated services applications. (*) RFC

One of these services is to support presentations, i.e. the appearing change of the network to a qualified externet (an external network look alike within the Internet). The qualifications can include linguistic networks (the default being the English ASCII version), encryption, the support of other codes than Unicode or algorithms than punycode, etc. Presentations can be identified through label headers, (such as "xn--" for IDNA), TLD (as in the ".fra" experiment), classes, or a mix of them. Presentations are transparent to applications.

Another service is the support of the Domain Names Pile. This is a smart extension of the Host.txt possibility. The user name space can fully support the orthotypography of every language and trademarks through User/Universal Domain Names (UDN). UDNs will first be converted into Application Domain Names (ADNs) in order to relate with applications (such as Skype, authentication, payment services, etc.). If needed, they will in turn be converted into regular Domain Names (for example, through punycode). UDNs are independent from the Internet and can be used for application, TV, process resolution, IA, etc. As far as the Internet is concerned, IDNs (Internet Domain Names), as accepted and resolved by the DNS, are the reference. This reference should become universal and be read as based upon the 0-Z numbering set as well as the "." and "-" strong and weak separators.

[edit] (*) Notes on OPES RFCs

RFC 3238 IAB Architectural and Policy Considerations for Open Pluggable Edge Services
RFC 3752 Open Pluggable Edge Services (OPES) Use Cases and Deployment Scenarios
RFC 3836 Requirements for Open Pluggable Edge Services (OPES) Callout Protocols
RFC 3835 An Architecture for Open Pluggable Edge Services (OPES)
RFC 3837 Security Threats and Risks for Open Pluggable Edge Services (OPES)
RFC 3838 Policy, Authorization, and Enforcement Requirements of the Open Pluggable Edge Services (OPES)
RFC 3897 Open Pluggable Edge Services (OPES) Entities and End Points Communication
RFC 3914 Open Pluggable Edge Services (OPES) Treatment of IAB Considerations
RFC 4037 Open Pluggable Edge Services (OPES) Callout Protocol (OCP) Core
RFC 4236 HTTP Adaptation with Open Pluggable Edge Services (OPES)
RFC 4496 Open Pluggable Edge Services (OPES) SMTP Use Cases
RFC 4902 Integrity, Privacy, and Security in Open Pluggable Edge Services (OPES) for SMTP

[edit]
Response from Lisa Dusseault, Applications AD

Hi,

I've CC'ed Vint on this response to your request to have the WG recharter.

I'm very aware of the charter text that requires the WG to recharter, and I'm keeping an eye on that. I will certainly press the WG to modify its charter if it concludes that a different approach from that described in the charter.

However, I am interpreting some things in a somewhat different light than you.

1. I will wait until the WG *concludes* that it needs to take a different approach than that described in its charter. Explorations of different approaches are not contrary to the charter (although the chair may rule such explorations out of scope and/or decide that the WG came to consensus not to follow the different approach).

2. I don't consider DISALLOWED to be a mapping to nil at all. There is no requirement for clients to map DISALLOWED characters to nil and then request a domain -- a client could reject a string with DISALLOWED characters. Thus, even if the WG concludes to have DISALLOWED characters, that does not depart from the charter- it was very clear when the charter was written that some character classes were to be DISALLOWED.

In a similar light although this is a little off-topic, I would ask you to relook at how people are using the phrase "at the protocol level". Personally, I think it's nearly a meaningless phrase, and I'm disappointed that so many arguments are focusing on whether mappings are happening at the protocol level or not. But to the extent that it has a meaning, I think your interpretation is not quite the common one. Suggesting to user interface implementations that they could offer alternatives to disallowed characters, to help the user find the site they are looking for, is about as far from "at the protocol level' as you can get.

Thank-you,
Lisa Dusseault

[edit]
Comment from John Klensin (author of RFC 4690 and IDNA2008)

On Friday, July 17, 2009 00:14 +0200 JFC Morfin <jefsey@jefsey.com> wrote:

DISALLOWing being a "mapping to nil", here voluntarily included in the protocol, I hereby formally request that this WG stops all work and a new charter is subsequently generated.

Disallowing is _not_ "mapping to nil". It is a simple prohibition of the use of the code point (I'm reluctant to say "character" here) in DNS labels. While there were some mappings to nil in IDNA2003, we have eliminated all of them (despite a few objections from some of the Unicode folks) in IDNA2008. I believe any attempt to revisit those decisions will meet very strong resistance from the WG (or at least from me).

It would help with this process if you would read the documents carefully enough to know what you are talking about.

[edit]
Response of JFC

Lisa, John,

I agree with Shawn Steel, let call a cat a cat. There are too many words being used without considering their whole implications. This is like just thinking over details instead of also taking care of the whole picture. I can't and I do not think it can be productive, moreover in a reticular system like the Internet?


1. When a character is DISALLOWED this means that one way or another if it used something is going to be discarded, i.e. that "something" is mapped to nil. The "something" may be a single code point or the whole domain name, or something in between - the application decides. The consequence is the same: loss of information and typographic bandwidth reduction.


2. I consider "protocol level" as a fundamental point indeed (I do not think the IESG and IAB use nearly meaningless phrases in a Charter, moreover when issuing strict conditions). If the Charter says "elimination of character mapping in the protocol" this may only mean one thing. That thing - as we understood it, and the reason we are waiting for Vint's plan to complete - is well documented in Pete Draft: the mapping is to happen outside of the protocol, i.e. not on the wire. Because what is involved is outside of the network control and of the IETF network territory. It is users' usage operation territory.

This clearly translates in Pete's I_D:

This document describes the operations that can be

First fundamental point: there is no MUST, this is consistent with the rest of the position. The only MUST is to be acceptable to the DNS.

applied to user input in order to get it into a form acceptable by the Internationalized Domain Names in Applications (IDNA) protocol [I-D.ietf-idnabis-protocol]. The document describes the underlying architectural principles (in section 2 and the general implementation procedure (in section 3).

I class Patrik's document among the underlying architectural principles the WG is to define that "can be applied to user input" (meaning by the application). This is a BCP not a Standard.

It should be noted that this document does not specify the behavior of a protocol that appears "on the wire".

This conforms to the Charter. Character mapping of any kind MUST not appear on the wire, but off the wire.

It describes an operation that is to be applied to user input in order to prepare that user input for use in an "on the network" protocol.

All this clearly describes something which is outside the "on the network" protocol.

As unusual as this may be for an IETF protocol document, it is a necessary operation to maintain interoperability.

The text underlines that it is unusual and qualifies what is not part of the protocol as an operation. We have widely published that we are in agreement with the Charter and with this text.


3. The coup consists in trying to obtain URL indexing stability in invading and annexing that territory, my/our territory. And then starting rulling it, through MUSTs where interoperability calls for negociation based upon SHOULDs. If I want to make Tatweel resolve the IETF site, I just have to use the 1972 host.txt service, making the Tatweel code-point correspond to the IETF IP. What ever the way I can enter a tatweel and punycode it, it will work. If I try the same under IDNA2008, my entry will be mapped to nil and nothing will happen. I do not really call this innovation and progress in 37 years :-) Worse, I might also name it an alias/domain name conflict.

Gee! I am on vacations !


4. I forgot a point.

At 00:44 17/07/2009, Lisa Dusseault wrote:

Suggesting to user interface implementations that they could offer alternatives to disallowed characters, to help the user find the site they are looking for, is about as far from "at the protocol level' as you can get.

Yes. But "suggesting" is not carried in using MUSTs (while SHOULD is acceptabe). This is why I am OK with everything that has no other MUST than DNS respect and resolvability. Because the user may want to use tatweel. It is also illusory to use a MUST: how are you going to enforce it?

[edit]
Comment from John Klensin

On Friday, July 17, 2009 04:04 +0200 JFC Morfin <jefsey@jefsey.com> wrote:

I agree with Shawn Steel, let call a cat a cat. There are too many words being used without considering their whole implications. This is like just thinking over details instead of also taking care of the whole picture. I can't and I do not think it can be productive, moreover in a reticular system like the Internet?

I think it is precisely the whole picture that we are concerned about. The whole, global identifier, picture, not one country or language group at a time.


1. When a character is DISALLOWED this means that one way or another if it used something is going to be discarded, i.e. that "something" is mapped to nil. The "something" may be a single code point or the whole domain name, or something in between - the application decides. The consequence is the same: loss of information and typographic bandwidth reduction.

Again, no. If a character is DISALLOWED, then the string that it contains is prohibited at a label, is not looked up, and, I hope results in a message to the user that it is invalid. That has nothing to do with discarding all or part of the string and then looking something else (or a subset) up. The latter would be a clear violation of the protocol.


2. I consider "protocol level" as a fundamental point indeed (I do not think the IESG and IAB use nearly meaningless phrases in a Charter, moreover when issuing strict conditions). If the Charter says "elimination of character mapping in the protocol" this may only mean one thing. That thing - as we understood it, and the reason we are waiting for Vint's plan to complete - is well documented in Pete Draft: the mapping is to happen outside of the protocol, i.e. not on the wire. Because what is involved is outside of the network control and of the IETF network territory. It is users' usage operation territory.

Yes. We actually agree about that. But it has nothing to do with which characters are DISALLOWED, which the mapping document carefully does not touch.


This clearly translates in Pete's I_D: ["This document describes the operations that can be"] First fundamental point: there is no MUST, this is consistent with the rest of the position. The only MUST is to be acceptable to the DNS.

No. The MUST is to be consistent with the core IDNA2008 specs, which include consistency with the DNS. And that is exactly what the rest of the sentence says.


I class Patrick's document among the underlying architectural principles the WG is to define that "can be applied to user input" (meaning by the application). This is a BCP not a Standard.

If, by "Patrick's (sic) document" you mean draft-ietf-idnabis-tables, it is referenced normatively from Protocol and is a core part of the Standards-Track IDNA2008 specification, not a BCP.


["It should be noted that this document does not specify the behavior of a protocol that appears "on the wire"."] This conforms to the Charter. Character mapping of any kind MUST not appear on the wire, but off the wire.

Sure. But, strictly speaking, IDNA does not appear on the wire. Only A-labels do. That has been the source of a lot of confusion. Note that the IRI document, which is well outside the WG's scope, does refer to IRIs as protocol elements, which pushes them into the "on the wire" category, including their IDN components.


3. The coup consists in ...

There is no coup and the above is either a misinterpretation or outside the WG's scope. URIs are prohibited from containing non-ASCII domain names -- I believe globally although there are some factions who believe that such names can be included in %-escaped UTF-8 in those fields.

And you can't use the 1972 Hosts.TXT service because it was shut down long ago. More important, it was strictly limited --technically and administratively-- to ASCII letters, digits, and hyphens, so there is no way to express a Tatweel in those tables. In IDNA2008, you simply cannot express Tatweel in a label so, in that narrow sense, the behavior under the Hosttable rules and that under IDNA2008 is identical.


At 00:44 17/07/2009, Lisa Dusseault wrote:

I don't consider DISALLOWED to be a mapping to nil at all. There is no requirement for clients to map DISALLOWED characters to nil and then request a domain -- a client could reject a string with DISALLOWED characters.

Indeed, I think the client is required to do so and that simply mapping a DISALLOWED character to nothing would be a violation of the specification. Except for those characters that IDNA2003 _required_ by mapped to nothing, that is also true of IDNA2003. There is no provision in either for a conforming implement to drop prohibited characters and look up the rest of the label.

[edit]
Important memo of John Klensin IRT Unicode etc.

(the pending answer from JFC will show they are mostly in agreement, and will explain why in spite of apparent contradictions: John is interested in the document, JFC is interested in the use of IDNs).

Jefsey,

Let me make a very few comments, based mostly on the beginning and end of your note.

First of all (although last in your note) a "clean Unicode" isn't going to be Unicode at all. A large fraction of your objections (and many of mine) over the years have been due to inconsistencies in Unicode. Most of those inconsistencies are the result of a fundamental design decision, made in around 1990 or 1991. As far as I can tell --and all discussions about character set coding in ISO and ISO/IEC JTC1 from the mid-1980s forward have been consistent with this-- there are only two possible ways to construct an all-script (in deference to another note from you, I'm not going to say "global" or "international") character coding standard. The are:

  • (1) Create such a standard from scratch, starting with and consistently reflecting principles about code constructions. Such a standard would not have alternate ways to represent the same character (e.g., precomposed versus base + combining character), would not "unify" some closely-related writing systems (e.g., CJK) while keeping other closely-related ones (e.g., Greek-Latin-Cyrillic) distinct. As a result, the standard would not require any special mapping (or "normalization", etc.) rules to make it usable because it would not suffer from large numbers of coding artifacts. A consequence of that, in turn, would be that decisions about what would be permitted would be strictly a matter for applications, a presentation layer, or constraints imposed by non-character-coding protocols.

    It is important to note that, even were such a standard possible, it would still suffer from difficulties with collation patterns and perceived advantages of low-numbered codes versus high-numbered codes. In addition, decisions about, e.g., unification inevitably involve controversial tradeoffs and such tradeoffs imply that some people, on some days, are going to be unhappy no matter what the result.
  • (2) Create such a standard by combining, largely concatenating, existing national and international (e.g., ISO) standards, adding to them as needed and making decisions about the inevitable edge cases. That inevitably means there will be inconsistencies because different national standards would have made different decisions. If those national standards were restricted to 7 bit or 8 bit codes, there will also be inconsistencies between any decisions made because of the limits those imply and codings that take advantage of a wider (16 or 32bit) structure.

The original ISO/IEC 10646 DIS-1 was based on the first model above. It was not successful and was replaced for DIS-2 by Unicode 1.0, which is an example of the second model. The way in which that was done is, I believe, a permanent stain on ISO's reputation but I've come to understand what probably happened only in watching the behavior of the UTC folks in the last few years. But it was done.

As I think you know, I disagreed with that decision then and still do. But it is not clear to me how one would go about switching to the other one at this point even if one wanted to. At minimum, it would require a completely new standard that would have to overcome claims that the first choice is impossible in practice.

To meet your needs (or mine) in an elegant way, "Clean Unicode" would have to be a complete replacement, probably using that first model. It won't come out of the current UTC group, nor out of ISO/IEC JTC1/SC2 (at least without a radical change in membership) -- given an opportunity to make the same fundamental design decisions again, I am firmly convinced that those bodies would make them the same way.

There is, of course, a third model. It involves abandoning the idea of a single, all-script, character set entirely and going back to table-switching models based on ISO 2022 or equivalent. That approach actually has many advantages, including the ability to tailor coding and collation decisions closely to the needs of particular languages and locales. If one based IDNs on it, it would be painfully simple to impose and enforce a "one label, one language" rule by permitting only one table designator in the string. But the IETF very generally rejected that option in the early 1990s. I don't believe that reopening the question today would produce a different answer, especially given moves in JTC1/SC2 to deprecate code table switching approaches.

If you want to reopen either the decisions to avoid code table switching or the fundamental Unicode decision outlined above, I suggest the right place to start is with a new work item project proposal to JTC1.

If one accepts Unicode as a given --which the IDNA WG needs to do, not only by charter but by many other constraints-- then one has to start examining Unicode for places where their decisions, most of them made in the context of large bodies of running text, are inappropriate (either generally or specifically to DNS needs) and work around them. With regard to those issues, there are many odd decisions -- really coding errors although one can argue the case either way in the context of the original national standards or large bodies of text destined for display.

Tatweel and Lajanyalan are two of those. They have only one purpose, which is to justify text within a line. There are many characters in Unicode whose only substantive purpose is to affect line layout, e.g., U+0089 (CHARACTER TABULATION WITH JUSTIFICATION) and the various tricky spaces in the 2000...200B range. In the Unicode property scheme, they are classified as control characters and hence DISALLOWED under IDNA2008. The vast majority of them were prohibited in one way or another in IDNA2003 as well because they are considered to be "space" characters. However, if one is going to have an inherently-cursive script, justification controls have associated glyphs, so, depending on the order in which property classification rules are applied, the code point can be identified as a letter, rather than as justification controls. That is exactly what Unicode did with Tatweel and Lajanyalan, but that doesn't make them justification controls any less.

That they are coded at all is partially another artifact of concatenating national standards that were developed using different assumptions. Justification controls that can be associated with glyphs exist in several other scripts as well, but the national bodies who coded those scripts or provided input into Unicode didn't see them as "characters", so they didn't assign glyph code points and neither did Unicode.

I believe you have accepted the notion that control characters should not be permitted in IDNs. What you seem to believe you are hearing from the WG is that we are prohibiting a character that may be important to the writing system and that hence should be prohibited only on registration if at all. But what is actually being said in the WG, including by the UTC experts, is not that but "this is really a justification control and we have banned all the others as punctuation".

The argument from our Iranian colleagues about permitting Tatweel is a little different. They have never claimed that it is a character that one uses in writing individual words or that it is anything but a justification control. Instead, they are trying to wrestle with the difficult and interesting problem that occurs when one wants to construct domain name labels by joining two or three words together. We don't permit using embedded spaces to do that for Western European scripts, a decision you apparently don't find problematic (at least I haven't noticed complaints about it). We do permit simply cramming the words together, a style that works much better for German than it does for French or English. And, has been discussed extensively, it is common to try to use case distinctions to mark those boundaries when the labels are derived from some source languages. There are no obvious word markers for labels formed from concatenated words for the Persian languages, so they are trying to use Tatweel as such a marker. I can understand and sympathize with that, but, once the ability to use case distinctions is eliminated by either basic DNS rules or your position that case carries too much meaning to be mapped away, there are no such markers for any other script either (unless one counts the hyphen, whose use is not restricted to Latin-based scripts).

I also note that there are no MUSTs, or even SHOULDs, in the mapping document and that the people who are arguing that mapping MUST be required, or even that a particular one SHOULD be required, are a small minority. As far as I can tell, the only argument going on now is whether the mapping document should be dropped entirely rather than having it referred to as optional. Perhaps you didn't see that convergence while you've been on vacation or preoccupied with other work. To the extent to which you are arguing for mapping that is optional at most, either of those outcomes --clearly optional mappings or no mapping specification at all-- should be acceptable to you. So I'm not entirely sure what that part of your comments is about.

I also note that there has always been an alternative to the IDNA approach, which would be to simply permit the use of UTF-8 strings directly in the DNS. It turns out that raises a whole series of other problems, perhaps ones that are more difficult than the IDNA ones and that its effective use would certainly take a long time to deploy. We've been through those arguments before and, for whatever it is worth, I believe the IAB will try to revisit them, as part of a long-term review, over the next year or so. Part of the reason for that discussion is that it is the only strategy that is "true to the DNS". But it also has some odd side-effects, such as making the effective limit on the length of labels 63 characters for undecorated Latin and around 20 characters for CJK, with other scripts falling somewhere in between.

Finally, you indicate a preference for "Unicode takes responsibility of IDNA" which I have trouble understanding. You don't like many of the Unicode coding decisions, but this would turn everything over to them. Clearly, they would ban characters like Tatweel and Lajanyalan, which you oppose disallowing. They have made it clear that they prefer to use the CaseFold operation in any matching situation but CaseFold destroys more information than even the proposed (optional) IDNA mappings do. And they have already written a specification for what they consider appropriate in identifiers and suggested that specification would be appropriate for IDN use. Independent of what one might think of the details of that document, it is actually more restrictive and harder to adapt to local needs than the current proposed IDNA2008 rules. So I cannot see how that approach would help at all with the issues you are raising.

[edit] Pre-Stockholm JFC support to authors and Vint

Dear all,

1. I started answering John's mail with a first part which takes time. However, it is only a rationale that explains my/france@large positions. It does not affect IDNA at this stage.

2. what is proposed is a ballanced compromise. As every ballanced compromise it can turn catastrophic, but the IETF/LC is the part of the Internet standard process to discuss that risk. So, I do support at this stage this compromise provided "Mapping should NOT be a MUST on lookup, to allow for the fact that a substantial range of transformations may take place from the time a possible DNS reference is input into an application to the point where a DNS query is made."

3. I have also the following comment on documents:

  • I consider some points in the Mapping Document as architecturally fundamental, and therefore that this document needs to be a standard track document.
  • I consider draft-ietf-idnabis-tables should be a BCP implementing a IANA informational open DDDS registry to record the current status of the considered codepoints by application designers, registries, and in the future presentations and coding systems. The presented tables would be considered as the default of such a registry.

We intend to focus on the preparation of a presentation of the Interplus, built over the IDNA2008 premises, in January 2010 at a telecommunications industry meeting in Paris where Vint Cerf was invited but declined. Hopefully, IDNA2008 should have at least been approved by the IESG at this date.

jfc

Personal tools