A-label

From Wikidna.org

Jump to: navigation, search

At 06:04 20/06/2008, Frank Ellermann wrote:

YAO Jiankang wrote:

> It is better if we clarify 3 definitions.

> LDH , which is the domain name lable defined in RFC 1034 and 1035

I think we need the RFC 3696 concept "updating" the RFC 1123 idea, which in turn updated RFC 1034/1035. It's a subtle point, but the older definitions had "must start with a letter", that's obsolete.

RFC 1123 fixed that claiming that the <toplabel> can contain only letters, but that would rule out IDN TLDs. RFC 3696 finally got it right, but did not update RFC 1123. Somebody has to fix this, we are in the position to do it.

> U-label , which contains at least a non-ASCII character

Okay, but please without the "standard Unicode encoding" blurb, it only needs Unicode code points (the numbers, any encoding).

> A-label, which is transformed from U-label with the algorithm > (punycode), plus a prefix such as XN-- (some lable withe the > prefix XN-- can not be converted to U-label is not valid A-label)

+1, define A-label based on U-label, and not the other way around.

> LDH label includes A-label.

+1, that is the whole point of this business.

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 01:20 21/06/2008, John C Klensin wrote:


--On Friday, 20 June, 2008 06:04 +0200 Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com> wrote:

> YAO Jiankang wrote: > >> It is better if we clarify 3 definitions. > >> LDH , which is the domain name lable defined in RFC 1034 and >> 1035 > > I think we need the RFC 3696 concept "updating" the RFC 1123 > idea, which in turn updated RFC 1034/1035. It's a subtle > point, but the older definitions had "must start with a > letter", that's obsolete. > > RFC 1123 fixed that claiming that the <toplabel> can contain > only letters, but that would rule out IDN TLDs. RFC 3696 > finally got it right, but did not update RFC 1123. Somebody > has to fix this, we are in the position to do it.

3696 is an informational document that doesn't update anything... and can't. Even had it been standards-track, it deliberately takes a very permissive view toward what is permitted and has little to do with the present situation, much less what is permitted as a TLD name.

>> U-label , which contains at least a non-ASCII character > > Okay, but please without the "standard Unicode encoding" blurb, > it only needs Unicode code points (the numbers, any encoding).

No. Unless I misunderstand what you are asking for, it really is important that U-label and A-label refer to _valid_ IDNA label forms. If we go off into an adventure into "any Unicode code point", we rapidly slide down the slippery slope toward labels that are IDNA-invalid (which is part of what caused so much confusion about what "punycode" referred to and they use of binary and other non-character labels (see RFC 2181).

>> A-label, which is transformed from U-label with the algorithm >> (punycode), plus a prefix such as XN-- (some lable withe the >> prefix XN-- can not be converted to U-label is not valid >> A-label) > > +1, define A-label based on U-label, and not the other way > around.

At the moment, neither is defined in terms of the other (in rationale). There is an implication of linkage, but that is because A-labels have to be IDNA-valid and IDNA-validity is defined in terms of operations on U-labels. What are you suggesting?

>> LDH label includes A-label. > > +1, that is the whole point of this business.

No, actually, "rationale" creates, effectively, four categories which are disjoint:

* LDH labels (as defined in 1035, with no prefix or other IDNA implications) * A-labels (prefix, punycode encoding of the rest of the string, IDNA-valid) * U-labels (Unicode string that is valid under IDNA) * Invalid

Treating A-labels as a subset of LDH labels gets us back into situations in which there are LDH labels that look like A-labels and aren't. And that has been a _huge_ source of confusion and something of a source of bad behavior.

   john

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 08:23 21/06/2008, Frank Ellermann wrote:

John C Klensin wrote:

  [See separate reply wrt <toplabel>]

>>> U-label , which contains at least a non-ASCII character

>> Okay, but please without the "standard Unicode encoding" blurb, >> it only needs Unicode code points (the numbers, any encoding).

> No. Unless I misunderstand what you are asking for, it really > is important that U-label and A-label refer to _valid_ IDNA > label forms.

Yes, no problem with that, it's the definition. I'm concerned about something else. At the moment IRIs are the most important application of IDNA, with EAI still struggling to get on board. I'm ignoring XML system identifiers is temporarily broken.

In an IRI it's the <ihost> part that actually uses IDNA. And an IRI can in essence exist in *any* charset, it's not limited to the "standard Unicode encodings" (UTF-16, -16BE, -16LE, -32, -32BE, -32LE, -8).

In a reply not yet visible on the list from my POV Ken noted that SCSU is no "standard Unicode encoding", but a registered charset. AFAIK it's an "Unicode standard", as opposed to say UTF-EBCDIC, UTF-7, UTF-1, or BOCU-1. If an "Unicode standard" charset is not the same as a "standard Unicode encoding" that is okay, it isn't the point I'm concerned about. Maybe it is only me, but it might indicate why "standard Unicode encoding" in idnabis-rationale can be misleading:

IRIs with perfectly valid <ihosts>s containing "U-labels" can use other encodings (based on the document charset), not only the seven (at the moment) "standard Unicode encodings". What you really want is that U-labels survive the IDNAbis procedure resulting in a corresponding A-label.

In RFC 3987 the IRI to URI translation starts with "transform whatever it is to UTF-8". After that step (wannabe-) U-labels are UTF-8 and in a "standard Unicode encoding". But they were already U-labels before that step.

The requirement for U-labels (before the IDNAbis procedure) is "must have corresponding Unicode code points", not "must be in a standard Unicode encoding". I hope it's now clearer what I mean, and why I proposed "I-label" instead of "U-label".

For an IRI in a Latin-1 document an "U-label" will use Latin-1 octets, and iso-8859-1 is no "standard Unicode encoding".

Digression: It might also use percent-encoded UTF-8, and while you might not like this IRI-magic, "percent encoded UTF-8" is also no "standard Unicode encoding", ditto all RFC 5137 ideas.

[...] >> +1, define A-label based on U-label, and not the other way >> around.

> At the moment, neither is defined in terms of the other (in > rationale). There is an implication of linkage, but that > is because A-labels have to be IDNA-valid and IDNA-validity > is defined in terms of operations on U-labels. What are you > suggesting?

Swap the paragraphs, start with U-label followed by A-label.

>>> LDH label includes A-label.

>> +1, that is the whole point of this business.

> No, actually, "rationale" creates, effectively, four > categories which are disjoint:

> * LDH labels (as defined in 1035, with no prefix or other > IDNA implications)

ldh-label = <letdig> [1*61<l-d-h> <letdig>] ;or similar

> * A-labels (prefix, punycode encoding of the rest of the > string, IDNA-valid)

a-label = "xn--" *<l-d-h> "-" 1*<letdig>  ;or similar

Limited to length 63 and only valid if following the rules in idnabis-protocol, TUS, RFC 3492, the works. Any valid <a-label> is by definition also a valid <ldh-label>, that is what I meant.

> * U-labels (Unicode string that is valid under IDNA)

NAK, you need at least one non-<l-d-h> code point to get U-label != LDH-label, and therefore U-label != A-label.

> * Invalid

An invalid <a-label> matching the ABNF outlined above, and not longer than 63 octets, is still a valid <ldh-label>.

Like a label that is no valid <ldh-label>, it can be still a valid label, DNS allows any 1*63<octet>.

> Treating A-labels as a subset of LDH labels gets us back > into situations in which there are LDH labels that look > like A-labels and aren't.

But A-labels just *are* a proper subset of LDH labels, that is the one and only point of IDNA(bis), as opposed to using raw UTF-8 octets up to a maximal length of 63 octets.

That not any LDH-label starting with "xn--" is also a valid <a-label> is the fine print, and one reason why IDNAbis and IDNA need several RFCs for the details.

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 15:05 21/06/2008, JFC Morfin wrote:

This mail was blocked due to the number of recipiants. I send it again. The cc list is: "Roberto Gaetano" <roberto@icann.org>, "Subrenat" <jj.subrenat@welho.com>, bdelachapelle@gmail.com, alac@atlarge-lists.icann.org, anne-rachel.inne@icann.org, twomey@icann.org, gerard.lang@insee.fr, maaya@funredes.org, listegenerale@franceatlarge.org, MKUMMER@unog.ch, genetech@mail.mltf.org


To follow on Frank's remarks calling for clear definitions for the words we use at the IETF/WGIDNABIS.

We (france@large and related entities and experts as well as Members from the French Internet Governance) have identified a difficult (and politically sensible) translation problem of the Internet terms into French.

This shows us that many terms that we commonly intuitively understand may not reasonably understood well in an other language (French is certainly one of the most difficult semantic filter for external pragmatics). Relations with other languages of Members of the involved entities (European languages, Arabic, African languages) make us understand that the most difficult part of the IDNA/ML-DNS issue is the user-guide and contract languistic interintelligibility. To interthink may be more difficult than to internet.

We are therefore considering an Internet terminology project (ITP/TPI) everyone would be welcomed to participate into. It would articulated as follows:

1. to make a list of the Internet related words (technical and governance) and we would like some help from the IETF side to be sure we collect the largest number of terms, starting wih the very IDNA words, with their English definition. 2. to translate that list in French to come back with a Anglo/French glossary and all the definition clarifications we need/can propose. 3. to engage into a mutual relation with experts from every ISO 3166 listed scripts and languages in order to provide an Internet Ontology with reference links embeded and copnceptual, political, technical, managerial, and user oriented notes and comments. 4. to require Multilinc geocultural semantic addressing registry projects to contribute with their own language definition (corsican, breton, welsh, etc.) 5. as a common reference we plan to organise (based upon a wiki work through http://wikidna.org) a maintained unique document built after the existing and future RFCs (for the IDNA part) about the technical architecture, logic, formats, operations, databases, etc. supporting the Multilingual Internet. For compatibility with IETF and other works we would like this document to be at least bi-lingual.

We will discuss this project more in detail during our france@large July 3rd quick-off meeting (hosting, leaders, methodology, support, etc.). The same as for the IGF process, we would certainly like that Internet based/related effort to serve as a practical example of intergovernance support that could be copied in other areas.

Comments and help welcome. jfc

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 20:35 21/06/2008, Frank Ellermann wrote:

John C Klensin wrote:

>> (1) LDH label, that's AFAIK 1 to 63 letters, digits, >> and hyphens, not starting or ending with a hyphen.

> And not having two hyphens in the third or forth positions, > according to the current definition in idnabis-rationale.

That would be a major change to what is currently known as label in a FQDN of a host. I hope for one "updates: 1123", but not twenty of "updates: ????" for the various RFCs with their own idea of a host <label>:

| Domain = sub-domain *("." sub-domain) | sub-domain = Let-dig [Ldh-str] | Let-dig = ALPHA / DIGIT | Ldh-str = *( ALPHA / DIGIT / "-" ) Let-dig

That is an example in a not yet approved RFC about SMTP

> Note that IIR 1035 doesn't say "LDH", and 1123 doesn't > either, they say "host name".

RFC 1035 defines <ldh-str> and <let-dig>, RFC 821 defines <ldh-str> and <let-dig>, RFC 937 uses <ldh>, 819, 882, 883, 1034, 2486, 2645, 2821, 3467, 3490, 3696, 3743, 4185, 4282, 4290, 4408, 4471, 4690, 4713, 5178. That's what I found with <http://purl.net/xyzzy/-a9/LDH+RFC>

<toplabel> 

> I hope that it is out of scope for this WG, but that is > certainly subject to debate. As you know, I've written > the IESG asking them to give some priority to validating > that erratum.

I don't understand why you hope that this is out of scope. It has to be fixed for future IDN TLDs, and your erratum update killed the happy theory that RFC 3696 is the last word on <toplabel>, e.g., as used in the following draft:

<http://www.icann.org/topics/dns-stability-draft-paper-06feb08.pdf>

Folks are grabbing for anything, informational RFC or even unverified erratum, just to get any "authoritative" source about this.

> We probably should extend the 1123 rule to permit those > hyphens but, IMO, that is as far as we should go.

That is already good enough, there are only two variants,

 toplabel = <let> [1*61<l-d-h> <let-dig>]  ; variant 1
 toplabel = <let> 0*61<l-d-h> <let-dig>    ; variant 2

Let's just pick what you like better, but not variant 1

> A combination of I-Ds, informational and experimental > documents, and opinions that don't represent demonstrated > community consensus. Sorry if I don't find much > authority in these.

That is because everybody waits for you to say what you think is best in a published RFC on standards track with an "updates: 1123" note. The USEFOR RFC is on standards track, with the 3696 version of variant 2 (= length two).

>> By definition an A-label is also a valid <toplabel>, >> and we don't need to talk about this.

> By whose definition?

By your definition in either RFC 3696 or Errata ID 1353, and your definition in idnabis-rationale. The latter defines (in prose)...

x-label = "xn--" *<l-d-h> "-" 1*<let-dig> ; length 6..63

...and any valid A-label matches <x-label>. Because any <x-label> also matches <ldh-label>, and any <toplabel> is simply an <ldh-label> starting with a letter (length 1..63 or 2..63 depending on the chosen variant) I get:

  • "x" is a letter
  • "xn--" + "-" + 1*<let-dig> has length 6, and 6 > 1
  • 6..63 has the same maximal length as 1+61+1

> all the ICANN test collection proves is that one can > violate 1123 without causing very many problems, at > least for the mostly-web applications that have been > used in tests.

Joke - I had to fix my rxwhois client, anything with a hyphen went into the "guess what NIC handle" procedure.

> not obviously in the WG's charter.

| In particular, IDNs continue to use the "xn--" prefix

The Charter wants "xn--", it does not say "but not for TLDs". Vint or Lisa would tell us if they don't want IDN TLDs for some obscure reason.

 <potentially open question: valid U-toplabel>

>> Depending on the script "one code point" can express >> things that would need several letters in other >> scripts. ICANN can sort this out.

> It is not clear who gets to "sort this out".

What I wrote was a proposal. Do you want to tackle the minimal length of an U-toplabel in Unicode code points ?

I'm not (yet) aware of technical reasons to do this, a corresponding A-toplabel has length 6..63, is that not good enough ?

> again, I hope that work doesn't belong to this WG.

That matches "ICANN can sort this out", it would be bad if we say "two code points", and some language in some script uses a single code point for "motherland".

The Chinese IDN test TLDs use only two code points for "test". The Cyrillic RF proposal uses two code points, and it won't surprise me if somebody wants or needs one.

> The current rule (banning anything with "--" in positions > two and three that isn't a valid A-label) in IDNA2008 > is extremely conservative wrt prefix forms as a means > of avoiding nonsense

Nobody can prevent me from creating a label fe--2008-11-11, it is LDH, and it makes sense from my POV. How could we find out if somebody uses similar labels already, and get them to change it ?

The IDNA "xn--" approach used a proper subset of LDH for its purposes out of necessity, but I see no technical necessity to say that other LDH subsets are *invalid*.

IMO figuring out which <x-label>s (see above) are valid A-labels is interesting enough.

> That isn't much of a restriction, since no one has > really demonstrated a need for such strings.

There is no need to have hmdmhdfmhdjmzdtjmzdtzktdkztdjz as label, nevertheless I ended up with it, after a piece of software rejected about a dozen less obscure ideas, and I lost my patience. IIRC I needed a working jabber account fast, I wasn't aware that this would be a label and local part later.

> If the WG concludes that is excessive and wants to > drop back all or part of the way to a rule that merely > says that, if the label starts in "xn--", it must be > an A-label, I won't lose any sleep over it...

I guess you could say that any <x-label> that is not a valid A-label MUST NOT be registered as <toplabel>, and that it also MUST NOT be registered in any "decent" TLD registry (at any level managed by the TLD registry).

That is already difficult, constructed example, what if an URI scheme xn--foo needs xn--foo.uri.arpa ? Subtle point, this is no <x-label> as defined above.

<xn--cocacola> 

> If one decides that an A-label that cannot satisfy > those rules is "whatever it is", one ends up with a > string with two possible interpretations depending > on the version of Unicode being used

Okay, to eliminate any "it is not even an <x-label>" argument let's take xn--coca-cola, a valid <x-label>.

The MUSTard would guarantee that xn--coca-cola cannot be registered if it has no corresponding U-label for Unicode 5.1 (caveat, maybe it has, I didn't check it).

At other levels folks will do what they want no matter what IDNAbis tries to decree. Applications could not decode it to an U-label, because there is no U-label.

Isn't that good enough, treat xn--coca-cola "as is" ?

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 02:54 22/06/2008, Frank Ellermann wrote:

Sorry, I got <x-label> wrong. There is no "third hyphen" if the U-label contains no LDH:

wrong: x-label = "xn--" *<l-d-h> "-" 1*<let-dig> fixed: x-label = "xn--" [*<l-d-h> "-"] 1*<let-dig>

The punycode detail is not more interesting in the fixed form, just x-label = "xn--" *<l-d-h> <let-dig> will do.

Any valid A-label is still an x-label, and any x-label is still a top-label.

xn--cocacola and xn--foo are <x-labels>, they no need no third hyphen to become more obscure. And if they are no valid A-label they are still ldh-labels.

Maybe the <x-label> syntax could help in the draft, it's clearer than talking about "potential A-labels", if the term A-label implies valid.

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 04:51 23/06/2008, Mark Andrews wrote:


> John C Klensin wrote: > > >> (1) LDH label, that's AFAIK 1 to 63 letters, digits, > >> and hyphens, not starting or ending with a hyphen. > > > And not having two hyphens in the third or forth positions, > > according to the current definition in idnabis-rationale. > > That would be a major change to what is currently known as > label in a FQDN of a host. I hope for one "updates: 1123", > but not twenty of "updates: ????" for the various RFCs with > their own idea of a host <label>: > > | Domain = sub-domain *("." sub-domain) > | sub-domain = Let-dig [Ldh-str] > | Let-dig = ALPHA / DIGIT > | Ldh-str = *( ALPHA / DIGIT / "-" ) Let-dig > > That is an example in a not yet approved RFC about SMTP > > > Note that IIR 1035 doesn't say "LDH", and 1123 doesn't > > either, they say "host name". > > RFC 1035 defines <ldh-str> and <let-dig>, RFC 821 defines > <ldh-str> and <let-dig>, RFC 937 uses <ldh>, 819, 882, 883, > 1034, 2486, 2645, 2821, 3467, 3490, 3696, 3743, 4185, 4282, > 4290, 4408, 4471, 4690, 4713, 5178. That's what I found > with <http://purl.net/xyzzy/-a9/LDH+RFC>

Hostnames are defined in RFC 952 which is modified by 1123. RFC 1035 does NOT define hostnames. It says to use *existing* rules. For hostnames it says this would be those for hosts.txt, RFC 952.

RFC 1123 also does not preclude alphanumeric. What it does say is that all the currently allocated tlds (at the time of writing) are alphabetic and that because they are alphabetic there is no possiblilty of a clash with a dotted decimal notation for a IPv4 address.

At best there is guidance not to allocate a TLD which will potentially clash with a representation of a IPv4 address.

0xde.0xad.0xbe.0xef 222.137.190.239 0xdeadbeef 0337.0211.0276.0357 033653337357 3735928559

xn--* will never clash with a dotted decimial or any other representation of a IPv4 address.

xn--* is a legal tld under RFC 952 and it was not made illegal by RFC 1123.

> <toplabel> > > I hope that it is out of scope for this WG, but that is > > certainly subject to debate. As you know, I've written > > the IESG asking them to give some priority to validating > > that erratum. > > I don't understand why you hope that this is out of scope. > It has to be fixed for future IDN TLDs, and your erratum > update killed the happy theory that RFC 3696 is the last > word on <toplabel>, e.g., as used in the following draft: > > <http://www.icann.org/topics/dns-stability-draft-paper-06feb08.pdf> > > Folks are grabbing for anything, informational RFC or even > unverified erratum, just to get any "authoritative" source > about this. > > > We probably should extend the 1123 rule to permit those > > hyphens but, IMO, that is as far as we should go. > > That is already good enough, there are only two variants, > > toplabel = <let> [1*61<l-d-h> <let-dig>]  ; variant 1 > toplabel = <let> 0*61<l-d-h> <let-dig>  ; variant 2 > > Let's just pick what you like better, but not variant 1 > > > A combination of I-Ds, informational and experimental > > documents, and opinions that don't represent demonstrated > > community consensus. Sorry if I don't find much > > authority in these. > > That is because everybody waits for you to say what you > think is best in a published RFC on standards track with > an "updates: 1123" note. The USEFOR RFC is on standards > track, with the 3696 version of variant 2 (= length two). > > >> By definition an A-label is also a valid <toplabel>, > >> and we don't need to talk about this. > > > By whose definition? > > By your definition in either RFC 3696 or Errata ID 1353, > and your definition in idnabis-rationale. The latter > defines (in prose)... > > x-label = "xn--" *<l-d-h> "-" 1*<let-dig> ; length 6..63 > > ...and any valid A-label matches <x-label>. Because any > <x-label> also matches <ldh-label>, and any <toplabel> > is simply an <ldh-label> starting with a letter (length > 1..63 or 2..63 depending on the chosen variant) I get: > > * "x" is a letter > * "xn--" + "-" + 1*<let-dig> has length 6, and 6 > 1 > * 6..63 has the same maximal length as 1+61+1 > > > all the ICANN test collection proves is that one can > > violate 1123 without causing very many problems, at > > least for the mostly-web applications that have been > > used in tests. > > Joke - I had to fix my rxwhois client, anything with a > hyphen went into the "guess what NIC handle" procedure. > > > not obviously in the WG's charter. > > | In particular, IDNs continue to use the "xn--" prefix > > The Charter wants "xn--", it does not say "but not for > TLDs". Vint or Lisa would tell us if they don't want > IDN TLDs for some obscure reason. > > <potentially open question: valid U-toplabel> > >> Depending on the script "one code point" can express > >> things that would need several letters in other > >> scripts. ICANN can sort this out. > > > It is not clear who gets to "sort this out". > > What I wrote was a proposal. Do you want to tackle the > minimal length of an U-toplabel in Unicode code points ? > > I'm not (yet) aware of technical reasons to do this, a > corresponding A-toplabel has length 6..63, is that not > good enough ? > > > again, I hope that work doesn't belong to this WG. > > That matches "ICANN can sort this out", it would be bad > if we say "two code points", and some language in some > script uses a single code point for "motherland". > > The Chinese IDN test TLDs use only two code points for > "test". The Cyrillic RF proposal uses two code points, > and it won't surprise me if somebody wants or needs one. > > > The current rule (banning anything with "--" in positions > > two and three that isn't a valid A-label) in IDNA2008 > > is extremely conservative wrt prefix forms as a means > > of avoiding nonsense > > Nobody can prevent me from creating a label fe--2008-11-11, > it is LDH, and it makes sense from my POV. How could we > find out if somebody uses similar labels already, and get > them to change it ? > > The IDNA "xn--" approach used a proper subset of LDH for > its purposes out of necessity, but I see no technical > necessity to say that other LDH subsets are *invalid*. > > IMO figuring out which <x-label>s (see above) are valid > A-labels is interesting enough. > > > That isn't much of a restriction, since no one has > > really demonstrated a need for such strings. > > There is no need to have hmdmhdfmhdjmzdtjmzdtzktdkztdjz > as label, nevertheless I ended up with it, after a piece > of software rejected about a dozen less obscure ideas, > and I lost my patience. IIRC I needed a working jabber > account fast, I wasn't aware that this would be a label > and local part later. > > > If the WG concludes that is excessive and wants to > > drop back all or part of the way to a rule that merely > > says that, if the label starts in "xn--", it must be > > an A-label, I won't lose any sleep over it... > > I guess you could say that any <x-label> that is not a > valid A-label MUST NOT be registered as <toplabel>, and > that it also MUST NOT be registered in any "decent" TLD > registry (at any level managed by the TLD registry). > > That is already difficult, constructed example, what if > an URI scheme xn--foo needs xn--foo.uri.arpa ? Subtle > point, this is no <x-label> as defined above. > > <xn--cocacola> > > If one decides that an A-label that cannot satisfy > > those rules is "whatever it is", one ends up with a > > string with two possible interpretations depending > > on the version of Unicode being used > > Okay, to eliminate any "it is not even an <x-label>" > argument let's take xn--coca-cola, a valid <x-label>. > > The MUSTard would guarantee that xn--coca-cola cannot > be registered if it has no corresponding U-label for > Unicode 5.1 (caveat, maybe it has, I didn't check it). > > At other levels folks will do what they want no matter > what IDNAbis tries to decree. Applications could not > decode it to an U-label, because there is no U-label. > > Isn't that good enough, treat xn--coca-cola "as is" ? > > Frank > > _______________________________________________ > Idna-update mailing list > Idna-update@alvestrand.no > http://www.alvestrand.no/mailman/listinfo/idna-update -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: Mark_Andrews@isc.org _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 14:52 23/06/2008, John C Klensin wrote:

Mark,

I think the issue here is not what is syntactically possible, but about what makes good sense. We presumably all know that virtually anything can be placed into a DNS label, including arbitrary strings in UTF-8, UTF-32, or random character sets. RFC 2181 is quite clear about that and, in that area, it really contained nothing new although there had been enough confusion to justify writing it.

IDNA was designed around the assumption that we didn't want to tamper with the host name rules which are, in turn, reflected as the "existing" rules in 1035, specific syntax in 821, etc. It wasn't that we couldn't simply, e.g., store UTF-8 in domain labels, it is that we decided to not do (for a lot of perfectly good reasons, but let's not reprise that discussion here).

The question of what strings should be permitted as TLD labels has always been more a matter of judgment than of protocol. Being pedantic about what the protocol permits does not help us make progress; again, it is clear that the DNS itself permits almost anything. Jon's judgment was that we would be better off with a clear lexical distinction, based on length, between ccTLDs and gTLDs. For better or worse, that distinction is now ancient history. My recollection and understanding, from the pre-1591 discussions, is that "alphabetic" meant exactly that -- the intention was that, if new gTLDs were allocated, their names would contain nothing but alphabetic ASCII characters. The reason was to avoid any possible confusion with IP addresses or other, non-DNS identifiers, either by stupid parsing algorithms or by careless people.

Now, IDNs change that rule. One cannot have a full range of A-labels without digits and cannot have A-labels at all without hyphens in the third and fourth positions. My intuition --again consistent with extrapolation from the 1591 discussions-- is that TLD U-labels (or, more generally, anything that isn't strictly a U-label) should not include any digits (in any script) or punctuation (even hyphens), regardless of what is permitted elsewhere.

How dangerous would it be to be more relaxed than that? I don't know. Certainly it is possible that I'm being too conservative But I'm also not really interested in finding out, given the sweeping consequences of misinterpreting a TLD string and also given that there is no obvious _need_ for such strings, regardless of what people might "like" to do. If we get down to "like to do", then there are clearly folks who would "like" to create confusion and attack vectors -- both can be quite profitable.

Now, all of that said, at one issue clearly remains:

Should IETF try to impose any requirements or limitations that would apply strictly to TLD labels, or should we decide that they are just "policy" and leave them to ICANN? My personal view is that the type of restrictions described above are not "just policy" because they are important to preserving the ability of older, non-IDNA-aware, applications to continue to behave smoothly and predictably. I also don't trust ICANN's decision-making processes very much and, in particular, do not trust them to favor conservatism about long-term identifier integrity over the short-term commercial interests of someone with a clever idea. I also believe, based on some small experience, that the argument will be made there that, if something wasn't important enough for the IETF to lay down a firm rule, then there should be no restrictions and commercial ("competitive") interests should prevail.

YMMD on any or all of those points -- you may, in particular, believe that the IETF should stay out of the TLD syntax issues on principle regardless of consequences; you may trust ICANN and its processes to protect the integrity of the DNS and its identifiers; or you may believe that the interests of the Internet are best served by uncontrolled commercialization of the DNS. I don't believe that either of our interpretations of history will help reach conclusions on those issues if, indeed, we disagree.

But, FWIW, my conclusions from my reasoning about this and the assumption that IDN TLDs are either a good idea or inevitable are that:

* We should continue to restrict ASCII TLD strings (a subset of "LDH labels" in IDNA2008-speak) to alphabetic-only... no digits or hyphens at all.

* We should apply the same rules to U-labels (native character string forms) for TLDs, i.e., no digits, no punctuation, and, preferably, at least two or three characters (in the "print position" sense of "character" long, not dependent on however Unicode coding happens to work) long.

* We should permit whatever A-labels fall out from the above. I.e., A-labels contain hyphens by definition and often contain digits as a consequence of the coding.

But that is just the conclusion I get to by applying my conclusions as postulates. If I do, the above rules more or less fall out. If one starts with different postulates, one ends up with different rules.

A few more comments below...


--On Monday, 23 June, 2008 12:51 +1000 Mark Andrews <Mark_Andrews@isc.org> wrote:

>... > RFC 1123 also does not preclude alphanumeric. What it does > say is that all the currently allocated tlds (at the time > of writing) are alphabetic and that because they are > alphabetic there is no possiblilty of a clash with a dotted > decimal notation for a IPv4 address.

No. Go back and read it again. "will be alphabetic" was a comment about future allocations, not just the circumstances at the time.

> At best there is guidance not to allocate a TLD which will > potentially clash with a representation of a IPv4 address. > > 0xde.0xad.0xbe.0xef > 222.137.190.239 > 0xdeadbeef > 0337.0211.0276.0357 > 033653337357 > 3735928559 > > xn--* will never clash with a dotted decimial or any other > representation of a IPv4 address. > > xn--* is a legal tld under RFC 952 and it was not made illegal > by RFC 1123.

That is absolutely correct. There is no "illegal" in any of this. The question, as noted above, is about strategies and wisdom in allocating TLD labels given that 1123 eliminated the "no leading digit" rule for labels -- a rule that was never enforced by the DNS but by applications protocols (such as SMTP).

Thus, a different way to put the question is "should there be restrictions on TLD labels that are a superset of the restrictions on labels elsewhere in the tree". If one's answer is "no", then my concerns, and most of the discussion above and earlier, is irrelevant (and Frank's "<toplabel>" production is trivial). If the answer is "yes", or even "maybe", then the questions are what those additional restrictions should be and who should define them.

Even if the answer is "yes, but only to prevent confusion with IPv4 addresses", one could still have an FQDN of 1.2.3.4.5 or 1.2.3, as long as 1.2.3.4 is avoided. I'd rather not, if only because I can imagine ways in which parsers based on other assumptions, DNAMEs, and mappings to or from reverse forms could lead to trouble, but, again, that is a matter of conservative preferences, not because there would be serious problems for very careful applications.

>...

   john

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 22:55 23/06/2008, Frank Ellermann wrote:

Mark Andrews wrote:

> At best there is guidance not to allocate a TLD which will > potentially clash with a representation of a IPv4 address.

> 0xde.0xad.0xbe.0xef > 222.137.190.239 > 0xdeadbeef > 0337.0211.0276.0357 > 033653337357 > 3735928559

The 0x concept would break various of my scripts based on "if it only contains the characters '0.123456789' it might

be an IPv4, and it is no FQDN (or vice versa)".  

Maybe it is a good idea to say that "0x" 1*HEXDID cannot be a <toplabel>, but that has to be said in something that is fresher than "RFC 952 - status: unknown". With no note about "updated by RFC 1123".

> xn--* will never clash with a dotted decimial or any other > representation of a IPv4 address.

Yes, any valid A-label is automatically a valid <toplabel>.

> xn--* is a legal tld under RFC 952 and it was not made > illegal by RFC 1123.

RFC 952 has a limit of 24 characters and proposes suffixes like "-GW" and "-NIC", let's say it was not designed for IDNA and RFC 3492. If you think it helps we could move RFC 952 to HISTORIC, it muddies the water when it shows up in ICANN documents published in 2008.

The decruft experiment (RFC 4450) missed RFC 952, because it was limited to standards, excluding "status: unknown".

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 00:51 24/06/2008, Frank Ellermann wrote:

John C Klensin wrote:

> Jon's judgment was that we would be better off with a clear > lexical distinction, based on length, between ccTLDs and gTLDs. > For better or worse, that distinction is now ancient history.

As ancient as one minute ago (somehow I ended up in a Wikipedia dispute about a "proposed top-level domain" QC, the proponent did not read the "country code top-level domain" article with the RFC 1591 fine print).

> My intuition --again consistent with extrapolation from the 1591 > discussions-- is that TLD U-labels (or, more generally, anything > that isn't strictly a U-label) should not include any digits (in > any script) or punctuation (even hyphens), regardless of what is > permitted elsewhere.

Dunno, figuring out what is a good, bad, or ugly U-toplabel for a given valid U-label is something ICANN can do. If they want a rule in the IDNAbis RFCs about it, fine. I'd have vague ideas why "only non-ASCII digits" in a U-toplabel would be odd, after all "only ASCII digits" isn't permitted. But if they don't want a rule better don't talk about it.

> Certainly it is possible that I'm being too conservative > But I'm also not really interested in finding out, given the > sweeping consequences of misinterpreting a TLD string and also > given that there is no obvious _need_ for such strings

Maybe somewhere in the world counties are known by strings containing digits, and want to get TLDs (pure speculation), IMO this is not "obvious". Conservative is good, but I like KISS better...

> Should IETF try to impose any requirements or limitations that > would apply strictly to TLD labels, or should we decide that > they are just "policy" and leave them to ICANN?

The proper <toplabel> subset of LDH has to be specified, it's messy at the moment with at least five versions in the wild. U-toplabel could be "policy".

> and, in particular, do not trust them to favor conservatism > about long-term identifier integrity over the short-term > commercial interests of someone with a clever idea.

That's why I wrote "pick version 1 or 2, but not 1" (SC TLD) for <toplabel>. For a single non-ASCII code point U-toplabel see above. If you want to do something about the U-toplabels please keep it simple, I'm more interested in LDH <toplabel>s.

> * We should continue to restrict ASCII TLD strings (a > subset of "LDH labels" in IDNA2008-speak) to > alphabetic-only... no digits or hyphens at all.

"xn--" labels have hyphens and digits, for implementations it means they have to accept this anyway. MUST start with <let> is ok. (eid=1335), MUST contain non-digit is ok. (RFC 3696), let's not make this more complex.

Same idea as in MIME, implementations do not need to know that =?...?.?...?= is magic, they treat it just as a word. Similar implementations don't need to know that xn--... is magic, it's just a peculiar LDH label. Only IDNA software will try to do more with xn--...

> one could still have an FQDN of 1.2.3.4.5 or 1.2.3, as > long as 1.2.3.4 is avoided.

Some stupid software looks for "only dots or digits" for its decision "might be a FQDN or IPv4", and it doesn't count dots, or check that 1.2.3.456 is no IPv4. Now you could say "this software is broken, fix it", but it was good enough to handle IPv4 vs. FQDN implicitly.

> not because there would be serious problems for very > careful applications.

Sure, some protocols insist on square brackets for IPs, after that decision the whole issue doesn't exist. But other popular protocols don't for IPv4, notably STD 66.

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 01:25 24/06/2008, Mark Andrews wrote:


> Mark Andrews wrote: > > > At best there is guidance not to allocate a TLD which will > > potentially clash with a representation of a IPv4 address. > > > 0xde.0xad.0xbe.0xef > > 222.137.190.239 > > 0xdeadbeef > > 0337.0211.0276.0357 > > 033653337357 > > 3735928559 > > The 0x concept would break various of my scripts based on > "if it only contains the characters '0.123456789' it might > be an IPv4, and it is no FQDN (or vice versa)". > > Maybe it is a good idea to say that "0x" 1*HEXDID cannot > be a <toplabel>, but that has to be said in something that > is fresher than "RFC 952 - status: unknown". With no note > about "updated by RFC 1123". > > > xn--* will never clash with a dotted decimial or any other > > representation of a IPv4 address. > > Yes, any valid A-label is automatically a valid <toplabel>. > > > xn--* is a legal tld under RFC 952 and it was not made > > illegal by RFC 1123. > > RFC 952 has a limit of 24 characters and proposes suffixes > like "-GW" and "-NIC", let's say it was not designed for > IDNA and RFC 3492. If you think it helps we could move > RFC 952 to HISTORIC, it muddies the water when it shows up > in ICANN documents published in 2008. > > The decruft experiment (RFC 4450) missed RFC 952, because > it was limited to standards, excluding "status: unknown".

It is the current RFC that limits hostnames to LDH. -GW and -NIC etc. are just shoulds not musts.

To move RFC 952 to historic we need to write a RFC which consolidates all the changes to hostnames: syntax, lengths etc. into one document.

> Frank > > _______________________________________________ > Idna-update mailing list > Idna-update@alvestrand.no > http://www.alvestrand.no/mailman/listinfo/idna-update -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: Mark_Andrews@isc.org _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 02:12 24/06/2008, Mark Andrews wrote:


> Even if the answer is "yes, but only to prevent confusion with > IPv4 addresses", one could still have an FQDN of 1.2.3.4.5 or > 1.2.3, as long as 1.2.3.4 is avoided. I'd rather not, if only > because I can imagine ways in which parsers based on other > assumptions, DNAMEs, and mappings to or from reverse forms could > lead to trouble, but, again, that is a matter of conservative > preferences, not because there would be serious problems for > very careful applications.

Most applications will accept 127.1 for 127.0.0.1.

Anything that matches valid inet_addr() input is problematic not just dotted decimal.

Mark

-- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: Mark_Andrews@isc.org _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 02:17 24/06/2008, Mark Andrews wrote:


> But, FWIW, my conclusions from my reasoning about this and the > assumption that IDN TLDs are either a good idea or inevitable > are that: > > * We should continue to restrict ASCII TLD strings (a > subset of "LDH labels" in IDNA2008-speak) to > alphabetic-only... no digits or hyphens at all. > > * We should apply the same rules to U-labels (native > character string forms) for TLDs, i.e., no digits, no > punctuation, and, preferably, at least two or three > characters (in the "print position" sense of "character" > long, not dependent on however Unicode coding happens to > work) long. > > * We should permit whatever A-labels fall out from the > above. I.e., A-labels contain hyphens by definition and > often contain digits as a consequence of the coding. > > But that is just the conclusion I get to by applying my > conclusions as postulates. If I do, the above rules more or > less fall out. If one starts with different postulates, one > ends up with different rules.

I'm happy with that as a outcome. -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: Mark_Andrews@isc.org _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 02:30 24/06/2008, Frank Ellermann wrote:

Mark Andrews wrote:

>> If you think it helps we could move RFC 952 to HISTORIC, >> it muddies the water when it shows up in ICANN documents >> published in 2008.

>> The decruft experiment (RFC 4450) missed RFC 952, because >> it was limited to standards, excluding "status: unknown".

> It is the current RFC that limits hostnames to LDH. -GW > and -NIC etc. are just shoulds not musts.

RFC 1035 claims that RFC 952 "specifies the format of HOSTS.TXT, the host/address table replaced by the DNS."

                                 ^^^^^^^^

RFC 1035 says "63", not "24", it is an Internet Standard, it was updated by RFC 1123, another STD. And it defines LDH.

For what purpose do you onsider RFC 952 as current ? It has in essence the same LDH concept, only limited to "24". I'm not *generally* opposed to old RFCs with an unknown status, but RFC 952 is (apparently) "de facto" obsolete. It's just that nobody bothered to note the fact "officially" so far.

> To move RFC 952 to historic we need to write a RFC which > consolidates all the changes to hostnames: syntax, lengths > etc. into one document.

Okay, I wanted an "updates: 1123" for the <toplabel> issue, we don't need "updates: 1035" because RFC 1123 already did this, but adding "obsoletes: 952" to idnabis-rationale is a possibility. But I don't see the necessity to justify an "obsoletes: 952" in an IDNAbis memo. Unlike the <toplabel> bug, that is IMO required for IDNAbis, and it doesn't belong into say 2606bis.

We could of course also fix this bug in a separate IDNAbis memo, and while at it add an "obsoletes: 952" - short RFCs are good. Is that what you propose ?

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 03:17 24/06/2008, Mark Andrews wrote:


> Mark Andrews wrote: > > > At best there is guidance not to allocate a TLD which will > > potentially clash with a representation of a IPv4 address. > > > 0xde.0xad.0xbe.0xef > > 222.137.190.239 > > 0xdeadbeef > > 0337.0211.0276.0357 > > 033653337357 > > 3735928559 > > The 0x concept would break various of my scripts based on > "if it only contains the characters '0.123456789' it might > be an IPv4, and it is no FQDN (or vice versa)".

Well you scripts are most probably already broken.

% telnet 0x7f.0x01 Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused telnet: Unable to connect to remote host %

Mark -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: Mark_Andrews@isc.org _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 04:01 24/06/2008, Vint Cerf wrote:

seems to me that clarity might suggest a distinct RFC. the WG would have to agree that such an RFC is in scope (I certainly believe it to be on the grounds that we are having to look very carefully at how we word our definitions as we introduce IDNs since we don't want to create unplanned side problems with the introduction of the xn-- format at each level.

vint


On Jun 23, 2008, at 8:30 PM, Frank Ellermann wrote:

> Mark Andrews wrote: > >>> If you think it helps we could move RFC 952 to HISTORIC, >>> it muddies the water when it shows up in ICANN documents >>> published in 2008. > >>> The decruft experiment (RFC 4450) missed RFC 952, because >>> it was limited to standards, excluding "status: unknown". > >> It is the current RFC that limits hostnames to LDH. -GW >> and -NIC etc. are just shoulds not musts. > > RFC 1035 claims that RFC 952 "specifies the format of > HOSTS.TXT, the host/address table replaced by the DNS." > ^^^^^^^^ > RFC 1035 says "63", not "24", it is an Internet Standard, it > was updated by RFC 1123, another STD. And it defines LDH. > > For what purpose do you onsider RFC 952 as current ? It has > in essence the same LDH concept, only limited to "24". I'm > not *generally* opposed to old RFCs with an unknown status, > but RFC 952 is (apparently) "de facto" obsolete. It's just > that nobody bothered to note the fact "officially" so far. > >> To move RFC 952 to historic we need to write a RFC which >> consolidates all the changes to hostnames: syntax, lengths >> etc. into one document. > > Okay, I wanted an "updates: 1123" for the <toplabel> issue, > we don't need "updates: 1035" because RFC 1123 already did > this, but adding "obsoletes: 952" to idnabis-rationale is > a possibility. But I don't see the necessity to justify an > "obsoletes: 952" in an IDNAbis memo. Unlike the <toplabel> > bug, that is IMO required for IDNAbis, and it doesn't belong > into say 2606bis. > > We could of course also fix this bug in a separate IDNAbis > memo, and while at it add an "obsoletes: 952" - short RFCs > are good. Is that what you propose ? > > Frank > > _______________________________________________ > Idna-update mailing list > Idna-update@alvestrand.no > http://www.alvestrand.no/mailman/listinfo/idna-update

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 04:04 24/06/2008, Mark Andrews wrote:


> Mark Andrews wrote: > > >> If you think it helps we could move RFC 952 to HISTORIC, > >> it muddies the water when it shows up in ICANN documents > >> published in 2008. > > >> The decruft experiment (RFC 4450) missed RFC 952, because > >> it was limited to standards, excluding "status: unknown". > > > It is the current RFC that limits hostnames to LDH. -GW > > and -NIC etc. are just shoulds not musts. > > RFC 1035 claims that RFC 952 "specifies the format of > HOSTS.TXT, the host/address table replaced by the DNS." > ^^^^^^^^ The DNS replaced hosts.txt as the distribution method. It did not change the syntax.

> RFC 1035 says "63", not "24", it is an Internet Standard, it > was updated by RFC 1123, another STD. And it defines LDH. > > For what purpose do you onsider RFC 952 as current ? It has > in essence the same LDH concept, only limited to "24". I'm > not *generally* opposed to old RFCs with an unknown status, > but RFC 952 is (apparently) "de facto" obsolete. It's just > that nobody bothered to note the fact "officially" so far.

RFC 1035 does NOT define hostnames.

Hostnames, when stored in the DNS, are a subset of the available domain names in the DNS.

Hostnames are allowed to be up to 255 characters in length and labels up to 63 characters (RFC 1123).

The DNS is only capable of representing hostnames that are up to 253 characters in length which is what you get back to when you take the maximal wire format and convert it back into a presentation format that is only LDH for the labels.

Hostnames don't have a trailing period. Domain names may have a trailing period.

For a arbitary domain name the presentation format is anything up to ~1k in length.

Even RFC 1123 says that hostname syntax is modified from RFC 952.

2. GENERAL ISSUES

  This section contains general requirements that may be applicable to
  all application-layer protocols.
  2.1  Host Names and Numbers
     The syntax of a legal Internet host name was specified in RFC-952
     [DNS:4].  One aspect of host name syntax is hereby changed: the
     restriction on the first character is relaxed to allow either a
     letter or a digit.  Host software MUST support this more liberal
     syntax.
     Host software MUST handle host names of up to 63 characters and
     SHOULD handle host names of up to 255 characters.

DNS domain names and hostnames are different things. Most host names can be stored in and retrieved from the DNS.

> > To move RFC 952 to historic we need to write a RFC which > > consolidates all the changes to hostnames: syntax, lengths > > etc. into one document. > > Okay, I wanted an "updates: 1123" for the <toplabel> issue, > we don't need "updates: 1035" because RFC 1123 already did > this, but adding "obsoletes: 952" to idnabis-rationale is > a possibility. But I don't see the necessity to justify an > "obsoletes: 952" in an IDNAbis memo. Unlike the <toplabel> > bug, that is IMO required for IDNAbis, and it doesn't belong > into say 2606bis. > > We could of course also fix this bug in a separate IDNAbis > memo, and while at it add an "obsoletes: 952" - short RFCs > are good. Is that what you propose ? > > Frank > > _______________________________________________ > Idna-update mailing list > Idna-update@alvestrand.no > http://www.alvestrand.no/mailman/listinfo/idna-update -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: Mark_Andrews@isc.org _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 04:44 24/06/2008, Frank Ellermann wrote:

Mark Andrews wrote:

> Well you scripts are most probably already broken.

Yes, the missing IPv6 support is broken, OTOH the API doesn't support it.

> % telnet 0x7f.0x01 > Trying 127.0.0.1... > telnet: connect to address 127.0.0.1: Connection refused > telnet: Unable to connect to remote host > %

D:\PROGRA~1\bin>rxgeturl 0x7f.0x01

   +++ "WindowsNT COMMAND D:\Programme\bin\RXGETURL.REX"

113 *-* if sign( verify( arg( 1 ), '0.123456789' )) = 0

   >>>     "0"

118 *-* else 118 *-* if SockGetHostByName( arg( 1 ), 'PEER.' ) = 0

   >>>         "1"

119 *-* then 119 *-* return RXMSG( 'unknown host' arg( 1 ) value( 'h_errno' ))

Found no host 0x7f.0x01, as expected.

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 05:18 24/06/2008, Mark Andrews wrote:


> Mark Andrews wrote: > > > Well you scripts are most probably already broken. > > Yes, the missing IPv6 support is broken, OTOH the API > doesn't support it.

0x7f.0x01 is a IPv4 address for a large portion of the world. If you have a host called 0x7f.0x01.example.com and a search list containg example.com the when someone attempts to telnet to 0x7f.0x01 it won't go to the address in the A record associated with 0x7f.0x01.example.com. It will instead connect to 127.0.0.1.

Lots of applications have something like the following.

addr = inet_addr(argv[1]); if (addr != INADDR_NONE) { r = connect(); } else { he = gethostbyname(argv[1]); i = 0; while { addr = he->h_addr[i]; r = connect(); if (r >= 0) break; } while (he->h_addr[++i] != NULL); }

gethostbyname() itself may do a similar thing.

Firefox accepts ftp://0x7f.0x1/ as if it was ftp://127.0.0.1/ to give you a example of where it succeeds in a url. I havn't looked to see at which level 0x7f.0x1 was treated as a raw address but it doesn't generate DNS lookups.

YMMV but you need to be aware of what applications on various platforms accept as valid IP addreses and steer clear of using anything which could be potentially confusing.

I've see some platforms zero fill decimal numbers. 067 was treated as 67 not 55. We have code to detect just such ambigious use.

Mark

> > % telnet 0x7f.0x01 > > Trying 127.0.0.1... > > telnet: connect to address 127.0.0.1: Connection refused > > telnet: Unable to connect to remote host > > % > > D:\PROGRA~1\bin>rxgeturl 0x7f.0x01 > +++ "WindowsNT COMMAND D:\Programme\bin\RXGETURL.REX" > 113 *-* if sign( verify( arg( 1 ), '0.123456789' )) = 0 > >>> "0" > > 118 *-* else > 118 *-* if SockGetHostByName( arg( 1 ), 'PEER.' ) = 0 > >>> "1" > > 119 *-* then > 119 *-* return RXMSG( 'unknown host' arg( 1 ) value( 'h_errno' )) > > Found no host 0x7f.0x01, as expected. > > Frank > > _______________________________________________ > Idna-update mailing list > Idna-update@alvestrand.no > http://www.alvestrand.no/mailman/listinfo/idna-update -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: Mark_Andrews@isc.org _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 06:41 24/06/2008, Frank Ellermann wrote:

Mark Andrews wrote:

> If you have a host called 0x7f.0x01.example.com and > a search list containg example.com the when someone > attempts to telnet to 0x7f.0x01 it won't go to the > address in the A record associated with > 0x7f.0x01.example.com. It will instead connect to > 127.0.0.1.

Testing this... telnet 0xD0.0x4D.0xBC.0xA6 80<crlf> GET / HTTP/1.0<crlf> <crlf> <crlf>

Odd, it works, as *not* expected by me. Also with Firefox 2 and <http://0xD0.0x4D.0xBC.0xA6/> They have no ftp or smtp, but http. No idea what it is good for, it could be a joke or a mistake.

> addr = inet_addr(argv[1]); > if (addr != INADDR_NONE) { > r = connect(); > } else { > he = gethostbyname(argv[1]);

If folks want to support IPv4 given in odd (or old) formats it is what they want. But I think a popular browser, which won't let me use gopher on interesting ports claiming that "whois" or "daytime" are insecure, could be more restrictive with strange IPv4 formats.

> gethostbyname() itself may do a similar thing.

Apparently not, my snippet used a REXX API on top of an ordinary library, it didn't return 127.0.0.1 for the hex. format.

> YMMV but you need to be aware of what applications > on various platforms accept as valid IP addreses and > steer clear of using anything which could be > potentially confusing.

Staying as far as possible away from trouble is good, but I don't consider the Firefox behaviour as feature. For purposes such as SURBL it could be annoying. And it is AFAIK not specified in any RFC.

> 067 was treated as 67 not 55.

Only hardcore C-fans can consider this as a bug...

We should not put idiosyncrasies of specific languages into an RFC. Look at say RFC 2821, there is no chance to get it wrong, it uses an unambiguous format, and if an application thinks that 067 is 55 it is broken.

All discussed <top-label> variants forbid "digit only", that also covers "octal" <top-label>s. For the hex. pitfall we could decree MUST NOT "0x" 1*HEXDIG This case is anyway already mentioned in an ICANN proposal.

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 07:10 24/06/2008, Frank Ellermann wrote:

Mark Andrews wrote:

> Hostnames are allowed to be up to 255 characters in length > and labels up to 63 characters (RFC 1123).

> The DNS is only capable of representing hostnames that are > up to 253 characters in length which is what you get back > to when you take the maximal wire format and convert it > back into a presentation format that is only LDH for the > labels.

Right, we talked temporarily about different things. For the at the moment existing www.example.com I'd sometimes say that www is the host name, and www.example.com the FQDN. And at other times I'd say that this is host www.example.com, which is clearly inconsistent. You had www.example.com (the FQDN) in mind, I had www (the label) in mind.

Yes, I'm aware of the 253 limit, RFC 4408 mentions it. And of course the 255 in (among others) 2821bis, which doesn't elaborate how this can ever work if DNS doesn't support it.

It used to be irrelevant, but with those long xn--... labels it might be something "draft-idnabis-952bis" should mention.

> Domain names may have a trailing period.

Even in RFC 4408, after my appeal about last minute changes in AUTH48 met Dave Null. Nobody ever said that this caused a problem, fortunately it turned out that I was paranoid.

> For a arbitary domain name the presentation format is > anything up to ~1k in length.

Now you've lost me, 253 or 255 are not ~1K. Is what you are talking about the maximal length expected to work in an /etc/hosts file ?

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 08:25 24/06/2008, Mark Andrews wrote:


> Mark Andrews wrote: > > > Hostnames are allowed to be up to 255 characters in length > > and labels up to 63 characters (RFC 1123). > > > The DNS is only capable of representing hostnames that are > > up to 253 characters in length which is what you get back > > to when you take the maximal wire format and convert it > > back into a presentation format that is only LDH for the > > labels. > > Right, we talked temporarily about different things. For the > at the moment existing www.example.com I'd sometimes say that > www is the host name, and www.example.com the FQDN. And at > other times I'd say that this is host www.example.com, which > is clearly inconsistent. You had www.example.com (the FQDN) > in mind, I had www (the label) in mind. > > Yes, I'm aware of the 253 limit, RFC 4408 mentions it. And > of course the 255 in (among others) 2821bis, which doesn't > elaborate how this can ever work if DNS doesn't support it. > > It used to be irrelevant, but with those long xn--... labels > it might be something "draft-idnabis-952bis" should mention. > > > Domain names may have a trailing period. > > Even in RFC 4408, after my appeal about last minute changes > in AUTH48 met Dave Null. Nobody ever said that this caused > a problem, fortunately it turned out that I was paranoid. > > > For a arbitary domain name the presentation format is > > anything up to ~1k in length. > > Now you've lost me, 253 or 255 are not ~1K. Is what you are > talking about the maximal length expected to work in an > /etc/hosts file ?

As I said, a arbitrary *domain name*. I did not say an arbitary *hostname*.

Hostname and domain names are NOT the same things. There are lots of RFC's which confuse the two. There are lots of people which confuse the two.

A domain name is the owner of data in the DNS. A hostname is a name which is restricted to LDH labels and has associated addresses.

_sip._udp.isc.org is a domain name, it is not a hostname. isc.org is a domain name, it is also a hostname.

\000.example.net is a domain name. It is not a hostname. The first label in wire format is "0x01 0x00", length 1 with a octet of all zero bits.

Mark

> Frank > > _______________________________________________ > Idna-update mailing list > Idna-update@alvestrand.no > http://www.alvestrand.no/mailman/listinfo/idna-update -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: Mark_Andrews@isc.org _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 06:54 24/06/2008, Martin Duerst wrote:

Hello John,

I mostly agree with you, but I have to disagree clearly on one point:

At 21:52 08/06/23, John C Klensin wrote:

> * We should apply the same rules to U-labels (native > character string forms) for TLDs, i.e., no digits, no > punctuation, and, preferably, at least two or three > characters (in the "print position" sense of "character" > long, not dependent on however Unicode coding happens to > work) long.

The "at least two or three characters" in my view is fine for alphabetic scripts (in the wider sense, i.e. including things such as Arabic (mostly just consonants) and Indic scripts (inherent vowels)).

But it can make a lot of sense to use one-character TLDs from ideographic and syllabic scripts (Han ideographs, Hangul, Ethiopic,...). Most of the characters in these scripts would be written with two or more letters in Latin and simlar scripts, most of the characters in these scripts would be entered by two or more keystrokes,..., and in particular for Han ideographs and Hangul, there are more of them available than basic Latin two-letter combinations.

So we either make a script-type based distinction, or we leave it to ICANN altogether.

Regards, Martin.


  1. -#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
  2. -#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 11:44 24/06/2008, Vint Cerf wrote:

Martin that sounds right to me too. John are we missing some important collateral issue?

Vint


Original Message -----

From: idna-update-bounces@alvestrand.no <idna-update-bounces@alvestrand.no> To: John C Klensin <klensin@jck.com>; Mark Andrews <Mark_Andrews@isc.org> Cc: idna-update@alvestrand.no <idna-update@alvestrand.no> Sent: Mon Jun 23 21:54:32 2008 Subject: Re: A-label definition

Hello John,

I mostly agree with you, but I have to disagree clearly on one point:

At 21:52 08/06/23, John C Klensin wrote:

> * We should apply the same rules to U-labels (native > character string forms) for TLDs, i.e., no digits, no > punctuation, and, preferably, at least two or three > characters (in the "print position" sense of "character" > long, not dependent on however Unicode coding happens to > work) long.

The "at least two or three characters" in my view is fine for alphabetic scripts (in the wider sense, i.e. including things such as Arabic (mostly just consonants) and Indic scripts (inherent vowels)).

But it can make a lot of sense to use one-character TLDs from ideographic and syllabic scripts (Han ideographs, Hangul, Ethiopic,...). Most of the characters in these scripts would be written with two or more letters in Latin and simlar scripts, most of the characters in these scripts would be entered by two or more keystrokes,..., and in particular for Han ideographs and Hangul, there are more of them available than basic Latin two-letter combinations.

So we either make a script-type based distinction, or we leave it to ICANN altogether.

Regards, Martin.


  1. -#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
  2. -#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 12:34 24/06/2008, Patrik Fältström wrote:


On 24 jun 2008, at 05.18, Mark Andrews wrote:

> If you have a host called 0x7f.0x01.example.com and > a search list containg example.com the when someone attempts > to telnet to 0x7f.0x01 it won't go to the address in the A > record associated with 0x7f.0x01.example.com.

Can not someone write a draft that say "search lists are bad for you"?

And then turn off the search list features in software....

   Patrik

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 13:14 24/06/2008, John C Klensin wrote:


--On Monday, 23 June, 2008 22:01 -0400 Vint Cerf <vint@google.com> wrote:

> seems to me that clarity might suggest a distinct RFC. the WG > would have to agree that such an RFC is in scope (I > certainly believe it to be on the grounds that we are having > to look very carefully at how we word our definitions as we > introduce IDNs since we don't want to create unplanned side > problems with the introduction of the xn-- format at each > level.

I still haven't been able to read every note in this thread, but, FWIW, this is the direction I think I was headed in with my earlier note today. After thinking about it a bit more, let me suggest the following:

(1) We not spend energy tracing the derivation of LDH, hostname, etc., rules and their relationships to IDNs. We also avoid spending any more energy than absolutely necessary on figuring out what "will be alphabetic" meant in RFC 1123, how it relates to some comments in RFC 1591, and how relevant either one is today.

(2) Instead, let's move in the direction of getting a BCP document together that makes recommendations about whatever the IETF thinks is the best/ most desirable practice in this area. IMO, we should be talking about best practice recommendations because, for reasons that Mark and I have been discussing from our rather different perspectives, the underlying DNS protocol is able to handle a lot of strings that no sensible person would use... these are really recommendations about how the DNS is used in the protocols that call on it and hence about registration policies, not the DNS protocol itself. If we go that route, the BCP should explicitly update/ quality the language in 1034, 1123, 1591, etc.

(3) If we do such a BCP, the "no strings that have hyphens in positions 3 and 4 and are not A-labels" text should come out of "Rationale". It would either belong in the BCP (if consensus exists) or not at all. The IDNA2008 documents would presumably still have something to say about strings that start in "xn--" but that are not valid A-labels, at least wrt IDNA-aware applications.

(4) I don't have a strong opinion about whether the BCP effort should be on this WG's task list or not. Either way, it would clearly need to be carefully reviewed by the DNS Directorate and appropriate DNS WGs and mailing lists, since its applicability would be to DNS use and applications, not just to IDNs.

I think that document is more or less the one Mark was looking for when he said "...RFC which consolidates all the changes to hostnames: syntax, lengths, etc., into one document.... although perhaps that turns it back into a standards track document that also contains some recommendations.


I'd be happy to work with Mark and/or Frank to get such a document together if that would help. I don't think the text is very complicated. Getting consensus on its provisions probably would be, but I'd consider it a success in the more difficult areas if it could simply explain the tradeoffs to be considered carefully s.t. the "best practice" would be to consider them.

Does that help?

    john


_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 13:17 24/06/2008, John C Klensin wrote:


--On Tuesday, 24 June, 2008 02:44 -0700 Vint Cerf <vint@google.com> wrote:

> Martin that sounds right to me too. John are we missing some > important collateral issue?

Nope. While single-character names make me nervous even with "ideographic" scripts, Martin is clearly correct.

Were we (for some value of "we") to go with a document that was partially or mostly a discussion of tradeoffs and recommendations, as suggested in my previous note, it could sensibly discuss the issues here and then leave the decisions to ICANN (presumably on a case-by-case basis, which is the way all TLD decisions are made).

     john

> ----- Original Message ----- > From: idna-update-bounces@alvestrand.no > <idna-update-bounces@alvestrand.no> To: John C Klensin > <klensin@jck.com>; Mark Andrews <Mark_Andrews@isc.org> Cc: > idna-update@alvestrand.no <idna-update@alvestrand.no> Sent: > Mon Jun 23 21:54:32 2008 > Subject: Re: A-label definition > > Hello John, > > I mostly agree with you, but I have to disagree clearly > on one point: > > At 21:52 08/06/23, John C Klensin wrote: > >> * We should apply the same rules to U-labels (native >> character string forms) for TLDs, i.e., no digits, no >> punctuation, and, preferably, at least two or three >> characters (in the "print position" sense of "character" >> long, not dependent on however Unicode coding happens to >> work) long. > > The "at least two or three characters" in my view is fine for > alphabetic scripts (in the wider sense, i.e. including things > such as Arabic (mostly just consonants) and Indic scripts > (inherent vowels)). > > But it can make a lot of sense to use one-character TLDs from > ideographic and syllabic scripts (Han ideographs, Hangul, > Ethiopic,...). Most of the characters in these scripts would > be written with two or more letters in Latin and simlar > scripts, most of the characters in these scripts would be > entered by two or more keystrokes,..., and in particular for > Han ideographs and Hangul, there are more of them available > than basic Latin two-letter combinations. > > So we either make a script-type based distinction, or we leave > it to ICANN altogether. > > Regards, Martin. > > ># -#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin ># University -#-# http://www.sw.it.aoyama.ac.jp ># mailto:duerst@it.aoyama.ac.jp > > _______________________________________________ > Idna-update mailing list > Idna-update@alvestrand.no > http://www.alvestrand.no/mailman/listinfo/idna-update



_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 13:31 24/06/2008, John C Klensin wrote:


--On Tuesday, 24 June, 2008 07:10 +0200 Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com> wrote:

>... > Right, we talked temporarily about different things. For the > at the moment existing www.example.com I'd sometimes say that > www is the host name, and www.example.com the FQDN. And at > other times I'd say that this is host www.example.com, which > is clearly inconsistent. You had www.example.com (the FQDN) > in mind, I had www (the label) in mind. >...

And this is why I've tried (unsuccessfully) on several occasions to simply eliminate "host name" from our vocabulary. The term is used in different places and by different people to identify either

* the first label in an FQDN

* the first label in an FQDN iff it identifies a host rather than something else.

* an FQDN that identifies a host.

There are some similar issues with "domain": Can it be a single label or some combination of labels that is not a FQDN? Does it refer to an FQDN without a leaf node (e.g., a hostname-FQDN without the first label)?

Curiously, I was motivated to try to make disjoint { A-label, U-label, LDH-label } definitions precisely because I saw us going down the same road with terms like

* punycode string (does it contain the prefix or not? Used both ways in various contexts)

* IDN (a label or an FQDN? Does it or does it not include all-ASCII, non-punycode-encoded and prefixed, strings?. Used in all of those ways.)

IMO, whether the particular words chosen are right or need tuning and regardless of how we tune the definitions, ending up with three distinct and non-overlapping terms, possibly plus some things like:

* IDNA-invalid label: something that is none of a U-label, A-label, or LDH-label.

* IDNA-domain: an FQDN that contains at least one U-label or A-label (i.e., whose labels are not all unprefixed LDH strings). Note that even this is ambiguous, because it is unclear whether an IDNA-domain as defined that way would include SRV FQDNs with U-labels or A-labelsin the rightmost label positions. So, if we need this at all, we need to do better.

john


_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 13:41 24/06/2008, JFC Morfin wrote:

At 12:34 24/06/2008, Patrik Fältström wrote: >On 24 jun 2008, at 05.18, Mark Andrews wrote: > > If you have a host called 0x7f.0x01.example.com and > > a search list containg example.com the when someone attempts > > to telnet to 0x7f.0x01 it won't go to the address in the A > > record associated with 0x7f.0x01.example.com. > >Can not someone write a draft that say "search lists are bad for you"? >And then turn off the search list features in software....

Dear Patrick, I am afraid that the market input would more probably be "why search lists are good for you". There are plenty of such searchlists to remove advertising, support aliases and local TLDs. Anyway the dissemination of Unbound under Windows will make the Internet multirooted even before this WG is due to publish. This is why I suggested to take, and took a more deductive rather than pursuing an inductive approach of _staying_ backward compatible. The only difference in principle is that our ML-DNS approach is to say: let build a solution that _will_be_ backward compatible with non-IDNA and IDNA and compatible with the state of the art and its architectural consequences.

Tunrning off means billion software updates. If you do that, why not to upgrade the software and make it ML-compatible by the same token, so you do not have nightmares anymore?

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 13:47 24/06/2008, JFC Morfin wrote:

At 13:17 24/06/2008, John C Klensin wrote: > > Martin that sounds right to me too. John are we missing some > > important collateral issue? > >Nope. While single-character names make me nervous even with >"ideographic" scripts, Martin is clearly correct. > >Were we (for some value of "we") to go with a document that was >partially or mostly a discussion of tradeoffs and >recommendations, as suggested in my previous note, it could >sensibly discuss the issues here and then leave the decisions to >ICANN (presumably on a case-by-case basis, which is the way all >TLD decisions are made). > > john

Full agreement with John's two last mails(about BCP and this one). This certainly fits in the ICANN new gTLD evaluation process where they also consider other forms of ambiguity and risks due to the way proposed TLDs are composed. jfc

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 18:16 24/06/2008, John C Klensin wrote:


--On Tuesday, 24 June, 2008 00:51 +0200 Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com> wrote:

>> one could still have an FQDN of 1.2.3.4.5 or 1.2.3, as >> long as 1.2.3.4 is avoided. > > Some stupid software looks for "only dots or digits" for > its decision "might be a FQDN or IPv4", and it doesn't > count dots, or check that 1.2.3.456 is no IPv4. Now you > could say "this software is broken, fix it", but it was > good enough to handle IPv4 vs. FQDN implicitly. > >> not because there would be serious problems for very >> careful applications. > > Sure, some protocols insist on square brackets for IPs, > after that decision the whole issue doesn't exist. But > other popular protocols don't for IPv4, notably STD 66.

Frank,

There are two schools of protocol design here. One starts from the assumption that heuristics are a bad idea, partially because different folks use different ones. The other likes them. The first view gets one "address literals must be an square brackets", the second gets the discussion you and Mark are having and an argument about whether 1.2.3.456 ought to be looked up as a domain name.

I can't usefully participate in that discussion because every example of something that might be considered as an address (or domain name) leads me to believe even more strongly that any protocol that does not make a clear lexical distinction between address literals and domain names is defective, if not broken.

The bad news is that we have lots of protocols that don't -- SMTP is definitely in the minority in requiring a lexical distinction. URNs (BCP66/RFC3406) is certainly not the only one; the problem goes back to most UIs to Telnet and FTP if not earlier.

    john

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 09:06 25/06/2008, Frank Ellermann wrote:

John C Klensin wrote:

> let's move in the direction of getting a BCP document together > that makes recommendations about whatever the IETF thinks is > the best/ most desirable practice in this area.

> IMO, we should be talking about best practice recommendations > because, for reasons that Mark and I have been discussing from > our rather different perspectives, the underlying DNS protocol > is able to handle a lot of strings that no sensible person would > use... these are really recommendations about how the DNS is > used in the protocols that call on it and hence about > registration policies, not the DNS protocol itself. If we go > that route, the BCP should explicitly update/ quality the > language in 1034, 1123, 1591, etc.

Sounds good, BCP or PS will do. You forgot "obsoletes 952".

> the "no strings that have hyphens in positions 3 and 4 and > are not A-labels" text should come out of "Rationale". It > would either belong in the BCP (if consensus exists) or not > at all.

Not at all. Folks hate obscure restrictions. Even if their only purpose is to show that it's a bad idea and futile (for the "nic", "whois", and "www" labels in the test-tlds draft).

> The IDNA2008 documents would presumably still have something > to say about strings that start in "xn--" but that are not > valid A-labels, at least wrt IDNA-aware applications.

+1

> I don't have a strong opinion about whether the BCP effort > should be on this WG's task list or not.

Without IDNA it would be the "editorial" erratum that it is at the moment. Let's do it here.

> I don't think the text is very complicated.

+1 With your "must start with ALPHA" proposal a <toplabel> is clear, there is no chance to confuse it with 0x or octal, and "xn--" does start with ALPHA => problem solved. Compare <http://article.gmane.org/gmane.ietf.usenet.format/30441>

> Getting consensus on its provisions probably would be

"No IETF consensus for IDN TLDS, film@11". Not really.

> Does that help?

Yes... </t><figure><artwork type="abnf">

 hostname = *( label "." ) toplabel 
 label    = alnum [ 1*61( ldh ) alnum ]
 toplabel = ALPHA [ 1*61( ldh ) ] alnum
 ldh      = ALPHA / DIGIT / "-"    ; letter digit hyphen
 alnum    = ALPHA / DIGIT 
 ALPHA    = <see STD 68>           ; ASCII letter 
 DIGIT    = <see STD 68>           ; 0-9

</artwork></figure><t> ...something like this.

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 18:10 27/06/2008, Stephane Bortzmeyer wrote:

On Mon, Jun 23, 2008 at 08:52:52AM -0400,

John C Klensin <klensin@jck.com> wrote 
a message of 169 lines which said:

> Should IETF try to impose any requirements or limitations that would > apply strictly to TLD labels, or should we decide that they are just > "policy" and leave them to ICANN?

My view is not that ICANN is perfect and will do perfect things but that:

1) TLD are not special. They are domains like other domains and I'm very reluctant to make them a special case IN THE PROTOCOL.

2) Sure, ICANN can do bad things. But why the IETF would be better here?

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 03:22 20/06/2008, Frank Ellermann wrote:

Hi, I thought I know what an "A-label" is, but looking into draft-ietf-idnabis-rationale-00 I found that this is not the case:

(1) LDH label, that's AFAIK 1 to 63 letters, digits,

   and hyphens, not starting or ending with a hyphen.

All LDH labels are technically valid host name labels, because that's what the relevant IETF standards say.

(2) Toplabel, that is at the moment a shaky RFC 1123

   erratum.  IMO it should be the same as LDH label,
   but including at least one non-digit.  It needs
   an "updates 1123" in idnabis-rationale.  While at
   it we could also say "not only a single letter".
   If we do the latter:  Folks often need syntax in
   the form of STD 68 ABNF in their drafts, and we
   can copy <toplabel> from RFC.ietf-usefor-usefor
   If we don't do this we can copy <toplabel> from
   RFC 4408.  You can guess who needed this syntax,
   and arrived at a slight difference.  <shudder />      
   JFTR, a USEFOR co-Chair (i.e. Harald) asked IAB
   and ICANN (IIRC) about this issue.  Somebody 
   found a simpler <toplabel> version for the "not
   only a letter" variant, I can find it if needed.

(3) U-label, the definition should mention that this

   is about labels with at least one non-ASCII code
   point, otherwise we would get a confusing overlap
   with LDH labels.
   

(4) A-label, that is apparently the proper subset of

   valid LDH labels (see 1) starting with "xn--",
   and corresponding to valid U-labels (see 3).  By
   definition an A-label is also a valid <toplabel>, 
   and we don't need to talk about this.

There's an open question about "valid U-toplabel", is more than one code point required. I think it is not required: Depending on the script "one code point" can express things that would need several letters in other scripts. ICANN can sort this out.

(5) I-label (making up a new term for this article):

   An "I-label" is an U-label in legacy non-Unicode
   and non-ASCII charsets, as found in RFC 3987 IRIs,
   or more precisely in labels of an <ihost> for a
   corresponding registered DNS host name.
   The typical example is "bücher", unless I screw up
   and send this as UTF-8.  Please assume that I want
   windows-1252 or iso-8859-1, not UTF-8.
   Maybe idnabis-rationale should define I-label with
   a reference to RFC 3987.  I also don't see why the
   U-label is limited to a "standard Unicode encoding
   form", that would mean "can be SCSU, but not BOCU,
   UTF-7, UTF-1, GB 18030, etc.".  IMO the question of
   encoding forms misses some points, maybe we should
   simply rename U-label to I-label:
   "I" as in I18N, IDNAbis, IRI is intuitive and KISS.

Above all I disagree with the proposed decree that all LDH labels with a hyphen in position 3 and 4 have to be A-labels. That could require to update hundreds of RFCs simultaneously, followed by a worldwide upgrade.

Looking at this from the other side: If a worldwide upgrade would work we could simply decree that host names can use UTF-8, and be done with it. As this is obviously wrong we cannot say that certain LDH labels are "invalid", we can only define valid A-labels, and anything else is whatever it is, xn--cocacola.

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 03:51 20/06/2008, YAO Jiankang wrote:


IMO,

   It is better if we clarify 3 definitions.
    LDH , which is the domain name lable defined in RFC 1034 and 1035
   U-label , which contains at least a non-ASCII character
   A-label, which is transformed from U-label with the  algorithm (punycode), plus a prefix such as XN--
   (some lable withe the prefix XN-- can not be converted to U-label is not valid A-label)
   LDH label includes A-label.


YAO Jiankang





Original Message -----

From: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com> To: <idna-update@alvestrand.no> Sent: Friday, June 20, 2008 9:22 AM Subject: A-label definition (was: IDN test TLDs)


Hi, I thought I know what an "A-label" is, but looking into draft-ietf-idnabis-rationale-00 I found that this is not the case:

(1) LDH label, that's AFAIK 1 to 63 letters, digits,

   and hyphens, not starting or ending with a hyphen.

All LDH labels are technically valid host name labels, because that's what the relevant IETF standards say.

(2) Toplabel, that is at the moment a shaky RFC 1123

   erratum.  IMO it should be the same as LDH label,
   but including at least one non-digit.  It needs
   an "updates 1123" in idnabis-rationale.  While at
   it we could also say "not only a single letter".
   If we do the latter:  Folks often need syntax in
   the form of STD 68 ABNF in their drafts, and we
   can copy <toplabel> from RFC.ietf-usefor-usefor
   If we don't do this we can copy <toplabel> from
   RFC 4408.  You can guess who needed this syntax,
   and arrived at a slight difference.  <shudder />      
   JFTR, a USEFOR co-Chair (i.e. Harald) asked IAB
   and ICANN (IIRC) about this issue.  Somebody 
   found a simpler <toplabel> version for the "not
   only a letter" variant, I can find it if needed.

(3) U-label, the definition should mention that this

   is about labels with at least one non-ASCII code
   point, otherwise we would get a confusing overlap
   with LDH labels.
   

(4) A-label, that is apparently the proper subset of

   valid LDH labels (see 1) starting with "xn--",
   and corresponding to valid U-labels (see 3).  By
   definition an A-label is also a valid <toplabel>, 
   and we don't need to talk about this.

There's an open question about "valid U-toplabel", is more than one code point required. I think it is not required: Depending on the script "one code point" can express things that would need several letters in other scripts. ICANN can sort this out.

(5) I-label (making up a new term for this article):

   An "I-label" is an U-label in legacy non-Unicode
   and non-ASCII charsets, as found in RFC 3987 IRIs,
   or more precisely in labels of an <ihost> for a
   corresponding registered DNS host name.
   The typical example is "bücher", unless I screw up
   and send this as UTF-8.  Please assume that I want
   windows-1252 or iso-8859-1, not UTF-8.
   Maybe idnabis-rationale should define I-label with
   a reference to RFC 3987.  I also don't see why the
   U-label is limited to a "standard Unicode encoding
   form", that would mean "can be SCSU, but not BOCU,
   UTF-7, UTF-1, GB 18030, etc.".  IMO the question of
   encoding forms misses some points, maybe we should
   simply rename U-label to I-label:
   "I" as in I18N, IDNAbis, IRI is intuitive and KISS.

Above all I disagree with the proposed decree that all LDH labels with a hyphen in position 3 and 4 have to be A-labels. That could require to update hundreds of RFCs simultaneously, followed by a worldwide upgrade.

Looking at this from the other side: If a worldwide upgrade would work we could simply decree that host names can use UTF-8, and be done with it. As this is obviously wrong we cannot say that certain LDH labels are "invalid", we can only define valid A-labels, and anything else is whatever it is, xn--cocacola.

Frank

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 02:02 21/06/2008, Kenneth Whistler wrote:

I'll skip the various issues related to A-label definition that John addressed, but... Frank Ellermann stated:

> ... I also don't see why the > U-label is limited to a "standard Unicode encoding > form", that would mean "can be SCSU, but not BOCU, > UTF-7, UTF-1, GB 18030, etc.". IMO the question of > encoding forms misses some points, maybe we should > simply rename U-label to I-label: > > "I" as in I18N, IDNAbis, IRI is intuitive and KISS.

First of all, a standard Unicode Character Encoding Form could be UTF-8, UTF-16, or UTF-32, but *not* SCSU, which is not a Unicode Character Encoding Form at all, by Unicode Standard definitions. (I realized that SCSU is a registered charset, but that is an entirely different thing.)

draft-ietf-idnabis-rationale-00.txt states that:

 * A "U-label" is an IDNA-valid string of Unicode characters,
   expressed in a standard Unicode Encoding Form, normally
   UTF-8 in an Internet transmission context...
   

I assume the 2nd part is obvious in this discussion (i.e. why UTF-8, rather than UTF-16 or UTF-32).

The reason why it should be a standard Unicode Encoding Form can be seen from the proposed protocol definition:

draft-ietf-idnabis-protocol-01.txt requires that:

 Some system routine, or a localized front-end to the IDNA
 process, ensures that the proposed label is a Unicode string.
 That string MUST be in Unicode Normalization Form C.
 

Unicode text content compressed by the SCSU algorithm is a sequence of bytes that is neither a Unicode string nor in Normalization Form C. And it could be put in NFC only by first extracting it from the compressed form into one of the 3 Unicode Character Encoding Forms, and then normalizing it by the UAX #15 algorithm -- which works on Unicode strings (in standard Encoding Forms).

--Ken

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 12:50 21/06/2008, John C Klensin wrote:

--On Friday, 20 June, 2008 03:22 +0200 Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com��Ü›Ý�N‚‚ˆ��K�I thought I know what an "A-label" is, but looking > into draft-ietf-idnabis-rationale-00 I found that this > is not the case: ��‚�ƒ�’�ÄD‚�Æ�&VÂÂ�F†�Bw2��d�”²���Fò�c2�ÆPtters, digits, > and hyphens, not starting or ending with a hyphen.

And not having two hyphens in the third or forth positions, according to the current definition in idnabis-rationale. I've clarified this slightly in the working draft for rationale-01. If the WG concludes that it doesn't want the restriction prohibiting non-IDNA labels with hyphens in those positions, that will need to be revised.

> All LDH labels are technically valid host name labels, > because that's what the relevant IETF standards say.

Yes. But the terminology in "rationale" is a little different, and says so. Note that IIR 1035 doesn't say "LDH", and 1123 doesn't either, they say "host name".

� �ŠH��Ü��X™[�����]��\È�]����H�[ÛY[��H�Ú�ZÞH�‘È�LLŒÂ€> erratum. IMO it should be the same as LDH label,

    but including at least one non-digit.  It needs

��� an "updates 1123" in idnabis-rationale. While at > it we could also say "not only a single letter".

This is really a separate discussion. I hope that it is out of scope for this WG, but that is certainly subject to debate. As you know, I've written the IESG asking them to give some priority to validating that erratum. On the other hand, 1123 is actually quite clear, IMO: it says "alphabetic" and meant it. And "alphabetic", in ordinary, common-sense, usage means "no digits" (if 1123 had intended "alphanumeric", it would have said so). We probably should extend the 1123 rule to permit those hyphens but, IMO, that is as far as we should go.

������Yˆ�ÙH��È���H��]��\€: Folks often need syntax in > the form of STD 68 ABNF in their drafts, and we > can copy <toplabel> from RFC.ietf-usefor-usefor > > If we don't do this we can copy <toplabel��œ›ÛBˆ�����‘È

���ˆ��[ÝH�Ø[ˆ�ÝY\ÜÈ�Úo needed this syntax, > and arrived at a slight difference. <shudder /> ��‚�����¤eE"Â���U4Tdõ"�6òÔ6†�r (i.e. Harald) asked IAB ������[™��PÐS“ˆ �RTÊH�X›Ý]���is issue. Somebody > found a simpler <toplabel> version for the "not > only a letter" variant, I can find it if needed.

A combination of I-Ds, informational and experimental documents, and opinions that don't represent demonstrated community consensus. Sorry if I don't find much authority in these.

> (3) U-label, the definition should mention that this > is about labels with at least one non-ASCII code > point, otherwise we would get a confusing overlap > with LDH labels.

Correction made in the working version of "rationale-01". Thanks.

> (4) A-label, that is apparently the proper subset of

    valid LDH labels (see 1) starting with "xn--",

> and corresponding to valid U-labels (see 3). By ������efinition an A-label is also a valid <toplabel>, ������@nd we don't need to talk about this.

By whose definition? By the definition in 1123, one cannot have an A-label as a TLD label, because it isn't alphabetic. In that context, all the ICANN test collection proves is that one can violate 1123 without causing very many problems, at least for the mostly-web applications that have been used in tests.

As noted above, I think we should probably change that, but it means updating 1123, which is not obviously in the WG's charter.

> There's an open question about "valid U-toplabel", is ��[Ü™H���[ˆ�Û™H�ÛÙ�H��Ú[��™\]Z\™@d. I think it is not > required: Depending on the script "one code point" > can express things that would need several letters in > other scripts. ICANN can sort this out.

It is not clear who gets to "sort this out". When RFC 1591 was written, its author and contributors assumed (and discussed the assumption) that, if and when future TLDs were allocated, they would be allocated according to the 2-3-4 (ccTLDs, gTLDs, ARPA) rule and would be all-alphabetic. That document anticipated neither IDNs nor ICANN decisions to allocate gTLDs with names of more or less arbitrary length.

But the enforcers of validity of DNS labels (at any level) has always been the applications protocols. If the IETF concludes that there are substantive reasons to prohibit one-character labels, or labels containing all (or any) digits at the top level, etc.; incorporates those rules into protocol syntax; and are lucky enough to have anyone pay attention to us, then ICANN ends up in a very difficult situation in which they can allocate strings that don't follow the rules but find that applications won't look them up or otherwise consider them valid.

So, if it is important enough that we can convince others that we have a valid basis for doing so, we still have the ability to do the sorting out. And, again, I hope that work doesn't belong to this WG.

JH�K[�X™[ (making up a new term for this article): > An "I-label" is an U-label in legacy non-Unicode > and non-ASCII charsets, as found in RFC 3987 IRIs, > or more precisely in labels of an <ihost��›Üˆ�Bˆ�����ÛÜœ™\Ü�Û™�[™Èregistered DNS host name. ��‚�����F†R�G—�­6�Â�W†�×�ƒ�—0 "bücher", unless I screw up > and send this as UTF-8. Please assume that I want > windows-1252 or iso-8859-1, not UTF-8. > > Maybe idnabis-rationale should define I-label with > a reference to RFC 3987. I also don't see why the > U-label is limited to a "standard Unicode encoding > form", that would mean "can be SCSU, but not BOCU, > UTF-7, UTF-1, GB 18030, etc.". IMO the question of > encoding forms misses some points, maybe we should > simply rename U-label to I-label: ��‚�����$’"��2�­â�“�„âÂ�”Dä�&—2Â�•$’�—2�­çGV—F—fR��nd KISS.

I believe that 3987, in permitting non-Unicode labels, is either:

(i) a piece of user interface specification, as provided for in Section 4.1 and 5.1 of draft-ietf-idnabis-protocol-01, and not suitable for use "on the wire" or

(ii) a serious threat to interoperability.

See Ken's note for an explanation of why "standard Unicode encoding form" is exactly the right definition.

> Above all I disagree with the proposed decree that all > LDH labels with a hyphen in position 3 and 4 have to ��™@ A-labels. That could require to update hundreds of ��‘€Cs simultaneously, followed by a worldwide upgrade.

It does no such thing, IMO. Remember that any use of the IDNA "trick" (i.e., treating some domain names as a special encoding of something else, rather than whatever they appear to be) at all requires that we take some subset of domain names that were previously interpreted as themselves and start interpreting them as something else. It doesn't require "update hundreds of RFCs simultaneously, followed by a worldwide upgrade". What it does require is "if you are going to be IDNA-aware and -capable, then you need to interpret prefixed labels in a particular way and take some other precautions". Nothing new.

The current rule (banning anything with "--" in positions two and three that isn't a valid A-label) in IDNA2008 is extremely conservative wrt prefix forms as a means of avoiding nonsense in the present and preserving the ability to introduce new special codings in the future. It still doesn't change anything for applications that are not IDNA-aware. For IDNA-aware registries, it prohibits registering such names as a precaution. That isn't much of a restriction, since no one has really demonstrated a need for such strings. And, for IDNA-aware lookup applications, it recommends not looking the strings up, at least unless the application is sure it knows how to interpret them. Not a really big deal, IMO.

If the WG concludes that is excessive and wants to drop back all or part of the way to a rule that merely says that, if the label starts in "xn--", it must be an A-label, I won't lose any sleep over it... but let's not try to get there by hyperbole about global changes to RFCs and worldwide upgrades.

���ÛÚÚ[™È�]�this from the other side: If a worldwide > upgrade would work we could simply decree that host > names can use UTF-8, and be done with it. As this is ��؝š[Ý\Û�H�ܛۙÈ�Àe cannot say that certain LDH labels > are "invalid", we can only define valid A-labels, and > anything else is whatever it is, xn--cocacola.

The latter is the one thing you cannot do because it prevents future expansion within the Unicode set. A valid A-label today is one that satisfies the U-label conversion rules, e.g., it doesn't map to Disallowed or Unassigned Unicode code points. If one decides that an A-label that cannot satisfy those rules is "whatever it is", one ends up with a string with two possible interpretations depending on the version of Unicode being used by the application.

    john


_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 17:56 27/06/2008, Stephane Bortzmeyer wrote:

On Fri, Jun 20, 2008 at 09:51:21AM +0800,

YAO Jiankang <yaojk@cnnic.cn> wrote 
a message of 116 lines which said:

> LDH , which is the domain name lable defined in RFC 1034 and > 1035

LDH is a nice name (although "traditional host name syntax" is more common) but I suggest to change the references. RFC 1034 section 3.5 or RFC 1035, section 2.3.1, do not define LDH, they just reuse traditional host name definitions (which are now in RFC 1123, section 2.1).

AFAIK, there is not one authoritative RFC where you can find the official grammar for the "traditional host name syntax", LDH. Even RFC 1123 is written (on this point) as a diff to RFC 952. _______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

At 18:00 27/06/2008, Stephane Bortzmeyer wrote:

On Fri, Jun 20, 2008 at 05:02:37PM -0700,

Kenneth Whistler <kenw@sybase.com> wrote 
a message of 50 lines which said:

> draft-ietf-idnabis-rationale-00.txt states that: > > * A "U-label" is an IDNA-valid string of Unicode characters, > expressed in a standard Unicode Encoding Form, normally > UTF-8 in an Internet transmission context...

Frankly, I do not see why we even need to specify an Unicode Encoding Form here, since it has no use in the protocol. Punycode is necessary for interoperability but Unicode labels can be exchanged in whatever format people wish, since they do not appear on the wire with IDNA.

I would therefore suggest a more abstract definition:

  • A "U-label" is an IDNA-valid string of Unicode characters,

_______________________________________________ Idna-update mailing list Idna-update@alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update

Personal tools