090628 - Wil-Vint ping-pong

From Wikidna.org

Jump to: navigation, search

Contents

[edit]
28 juin 2009 13:20 Wil Tan

Hi folks,

RFC3492 contained a mixed-case annotation feature which, though not used in IDNA2003, may affect the IDNA2008 specs. In particular, basic code points ([a-z]) that are left unencoded in punycode may be substituted in upper case, and the result of ToUnicode operation will preserve them. For example,

ToUnicode("xn--RSUM-bpad.com") = "RéSUMé.com"

From reading the rationale and protocol drafts, I'm not entirely sure if the input is considered an A-label. The output is certainly not a U-label since "RSUM" are disallowed codepoints.

I don't know if this is a problem, but it may warrant at least some discussion in section 5.4 of idnabis-protocol?

[edit] 28 juin 2009 13:26 Vint Cerf

If we adopt a policy of mapping prior to look up, and if we map upper case to lower case, it may be that xn--RSUM-bpad.com will be changed to xn-rsum-bpad.com prior to lookup and it will work.

[edit] 28 juin 2009 15:13 Wil Tan

Yes, that would work. Should we also discourage the use of such labels, and explicitly say that XN-labels containing uppercase characters are not A-labels?

[edit] 28 juin 2009 15:47 Vint Cerf

Well this is tricky especially if we adopt a practice, for look up, of mapping.

I think we want to preserve the definitional idea that punycode A form and Unicode U form must be convertible.

My understanding is that the punycode algorithm treats upper and lower case ASCII letters as equivalent for purposes of conversion (they have the same values in the algorithm).

I hope someone with more facility with the coding algorithms will jump in at this point.

[edit] 28 juin 2009 16:10 Wil Tan

The algorithm treats them differently. Basic (ASCII) code points are copied verbatim to the output. We only see lowercase output because nameprep does the casefolding so in IDNA2003 only lowercase characters go in as input to the punycode encoding process.

[edit] 28 juin 2009 17:05 Vint Cerf

So, absent nameprep we would see upper and lowercase output from punycode? and what about conversion back to unicode form?

[edit] 28 juin 2009 17:21 Wil Tan

Yes. Punycode will encode "foobäRr" into "foobRr-eua". Simon Josefsson's tool comes in handy:

<http://josefsson.org/idn.php?data=foobäRr&profile=Nameprep&mode=punyencode&charset=UTF-8&lastcharset=UTF-8>

It is a lossless algorithm so decoding back to Unicode will give you the exact original.

As an alternative to lowercasing the XN-label before lookup, perhaps we can specify an additional step to casefold any ASCII code points in the punycode decoding process in section 5.4 "A-label Input" of idnabis-protocol?

[edit] 28 juin 2009 17:34 Vint Cerf

Casefold has broad effect as I understand it, beyond lower casing and this may have side effects that should be considered before coming to that general conclusion. I think one objective in this mapping aspect on lookup only is to preserve the case insensitivity that has been related to dns lookups. That was accomplished by the matching algoritm in the name servers. Since we seek a solution that is client side only to avoid any need to modify servers, we have to accomplish an approximation at the lookup client sidem at the sme time we want to assure that the 1:1 conversion property of A-label and U-label is preserved. Sorry of I am being redundant here. Just trying to keep straight the constraints within which we are looking to define a lookup only mapping function.

[edit] 28 juin 2009 17:48 Wil Tan

I do understand and agree with the design constraints within which we are working.

Your proposal to case fold the XN-label prior to lookup works. The only side-effect I perceive is that XN-labels that are not all-lowercase may not qualify as A-labels since it doesn't produce valid U-label.

My proposal is to case fold only the ASCII codepoints in the Unicode string obtained from Punycode decoding of the XN-label, prior to checking the validity of the characters. I'm not aware of any side-effects of ASCII lowercasing, but do appreciate that the protocol steps must be very carefully considered.

I'm hoping someone would jump in here too.

[edit] 29 juin 2009 00:34 Marie-France Berny

Dear William Tan,

I am afraid I am quite confused by this burst of technical ping pong with Vint who wants mapping at protocol level.

I just want to know how:

  • ecole.fra
  • école.fra
  • Ecole.fra

These are three French orthotypographies of three different semantics which may relate to three different IP addresses. How do you propose to support them ?

Thank you.

unfortunately this was not answered. "ecole.fra" is ASCII label. "école.fra" is an U-label. "Ecole.fra" can be an ASCII-label or an U-label. The target was to understand how: "ÉCOLE.FRA" would be supported, résulting possibly as "éCOLE.FRA" on the other side.

Personal tools