Requirements/Proposal Page for Syntax Issue #4: XRI Normalization

1. Introduction/Motivation

This issue was raised by Wil Tan, Les Chasen, and Sharon Wodjenski at NeuStar due to their implementation experience with Internationalize Domain Names. Since normalization rules for widely used infrastructure can be loosened but almost never be tightened after adoption, they recommend that the TC look carefully at specifying Unicode Normal Form KC (NFKC) for XRI normalization.

2. Status

3. Requirements

4. Background

The following background and spec excerpts are very helpful in understanding the issue and proposal.

4.1. Excerpt of Start Of Section 3.1 Of IRI Spec

   Applications MUST map IRIs to URIs by using the following two steps.

   Step 1.  Generate a UCS character sequence from the original IRI
            format.  This step has the following three variants,
            depending on the form of the input:

            a. If the IRI is written on paper, read aloud, or otherwise
               represented as a sequence of characters independent of
               any character encoding, represent the IRI as a sequence
               of characters from the UCS normalized according to
               Normalization Form C (NFC, [UTR15]).

            b. If the IRI is in some digital representation (e.g., an
               octet stream) in some known non-Unicode character
               encoding, convert the IRI to a sequence of characters
               from the UCS normalized according to NFC.

            c. If the IRI is in a Unicode-based character encoding (for
               example, UTF-8 or UTF-16), do not normalize (see section
               5.3.2.2 for details).  Apply step 2 directly to the
               encoded Unicode character sequence.

   Step 2.  For each character in 'ucschar' or 'iprivate', apply steps
            2.1 through 2.3 below.

       2.1.  Convert the character to a sequence of one or more octets
             using UTF-8 [RFC3629].

       2.2.  Convert each octet to %HH, where HH is the hexadecimal
             notation of the octet value.  Note that this is identical
             to the percent-encoding mechanism in section 2.1 of
             [RFC3986].  To reduce variability, the hexadecimal notation
             SHOULD use uppercase letters.

       2.3.  Replace the original character with the resulting character
             sequence (i.e., a sequence of %HH triplets).

4.2. Excerpt of Section 5.3.2.2 of IRI Spec

5.3.2.2.  Character Normalization

   The Unicode Standard [UNIV4] defines various equivalences between
   sequences of characters for various purposes.  Unicode Standard Annex
   #15 [UTR15] defines various Normalization Forms for these
   equivalences, in particular Normalization Form C (NFC, Canonical
   Decomposition, followed by Canonical Composition) and Normalization
   Form KC (NFKC, Compatibility Decomposition, followed by Canonical
   Composition).

   Equivalence of IRIs MUST rely on the assumption that IRIs are
   appropriately pre-character-normalized rather than apply character
   normalization when comparing two IRIs.  The exceptions are conversion
   from a non-digital form, and conversion from a non-UCS-based
   character encoding to a UCS-based character encoding. In these cases,
   NFC or a normalizing transcoder using NFC MUST be used for
   interoperability.  To avoid false negatives and problems with
   transcoding, IRIs SHOULD be created by using NFC.  Using NFKC may
   avoid even more problems; for example, by choosing half-width Latin
   letters instead of full-width ones, and full-width instead of
   half-width Katakana.

   As an example, "http://www.example.org/résumé.html" (in XML
   Notation) is in NFC.  On the other hand,
   "http://www.example.org/résumé.html" is not in NFC.

   The former uses precombined e-acute characters, and the latter uses
   "e" characters followed by combining acute accents.  Both usages are
   defined as canonically equivalent in [UNIV4].

   Note: Because it is unknown how a particular sequence of characters
      is being treated with respect to character normalization, it would
      be inappropriate to allow third parties to normalize an IRI
      arbitrarily.  This does not contradict the recommendation that
      when a resource is created, its IRI should be as character
      normalized as possible (i.e., NFC or even NFKC).  This is similar
      to the uppercase/lowercase problems.  Some parts of a URI are case
      insensitive (domain name).  For others, it is unclear whether they
      are case sensitive, case insensitive, or something in between
      (e.g., case sensitive, but with a multiple choice selection if the
      wrong case is used, instead of a direct negative result).  The
      best recipe is that the creator use a reasonable capitalization
      and, when transferring the URI, capitalization never be changed.

   Various IRI schemes may allow the usage of Internationalized Domain
   Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
   Character Normalization also applies to IDNs, as discussed in section
   5.3.3.

4.3. Excerpt of Section 4 of StringPrep Spec

4. Normalization

   The output of the mapping step is optionally normalized using one of
   the Unicode normalization forms, as described in [UAX15].  A profile
   can specify one of two options for Unicode normalization:

   - no normalization

   - Unicode normalization with form KC

   A profile MAY choose to do no normalization.  However, such a profile
   can easily yield results that will be surprising to typical users,
   depending on the input mechanism they use.  For example, some input
   mechanisms enter compatibility characters that look exactly like the
   underlying characters, but have different code points.  Another
   example of where Unicode normalization helps create predictable
   results is with characters that have multiple combining diacritics:
   normalization orders those diacritics in a predictable fashion.

   On the other hand, Unicode normalization requires fairly large tables
   and somewhat complicated character reordering logic.  The size and
   complexity should not be considered daunting except in the most
   restricted of environments, and needs to be weighed against the
   problems of user surprise from comparing unnormalized strings.  Note
   that the tables used for normalization are not given in this
   document, but instead must be derived from the Unicode database, as
   described in [UAX15].

   There is a third form of normalization, Unicode normalization with
   form C.  If a profile is going to use a Unicode normalization, it
   MUST use Unicode normalization form KC.  Form KC maps many
   "compatibility characters" to their equivalents.  Some user interface
   systems make it possible to enter compatibility characters instead of
   the base equivalents.  Thus, using form KC instead of form C will
   cause more strings that users would expect to match to actually
   match.

5. Proposal

(Note that the following proposal was generated in the special TC call held on this topic – see the Discussion section below.)

Revise appropriate sections of the Syntax spec to require NFKC normalization as part of encoding an XRI in XRI normal form. This eliminates the need to specify normalization in the transformation of an XRI into an IRI, since this normalization will have already have been done. Also, because NFKC is stricter than NFC, it also maintains full compatability with IRI, since XRIs transformed to IRIs will be a subset of all valid IRIs.

6. Discussion

Following is a copy of the discussion from the minutes of a special TC call on this topic held 5PM Pacific on 2005/09/13.


Discussion on the call quickly moved from: a) the issue of whether NFKC or NFC should be specified on the conversion from an XRI in XRI normal form to IRI normal form, to b) the issue of whether NFKC or NFC should be specified on the conversion from a native XRI to an XRI in XRI normal form.

The latter approach matches the approach taken in the IRI spec, where conversion of a "native" IRI (the IRI in a native application before any encoding has been applied, or when conversion into UTF-8 is necessary) requires normalization using NFC.

The attendees agreed that IRI was most likely motivated to use NFC by the huge installed base of existing IRI implementations, which effectively precluded them from specifying NFKC. However, XRI does not have this installed base. So there was was unanimous agreement that we would be doing the world of XRI adopters a big service by specifying NFKC as the normalization requirement *in the conversion of a native XRI into XRI normal form*.

After lengthy discussion it was agreed that, for the same reasons cited above in the #3 excerpt above (Section 4 of the StringPrep spec), this would be the best approach for the entirety of XRI infrastructure, since it requires XRIs to be normalized according to NFKC at their very earliest point of origin in the infrastructure (the native user interfaces or applications that generate them). This simplifies the processing and equivalence-checking burden on every other participant in the infrastructure. As section 4 of StringPrep spec says:

"The size and complexity [of Unicode NFCK normalization] should not be considered daunting [on an application] except in the most restricted of environments, and needs to be weighed against the problems of user surprise from comparing unnormalized strings."

Thus our final recommendation is that the appropriate sections of the Syntax spec be revised to require NFKC normalization as part of encoding an XRI in XRI normal form. This eliminates the need to specify normalization in the transformation of an XRI into an IRI, since this normalization will have already have been done. Also, because NFKC is stricter than NFC, it also maintains full compatability with IRI, since XRIs transformed to IRIs will be a subset of all valid IRIs.


This wiki is hosted by OASIS and powered by MoinMoin.

Xri2Cd02/SynTax/I4XriNormalization (last edited 2009-08-12 18:07:17 by localhost)