Requirements/Proposal Page for Syntax Issue #4: XRI Normalization
Contents
1. Introduction/Motivation
This issue was raised by Wil Tan, Les Chasen, and Sharon Wodjenski at NeuStar due to their implementation experience with Internationalize Domain Names. Since normalization rules for widely used infrastructure can be loosened but almost never be tightened after adoption, they recommend that the TC look carefully at specifying Unicode Normal Form KC (NFKC) for XRI normalization.
2. Status
- Version: 1
- Action: Active proposal that needs discussion and closure.
3. Requirements
- Before XRIs gain wide adoption, establish a clear standard for the critical issue of how XRIs that use the Unicode character set will be normalized.
- Reasonably minimize the discrepancy between user expectation of equivalence and machine determination of equivalence of XRIs.
- Minimize the normalization processing burden for the whole of XRI infrastructure while also making sure no one point in the infrastructure suffers too great a burden.
- If possible, ensure that XRI normalization does not create incompatability the IRI and URI specifications.
4. Background
The following background and spec excerpts are very helpful in understanding the issue and proposal.
The IRI specification requires Unicode NFC normalization in the original encoding of an IRI (if it is not already encoded), NOT on the conversion of an IRI to a URI (as some of us expected). See section 3.1 step 1a and 1b as well as section 5.3.2.2 (both excerpted below for easy reference.)
The Internationalized Domain Names (IDN) specifications rely on the StringPrep specification, which requires the stricter NFKC rules that normalize a wider set of "compatability characters" which are allowed under Unicode but which make identifier comparision more difficult for both humans and machines. Note that the IDN specifications (and NFKC) apply to the ireg-name component of an IRI in certain schemes. The reasons for requiring NFKC are explained in section 4 of this spec (excerpted below for easy reference).
4.1. Excerpt of Start Of Section 3.1 Of IRI Spec
Applications MUST map IRIs to URIs by using the following two steps.
Step 1. Generate a UCS character sequence from the original IRI
format. This step has the following three variants,
depending on the form of the input:
a. If the IRI is written on paper, read aloud, or otherwise
represented as a sequence of characters independent of
any character encoding, represent the IRI as a sequence
of characters from the UCS normalized according to
Normalization Form C (NFC, [UTR15]).
b. If the IRI is in some digital representation (e.g., an
octet stream) in some known non-Unicode character
encoding, convert the IRI to a sequence of characters
from the UCS normalized according to NFC.
c. If the IRI is in a Unicode-based character encoding (for
example, UTF-8 or UTF-16), do not normalize (see section
5.3.2.2 for details). Apply step 2 directly to the
encoded Unicode character sequence.
Step 2. For each character in 'ucschar' or 'iprivate', apply steps
2.1 through 2.3 below.
2.1. Convert the character to a sequence of one or more octets
using UTF-8 [RFC3629].
2.2. Convert each octet to %HH, where HH is the hexadecimal
notation of the octet value. Note that this is identical
to the percent-encoding mechanism in section 2.1 of
[RFC3986]. To reduce variability, the hexadecimal notation
SHOULD use uppercase letters.
2.3. Replace the original character with the resulting character
sequence (i.e., a sequence of %HH triplets).
4.2. Excerpt of Section 5.3.2.2 of IRI Spec
5.3.2.2. Character Normalization
The Unicode Standard [UNIV4] defines various equivalences between
sequences of characters for various purposes. Unicode Standard Annex
#15 [UTR15] defines various Normalization Forms for these
equivalences, in particular Normalization Form C (NFC, Canonical
Decomposition, followed by Canonical Composition) and Normalization
Form KC (NFKC, Compatibility Decomposition, followed by Canonical
Composition).
Equivalence of IRIs MUST rely on the assumption that IRIs are
appropriately pre-character-normalized rather than apply character
normalization when comparing two IRIs. The exceptions are conversion
from a non-digital form, and conversion from a non-UCS-based
character encoding to a UCS-based character encoding. In these cases,
NFC or a normalizing transcoder using NFC MUST be used for
interoperability. To avoid false negatives and problems with
transcoding, IRIs SHOULD be created by using NFC. Using NFKC may
avoid even more problems; for example, by choosing half-width Latin
letters instead of full-width ones, and full-width instead of
half-width Katakana.
As an example, "http://www.example.org/résumé.html" (in XML
Notation) is in NFC. On the other hand,
"http://www.example.org/résumé.html" is not in NFC.
The former uses precombined e-acute characters, and the latter uses
"e" characters followed by combining acute accents. Both usages are
defined as canonically equivalent in [UNIV4].
Note: Because it is unknown how a particular sequence of characters
is being treated with respect to character normalization, it would
be inappropriate to allow third parties to normalize an IRI
arbitrarily. This does not contradict the recommendation that
when a resource is created, its IRI should be as character
normalized as possible (i.e., NFC or even NFKC). This is similar
to the uppercase/lowercase problems. Some parts of a URI are case
insensitive (domain name). For others, it is unclear whether they
are case sensitive, case insensitive, or something in between
(e.g., case sensitive, but with a multiple choice selection if the
wrong case is used, instead of a direct negative result). The
best recipe is that the creator use a reasonable capitalization
and, when transferring the URI, capitalization never be changed.
Various IRI schemes may allow the usage of Internationalized Domain
Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
Character Normalization also applies to IDNs, as discussed in section
5.3.3.
4.3. Excerpt of Section 4 of StringPrep Spec
4. Normalization The output of the mapping step is optionally normalized using one of the Unicode normalization forms, as described in [UAX15]. A profile can specify one of two options for Unicode normalization: - no normalization - Unicode normalization with form KC A profile MAY choose to do no normalization. However, such a profile can easily yield results that will be surprising to typical users, depending on the input mechanism they use. For example, some input mechanisms enter compatibility characters that look exactly like the underlying characters, but have different code points. Another example of where Unicode normalization helps create predictable results is with characters that have multiple combining diacritics: normalization orders those diacritics in a predictable fashion. On the other hand, Unicode normalization requires fairly large tables and somewhat complicated character reordering logic. The size and complexity should not be considered daunting except in the most restricted of environments, and needs to be weighed against the problems of user surprise from comparing unnormalized strings. Note that the tables used for normalization are not given in this document, but instead must be derived from the Unicode database, as described in [UAX15]. There is a third form of normalization, Unicode normalization with form C. If a profile is going to use a Unicode normalization, it MUST use Unicode normalization form KC. Form KC maps many "compatibility characters" to their equivalents. Some user interface systems make it possible to enter compatibility characters instead of the base equivalents. Thus, using form KC instead of form C will cause more strings that users would expect to match to actually match.
5. Proposal
(Note that the following proposal was generated in the special TC call held on this topic – see the Discussion section below.)
Revise appropriate sections of the Syntax spec to require NFKC normalization as part of encoding an XRI in XRI normal form. This eliminates the need to specify normalization in the transformation of an XRI into an IRI, since this normalization will have already have been done. Also, because NFKC is stricter than NFC, it also maintains full compatability with IRI, since XRIs transformed to IRIs will be a subset of all valid IRIs.
6. Discussion
Following is a copy of the discussion from the minutes of a special TC call on this topic held 5PM Pacific on 2005/09/13.
Discussion on the call quickly moved from: a) the issue of whether NFKC or NFC should be specified on the conversion from an XRI in XRI normal form to IRI normal form, to b) the issue of whether NFKC or NFC should be specified on the conversion from a native XRI to an XRI in XRI normal form.
The latter approach matches the approach taken in the IRI spec, where conversion of a "native" IRI (the IRI in a native application before any encoding has been applied, or when conversion into UTF-8 is necessary) requires normalization using NFC.
The attendees agreed that IRI was most likely motivated to use NFC by the huge installed base of existing IRI implementations, which effectively precluded them from specifying NFKC. However, XRI does not have this installed base. So there was was unanimous agreement that we would be doing the world of XRI adopters a big service by specifying NFKC as the normalization requirement *in the conversion of a native XRI into XRI normal form*.
After lengthy discussion it was agreed that, for the same reasons cited above in the #3 excerpt above (Section 4 of the StringPrep spec), this would be the best approach for the entirety of XRI infrastructure, since it requires XRIs to be normalized according to NFKC at their very earliest point of origin in the infrastructure (the native user interfaces or applications that generate them). This simplifies the processing and equivalence-checking burden on every other participant in the infrastructure. As section 4 of StringPrep spec says:
"The size and complexity [of Unicode NFCK normalization] should not be considered daunting [on an application] except in the most restricted of environments, and needs to be weighed against the problems of user surprise from comparing unnormalized strings."
Thus our final recommendation is that the appropriate sections of the Syntax spec be revised to require NFKC normalization as part of encoding an XRI in XRI normal form. This eliminates the need to specify normalization in the transformation of an XRI into an IRI, since this normalization will have already have been done. Also, because NFKC is stricter than NFC, it also maintains full compatability with IRI, since XRIs transformed to IRIs will be a subset of all valid IRIs.
XRI Wiki