Contents

  1. Requirements
    1. Must be able to represent standalone codes
      1. Current solution (status: STABLE)
    2. Must be able to represent balanced paired-codes
      1. Current solution (status: STABLE)
    3. Must be able to represent paired codes that have been separated
      1. Current solution (status: STABLE)
    4. Must be able to represent paired codes that are overlapping each other
      1. Current solution (status: STABLE)
    5. Must allow to associate spans of content with metadata
      1. Current solution (status: UNDER DISCUSSION)
    6. Must be able to store a display-friendly representation of an inline code for informational purpose
      1. Current solution (status: STABLE)
    7. Must be able to store a text-equivalent representation of an inline code for linguistic processes
      1. Current solution (status: STABLE)
    8. Must be able to identify uniquely an inline code within a segment (<source> and/or <target> in XLIFF 1.2 or <seg> in TMX)
      1. Current solution (status: STABLE)
    9. Must be able to associate each code of the source with its corresponding code in the target
      1. Current solution (status: STABLE)
    10. Must be able to represent the duplication of inline codes in the segment
      1. Current solution (status: STABLE)
    11. Must be able to represent inline codes added to the segment
      1. Current solution (status: STABLE)
    12. Must allow three ways to deal with the native data corresponding to an XLIFF inline code
      1. to store only the XLIFF representation, discarding the native data
        1. Current solution (status: STABLE)
      2. to store it along with its XLIFF representation
        1. Current solution (status: STABLE)
      3. to store a pointer to it along with its XLIFF representation
        1. Current solution (status: STABLE)
    13. Must be able to represent separately different flows of text and codes when, in the original format, they are mixed together
      1. Current solution (status: Stable)
    14. Should be able to represent the mutual relationships between a nested flow of text and its parent
      1. Current solution (status: STABLE)
    15. Should be able to represent illegal XML characters in the content
      1. Current solution (status: STABLE)
    16. Inline codes should have a way to store information about the effect on the segmentation
      1. Current solution (status: UNDER DISCUSSION)
    17. Should preserve span-like structures
      1. Current solution (status: STABLE)
  2. Guiding Principles
    1. Consider other Standards rather than re-inventing the wheel
    2. If possible, all text nodes of the content should be real text, not codes

For comparison between solutions see: OneContentModel/Comparison

1. Requirements

1.1. Must be able to represent standalone codes

For example, a line break in HTML:

Line 1<br/>line 2

Or an image in HTML:

An elephant: <img src="elephant.png">

1.1.1. Current solution (status: STABLE)

The placeholder element:

<ph id='1'/>

1.2. Must be able to represent balanced paired-codes

For example, normal bolded text in HTML:

<b>Bold</b> text

1.2.1. Current solution (status: STABLE)

The paired codes element:

<pc id='1'>text</pc>

1.3. Must be able to represent paired codes that have been separated

For example, in HTML <b> in one segment, while </b> is in a different one:

<p>Text in <b>bold. This too</b>.</p>
--> seg1=[Text in <b>bold.] and seg2=[This too</b>.]

The format must support marking of pairs across segment boundaries.

1.3.1. Current solution (status: STABLE)

The start code and end code elements:

<sc id='1'/>text<ec id='2' rid='1'/>

or

<sc id='1'/>text<ec rid='1'/>

Name of attribute needs to be finalized (rid vs idref)

1.4. Must be able to represent paired codes that are overlapping each other

For example, the bookmarks in ODF:

<text:p><text:bookmark-start text:name="bm1"/>Text of bookmark bm1 <text:bookmark-start text:name="bm2"/>and bm2.<text:bookmark-end text:name="bm1"/>
Text of bookmark bm2.<text:bookmark-end text:name="bm2"/></text:p>

1.4.1. Current solution (status: STABLE)

The start code and end code elements cover this requirement.

1.5. Must allow to associate spans of content with metadata

Examples of potential metadata associated with a span of content:

1.5.1. Current solution (status: UNDER DISCUSSION)

Discussion:

1.6. Must be able to store a display-friendly representation of an inline code for informational purpose

For example, for an inline code representing a variable, it is useful to see the value of the variable or its name when translating. The alternative representation could also be used for hint when no other information is present while doing tasks such as alignment, etc.

For instance: given a native code "<#@style-type-534562-BoldFace@#>" indicating a start of bolded text, the text equivalent representation would be "<b>".

1.6.1. Current solution (status: STABLE)

The disp attribute in <sc/>, <ec/>, <ph/> elemnts and the disp and dispEnd for the <pc> element.

1.7. Must be able to store a text-equivalent representation of an inline code for linguistic processes

Indicates an equivalent text to substitute in place of an inline code when doing linguistic-related processes.

For example: if, in a text "F&ile", the '&' is an inline code, the text equivalent would be an empty string indicating "F&ile" should be seen as "File" for linguistic purposes.

1.7.1. Current solution (status: STABLE)

The equiv attribute in <sc/>, <ec/>, <ph/> elemnts and the equiv and equivEnd for the <pc> element.

1.8. Must be able to identify uniquely an inline code within a segment (<source> and/or <target> in XLIFF 1.2 or <seg> in TMX)

An inline code may be associated with external metadata. In order to link together the code and its associated metadata, a way to identify the inline code uniquely within the segment is needed.

1.8.1. Current solution (status: STABLE)

The id attribute.

The value would be unique within the <source> element and within the target> element.

1.9. Must be able to associate each code of the source with its corresponding code in the target

For example, in the following source and translation:

English: The text is in <b>bold</b> and <i>italics</i>.
Yoda-English: In <i>italics</i> and <b>bold</b> the text is.

The tags <b> </b>, <i>, and </i> of the source should be mappable to the ones in the translation.

1.9.1. Current solution (status: STABLE)

1.10. Must be able to represent the duplication of inline codes in the segment

Sometimes the translation of a formatted text requires the translation to split the source into several parts in different places in the segment and the original codes need to be replicated.

English: He often <B>came a cropper </B> due to stress.
German: Er is oft <B>auf die Nase</B> wegen Stress <B>gefallen</B>.

1.10.1. Current solution (status: STABLE)

1.11. Must be able to represent inline codes added to the segment

Translated text may need to have extra information inserted in the form of inline codes. For example directionality markers for bidirectional languages. Another example: The following Japanese text has a title between special marks that are rendered as italics in English:

Japanese: 私は『時間コレラの』愛を読む
English: I just read <i>Love in the Time of Cholera</i>

1.11.1. Current solution (status: STABLE)

1.12. Must allow three ways to deal with the native data corresponding to an XLIFF inline code

1.12.1. to store only the XLIFF representation, discarding the native data

1.12.1.1. Current solution (status: STABLE)

Empty <sc/>, <ec/> and <ph/> elements and the <pc> element.

1.12.2. to store it along with its XLIFF representation

1.12.2.1. Current solution (status: STABLE)

1.12.3. to store a pointer to it along with its XLIFF representation

1.12.3.1. Current solution (status: STABLE)

<originalData>
  <data id='d1'>&lt;b></data>
  <data id='d2'>&lt;/b></data>
</originalData>

Names still to be defined.

1.13. Must be able to represent separately different flows of text and codes when, in the original format, they are mixed together

Example 1: In DITA a footnote is stored at the location where it is referred to:

<p>Palouse horses<fn>A Palouse horse is the same as an Appaloosa.</fn> have spotted coats.</p>

This p element contains two separate flows: "Palouse horses have spotted coats" and "A Palouse horse is the same as an Appaloosa."

Example 2: The value of the HTML ALT attribute is stored in the IMG tag and can be within a paragraph:

<p>Click here: <img alt='OK' src='ok.png'/>.</p>

1.13.1. Current solution (status: Stable)

1.14. Should be able to represent the mutual relationships between a nested flow of text and its parent

The format should be able to represent both flows and have some information about their relationships, so the two text can be put in context when needed.

For example, the relation between the value of an HTML ALT attribute and the paragraph element where it appears should be somehow preserved:

<p>Click here: <img alt='OK' src='ok.png'/>.</p>

1.14.1. Current solution (status: STABLE)

1.15. Should be able to represent illegal XML characters in the content

Some characters are illegal in XML, but they may appear in extracted text and we should have a common way to represent them so they can be preserved and merged back if necessary, without causing the XML tools to fail.

For example, in the Java property string "Text with \u001a", the character U+001A is illegal in XML but needs to have a representation in XLIFF.

Note: An example of how some XML formats handle this case is the TS format from Qt-Linguist, which uses a <byte> element to represent such characters.

1.15.1. Current solution (status: STABLE)

Solution is a specialized element <cp> with an attribute hex that hold the Unicode code-point of the illegal character. This is the same solution as the one for LDML

<cp hex='001a'/>

1.16. Inline codes should have a way to store information about the effect on the segmentation

As some inline codes may have an effect on the segmentation of a given content, it is useful if segmentation-specific hints could be stored along with an inline code.

For example: In HTML a <BR> element indicates a forced line break, while a <B>...</B> element should not affect the segmentation.

1.16.1. Current solution (status: UNDER DISCUSSION)

No solution so far, but some discussion: http://lists.oasis-open.org/archives/xliff-inline/201109/msg00000.html

Current ideas:

1.17. Should preserve span-like structures

When processing original markup with a span-like structure, it should be represented using a span-like element in the XLIFF inline markup, rather than using two XML elements denoting the start and end of the span. This notation allows easier XML processing and corresponds to the original structure. For example, given the following original XML content:

<img src='image.png'/> is a <b>beautiful</b> image.

We should be able to represent it in XLIFF using a markup where "<img src='imag.png'/>" is represented by an empty XLIFF element; and "<b>"/"</b>" are represented by a unique XLIFF element that encloses "beautiful".

For instance, using a representation such as this imaginary one:

<code id="1" content="img"/> is a <code id="2" start="b" end="b">beautiful</code> image.

Instead of:

<code id="1"/> is a <startCode id="2" start="b"/>beautiful<endCode id="3" end="b"/> image.

1.17.1. Current solution (status: STABLE)

2. Guiding Principles

2.1. Consider other Standards rather than re-inventing the wheel

Wherever possible, data categories or representation mechanisms from other standards should be considered. Examples of these include ITS and RDF.

2.2. If possible, all text nodes of the content should be real text, not codes

When processing the content with XML parsers, all the nodes of type TEXT should contain real text. This allows the separation between textual content and codes to be physical even in XML tree representation, rather than requiring interpretation of the markup.

For example, the imaginary representation below stores the native codes [startBold] and [endBold] as part of the content. This is what we want to try to avoid.

This text is in <code>[startBold]</code>bold<code>[endBold]</code>.

In contrast, the imaginary representation below stores the native codes [startBold] and [endBold] outside the content. Therefore the sum of all TEXT nodes represent only true text. This is what we want to try to achieve.

This text is in <code native="[startBold]">bold<code native="[endBold]">.

Note that this objective may or may not be possible to achieve, depending on various factor.

OneContentModel/Requirements (last edited 2011-10-05 14:52:18 by ysavourel)