Contents

  1. Overview
    1. Definitions/Terminology
  2. Plan
  3. Scope
  4. Requirements
    1. Must be able to represent standalone codes
    2. Must be able to represent balanced paired-codes
    3. Must be able to represent paired codes that have been separated
    4. Must be able to represent paired codes that are overlapping each other
    5. Must allow to associate spans of content with metadata
    6. Must be able to store a display-friendly representation of an inline code for informational purpose
    7. Must be able to store a text-equivalent representation of an inline code for linguistic processes
    8. Must be able to identify uniquely an inline code within a segment (<source> and/or <target> in XLIFF 1.2 or <seg> in TMX)
    9. Must be able to associate each code of the source with its corresponding code in the target
    10. Must be able to represent the duplication of inline codes in the segment
    11. Must be able to represent inline codes added to the segment
    12. Must allow three ways to deal with the native data corresponding to an XLIFF inline code
    13. Must be able to represent separately different flows of text and codes when, in the original format, they are mixed together
    14. Should be able to represent the mutual relationships between a nested flow of text and its parent
    15. Should be able to represent illegal XML characters in the content
    16. Inline codes should have a way to store information about the effect on the segmentation
    17. Should preserve span-like structures
  5. Guiding Principles
    1. Consider other Standards rather than re-inventing the wheel
    2. If possible, all text nodes of the content should be real text, not codes

1. Overview

As the component-based and internet-driven technologies evolve, tools need to be able to communicate as seamlessly as possible not only using documents exchange, but also small segment of information.

Many of the Web services, plugins, and other bricks that are making up the tools being build today need to exchange data at the segment level, not at the file level. Whether these components identify terms, highlight spelling mistakes, provide TM matches, or MT guesses, they all, ultimately, need to access the same abstracted extracted text.

Having a single representation for this abstracted extracted text brings more interoperability with respect to data and thus tools. From a tool provider point-of-view, the advantage of a single representation arise for example from the fact that no bridges along the lines of "If dealing with X then ... if dealing with Y then ..." are necessary.

1.1. Definitions/Terminology

This section is under construction. The following documents may provide guidance.

http://www.lisa.org/fileadmin/standards/tmx1.4/tmx.htm#SectionContentMarkup

http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine

http://www.w3.org/International/articles/inline-bidi-markup/

http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_XHTML-1.0-Strict

http://www.w3.org/TR/xhtml1/dtds.html#dtdentry_xhtml1-strict.dtd_phrase

http://www.w3.org/TR/xml-c14n

http://www.w3.org/TR/2008/NOTE-xml-i18n-bp-20080213/#DevSeg

Term

Alternative Terms

Examples

Definition

Block-level Markup

Headings, lists, and blockquotes in HTML

Delimiters for content which is usually linguistically independent from its surrounding content.

Canonical Representation

Canonical Form

The canonical form of an XML document.

The physical representation of a given content fragment which has been derived from a native representation in an agreed upon way. Since the way is of derivation is defined, derivation can only produce a single possible result.

Inline-level Markup

Big, sup, and em in HTML

Delimiters for content which is usually linguistically dependent on its surrounding content.

Segment

A sentence in a language such as English.

Content which has either been marked up as "block", or has been created from a "block" by means of a segmentation mechanism such as sentence boundary detection.

Sub-flow

A footnote in an academic text

Content which is block-level in nature but is semantically closely related to other blocks.

Inline Code

Inline codes refers to non-linguistic data embedded within a unit of translatable content.

Inline Content

The representation of inline content consists of text possibly augmented with the following type of data: --a) genuine inline entities (entities which belong to the original extracted document (e.g. inline markup like "em")) --b) supplementary entities (entities which supplement/augment the original; annotations that are not available in the original) --c) surrogate entities to represent 'illegal characters' (characters not allowed in XML)

2. Plan

Step 1: Agree on Scope for the specification

Step 2: Define Requirements

Step 3: Design possible solutions

3. Scope

The purpose of this specification is to define a common model for inline-level markup for localization, allowing task and tool agnostic resource exchange and processing. The main implementation targets are XLIFF and TMX.

This specification will encourage a common semantic representation of native format constructs such as formatting, etc. in order to facilitate the re-use of translations across file formats and to facilitate common processing of localizable data across native file formats.

The markup model will be represented in XML and defined by an XML schema. There is no backward compatibility requirement with earlier versions of XLIFF and TMX, but a migration path from the previous version of these specifications is envisioned. Implementation details such as the use of XML namespaces will be decided during the development of this specification.

The markup model relates to Inline-level markup, and also describe how sub-flow are handled, including the relation between its extracted representation and its original location. An extensibility mechanism will also be defined as part of this specification.

For comparison between solutions see: OneContentModel/Comparison

4. Requirements

4.1. Must be able to represent standalone codes

For example, a line break in HTML:

Line 1<br/>line 2

Or an image in HTML:

An elephant: <img src="elephant.png">

4.1.1. Current solution (status: STABLE)

The placeholder element:

<ph id='1'/>

4.2. Must be able to represent balanced paired-codes

For example, normal bolded text in HTML:

<b>Bold</b> text

4.2.1. Current solution (status: STABLE)

The paired codes element:

<pc id='1'>text</pc>

4.3. Must be able to represent paired codes that have been separated

For example, in HTML <b> in one segment, while </b> is in a different one:

<p>Text in <b>bold. This too</b>.</p>
--> seg1=[Text in <b>bold.] and seg2=[This too</b>.]

The format must support marking of pairs across segment boundaries.

4.3.1. Current solution (status: STABLE)

The start code and end code elements:

<sc id='1'/>text<ec id='2' rid='1'/>

or

<sc id='1'/>text<ec rid='1'/>

Name of attribute needs to be finalized (rid vs idref)

4.4. Must be able to represent paired codes that are overlapping each other

For example, the bookmarks in ODF:

<text:p><text:bookmark-start text:name="bm1"/>Text of bookmark bm1 <text:bookmark-start text:name="bm2"/>and bm2.<text:bookmark-end text:name="bm1"/>
Text of bookmark bm2.<text:bookmark-end text:name="bm2"/></text:p>

4.4.1. Current solution (status: STABLE)

The start code and end code elements cover this requirement.

4.5. Must allow to associate spans of content with metadata

Examples of potential metadata associated with a span of content:

  • flag indicating the span must not be translated
  • flag indicating the span is a term
  • Part of speech, etc.
  • Reference ID used to point to external annotation
  • Translator comment
  • Tool-specific processing instructions

4.5.1. Current solution (status: UNDER DISCUSSION)

Discussion:

  • Should the marker elements be very specific (e.g. <term>, <notrans> etc.) or general with various attributes (e.g. <mrk term='yes' translate='no'>)?

  • Should we use ITS? <span its:translate='no'>?

  • How do deal with overlapping and broken spans? (replicate spans, or use isolated markers (similar to <sc/> and <ec/>)?

4.6. Must be able to store a display-friendly representation of an inline code for informational purpose

For example, for an inline code representing a variable, it is useful to see the value of the variable or its name when translating. The alternative representation could also be used for hint when no other information is present while doing tasks such as alignment, etc.

For instance: given a native code "<#@style-type-534562-BoldFace@#>" indicating a start of bolded text, the text equivalent representation would be "<b>".

4.6.1. Current solution (status: STABLE)

The disp attribute in <sc/>, <ec/>, <ph/> elemnts and the disp and dispEnd for the <pc> element.

4.7. Must be able to store a text-equivalent representation of an inline code for linguistic processes

Indicates an equivalent text to substitute in place of an inline code when doing linguistic-related processes.

For example: if, in a text "F&ile", the '&' is an inline code, the text equivalent would be an empty string indicating "F&ile" should be seen as "File" for linguistic purposes.

4.7.1. Current solution (status: STABLE)

The equiv attribute in <sc/>, <ec/>, <ph/> elemnts and the equiv and equivEnd for the <pc> element.

4.8. Must be able to identify uniquely an inline code within a segment (<source> and/or <target> in XLIFF 1.2 or <seg> in TMX)

An inline code may be associated with external metadata. In order to link together the code and its associated metadata, a way to identify the inline code uniquely within the segment is needed.

4.8.1. Current solution (status: STABLE)

The id attribute.

The value would be unique within the <source> element and within the target> element.

4.9. Must be able to associate each code of the source with its corresponding code in the target

For example, in the following source and translation:

English: The text is in <b>bold</b> and <i>italics</i>.
Yoda-English: In <i>italics</i> and <b>bold</b> the text is.

The tags <b> </b>, <i>, and </i> of the source should be mappable to the ones in the translation.

4.9.1. Current solution (status: STABLE)

  • The id of each code in the source is the link to its correspondence in the target.
  • Extra codes in the target must have new id values (unmapped to the any code in the source)

4.10. Must be able to represent the duplication of inline codes in the segment

Sometimes the translation of a formatted text requires the translation to split the source into several parts in different places in the segment and the original codes need to be replicated.

English: He often <B>came a cropper </B> due to stress.
German: Er is oft <B>auf die Nase</B> wegen Stress <B>gefallen</B>.

4.10.1. Current solution (status: STABLE)

  • New code with new id in the target
  • Native data can be:
    • copied and stored along with the code (inside or outside)
    • use a rel attribute to point to the code used as model.

4.11. Must be able to represent inline codes added to the segment

Translated text may need to have extra information inserted in the form of inline codes. For example directionality markers for bidirectional languages. Another example: The following Japanese text has a title between special marks that are rendered as italics in English:

Japanese: 私は『時間コレラの』愛を読む
English: I just read <i>Love in the Time of Cholera</i>

4.11.1. Current solution (status: STABLE)

  • New code with new id in the target
  • Native data can be:
    • stored along with the code (inside or outside)
    • use the type attribute to specify the type of code, and rely on the merging tools to generate the proper native code.

4.12. Must allow three ways to deal with the native data corresponding to an XLIFF inline code

4.12.1. to store only the XLIFF representation, discarding the native data

4.12.1.1. Current solution (status: STABLE)

Empty <sc/>, <ec/> and <ph/> elements and the <pc> element.

4.12.2. to store it along with its XLIFF representation

4.12.2.1. Current solution (status: STABLE)

  • The content of <sc>, <ec> and <ph> is the native data. The data is text.

  • The <pc> element does not support this notation.

4.12.3. to store a pointer to it along with its XLIFF representation

4.12.3.1. Current solution (status: STABLE)

  • All code elements have an attribute (nid) to point to the storage element.

  • Storage element at the unit level.
  • A simple structure like the following to store the data:

<originalData>
  <data id='d1'>&lt;b></data>
  <data id='d2'>&lt;/b></data>
</originalData>

Names still to be defined.

4.13. Must be able to represent separately different flows of text and codes when, in the original format, they are mixed together

Example 1: In DITA a footnote is stored at the location where it is referred to:

<p>Palouse horses<fn>A Palouse horse is the same as an Appaloosa.</fn> have spotted coats.</p>

This p element contains two separate flows: "Palouse horses have spotted coats" and "A Palouse horse is the same as an Appaloosa."

Example 2: The value of the HTML ALT attribute is stored in the IMG tag and can be within a paragraph:

<p>Click here: <img alt='OK' src='ok.png'/>.</p>

4.13.1. Current solution (status: Stable)

  • The current consensus is to have the different flows in different units.
  • The order in which the units are stored is not being defined.
  • The representation of the "sub-flow" units in relation to their "parent" may or may not be specified by the container format.

4.14. Should be able to represent the mutual relationships between a nested flow of text and its parent

The format should be able to represent both flows and have some information about their relationships, so the two text can be put in context when needed.

For example, the relation between the value of an HTML ALT attribute and the paragraph element where it appears should be somehow preserved:

<p>Click here: <img alt='OK' src='ok.png'/>.</p>

4.14.1. Current solution (status: STABLE)

  • subFlows attribute, with a list of ids. The id is a xsd:NMTOKEN value.

  • The inline code has a subFlows attribute that holds the list of the units where each sub-flow is stored. The list value is a xsd:NMTOKENS.
  • In the unit, the pointer back to the inline code where the extracted text is coming from is determined by the host format.

4.15. Should be able to represent illegal XML characters in the content

Some characters are illegal in XML, but they may appear in extracted text and we should have a common way to represent them so they can be preserved and merged back if necessary, without causing the XML tools to fail.

For example, in the Java property string "Text with \u001a", the character U+001A is illegal in XML but needs to have a representation in XLIFF.

Note: An example of how some XML formats handle this case is the TS format from Qt-Linguist, which uses a <byte> element to represent such characters.

4.15.1. Current solution (status: STABLE)

Solution is a specialized element <cp> with an attribute hex that hold the Unicode code-point of the illegal character. This is the same solution as the one for LDML

<cp hex='001a'/>

4.16. Inline codes should have a way to store information about the effect on the segmentation

As some inline codes may have an effect on the segmentation of a given content, it is useful if segmentation-specific hints could be stored along with an inline code.

For example: In HTML a <BR> element indicates a forced line break, while a <B>...</B> element should not affect the segmentation.

4.16.1. Current solution (status: UNDER DISCUSSION)

No solution so far, but some discussion: http://lists.oasis-open.org/archives/xliff-inline/201109/msg00000.html

Current ideas:

  • A way of defining an inline XLIFF element as causing a break in the segmentation.
  • A way of defining that a period is not a sentence delimiter (i.e. product -unique abbreviations).
  • A way of defining that an inline XLIFF element should be handled as whitespace.
  • Look at ITS for guidance

4.17. Should preserve span-like structures

When processing original markup with a span-like structure, it should be represented using a span-like element in the XLIFF inline markup, rather than using two XML elements denoting the start and end of the span. This notation allows easier XML processing and corresponds to the original structure. For example, given the following original XML content:

<img src='image.png'/> is a <b>beautiful</b> image.

We should be able to represent it in XLIFF using a markup where "<img src='imag.png'/>" is represented by an empty XLIFF element; and "<b>"/"</b>" are represented by a unique XLIFF element that encloses "beautiful".

For instance, using a representation such as this imaginary one:

<code id="1" content="img"/> is a <code id="2" start="b" end="b">beautiful</code> image.

Instead of:

<code id="1"/> is a <startCode id="2" start="b"/>beautiful<endCode id="3" end="b"/> image.

4.17.1. Current solution (status: STABLE)

  • The <pc>...</pc> element allows for span-like notation

5. Guiding Principles

5.1. Consider other Standards rather than re-inventing the wheel

Wherever possible, data categories or representation mechanisms from other standards should be considered. Examples of these include ITS and RDF.

5.2. If possible, all text nodes of the content should be real text, not codes

When processing the content with XML parsers, all the nodes of type TEXT should contain real text. This allows the separation between textual content and codes to be physical even in XML tree representation, rather than requiring interpretation of the markup.

For example, the imaginary representation below stores the native codes [startBold] and [endBold] as part of the content. This is what we want to try to avoid.

This text is in <code>[startBold]</code>bold<code>[endBold]</code>.

In contrast, the imaginary representation below stores the native codes [startBold] and [endBold] outside the content. Therefore the sum of all TEXT nodes represent only true text. This is what we want to try to achieve.

This text is in <code native="[startBold]">bold<code native="[endBold]">.

Note that this objective may or may not be possible to achieve, depending on various factor.

OneContentModel (last edited 2012-05-10 12:06:41 by ysavourel)