Contents

  1. Overview
    1. Definitions/Terminology
  2. Plan
  3. Defining the Scope (Step 1)
    1. Questions we need to consider in defining the scope
      1. Common representation of 'inline' markup vs 'block-level' markup
      2. Canonical Representation of native content
      3. Extensibility / Annotations
      4. Content Manipulation
      5. XML Implementation
      6. General Scope
  4. Requirements (Step 2)
    1. Should be able to represent standalone codes
    2. Should be able to represent balanced paired-codes
    3. Should be able to represent paired codes that have been separated
    4. Should be able to represent paired codes that are overlapping each other
    5. If possible, all text nodes of the content should be real text, not codes
    6. Should be able to mark up spans of content and associate them with standardized information
    7. Should be able to mark up spans of content and associate them with user-defined information
    8. Should be able to store a text-equivalent representation of the code in any markup that represent a code
    9. Should be able to identify uniquely a code within a content fragment
    10. Should be able to associate the same codes between the source and the target segments
    11. Should be able to store the actual data of a code along with the code
    12. Should be able to store a pointer to the actual data of a code along with the code
    13. Should be able to store no information about the data of a code along with the code
    14. Should be able to represent a flow of text that, in the original format, was stored has a nested flow
    15. Should be able to represent the relationship between a nested flow of text and its parent
    16. Should be able to represent invalid-XML character in the content
    17. Inline codes should have a way to store information about segmentation

This page continues the exchange started in http://lists.oasis-open.org/archives/xliff/200903/msg00006.html

1. Overview

As the component-based and internet-driven technologies evolve, tools need to be able to communicate as seamlessly as possible not only using documents exchange, but also small segment of information.

Many of the Web services, plugins, and other bricks that are making up the tools being build today need to exchange data at the segment level, not at the file level. Whether these components identify terms, highlight spelling mistakes, provide TM matches, or MT guesses, they all, ultimately, need to access the same abstracted extracted text.

Having a single representation for this abstracted extracted text brings more interoperability with respect to data and thus tools. From a tool provider point-of-view, the advantage of a single representation arise for example from the fact that no bridges along the lines of "If dealing with X then ... if dealing with Y then ..." are necessary.

1.1. Definitions/Terminology

This section is under construction. The following documents may provide guidance.

http://www.lisa.org/fileadmin/standards/tmx1.4/tmx.htm#SectionContentMarkup

http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine

http://www.w3.org/International/articles/inline-bidi-markup/

http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_XHTML-1.0-Strict

http://www.w3.org/TR/xhtml1/dtds.html#dtdentry_xhtml1-strict.dtd_phrase

http://www.w3.org/TR/xml-c14n

http://www.w3.org/TR/2008/NOTE-xml-i18n-bp-20080213/#DevSeg

Term

Alternative Terms

Examples

Definition

Block-level Markup

Headings, lists, and blockquotes in HTML

Delimiters for content which is usually linguistically independent from its surrounding content.

Canonical Representation

Canonical Form

The canonical form of an XML document.

The physical representation of a given content fragment which has been derived from a native representation in an agreed upon way. Since the way is of derivation is defined, derivation can only produce a single possible result.

Inline-level Markup

Big, sup, and em in HTML

Delimiters for content which is usually linguistically dependent on its surrounding content.

Segment

A sentence in a language such as English.

Content which has either been marked up as "block", or has been created from a "block" by means of a segmentation mechanism such as sentence boundary detection.

Sub-flow

A footnote in an academic text

Content which is block-level in nature but is semantically closely related to other blocks.

2. Plan

Step 1: Agree on Scope for the specification

Step 2: Define Requirements

Step 3: Design possible solutions

Note: Format for annotating items resolved by TC:

[Resolved (minutes: URL to minutes goes here)"Explanation goes here."]

3. Defining the Scope (Step 1)

3.1. Questions we need to consider in defining the scope

3.1.1. Common representation of 'inline' markup vs 'block-level' markup

  1. Does this proposal outline how blocks of text relate to each other, <group>-level, or only at the segment-level (<source>,<target>)?

[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200905/msg00023.html)"it was agreed to work at segment level, with flexibility to include groups when needed. Rodolfo will expand this section with comments about merging/splitting segments."]


While there is a need for XLIFF to address "block-level" relation, and maybe in TMX as well, I think the content model we are talking about here is about the representation of extracted data in source/target.

As far as I can think (but I may have a narrow view) the only two parts of that model that may have to do with relation between a given content and others would be:


I agree with Yves' comment above. The whole activity from my understanding should relate to a "generic inline markup".

I see a danger in injecting XLIFF-specific concepts such as "seg-source" (see comment from Rodolfo below). From my understanding, format-specific concepts could be introduced after the scope discussion has finished.


Discussion at segment level should consider <seg-source> and it's effects on <target> as well. When <seg-source> is used, the content of <target> does not represent the translation of <source> because there may be extra markup (<mrk>)present in <target> that is not included in <source>.


  1. Does the model allow for sub-flows?

[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200905/msg00023.html)"it was agreed that sub flows will not be allowed as they are now. Yves will add content to the wiki, summarizing the TCs view."] The model should allow to handle text that is marked up as a sub-flow in the original format.

  1. If nesting (<sub> flows) is discouraged, does this proposal describe how to reference inline the other unit(s) containing the nested content? (cf. W3C ITS "elements within text" data category)

[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200909/msg00001.html) "Everyone agreed that it would be good to move subflows to their own translation units. The specifications should describe how to indicate the original location of the moved text and the type of markup that enclosed it".]

The model should describe how sub-flow are handled, including the relation between its extracted representation and its original location.

3.1.2. Canonical Representation of native content

  1. Should there only be one physical representation for a given native representation (wrt codes, inline-markup, 'skeleton' data, sub-flows)?

[Not Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200906/msg00003.html) "The feasibility of having a canonical representation was discussed. Action Item: Christian to extend the wiki content regarding canonical forms."The request to allow annotations and extensions in the new model was analyzed. The scope of the request was not clear and several possibilities were discussed. Conversation may continue in an email thread". And June 16 "The meaning and need for a canonical representation of inline markup was discussed (wiki section 3.1.2). Christian proposed to move discussion to an email thread. "]

[Tabled (minutes: http://lists.oasis-open.org/archives/xliff/200909/msg00001.html) "A roll call vote was held and it was agreed that discussion of canonical representation will be kept as a working item for further discussion." Bryan: We have spent many meeting cycles and cannot seem to agree on an answer to the question (with regard to defining scope) "Should there only be one physical representation for a given native representation (wrt codes, inline-markup, 'skeleton' data, sub-flows)?" So I request the interested parties establish a point/counterpoint email thread. When all points are documented I will conduct an email ballot to arrive at a "yes/no" decision - no later than 13-OCTOBER-2009]


There is no need. It will not be possible to include all representations in the specification document.


A canonical representation from my point-of-view, would simplify processing substantially.

Maybe, we may want to differentiate between two types of canonical representation:

  1. a semantic/abstract one
  2. an encoded/concrete one

This distinction can be found for example in ITS (http://www.w3.org/TR/2007/REC-its-20070403/#design-decisions ; "abstraction"). The distinction establishes some middle ground since it may be much harder or even impossible to come up with an encoded canonical form whereas a semantic canonical form may be feasible. Furtermore, the distinction would provide an opportunity to deploy an "association mechanism" like the one in ITS (http://www.w3.org/TR/2007/REC-its-20070403/#associating-its-with-existing-markup).

We may for example decide that the generic inline markup will need to have one representation for both standalone and balanced-pair codes. This would be two abstract, semantic representations.

We may than say that the single one allowed representation for standalone codes is an XML element "x" with a couple of predefinied attributes. This would be a concrete, encoded representation.

An "association mechanisms" like that in ITS (see above) may be the middle ground which could reveal information such as: In my format, the XML element "y" corresponds to the "x" element in the generic inline markup namespace.


3.1.3. Extensibility / Annotations

  1. Should the proposal outline how to annotate and/or extend the common representation in a exchange-friendly way?
  2. separation between possibly localizable text, native format (includes structural information), and annotations to enable easy parsing/transformation


Rodolfo - September 14, 2009

What is the meaning of "common representation"?

As far as I know, when creating an XLIFF file translatable text is separated from native format. Native format is stored in a skeleton (internal or external) and translatable text is placed in <trans-unit> elements. Annotations, if any, are not part of XLIFF and I think they should not be. XLIFF already supports extensions via namespaces and that's how I see annotations should be implemented by those that want it.


The proposal should include an extensibility feature.

The proposal should clearly separate between different content categories such as localizable text and native format.

[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200909/msg00006.html ) "It was agreed that current method is sufficient and provides a friendly way to extend or annotate, therefore no additional work is required. The TC agreed on inviting all tool developers to contribute their custom extensions to XLIFF for publishing in an open repository that the XLIFF TC would maintain. "] - note: this was agreed upon by the members of the 14 Sep 2009 call - but a member left the call so with no quorum, the vote was not binding - so resolution is pending a roll call vote.

[Resolved (regarding extensibility) (minutes: http://lists.oasis-open.org/archives/xliff/200910/msg00002.html ) "(1) Rodolfo suggested adding a note to section 2.5 explaining that extension points are intended for providing support for internal tool processing, not for solving XLIFF deficiencies.

(2) A roll call ballot was conducted on these terms: A yes vote means that you agree on adding a note indicating that extensions should be used for tool specific internal processing purposes, not for holding localization data that should be supported using standard XLIFF elements. The motion was approved unanimously.

(3) A second roll call ballot was conducted on these terms: A yes vote means that you agree that current extensibility options are sufficient and provide a way to exchange in a user friendly way. The motion passed. "]

[Resolved (regarding annotation) (minutes: http://lists.oasis-open.org/archives/xliff/200911/msg00003.html ) "Christian explained that annotation refers to information targeted at human readers, contrasting it to extensibility which he thinks is targeted to processing tools. Standard XML comments would not be good enough, as there is a risk of losing them during processing. Magnus expressed that current extensibility mechanism is rather weak and doesn't provide all the means required for improving processing. Christian proposed a ballot with two questions regarding the future inline markup specification: should the new specification outline an extensibility mechanism? and should the extensibility mechanism include the possibility to classify extensions in different types? Two roll call ballots were conducted and the answer to both questions was yes.. "]

3.1.4. Content Manipulation

  1. Should the specification define how the inline-content (and block level) model can be manipulated, including:
    1. indicate when a code can be deleted or not, can be cloned or not,
    2. indicate if a code can be moved out of sequence or not;

(this type of info is useful when doing QA, when the translator is manually editing a segment, when composing a target based on various matches, or in many other scenarios.)

[Resolved (minutes: http://www.oasis-open.org/apps/org/workgroup/xliff/email/archives/200906/msg00011.html)"It was agreed that the specification should define how the inline-content (and block level) model can be manipulated."]


It should be possible in many cases. Code checking is one of the problem may LSP face, being able to have some way to perform smarter modification/check for some formats would help a lot. I think we should at least explore the possibilities.


I agree with Yves' comment above.


3.1.5. XML Implementation

  1. tied to current version or successor of an existing namespace (such as TMX or XLIFF)?
  2. or placed in a namespace of its own (to support a modular XML content architecture; cf. "xml:lang" or "xml:space")?
  3. Should the specification be backwards compatible with existing versions of e.g. TMX and XLIFF?
    1. If not, should the specification define migration paths and mappings for existing content models (TMX and XLIFF)?
  4. enabling for the W3C Internationalization Tag Set (I guess allowing local ITS markup would be sufficient)?


The specification should not prescribe the use of ITS and should not force applications to interpret ITS markup present in an XLIFF file. Applications should be able to safely ignore non-XLIFF markup present in an XLIFF file. If non-XLIFF markup cannot be ignored, then the file and/or it's producer should not be considered XLIFF compliant.


Q3) Not necessarily



3.1.6. General Scope


I disagree. The work done here applies only to exchange of localization data. Exchange of translation memories and other tasks are beyond the goals described in XLIFF TC's charter.

4. Requirements (Step 2)

4.1. Should be able to represent standalone codes

For example, a line break in HTML:

Line 1<br/>line 2

4.2. Should be able to represent balanced paired-codes

For example, normal bolded text in HTML:

<b>Bold</b> text

4.3. Should be able to represent paired codes that have been separated

For example, in HTML <b> in one segment, while </b> is in a different one:

Text in <b>bold. This too</b>.

4.4. Should be able to represent paired codes that are overlapping each other

For example:

Text in <startB/>bold and in <startI/>italic<endB/>.<endI/>

4.5. If possible, all text nodes of the content should be real text, not codes

When processing the content with XML parsers, all the nodes of type TEXT should contain real text. This allows the separation between textual content and codes to be physical even in the object, rather than depend on the markup itself.

4.6. Should be able to mark up spans of content and associate them with standardized information

Some examples:

4.7. Should be able to mark up spans of content and associate them with user-defined information

Same requirement as above, but where the associated information is user-defined.

4.8. Should be able to store a text-equivalent representation of the code in any markup that represent a code

For example, for an inline code representing a variable, it is useful to see the value of the variable or its name when translating. The text-equivalent information could also be used for hint when no other information is present while doing tasks such as alignment, etc.

4.9. Should be able to identify uniquely a code within a content fragment

As a code need to be associated to additional information within or outside its container, it needs a unique identifier.

4.10. Should be able to associate the same codes between the source and the target segments

For example, in the following source and translation:

English: "The text is in <b>bold</b> and <i>italics</i>."
Yoda-English: "In <i>italics</i> and <b>bold</b> the text is."

The codes <b> </b>, <i>, and </i> of the source should be mappable to the one in the translation.

4.11. Should be able to store the actual data of a code along with the code

Some tools may require the actual codes to be available from within the container.

Note this is one of three ways to store codes, the others are the two requirements below.

4.12. Should be able to store a pointer to the actual data of a code along with the code

Some tools may require the actual codes to be available from outside the text container.

Note this is one of three ways to store codes, the others are the requirement above and the one below.

4.13. Should be able to store no information about the data of a code along with the code

Note this is one of three ways to store codes, the others are the requirements above.

4.14. Should be able to represent a flow of text that, in the original format, was stored has a nested flow

For example, the value of the HTML ALT attribute is stored in the IMG tag that can be withing a paragraph:

<p>Click here: <img alt='OK' src='ok.png'/>.</p>

Another example: in ODF a footnote is stored at the location where the footnote reference is display when publishing.

4.15. Should be able to represent the relationship between a nested flow of text and its parent

The format should be able to represent both flows and have some information about their relationship, so the two text can be put in context when needed.

For example, the relation between the value of an HTML ALT attribute and the paragraph element where it appears should be somehow preserved.

4.16. Should be able to represent invalid-XML character in the content

Some characters are illegal in XML, but they are used in extracted text and we should have a way to represent them without causing the XML tools to fail.

For example, non-XML source documents may have control characters that cannot be represented directly in XML. It should be possible to code them smoehow so they can make it through the translation process without being lost or get corrupted.

Note: An example of how some XML format hanlde this is the TS format from Qt-Lingusit, which uses a <byte> element to represent such characters.

4.17. Inline codes should have a way to store information about segmentation

As some inline codes may have an effect on the segmentation of a given content, it is useful if segmentation-specific hints could be stored along with the inline code.

For example: In HTML a <br/> element indicates a forced line break, while a <b>...</b> element should not affect the segmentation.


OneContentModel/Requirements (last edited 2009-11-03 17:38:14 by bryan.s.schnabel)