Scope

The purpose of this specification is to define a common model for inline-level markup for localization, allowing task and tool agnostic resource exchange and processing. The main implementation targets are XLIFF and TMX.

This specification will encourage a common semantic representation of native format constructs such as formatting, etc. in order to facilitate the re-use of translations across file formats and to facilitate common processing of localizable data across native file formats.

The markup model will be represented in XML and defined by an XML schema. There is no backward compatibility requirement with earlier versions of XLIFF and TMX, but a migration path from the previous version of these specifications is envisioned. Implementation details such as the use of XML namespaces will be decided during the development of this specification.

The markup model relates to Inline-level markup, and also describe how sub-flow are handled, including the relation between its extracted representation and its original location. An extensibility mechanism will also be defined as part of this specification.

Working Document: Defining the Scope (Step 1)

Questions we need to consider in defining the scope

Common representation of 'inline' markup vs 'block-level' markup

  1. Does this proposal outline how blocks of text relate to each other, <group>-level, or only at the segment-level (<source>,<target>)?

[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200905/msg00023.html)"it was agreed to work at segment level, with flexibility to include groups when needed. Rodolfo will expand this section with comments about merging/splitting segments."]


While there is a need for XLIFF to address "block-level" relation, and maybe in TMX as well, I think the content model we are talking about here is about the representation of extracted data in source/target.

As far as I can think (but I may have a narrow view) the only two parts of that model that may have to do with relation between a given content and others would be:


I agree with Yves' comment above. The whole activity from my understanding should relate to a "generic inline markup".

I see a danger in injecting XLIFF-specific concepts such as "seg-source" (see comment from Rodolfo below). From my understanding, format-specific concepts could be introduced after the scope discussion has finished.


Discussion at segment level should consider <seg-source> and it's effects on <target> as well. When <seg-source> is used, the content of <target> does not represent the translation of <source> because there may be extra markup (<mrk>)present in <target> that is not included in <source>.


  1. Does the model allow for sub-flows?

[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200905/msg00023.html)"it was agreed that sub flows will not be allowed as they are now. Yves will add content to the wiki, summarizing the TCs view."] The model should allow to handle text that is marked up as a sub-flow in the original format.

  1. If nesting (<sub> flows) is discouraged, does this proposal describe how to reference inline the other unit(s) containing the nested content? (cf. W3C ITS "elements within text" data category)

[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200909/msg00001.html) "Everyone agreed that it would be good to move subflows to their own translation units. The specifications should describe how to indicate the original location of the moved text and the type of markup that enclosed it".]

The model should describe how sub-flow are handled, including the relation between its extracted representation and its original location.

Canonical Representation of native content

  1. Should there only be one physical representation for a given native representation (wrt codes, inline-markup, 'skeleton' data, sub-flows)?

[Not Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200906/msg00003.html) "The feasibility of having a canonical representation was discussed. Action Item: Christian to extend the wiki content regarding canonical forms."The request to allow annotations and extensions in the new model was analyzed. The scope of the request was not clear and several possibilities were discussed. Conversation may continue in an email thread". And June 16 "The meaning and need for a canonical representation of inline markup was discussed (wiki section 3.1.2). Christian proposed to move discussion to an email thread. "]

[Tabled (minutes: http://lists.oasis-open.org/archives/xliff/200909/msg00001.html) "A roll call vote was held and it was agreed that discussion of canonical representation will be kept as a working item for further discussion." Bryan: We have spent many meeting cycles and cannot seem to agree on an answer to the question (with regard to defining scope) "Should there only be one physical representation for a given native representation (wrt codes, inline-markup, 'skeleton' data, sub-flows)?" So I request the interested parties establish a point/counterpoint email thread. When all points are documented I will conduct an email ballot to arrive at a "yes/no" decision - no later than 13-OCTOBER-2009]


There is no need. It will not be possible to include all representations in the specification document.


A canonical representation from my point-of-view, would simplify processing substantially.

Maybe, we may want to differentiate between two types of canonical representation:

  1. a semantic/abstract one
  2. an encoded/concrete one

This distinction can be found for example in ITS (http://www.w3.org/TR/2007/REC-its-20070403/#design-decisions ; "abstraction"). The distinction establishes some middle ground since it may be much harder or even impossible to come up with an encoded canonical form whereas a semantic canonical form may be feasible. Furtermore, the distinction would provide an opportunity to deploy an "association mechanism" like the one in ITS (http://www.w3.org/TR/2007/REC-its-20070403/#associating-its-with-existing-markup).

We may for example decide that the generic inline markup will need to have one representation for both standalone and balanced-pair codes. This would be two abstract, semantic representations.

We may than say that the single one allowed representation for standalone codes is an XML element "x" with a couple of predefinied attributes. This would be a concrete, encoded representation.

An "association mechanisms" like that in ITS (see above) may be the middle ground which could reveal information such as: In my format, the XML element "y" corresponds to the "x" element in the generic inline markup namespace.


Extensibility / Annotations

  1. Should the proposal outline how to annotate and/or extend the common representation in a exchange-friendly way?
  2. separation between possibly localizable text, native format (includes structural information), and annotations to enable easy parsing/transformation


Rodolfo - September 14, 2009

What is the meaning of "common representation"?

As far as I know, when creating an XLIFF file translatable text is separated from native format. Native format is stored in a skeleton (internal or external) and translatable text is placed in <trans-unit> elements. Annotations, if any, are not part of XLIFF and I think they should not be. XLIFF already supports extensions via namespaces and that's how I see annotations should be implemented by those that want it.


The proposal should include an extensibility feature.

The proposal should clearly separate between different content categories such as localizable text and native format.

[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200909/msg00006.html ) "It was agreed that current method is sufficient and provides a friendly way to extend or annotate, therefore no additional work is required. The TC agreed on inviting all tool developers to contribute their custom extensions to XLIFF for publishing in an open repository that the XLIFF TC would maintain. "] - note: this was agreed upon by the members of the 14 Sep 2009 call - but a member left the call so with no quorum, the vote was not binding - so resolution is pending a roll call vote.

[Resolved (regarding extensibility) (minutes: http://lists.oasis-open.org/archives/xliff/200910/msg00002.html ) "(1) Rodolfo suggested adding a note to section 2.5 explaining that extension points are intended for providing support for internal tool processing, not for solving XLIFF deficiencies.

(2) A roll call ballot was conducted on these terms: A yes vote means that you agree on adding a note indicating that extensions should be used for tool specific internal processing purposes, not for holding localization data that should be supported using standard XLIFF elements. The motion was approved unanimously.

(3) A second roll call ballot was conducted on these terms: A yes vote means that you agree that current extensibility options are sufficient and provide a way to exchange in a user friendly way. The motion passed. "]

[Resolved (regarding annotation) (minutes: http://lists.oasis-open.org/archives/xliff/200911/msg00003.html ) "Christian explained that annotation refers to information targeted at human readers, contrasting it to extensibility which he thinks is targeted to processing tools. Standard XML comments would not be good enough, as there is a risk of losing them during processing. Magnus expressed that current extensibility mechanism is rather weak and doesn't provide all the means required for improving processing. Christian proposed a ballot with two questions regarding the future inline markup specification: should the new specification outline an extensibility mechanism? and should the extensibility mechanism include the possibility to classify extensions in different types? Two roll call ballots were conducted and the answer to both questions was yes.. "]

Content Manipulation

  1. Should the specification define how the inline-content (and block level) model can be manipulated, including:
    1. indicate when a code can be deleted or not, can be cloned or not,
    2. indicate if a code can be moved out of sequence or not;

(this type of info is useful when doing QA, when the translator is manually editing a segment, when composing a target based on various matches, or in many other scenarios.)

[Resolved (minutes: http://www.oasis-open.org/apps/org/workgroup/xliff/email/archives/200906/msg00011.html)"It was agreed that the specification should define how the inline-content (and block level) model can be manipulated."]


It should be possible in many cases. Code checking is one of the problem may LSP face, being able to have some way to perform smarter modification/check for some formats would help a lot. I think we should at least explore the possibilities.


I agree with Yves' comment above.


XML Implementation

  1. tied to current version or successor of an existing namespace (such as TMX or XLIFF)?
  2. or placed in a namespace of its own (to support a modular XML content architecture; cf. "xml:lang" or "xml:space")?
  3. Should the specification be backwards compatible with existing versions of e.g. TMX and XLIFF?
    1. If not, should the specification define migration paths and mappings for existing content models (TMX and XLIFF)?
  4. enabling for the W3C Internationalization Tag Set (I guess allowing local ITS markup would be sufficient)?


The specification should not prescribe the use of ITS and should not force applications to interpret ITS markup present in an XLIFF file. Applications should be able to safely ignore non-XLIFF markup present in an XLIFF file. If non-XLIFF markup cannot be ignored, then the file and/or it's producer should not be considered XLIFF compliant.


Q3) Not necessarily



General Scope


I disagree. The work done here applies only to exchange of localization data. Exchange of translation memories and other tasks are beyond the goals described in XLIFF TC's charter.

OneContentModel/Scope (last edited 2010-05-24 12:12:03 by asgeirf)