Contents
- Overview
- Plan
- Defining the Scope (Step 1)
-
Requirements (Step 2)
- Should be able to represent standalone codes
- Should be able to represent balanced paired-codes
- Should be able to represent paired codes that have been separated
- Should be able to represent paired codes that are overlapping each other
- If possible, all text nodes of the content should be real text, not codes
- Should be able to mark up spans of content and associate them with standardized information
- Should be able to mark up spans of content and associate them with user-defined information
- Should be able to store a text-equivalent representation of the code in any markup that represent a code
- Should be able to identify uniquely a code within a content fragment
- Should be able to associate the same codes between the source and the target segments
- Should be able to store the actual data of a code along with the code
- Should be able to store a pointer to the actual data of a code along with the code
- Should be able to store no information about the data of a code along with the code
- Should be able to represent a flow of text that, in the original format, was stored has a nested flow
- Should be able to represent the relationship between a nested flow of text and its parent
- Should be able to represent invalid-XML character in the content
- Inline codes should have a way to store information about segmentation
This page continues the exchange started in http://lists.oasis-open.org/archives/xliff/200903/msg00006.html
1. Overview
As the component-based and internet-driven technologies evolve, tools need to be able to communicate as seamlessly as possible not only using documents exchange, but also small segment of information.
Many of the Web services, plugins, and other bricks that are making up the tools being build today need to exchange data at the segment level, not at the file level. Whether these components identify terms, highlight spelling mistakes, provide TM matches, or MT guesses, they all, ultimately, need to access the same abstracted extracted text.
Having a single representation for this abstracted extracted text brings more interoperability with respect to data and thus tools. From a tool provider point-of-view, the advantage of a single representation arise for example from the fact that no bridges along the lines of "If dealing with X then ... if dealing with Y then ..." are necessary.
1.1. Definitions/Terminology
This section is under construction. The following documents may provide guidance.
http://www.lisa.org/fileadmin/standards/tmx1.4/tmx.htm#SectionContentMarkup
http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine
http://www.w3.org/International/articles/inline-bidi-markup/
http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_XHTML-1.0-Strict
http://www.w3.org/TR/xhtml1/dtds.html#dtdentry_xhtml1-strict.dtd_phrase
http://www.w3.org/TR/2008/NOTE-xml-i18n-bp-20080213/#DevSeg
Term |
Alternative Terms |
Examples |
Definition |
Block-level Markup |
|
Headings, lists, and blockquotes in HTML |
Delimiters for content which is usually linguistically independent from its surrounding content. |
Canonical Representation |
Canonical Form |
The canonical form of an XML document. |
The physical representation of a given content fragment which has been derived from a native representation in an agreed upon way. Since the way is of derivation is defined, derivation can only produce a single possible result. |
Inline-level Markup |
|
Big, sup, and em in HTML |
Delimiters for content which is usually linguistically dependent on its surrounding content. |
Segment |
|
A sentence in a language such as English. |
Content which has either been marked up as "block", or has been created from a "block" by means of a segmentation mechanism such as sentence boundary detection. |
Sub-flow |
|
A footnote in an academic text |
Content which is block-level in nature but is semantically closely related to other blocks. |
2. Plan
Step 1: Agree on Scope for the specification
Step 2: Define Requirements
Step 3: Design possible solutions
Note: Format for annotating items resolved by TC:
[Resolved (minutes: URL to minutes goes here)"Explanation goes here."]
3. Defining the Scope (Step 1)
3.1. Questions we need to consider in defining the scope
3.1.1. Common representation of 'inline' markup vs 'block-level' markup
Does this proposal outline how blocks of text relate to each other, <group>-level, or only at the segment-level (<source>,<target>)?
[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200905/msg00023.html)"it was agreed to work at segment level, with flexibility to include groups when needed. Rodolfo will expand this section with comments about merging/splitting segments."]
- Yves, 13May2009
While there is a need for XLIFF to address "block-level" relation, and maybe in TMX as well, I think the content model we are talking about here is about the representation of extracted data in source/target.
As far as I can think (but I may have a narrow view) the only two parts of that model that may have to do with relation between a given content and others would be:
- Relating sub-flow place-holder with the text unit where the sub-flow content is.
- Relation between inline codes or segments between source and target.
- Christian, May 14, 2009
I agree with Yves' comment above. The whole activity from my understanding should relate to a "generic inline markup".
I see a danger in injecting XLIFF-specific concepts such as "seg-source" (see comment from Rodolfo below). From my understanding, format-specific concepts could be introduced after the scope discussion has finished.
- Rodolfo - May 13, 2009
Discussion at segment level should consider <seg-source> and it's effects on <target> as well. When <seg-source> is used, the content of <target> does not represent the translation of <source> because there may be extra markup (<mrk>)present in <target> that is not included in <source>.
- Does the model allow for sub-flows?
[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200905/msg00023.html)"it was agreed that sub flows will not be allowed as they are now. Yves will add content to the wiki, summarizing the TCs view."] The model should allow to handle text that is marked up as a sub-flow in the original format.
If nesting (<sub> flows) is discouraged, does this proposal describe how to reference inline the other unit(s) containing the nested content? (cf. W3C ITS "elements within text" data category)
[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200909/msg00001.html) "Everyone agreed that it would be good to move subflows to their own translation units. The specifications should describe how to indicate the original location of the moved text and the type of markup that enclosed it".]
The model should describe how sub-flow are handled, including the relation between its extracted representation and its original location.
3.1.2. Canonical Representation of native content
- Should there only be one physical representation for a given native representation (wrt codes, inline-markup, 'skeleton' data, sub-flows)?
[Not Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200906/msg00003.html) "The feasibility of having a canonical representation was discussed. Action Item: Christian to extend the wiki content regarding canonical forms."The request to allow annotations and extensions in the new model was analyzed. The scope of the request was not clear and several possibilities were discussed. Conversation may continue in an email thread". And June 16 "The meaning and need for a canonical representation of inline markup was discussed (wiki section 3.1.2). Christian proposed to move discussion to an email thread. "]
[Tabled (minutes: http://lists.oasis-open.org/archives/xliff/200909/msg00001.html) "A roll call vote was held and it was agreed that discussion of canonical representation will be kept as a working item for further discussion." Bryan: We have spent many meeting cycles and cannot seem to agree on an answer to the question (with regard to defining scope) "Should there only be one physical representation for a given native representation (wrt codes, inline-markup, 'skeleton' data, sub-flows)?" So I request the interested parties establish a point/counterpoint email thread. When all points are documented I will conduct an email ballot to arrive at a "yes/no" decision - no later than 13-OCTOBER-2009]
- Rodolfo - May 13, 2009
There is no need. It will not be possible to include all representations in the specification document.
Christian - May 13, 2009; Updated 15 May 2009 (to address action item from http://lists.oasis-open.org/archives/xliff/200906/msg00003.html)
A canonical representation from my point-of-view, would simplify processing substantially.
Maybe, we may want to differentiate between two types of canonical representation:
- a semantic/abstract one
- an encoded/concrete one
This distinction can be found for example in ITS (http://www.w3.org/TR/2007/REC-its-20070403/#design-decisions ; "abstraction"). The distinction establishes some middle ground since it may be much harder or even impossible to come up with an encoded canonical form whereas a semantic canonical form may be feasible. Furtermore, the distinction would provide an opportunity to deploy an "association mechanism" like the one in ITS (http://www.w3.org/TR/2007/REC-its-20070403/#associating-its-with-existing-markup).
We may for example decide that the generic inline markup will need to have one representation for both standalone and balanced-pair codes. This would be two abstract, semantic representations.
We may than say that the single one allowed representation for standalone codes is an XML element "x" with a couple of predefinied attributes. This would be a concrete, encoded representation.
An "association mechanisms" like that in ITS (see above) may be the middle ground which could reveal information such as: In my format, the XML element "y" corresponds to the "x" element in the generic inline markup namespace.
3.1.3. Extensibility / Annotations
- Should the proposal outline how to annotate and/or extend the common representation in a exchange-friendly way?
- separation between possibly localizable text, native format (includes structural information), and annotations to enable easy parsing/transformation
Rodolfo - September 14, 2009
What is the meaning of "common representation"?
As far as I know, when creating an XLIFF file translatable text is separated from native format. Native format is stored in a skeleton (internal or external) and translatable text is placed in <trans-unit> elements. Annotations, if any, are not part of XLIFF and I think they should not be. XLIFF already supports extensions via namespaces and that's how I see annotations should be implemented by those that want it.
- Christian - May 13, 2009
The proposal should include an extensibility feature.
The proposal should clearly separate between different content categories such as localizable text and native format.
[Resolved (minutes: http://lists.oasis-open.org/archives/xliff/200909/msg00006.html ) "It was agreed that current method is sufficient and provides a friendly way to extend or annotate, therefore no additional work is required. The TC agreed on inviting all tool developers to contribute their custom extensions to XLIFF for publishing in an open repository that the XLIFF TC would maintain. "] - note: this was agreed upon by the members of the 14 Sep 2009 call - but a member left the call so with no quorum, the vote was not binding - so resolution is pending a roll call vote.
[Resolved (regarding extensibility) (minutes: http://lists.oasis-open.org/archives/xliff/200910/msg00002.html ) "(1) Rodolfo suggested adding a note to section 2.5 explaining that extension points are intended for providing support for internal tool processing, not for solving XLIFF deficiencies. (2) A roll call ballot was conducted on these terms: A yes vote means that you agree on adding a note indicating that extensions should be used for tool specific internal processing purposes, not for holding localization data that should be supported using standard XLIFF elements. The motion was approved unanimously. (3) A second roll call ballot was conducted on these terms: A yes vote means that you agree that current extensibility options are sufficient and provide a way to exchange in a user friendly way. The motion passed. "
[Resolved (regarding annotation) (minutes: http://lists.oasis-open.org/archives/xliff/200911/msg00003.html ) "Christian explained that annotation refers to information targeted at human readers, contrasting it to extensibility which he thinks is targeted to processing tools. Standard XML comments would not be good enough, as there is a risk of losing them during processing. Magnus expressed that current extensibility mechanism is rather weak and doesn't provide all the means required for improving processing. Christian proposed a ballot with two questions regarding the future inline markup specification: should the new specification outline an extensibility mechanism? and should the extensibility mechanism include the possibility to classify extensions in different types? Two roll call ballots were conducted and the answer to both questions was yes.. "]
3.1.4. Content Manipulation
- Should the specification define how the inline-content (and block level) model can be manipulated, including:
- indicate when a code can be deleted or not, can be cloned or not,
- indicate if a code can be moved out of sequence or not;
(this type of info is useful when doing QA, when the translator is manually editing a segment, when composing a target based on various matches, or in many other scenarios.)
[Resolved (minutes: http://www.oasis-open.org/apps/org/workgroup/xliff/email/archives/200906/msg00011.html)"It was agreed that the specification should define how the inline-content (and block level) model can be manipulated."]
- Yves - May 13, 2009
It should be possible in many cases. Code checking is one of the problem may LSP face, being able to have some way to perform smarter modification/check for some formats would help a lot. I think we should at least explore the possibilities.
- Christian - May 13, 2009
I agree with Yves' comment above.
3.1.5. XML Implementation
- tied to current version or successor of an existing namespace (such as TMX or XLIFF)?
- or placed in a namespace of its own (to support a modular XML content architecture; cf. "xml:lang" or "xml:space")?
- Should the specification be backwards compatible with existing versions of e.g. TMX and XLIFF?
- If not, should the specification define migration paths and mappings for existing content models (TMX and XLIFF)?
- enabling for the W3C Internationalization Tag Set (I guess allowing local ITS markup would be sufficient)?
- Rodolfo - May 13, 2009
The specification should not prescribe the use of ITS and should not force applications to interpret ITS markup present in an XLIFF file. Applications should be able to safely ignore non-XLIFF markup present in an XLIFF file. If non-XLIFF markup cannot be ignored, then the file and/or it's producer should not be considered XLIFF compliant.
- Yves - May 13, 2009
Q3) Not necessarily
- Christian - May 13, 2009
- Not tied to current version or successor of an existing namespace (such as TMX or XLIFF). This would be in line with the modularization approaches we see for example in XHTML.
- Placed in a namespace of its own (to support a modular XML content architecture; cf. "xml:lang" or "xml:space") since this would amongst others make it easier to develop and maintain the proposal.
- Should be backwards compatible with existing versions of e.g. TMX and XLIFF. I guess it would need to be discussed whether the definition of mappings for existing content models (TMX and XLIFF) to the proposal would count as backwards compatibility.
- Enabling for the W3C Internationalization Tag Set by allowing at least local ITS markup.
- Defined Scope
3.1.6. General Scope
- The purpose of this specification is to define a content model that is task agnostic in the sense that it is suited in any context, including translation memory exchange, translation workflow etc
- Rodolfo - September 14, 2009
I disagree. The work done here applies only to exchange of localization data. Exchange of translation memories and other tasks are beyond the goals described in XLIFF TC's charter.
4. Requirements (Step 2)
4.1. Should be able to represent standalone codes
For example, a line break in HTML:
Line 1<br/>line 2
4.2. Should be able to represent balanced paired-codes
For example, normal bolded text in HTML:
<b>Bold</b> text
4.3. Should be able to represent paired codes that have been separated
For example, in HTML <b> in one segment, while </b> is in a different one:
Text in <b>bold. This too</b>.
4.4. Should be able to represent paired codes that are overlapping each other
For example:
Text in <startB/>bold and in <startI/>italic<endB/>.<endI/>
4.5. If possible, all text nodes of the content should be real text, not codes
When processing the content with XML parsers, all the nodes of type TEXT should contain real text. This allows the separation between textual content and codes to be physical even in the object, rather than depend on the markup itself.
4.6. Should be able to mark up spans of content and associate them with standardized information
Some examples:
- assigning term information to a run of text
- associating part of the content with an annotation.
- associating part of the content with directionality and/or language information
4.7. Should be able to mark up spans of content and associate them with user-defined information
Same requirement as above, but where the associated information is user-defined.
4.8. Should be able to store a text-equivalent representation of the code in any markup that represent a code
For example, for an inline code representing a variable, it is useful to see the value of the variable or its name when translating. The text-equivalent information could also be used for hint when no other information is present while doing tasks such as alignment, etc.
4.9. Should be able to identify uniquely a code within a content fragment
As a code need to be associated to additional information within or outside its container, it needs a unique identifier.
4.10. Should be able to associate the same codes between the source and the target segments
For example, in the following source and translation:
English: "The text is in <b>bold</b> and <i>italics</i>." Yoda-English: "In <i>italics</i> and <b>bold</b> the text is."
The codes <b> </b>, <i>, and </i> of the source should be mappable to the one in the translation.
4.11. Should be able to store the actual data of a code along with the code
Some tools may require the actual codes to be available from within the container.
Note this is one of three ways to store codes, the others are the two requirements below.
4.12. Should be able to store a pointer to the actual data of a code along with the code
Some tools may require the actual codes to be available from outside the text container.
Note this is one of three ways to store codes, the others are the requirement above and the one below.
4.13. Should be able to store no information about the data of a code along with the code
Note this is one of three ways to store codes, the others are the requirements above.
4.14. Should be able to represent a flow of text that, in the original format, was stored has a nested flow
For example, the value of the HTML ALT attribute is stored in the IMG tag that can be withing a paragraph:
<p>Click here: <img alt='OK' src='ok.png'/>.</p>
Another example: in ODF a footnote is stored at the location where the footnote reference is display when publishing.
4.15. Should be able to represent the relationship between a nested flow of text and its parent
The format should be able to represent both flows and have some information about their relationship, so the two text can be put in context when needed.
For example, the relation between the value of an HTML ALT attribute and the paragraph element where it appears should be somehow preserved.
4.16. Should be able to represent invalid-XML character in the content
Some characters are illegal in XML, but they are used in extracted text and we should have a way to represent them without causing the XML tools to fail.
For example, non-XML source documents may have control characters that cannot be represented directly in XML. It should be possible to code them smoehow so they can make it through the translation process without being lost or get corrupted.
Note: An example of how some XML format hanlde this is the TS format from Qt-Lingusit, which uses a <byte> element to represent such characters.
4.17. Inline codes should have a way to store information about segmentation
As some inline codes may have an effect on the segmentation of a given content, it is useful if segmentation-specific hints could be stored along with the inline code.
For example: In HTML a <br/> element indicates a forced line break, while a <b>...</b> element should not affect the segmentation.
XLIFF Wiki