(Y32) Segmentation - DONE

Owner: Yves, Helena

Importance: Uncategorized

Category: Data Model

State: Done Core/Module: core

David Spec sign-off: YES [TODO: PRs for how elements and attributes in segment, source, target should behave when re-segmenting.]

Tom Schema sign-off: No

Spec Updated: Yes. Schema Updated: Yes

Segmentation representation should be an integral part of XLIFF.

There should be a unique way to represent segmentation inside an XLIFF document, so tools can create, remove or manipulate segments as part of their features, and pass on those changes to the next tool in the process chain. The representation should include a way to represent un-segmented content. The access to segmented and un-segmented content should be similar so tools do not have to work using condition. One should be able to annotate and assign various processing information to each segment.

Moved 2.11. Split and merge segments (---> See Segmentation) here

Moved 2.18. Crossover aligned segments (---> See Segmentation) here

Segmentation is an integral part of the localization and translation processes. While the tools creating the initail XLIFF document may not be responsible to segment the extracted content, the format still need to provide a way to store the result of a segmentation process. That representation should be unique and clearly defined.

Segmentation representation is also very important because some of the information carried by XLIFF relates directly to segments, for example the status of the translation, leveraged text, etc.

Some requirements

Possible solutions

Overview

In the context of XLIFF a segment is a content which is either a unit of extracted text, or has been created from a unit of extracted text by means of a segmentation mechanism such as sentence boundary detection. For example, a segment can be a title, the text of a menu item, a paragraph or a sentence in a paragraph.

Other types of segmentation in the context of XLIFF, are seen as tokenization and represented using inline codes. For example: the words in a segment can be identified and marked up using the appropriate inline codes.

XLIFF does not specify how segmentation is carried out, only how to represent its result. The ULI/CLDR [Link to the relevant document] provides detailed information about segmenting a content.

Representation

In XLIFF each segment is represented by a <segment> element.

There is always at least one segment per unit, and a segment can be the whole content of the unit. That is, <segment> can represent the an un-segmented content.

Each <segment> element has one <source> element that contains the source content and one optional <target> element that can be empty or contain the translation of the source content at a given state.

Content parts between segments are represented with the <ignorable> element, which has the same content model as <segment>.

<segment>
 <source>Source text.</source>
 <target>Text source.</target>
</segment>
<ignorable>
 <source> </source>
</ignorable>
<segment>
 <source>Second sentence.</source>
</segment>
</unit>

Segmentation indicator

Some content may have gone through segmentation but remain unchanged.

The segmented='yes' attribute in <unit> indicates if a content made of a single <segment> has been segmented.

Segment identification

The <segment> element has an optional id attribute.

The value of the id attribute is a xsd:NMTOKEN value and MUST be unique within the parent <unit> element.

Order of the segments

Some applications (e.g. aligner tools) may create segmented content where the target segments are not in the same order as the source.

To be able to map order differences, the <target> element has an optional order attribute that indicates its position in the sequence of segments (and inter-segments). Its value is an integer from 1 to N, counting parts between segments.

For example, in the following XLIFF 1.2 segmented content, the source order is "Sentence A. Sentence B. Sentence C" and the target order is "Phrase B. Phrase C. Phrase A."

<trans-unit id="1">
 <source xml:lang="en">Sentence A. Sentence B. Sentence C.</source>
 <seg-source><mrk mid='1' mtype='seg'>Sentence A.</mrk> <mrk mid='2' mtype='seg'>Sentence B.</mrk> <mrk mid='3' mtype='seg'>Sentence C.</mrk></seg-source>
 <target xml:lang="fr"><mrk mid='2' mtype='seg'>Phrase B.</mrk> <mrk mid='3' mtype='seg'>Phrase C.</mrk> <mrk mid='1' mtype='seg'>Phrase A.</mrk></target>
</trans-unit>

The XLIFF representation of the same content for this specification is:

<segment id="1">
 <source>Sentence A.</source>
 <target order="5">Phrase A.</target>
</segment>
<ignorable>
 <source> </source>
</ignorable>
<segment id="2">
 <source>Sentence B.</source>
 <target order="1">Phrase B.</target>
</segment>
<ignorable>
 <source> </source>
</ignorable>
<segment id="3">
 <source>Sentence C.</source>
 <target order="3">Phrase C.</target>
</segment>

Segmentation modification

The attribute canSplitSegments of the <file> element indicates if the content of the units can be segmented further. The default value is 'yes'

The attribute canJoinSegments of the <file> element indicates if the content of the units can have <segment> and/or <ignorable> elements merged together. The default value is 'yes'

[YS note: maybe a single attribute canReSegment would make more sense as allowing/disallowing only join or split may be too granular? Tools usually do both or nothing]

When allowed, segments within the same unit may be splitted or joined as needed.

TODO: Needs PE for expected behavior for <mrk> and other paired codes when splitting or joining.

TODO: Any other side effects of split/join? other attributes to update?

TODO: How translation candidates should be handled when the segmentation changes? (how should the attributes of the candidates should be modified? (score, quality, etc.))

Segmentation hints

Inline codes may carry information helpful to segmenting the content.


Back to XLIFF2.0/FeatureTracking

XLIFF2.0/Feature/Segmentation (last edited 2013-04-09 13:03:27 by David.Filip)