(Y32) Segmentation - DONE
Owner: Yves, Helena
Category: Data Model
State: Done Core/Module: core
David Spec sign-off: YES [TODO: PRs for how elements and attributes in segment, source, target should behave when re-segmenting.]
Tom Schema sign-off: No
Spec Updated: Yes. Schema Updated: Yes
Segmentation representation should be an integral part of XLIFF.
Proposed via XLIFF list (http://lists.oasis-open.org/archives/xliff/201104/msg00005.html)
There should be a unique way to represent segmentation inside an XLIFF document, so tools can create, remove or manipulate segments as part of their features, and pass on those changes to the next tool in the process chain. The representation should include a way to represent un-segmented content. The access to segmented and un-segmented content should be similar so tools do not have to work using condition. One should be able to annotate and assign various processing information to each segment.
Moved 2.11. Split and merge segments (---> See Segmentation) here
Moved 2.18. Crossover aligned segments (---> See Segmentation) here
Proposed via XLIFF list (http://lists.oasis-open.org/archives/xliff/201105/msg00038.html)
Segmentation is an integral part of the localization and translation processes. While the tools creating the initail XLIFF document may not be responsible to segment the extracted content, the format still need to provide a way to store the result of a segmentation process. That representation should be unique and clearly defined.
Segmentation representation is also very important because some of the information carried by XLIFF relates directly to segments, for example the status of the translation, leveraged text, etc.
- One should be able to represent segmented content in a text unit.
- Tools should not be forced to segment the content.
- Processing a segmented content and a non-segmented content should be easy and, if possible, not require conditions (e.g. if segmented doThis() else doThat())
- Segmented content should be able to have non-segments parts on each side outside the segment.
- One should be able to represent the target segments in a different order than the source segments.
- One should be able to control if the segmentation can be modified (split/join).
- One should be able to detect if a content with a single segment went through segmentation (vs being a content that is not segmented yet)
- One should be able to identify uniquely segments within their text unit.
In the context of XLIFF a segment is a content which is either a unit of extracted text, or has been created from a unit of extracted text by means of a segmentation mechanism such as sentence boundary detection. For example, a segment can be a title, the text of a menu item, a paragraph or a sentence in a paragraph.
Other types of segmentation in the context of XLIFF, are seen as tokenization and represented using inline codes. For example: the words in a segment can be identified and marked up using the appropriate inline codes.
XLIFF does not specify how segmentation is carried out, only how to represent its result. The ULI/CLDR [Link to the relevant document] provides detailed information about segmenting a content.
In XLIFF each segment is represented by a <segment> element.
There is always at least one segment per unit, and a segment can be the whole content of the unit. That is, <segment> can represent the an un-segmented content.
Each <segment> element has one <source> element that contains the source content and one optional <target> element that can be empty or contain the translation of the source content at a given state.
Content parts between segments are represented with the <ignorable> element, which has the same content model as <segment>.
<segment> <source>Source text.</source> <target>Text source.</target> </segment> <ignorable> <source> </source> </ignorable> <segment> <source>Second sentence.</source> </segment> </unit>
- PE: User agents MUST assume a unit may or may not be segmented.
PE: User agents MUST assume <ignorable> elements may or may not exist between segments.
Some content may have gone through segmentation but remain unchanged.
The segmented='yes' attribute in <unit> indicates if a content made of a single <segment> has been segmented.
PE: User agents MUST ignore the segmented attribute if the whole content of <unit> is not stored in a single <segment> element.
The <segment> element has an optional id attribute.
The value of the id attribute is a xsd:NMTOKEN value and MUST be unique within the parent <unit> element.
Order of the segments
Some applications (e.g. aligner tools) may create segmented content where the target segments are not in the same order as the source.
To be able to map order differences, the <target> element has an optional order attribute that indicates its position in the sequence of segments (and inter-segments). Its value is an integer from 1 to N, counting parts between segments.
For example, in the following XLIFF 1.2 segmented content, the source order is "Sentence A. Sentence B. Sentence C" and the target order is "Phrase B. Phrase C. Phrase A."
<trans-unit id="1"> <source xml:lang="en">Sentence A. Sentence B. Sentence C.</source> <seg-source><mrk mid='1' mtype='seg'>Sentence A.</mrk> <mrk mid='2' mtype='seg'>Sentence B.</mrk> <mrk mid='3' mtype='seg'>Sentence C.</mrk></seg-source> <target xml:lang="fr"><mrk mid='2' mtype='seg'>Phrase B.</mrk> <mrk mid='3' mtype='seg'>Phrase C.</mrk> <mrk mid='1' mtype='seg'>Phrase A.</mrk></target> </trans-unit>
The XLIFF representation of the same content for this specification is:
<segment id="1"> <source>Sentence A.</source> <target order="5">Phrase A.</target> </segment> <ignorable> <source> </source> </ignorable> <segment id="2"> <source>Sentence B.</source> <target order="1">Phrase B.</target> </segment> <ignorable> <source> </source> </ignorable> <segment id="3"> <source>Sentence C.</source> <target order="3">Phrase C.</target> </segment>
The attribute canSplitSegments of the <file> element indicates if the content of the units can be segmented further. The default value is 'yes'
The attribute canJoinSegments of the <file> element indicates if the content of the units can have <segment> and/or <ignorable> elements merged together. The default value is 'yes'
[YS note: maybe a single attribute canReSegment would make more sense as allowing/disallowing only join or split may be too granular? Tools usually do both or nothing]
When allowed, segments within the same unit may be splitted or joined as needed.
- PE: User agents MAY split content only if the canSplitSegments attribute is set.
- PE: User agents MAY join content parts only if the canJoinSegments attribute is set.
- PE: There MUST be always at least one segment in the text unit.
PE: When merging or joining segments, the user agent MUST update the order attributes of the <target> elements of the resulting segment(s) if they exist.
PE: If the result of a segmentation is a single <segment> for the whole unit, the user agent MUST set the segmented attribute of that unit to yes.
TODO: Needs PE for expected behavior for <mrk> and other paired codes when splitting or joining.
TODO: Any other side effects of split/join? other attributes to update?
TODO: How translation candidates should be handled when the segmentation changes? (how should the attributes of the candidates should be modified? (score, quality, etc.))
Inline codes may carry information helpful to segmenting the content.
- The equiv attribute provides an hint at what the code represent when rendered as text.
- The type attributes has the pre-defined values lb, cb representing a line-break and a column break.
- TODO: special character cases? U+2029 and U+200D in plain text
- PE: When performing segmentation, user agents SHOULD take into account the segmentation-related information present on inline codes.
Back to XLIFF2.0/FeatureTracking