Building and Using Parallel Texts:
Data Driven Machine Translation and Beyond
HLT-NAACL 2003 Workshop
Rada Mihalcea (
Ted Pedersen (
Task definition
The task of word alignment consists of finding correspondences between
words and phrases in parallel texts. Assuming a sentence aligned bilingual
corpus in languages L1 and L2, the task of a word alignment system is to
indicate which word token in the corpus of language L1 corresponds to which
word token in the corpus of language L2.
Systems participating in this shared task on word alignment will be provided
with training data, consisting of sentence aligned parallel texts. Two
subtasks are defined:
(1) "Limited resources", where systems are allowed to use ONLY the resources
(2) "Unlimited resources", where systems are allowed to use any resources in
addition to those provided. Such resources should be explicitely mentioned in
the system description.
Teams are encouraged to participate in both these subtasks.
Test data, consisting again of sentence aligned parallel corpora, will be
released one week prior to the deadline for results submissions. Participating
systems will produce word alignments, following the format specified below,
and submit their output by the deadline indicated in the timetable. Results
will be returned to each team within three days of submission. Comparative
results will be made public at the workshop.
Training data
Two sets of training data will be made available to participants.
(1) A set of Romanian-English parallel texts, consisting of about 1 million
Romanian words, and about the same number of English words. This data was
collected from various Romanian newspapers.
(2) A set of English-French parallel texts, consisting of about 20 million
English words, and about the same number of French words. This is a subset
of the Canadian Hansards, processed and sentence aligned by Ulrich Germann at
ISI [1].
The data is pre-tokenized, using a tokenizer similar with the one used on the
test data. Any particulars of the data, including source languages, file
naming conventions, details regarding preprocessing, and others, will be
provided at the time that the training data will be released.
Trial data
Two sets of trial data will be made available at the same time as the training
data. These are sentence aligned texts, provided together with their manually
determined word alignments. Trial data is provided so that participants can
better understand the format required for the word alignment results to be
We estimate that about 40 sentences (English-French) and 20 sentences
(Romanian-English) will be released as part of these trial data.
Test data
About 450 English-French sentences (data set created by Franz Och and
Prof.Hermann Ney), and 250 English-Romanian sentences (data set created by
Rada Mihalcea), will be released one week prior to the deadline for results
submissions. The participaing word alignment systems will be applied on
these test texts and produce the corresponding word alignments. Results
files need to be submitted by the deadline indicated in the shared task
timetable. A team can submit an unlimited number of results sets for each
language pair. Results will be submitted electronically; instructions
regarding the electronic submission will be made available at the time that
test data are released.
Output format
The results file should include one line for each word-to-word alignment
identified by the system. The lines in the results file should follow the
format below:
sentence_no position_L1 position_L2 [S|P] [confidence]
- sentence_no represents the id of the sentence within the test file.
Sentences in the test data already have an id assigned. (see the examples
- position_L1 represents the position of the token that is aligned from
the text in language L1; the first token in each sentence is token 1. (not 0)
- position_L2 represents the position of the token that is aligned from the
text in language L2; again, the first token is token 1.
- S|P can be either S or P, representing a Sure or Probable alignment. All
alignments that are tagged as S are also considered to be part of the P
alignments set (that is, all alignments that are considered "Sure" alignments
are also part of the "Probable" alignments set). If the S|P field is missing,
a value of S will be assumed by default.
- confidence is a real number, in the range (0-1] (1 meaning highly confident,
0 meaning not confident); this field is optional, and by default we assume a
confidence number of 1.
While the S|P and confidence fields overlap in their meaning, the intent of
having both fields available is to enable participating teams to draw their
own line on what they consider to be a Sure or Probable alignment. Again,
both these fields are optional. The standard evaluations that will be performed
(see below) will ignore these fields. However, teams that have systems with
the capability of ranking the alignments confidence are welcome to do so, in
which case additional weighted scoring measures may be applied.
Running example
Consider the two following aligned sentences:
[from the English file]
They had gone .
[from the French file]
Ils etaient alles .
A correct word alignment that will be produced for this sentence is
18 1 1
18 2 2
18 3 3
18 4 4
Which states that all these alignments are from sentence 18, and the English
token 1 ("They") aligns with the French token 1 ("Ils"), the English token 2
("had"), aligns with the French token 2 ("etaient"), and so on. Note that the
punctuation is also aligned (English token 4 (".") align with French token
(".")), and will count towards the final scoring figures.
Alternatively, systems may also provide an S|P marker and/or a confidence
score, as in the following example:
18 1 1 1
18 2 2 P 0.7
18 3 3 S
18 4 4 S 1
with missing S|P fields considered by default to be S, and missing confidence
scores considered by default 1.
Annotation guide for word alignments
The guidelines and examples in this section are from the Blinker Annotation
Project. Please refer to [2] for additional details.
a. All items separated by a white space are considered to be a word (or
token), and therefore should be aligned. (punctuation included)
b. Omissions in translation should use the NULL token, i.e. token with id
0. For instance, in the examples below:
And he said , appoint me thy wages , and I will give it .
fixe moi ton salaire , et je te le donnerai .
"and he said" from the English sentence has no corresponding translation
in French, and therefore all these words are aligned with the token id 0.
18 1 0
18 2 0
18 3 0
18 4 0
c. Phrasal correspondence will produce multiple word-to-word alignments.
For instance, in the examples below:
cultiver la terre
to be a husbandman
Since the words do not correspond one to one, and yet the two phrases mean
the same thing in the given context, the phrases should be linked as wholes,
by linking each word in one to each word in another. For the example above,
this translates into 12 word-to-word alignments:
18 1 1
18 1 2
18 1 3
18 1 4
18 2 1
18 2 2
18 2 3
18 2 4
18 3 1
18 3 2
18 3 3
18 3 4
All submissions will be evaluated using the standard precision and recall
measures, and using the alignment error rate score. Separate precision and
recall rates will be determined corresponding to each of the S and P sets in
the gold standard data.
Here, precision is defined as the percentage of correct alignments identified
by a system out of the total set of alignments provided by the system.
Recall is defined as the percentage of correct alignments identified by a
system out of the total set of correct alignments.
Alignment error rate is defined as in [3]:
AER = 1 - ( |A & S| + |A & P| ) / ( |A| + |S| )
(with & symbolizing the set intersection)
Participants are encouraged to suggest new evaluation schemes. Scoring
algorithms proposed by participants, documented and provided with scoring
software that relies on the data format mentioned above, may be used to score
all submitted systems. Participants with a scoring algorithm available should
contact us before March 21.
[1] Ulrich Germann, editor (2001). Aligned Hansards of the 36th Parliament
of Canada.
[2] I. Dan Melamed (1998). Annotation Style Guide for the Blinker Project,
IRCS Technical Report #98-06.
[3] Franz Josef Och, Hermann Ney (2000). A Comparison of Alignment Models for
Statistical Machine Translation. COLING 2000.