Building and Using Parallel Texts:
Data Driven Machine Translation and Beyond
ACL 2005 Workshop

GUIDELINES FOR THE SHARED TASK ON WORD ALIGNMENT

Joel Martin
Rada Mihalcea
Ted Pedersen


Task definition
---------------

The task of word alignment consists of finding correspondences between 
words and phrases in parallel texts. Assuming a sentence aligned bilingual 
corpus in languages L1 and L2, the task of a word alignment system is to 
indicate which word token in the corpus of language L1 corresponds to which 
word token in the corpus of language L2. 

Systems participating in this shared task on word alignment will be provided 
with training data, consisting of sentence aligned parallel texts. Two 
subtasks are defined: 
(1) "Limited resources", where systems are allowed to use ONLY the resources 
provided.
(2) "Unlimited resources", where systems are allowed to use any resources in 
addition to those provided. Such resources should be explicitely mentioned in 
the system description.
Teams are encouraged to participate in both these subtasks. 

Test data, consisting again of sentence aligned parallel corpora, will be 
released one week prior to the deadline for results submissions. Participating 
systems will produce word alignments, following the format specified below, 
and submit their output by the deadline indicated in the timetable. Results 
will be returned to each team within three days of submission. Comparative 
results will be made public at the workshop.


Training data
-------------

Two sets of training data will be made available to participants.   

(1) A set of Romanian-English parallel texts, consisting of about 1 million 
Romanian words, and about the same number of English words. This data was 
collected from various Romanian newspapers. 

(2) A set of Inuktitut-English parallel texts, consisting of about 3.5 million 
English words, and about 1.5 million Inuktitut words. This is a subset 
of the Canadian Hansards, processed and sentence aligned by Joel Martin

The data is pre-tokenized, using a tokenizer similar with the one used on the 
test data. Any particulars of the data, including source languages, file 
naming conventions, details regarding preprocessing, and others, are 
provided with the training/test data.


Trial data
----------

Two sets of trial data will be made available at the same time as the training 
data. These are sentence aligned texts, provided together with their manually 
determined word alignments. Trial data is provided so that participants can 
better understand the format required for the word alignment results to be 
submitted.

We estimate that about 20-30 sentences (Inuktitut-English) and 250 sentences 
(Romanian-English) will be released as part of these trial data.  


Test data
---------

About 70-80 Inuktitut-English and about 200 Romanian-English sentences 
will be released one week prior to the deadline for result submissions. 
The participaing word alignment systems will be applied on these test 
texts and produce the corresponding word alignments. Results files need to 
be submitted by the deadline indicated in the shared task timetable. A 
team can submit an unlimited number of results sets for each 
language pair. Results will be submitted electronically; instructions 
regarding the electronic submission will be made available at the time that 
test data are released.          


Output format
-------------

The results file should include one line for each word-to-word alignment 
identified by the system. The lines in the results file should follow the 
format below:     

sentence_no position_L1 position_L2 [S|P] [confidence] 

where:
- sentence_no represents the id of the sentence within the test file. 
Sentences in the test data already have an id assigned. (see the examples 
below)    

- position_L1 represents the position of the token that is aligned from 
the text in language L1; the first token in each sentence is token 1. (not 0)    

- position_L2 represents the position of the token that is aligned from the 
text in language L2; again, the first token is token 1.    

- S|P can be either S or P, representing a Sure or Probable alignment. All 
alignments that are tagged as S are also considered to be part of the P 
alignments set (that is, all alignments that are considered "Sure" alignments 
are also part of the "Probable" alignments set). If the S|P field is missing, 
a value of S will be assumed by default.

- confidence is a real number, in the range (0-1] (1 meaning highly confident, 
0 meaning not confident); this field is optional, and by default we assume a 
confidence number of 1.   

While the S|P and confidence fields overlap in their meaning, the intent of 
having both fields available is to enable participating teams to draw their 
own line on what they consider to be a Sure or Probable alignment. Again, 
both these fields are optional. The standard evaluations that will be performed
(see below) will ignore these fields. However, teams that have systems with 
the capability of ranking the alignments confidence are welcome to do so, in 
which case additional weighted scoring measures may be applied.  


Evaluation
----------

All submissions will be evaluated using the standard precision and recall 
measures, and using the alignment error rate score. Separate precision and 
recall rates will be determined corresponding to each of the S and P sets in 
the gold standard data.

Here, precision is defined as the percentage of correct alignments identified 
by a system out of the total set of alignments provided by the system.

Recall is defined as the percentage of correct alignments identified by a 
system out of the total set of correct alignments.

Alignment error rate:

AER = 1 - ( |A & S| + |A & P| ) / ( |A| + |S| )
(with & symbolizing the set intersection)

Participants are encouraged to suggest new evaluation schemes. Scoring 
algorithms proposed by participants, documented and provided with scoring 
software that relies on the data format mentioned above, may be used to score 
all submitted systems. Participants with a scoring algorithm available should 
contact us before March 31.


Please refer to the guidelines of the HLT/NAACL 2003 shared task on 
word alignment for additional information (including alignment examples,
evaluation details, etc.)