Building and Using Parallel Texts: Data Driven Machine Translation and Beyond ACL 2005 Workshop GUIDELINES FOR THE SHARED TASK ON WORD ALIGNMENT Joel Martin Rada Mihalcea Ted Pedersen Task definition --------------- The task of word alignment consists of finding correspondences between words and phrases in parallel texts. Assuming a sentence aligned bilingual corpus in languages L1 and L2, the task of a word alignment system is to indicate which word token in the corpus of language L1 corresponds to which word token in the corpus of language L2. Systems participating in this shared task on word alignment will be provided with training data, consisting of sentence aligned parallel texts. Two subtasks are defined: (1) "Limited resources", where systems are allowed to use ONLY the resources provided. (2) "Unlimited resources", where systems are allowed to use any resources in addition to those provided. Such resources should be explicitely mentioned in the system description. Teams are encouraged to participate in both these subtasks. Test data, consisting again of sentence aligned parallel corpora, will be released one week prior to the deadline for results submissions. Participating systems will produce word alignments, following the format specified below, and submit their output by the deadline indicated in the timetable. Results will be returned to each team within three days of submission. Comparative results will be made public at the workshop. Training data ------------- Two sets of training data will be made available to participants. (1) A set of Romanian-English parallel texts, consisting of about 1 million Romanian words, and about the same number of English words. This data was collected from various Romanian newspapers. (2) A set of Inuktitut-English parallel texts, consisting of about 3.5 million English words, and about 1.5 million Inuktitut words. This is a subset of the Canadian Hansards, processed and sentence aligned by Joel Martin The data is pre-tokenized, using a tokenizer similar with the one used on the test data. Any particulars of the data, including source languages, file naming conventions, details regarding preprocessing, and others, are provided with the training/test data. Trial data ---------- Two sets of trial data will be made available at the same time as the training data. These are sentence aligned texts, provided together with their manually determined word alignments. Trial data is provided so that participants can better understand the format required for the word alignment results to be submitted. We estimate that about 20-30 sentences (Inuktitut-English) and 250 sentences (Romanian-English) will be released as part of these trial data. Test data --------- About 70-80 Inuktitut-English and about 200 Romanian-English sentences will be released one week prior to the deadline for result submissions. The participaing word alignment systems will be applied on these test texts and produce the corresponding word alignments. Results files need to be submitted by the deadline indicated in the shared task timetable. A team can submit an unlimited number of results sets for each language pair. Results will be submitted electronically; instructions regarding the electronic submission will be made available at the time that test data are released. Output format ------------- The results file should include one line for each word-to-word alignment identified by the system. The lines in the results file should follow the format below: sentence_no position_L1 position_L2 [S|P] [confidence] where: - sentence_no represents the id of the sentence within the test file. Sentences in the test data already have an id assigned. (see the examples below) - position_L1 represents the position of the token that is aligned from the text in language L1; the first token in each sentence is token 1. (not 0) - position_L2 represents the position of the token that is aligned from the text in language L2; again, the first token is token 1. - S|P can be either S or P, representing a Sure or Probable alignment. All alignments that are tagged as S are also considered to be part of the P alignments set (that is, all alignments that are considered "Sure" alignments are also part of the "Probable" alignments set). If the S|P field is missing, a value of S will be assumed by default. - confidence is a real number, in the range (0-1] (1 meaning highly confident, 0 meaning not confident); this field is optional, and by default we assume a confidence number of 1. While the S|P and confidence fields overlap in their meaning, the intent of having both fields available is to enable participating teams to draw their own line on what they consider to be a Sure or Probable alignment. Again, both these fields are optional. The standard evaluations that will be performed (see below) will ignore these fields. However, teams that have systems with the capability of ranking the alignments confidence are welcome to do so, in which case additional weighted scoring measures may be applied. Evaluation ---------- All submissions will be evaluated using the standard precision and recall measures, and using the alignment error rate score. Separate precision and recall rates will be determined corresponding to each of the S and P sets in the gold standard data. Here, precision is defined as the percentage of correct alignments identified by a system out of the total set of alignments provided by the system. Recall is defined as the percentage of correct alignments identified by a system out of the total set of correct alignments. Alignment error rate: AER = 1 - ( |A & S| + |A & P| ) / ( |A| + |S| ) (with & symbolizing the set intersection) Participants are encouraged to suggest new evaluation schemes. Scoring algorithms proposed by participants, documented and provided with scoring software that relies on the data format mentioned above, may be used to score all submitted systems. Participants with a scoring algorithm available should contact us before March 31. Please refer to the guidelines of the HLT/NAACL 2003 shared task on word alignment for additional information (including alignment examples, evaluation details, etc.)