ACL 2005: Building and Using Parallel Texts for Languages with Scarce Resources.

ACL 2005: Building and Using Parallel Texts:
Data Driven Machine Translation and Beyond

Track 1: Building and Using Parallel Texts
for Languages with Scarce Resources

Joel Martin, National Research Council of Canada
Rada Mihalcea, University of North Texas
Ted Pedersen, University of Minnesota Duluth

Program (tentative)
Workshop Program (June 29)
8:45-9:00 Welcome
Invited Talk
9:00-10:00 So many languages, so few resources: How to bridge the gap?
Mike Maxwell
Regular Papers
10:00-10:20 Association-Based Bilingual Word Alignment
Robert C. Moore
10:30-11:00 Break
11:00-11:20 Cross Language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora
Alfio Gliozzo, Carlo Strapparava
11:20-11:40 Parsing Word-Aligned Parallel Corpora in a Grammar Induction Context
Jonas Kuhn
11:40-12:00 Bilingual Word Spectral Clustering for Statistical Machine Translation
Bing Zhao, Eric P. Xing, Alex Waibel
12:00-12:20 Revealing Phonological Similarities between Related Languages from Automatically Generated Parallel Corpora
Karin Mueller
12:20-12:40 Acquiring and Using Parallel Texts and Morpho-syntactic Language Resources for Serbian-English Statistical Machine Translation
Maja Popovic, David Vilar, Hermann Ney, Slobodan Jovicic, Zoran Saric
12:40-2:00 Lunch
2:00-2:20 Induction of Fine-grained Part-of-speech Taggers via Classifier Combination and Crosslingual Projection
Elliott Franco Drabek, David Yarowsky
2:20-2:40 Comparison, Selection, and Use of Sentence Alignment Algorithms for New Language Pairs
Anil Kumar Singh, Samar Husain
Shared Task on Word Alignment
2:40-3:00 Word Alignment for Languages with Scarce Resources
Joel Martin, Rada Mihalcea, Ted Pedersen
3:00-3:20 A hybrid approach to align sentences and words in English-Hindi parallel corpora
Niraj Aswani, Robert Gaizauskas
3:20-3:35 NUKTI: English-Inuktitut Word Alignment System Description
Philippe Langlais, Fabrizio Gotti, Guihong Cao
3:35-4:00 Break
4:00-4:15 Models for Inuktitut-English Word Alignment
Charles Schafer, Elliott Franco Drabek
4:15-4:30 Improved HMM Alignment Models for Languages with Scarce Resources
Adam Lopez, Philip Resnik
4:30-4:45 Symmetric Probabilistic Alignment
Ralf D. Brown, Jae Dong Kim, Peter J. Jansen, and Jaime G. Carbonell
4:45-5:00 ISI's Participation in the Romanian-English Alignment Task
Alexander Fraser, Daniel Marcu
5:00-5:15 Experiments Using MAR for Aligning Corpora
Juan Miguel Vilar
5:15-5:30 Combined word alignments
Dan Tufis, Radu Ion, Alexandru Ceausu, Dan Stefanescu
Panel and Discussions
5:30-6:30 "Building and Exploiting Parallel Texts for Languages with Scarce Resources: Lessons Learned and Future Directions".
Ralf Brown, Joel Martin, Bob Moore, Charles Schafer

Invited talk

So many languages, so few resources: How to bridge the gap?

Mike Maxwell
Linguistic Data Consortium
University of Pennsylvania

Abstract: It is now common knowledge that many of the world's more than six thousand languages are in danger of becoming extinct. While languages have disappeared throughout history and pre-history, the present rate of extinction is unprecedented. Attempts are being made both to preserve languages in the living state, and to document and describe languages.

The primary resource for language documentation is undoubtedly parallel text. Traditionally (and necessarily) field linguists have created parallel text in the languages they study by hand, starting out by transcribing previously unwritten languages. This work continues today, aided by modern tools, but it is still labor intensive and slow.

Computational linguists have also demonstrated the utility of parallel text as the fuel for many areas of NLP, including statistical machine translation. But while these uses of parallel text have a record of success with languages like French, Chinese and Arabic, recent efforts in so-called "Low Density" languages such as Hindi, Cebuano and others have shown lesser success, in large part because the of the shortage of parallel text.

In sum, apart from a few large languages, parallel text is a scarce and expensive commodity. I will try to give a feel for the availability of parallel text in a wide range of languages, and discuss efforts to create more parallel text, including better tools for field linguists, web search, paid translation, and Open-Mind style efforts. I conclude by suggesting that if the scarcity of parallel text is to be solved, both for language documentation and for NLP, then it is time to try new methods, perhaps including wikification.

Short bio: Mike Maxwell is a researcher at the Linguistic Data Consortium of the University of Pennsylvania. He obtained his BS in zoology at the University of Illinois in 1972, an MA in linguistics at the University of Washington in 1977, and his PhD at the University of Washington in 1984. As a member of the Summer Institute of Linguistics, he worked with indigenous languages of Mexico, Ecuador and Colombia, and developed tools for doing morphological analysis. He has also worked in syntactic parsing at Boeing Computer Services. At the Linguistic Data Consortium, his work has included developing morphological transducers for various languages, and creating corpora for "low density" languages, that is languages without extensive computational resources, ranging from Hindi to Tigrinya. His interests include documentation and description of endangered languages, collecting and building resources for low density languages, and morphology.
Task definition

The goal of this shared task is to provide an environment for the evaluation of systems for word alignment, with a focus on languages with scarce resources. This follows on the success of the word alignment shared task that took place as part of the NAACL 2003 workshop on parallel texts. All researchers who have a word alignment system available are invited to participate in this shared task on word alignment, individually or as part of a team.

Participants in the shared task will be provided with common sets of training data, consisting of English-Inuktitut, Romanian-English, and English-Hindi parallel texts (a participating team can choose to apply their system on one, two, or all three language pairs). Participants will be given approximately one month to train their systems with this data, and then previously held out test data will be released. Participants will run their alignment system on this test data and submit their results, which will be evaluated using a common set of metrics.


The registration form is now available here. All active participants who intend to participate in the word alignment shared task are required to register. During the test period (April 3 - April 10) test data will be released only to registered participants! Last day to register for participation in the shared task: April 7.

Everybody interested in the shared task is invited to register in the shared task mailing list (this mailing list is open to everybody interested in word alignment, regardless of their participation in the shared task). A list of general text alignment resources is also provided.


Activity Availability
Task guidelines March 7
Training data March 7
Development data March 10
Test data April 3
Submission of results April 10
Results back to participants April 12
Submission of short papers April 17

Guidelines and data sets
  • Code for alignment evaluation, and for format validation of alignment files.
  • General text alignment resources