ACL 2005: Building and Using Parallel Texts for Languages with Scarce Resources.

ACL 2005: Building and Using Parallel Texts:
Data Driven Machine Translation and Beyond

Track 1: Building and Using Parallel Texts
for Languages with Scarce Resources

Joel Martin, National Research Council of Canada
Rada Mihalcea, University of North Texas
Ted Pedersen, University of Minnesota Duluth

Workshop program | Shared task on word alignment

Program (tentative)

Workshop Program (June 29)
8:45-9:00	Welcome
Invited Talk
9:00-10:00	So many languages, so few resources: How to bridge the gap?
9:00-10:00	Mike Maxwell
Regular Papers
10:00-10:20	Association-Based Bilingual Word Alignment
10:00-10:20	Robert C. Moore
10:30-11:00	Break
11:00-11:20	Cross Language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora
11:00-11:20	Alfio Gliozzo, Carlo Strapparava
11:20-11:40	Parsing Word-Aligned Parallel Corpora in a Grammar Induction Context
11:20-11:40	Jonas Kuhn
11:40-12:00	Bilingual Word Spectral Clustering for Statistical Machine Translation
11:40-12:00	Bing Zhao, Eric P. Xing, Alex Waibel
12:00-12:20	Revealing Phonological Similarities between Related Languages from Automatically Generated Parallel Corpora
12:00-12:20	Karin Mueller
12:20-12:40	Acquiring and Using Parallel Texts and Morpho-syntactic Language Resources for Serbian-English Statistical Machine Translation
12:20-12:40	Maja Popovic, David Vilar, Hermann Ney, Slobodan Jovicic, Zoran Saric
12:40-2:00	Lunch
2:00-2:20	Induction of Fine-grained Part-of-speech Taggers via Classifier Combination and Crosslingual Projection
2:00-2:20	Elliott Franco Drabek, David Yarowsky
2:20-2:40	Comparison, Selection, and Use of Sentence Alignment Algorithms for New Language Pairs
2:20-2:40	Anil Kumar Singh, Samar Husain
Shared Task on Word Alignment
2:40-3:00	Word Alignment for Languages with Scarce Resources
2:40-3:00	Joel Martin, Rada Mihalcea, Ted Pedersen
3:00-3:20	A hybrid approach to align sentences and words in English-Hindi parallel corpora
3:00-3:20	Niraj Aswani, Robert Gaizauskas
3:20-3:35	NUKTI: English-Inuktitut Word Alignment System Description
3:20-3:35	Philippe Langlais, Fabrizio Gotti, Guihong Cao
3:35-4:00	Break
4:00-4:15	Models for Inuktitut-English Word Alignment
4:00-4:15	Charles Schafer, Elliott Franco Drabek
4:15-4:30	Improved HMM Alignment Models for Languages with Scarce Resources
4:15-4:30	Adam Lopez, Philip Resnik
4:30-4:45	Symmetric Probabilistic Alignment
4:30-4:45	Ralf D. Brown, Jae Dong Kim, Peter J. Jansen, and Jaime G. Carbonell
4:45-5:00	ISI's Participation in the Romanian-English Alignment Task
4:45-5:00	Alexander Fraser, Daniel Marcu
5:00-5:15	Experiments Using MAR for Aligning Corpora
5:00-5:15	Juan Miguel Vilar
5:15-5:30	Combined word alignments
5:15-5:30	Dan Tufis, Radu Ion, Alexandru Ceausu, Dan Stefanescu
Panel and Discussions
5:30-6:30	"Building and Exploiting Parallel Texts for Languages with Scarce Resources: Lessons Learned and Future Directions".
5:30-6:30	Ralf Brown, Joel Martin, Bob Moore, Charles Schafer

Invited talk

So many languages, so few resources: How to bridge the gap?

Mike Maxwell
Linguistic Data Consortium
University of Pennsylvania

Abstract: It is now common knowledge that many of the world's more than six thousand languages are in danger of becoming extinct. While languages have disappeared throughout history and pre-history, the present rate of extinction is unprecedented. Attempts are being made both to preserve languages in the living state, and to document and describe languages.

The primary resource for language documentation is undoubtedly parallel text. Traditionally (and necessarily) field linguists have created parallel text in the languages they study by hand, starting out by transcribing previously unwritten languages. This work continues today, aided by modern tools, but it is still labor intensive and slow.

Computational linguists have also demonstrated the utility of parallel text as the fuel for many areas of NLP, including statistical machine translation. But while these uses of parallel text have a record of success with languages like French, Chinese and Arabic, recent efforts in so-called "Low Density" languages such as Hindi, Cebuano and others have shown lesser success, in large part because the of the shortage of parallel text.

In sum, apart from a few large languages, parallel text is a scarce and expensive commodity. I will try to give a feel for the availability of parallel text in a wide range of languages, and discuss efforts to create more parallel text, including better tools for field linguists, web search, paid translation, and Open-Mind style efforts. I conclude by suggesting that if the scarcity of parallel text is to be solved, both for language documentation and for NLP, then it is time to try new methods, perhaps including wikification.

Short bio: Mike Maxwell is a researcher at the Linguistic Data Consortium of the University of Pennsylvania. He obtained his BS in zoology at the University of Illinois in 1972, an MA in linguistics at the University of Washington in 1977, and his PhD at the University of Washington in 1984. As a member of the Summer Institute of Linguistics, he worked with indigenous languages of Mexico, Ecuador and Colombia, and developed tools for doing morphological analysis. He has also worked in syntactic parsing at Boeing Computer Services. At the Linguistic Data Consortium, his work has included developing morphological transducers for various languages, and creating corpora for "low density" languages, that is languages without extensive computational resources, ranging from Hindi to Tigrinya. His interests include documentation and description of endangered languages, collecting and building resources for low density languages, and morphology.

Task definition

The goal of this shared task is to provide an environment for the evaluation of systems for word alignment, with a focus on languages with scarce resources. This follows on the success of the word alignment shared task that took place as part of the NAACL 2003 workshop on parallel texts. All researchers who have a word alignment system available are invited to participate in this shared task on word alignment, individually or as part of a team.

Participants in the shared task will be provided with common sets of training data, consisting of English-Inuktitut, Romanian-English, and English-Hindi parallel texts (a participating team can choose to apply their system on one, two, or all three language pairs). Participants will be given approximately one month to train their systems with this data, and then previously held out test data will be released. Participants will run their alignment system on this test data and submit their results, which will be evaluated using a common set of metrics.

Registration

The registration form is now available here. All active participants who intend to participate in the word alignment shared task are required to register. During the test period (April 3 - April 10) test data will be released only to registered participants! Last day to register for participation in the shared task: April 7.

Everybody interested in the shared task is invited to register in the shared task mailing list (this mailing list is open to everybody interested in word alignment, regardless of their participation in the shared task). A list of general text alignment resources is also provided.

Timetable

Activity	Availability
Task guidelines	March 7
Training data	March 7
Development data	March 10
Test data	April 3
Submission of results	April 10
Results back to participants	April 12
Submission of short papers	April 17

Guidelines and data sets

Guidelines for the shared task.
Training data
- English-Inuktitut training data. A collection of Inuktitut-English parallel texts from the Legislative Assembly of Nunavut, sentence-aligned. An introduction to Inuktitut that participants might find helpful is available here.
- Romanian-English training data. This collection groups together the parallel text of 1984, the Romanian Constitution, and a large (about 900,000 tokens) collection of texts collected from the Web. (to get access to this data set, please send an email to Rada Mihalcea, rada at cs unt edu).
- English-Hindi training data. A collection of English-Hindi parallel texts, from the Emille project. Data provided by Niraj Aswani and Rob Gaizauskas from U.Sheffield.
Development data
- English-Inuktitut development data.
- Romanian-English development data (this is the test set from the HLT/NAACL 2003 shared task, which now plays the role of development data)
- English-Hindi development data.
Test data

Code for alignment evaluation, and for format validation of alignment files.

General text alignment resources

ACL 2005: Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

Track 1: Building and Using Parallel Texts for Languages with Scarce Resources

So many languages, so few resources: How to bridge the gap?

ACL 2005: Building and Using Parallel Texts:
Data Driven Machine Translation and Beyond

Track 1: Building and Using Parallel Texts
for Languages with Scarce Resources