Rada Mihalcea: Downloads

rada mihalcea

home . research . students . publications . teaching . downloads . lit@umich . random . contact

downloads
[see also the research page for related information]

Various software modules and data sets that are/were used in my research. They are made available under the terms of GNU General Public License. Both data and software are distributed without any warranty.

For any questions regarding the content of this page, please contact Rada Mihalcea, mihalcea at umich.edu

new Compositional Demographic Word Embeddings
new Social Roles across Cultures
new Character Relatedness in Movies
new Multimodal Dialog Deception
new Code for Longitudinal Dialog
new Code for Embedding Stability
new Human Activity Similarity
Value Lexicon
Fake News
Multimodal Prediction of Gender and Personality
Usage Expressions
Grounded Emotions
Demographic-aware Word Associations
Targeted Sentiment
Real-life Deception
Semantic Affordances
Open-Domain Deception
Cross Cultural Deception
Linguistic Ethnography
Summarization and Keyword Extraction from Emails
MOUD: Multimodal Opinion Utterances Dataset
Sense Clustering Dataset
Efficient Indexer for the Google Web 1T Ngram corpus
Wikipedia Interlingual Links Evaluation Dataset
Sentiment Lexicons in Spanish
Measuring the Semantic Relatedness between Words and Images
Text Mining for Automatic Image Tagging
Learning to Identify Educational Materials (LIEM)
Cross-Lingual Semantic Relatedness (CLSR)
Data for Automatic Short Answer Grading
Multilingual Subjectivity Analysis: Gold Standard and Training Data
GWSD: Graph-based Unsupervised Word Sense Disambiguation
Affective Text: data annotated for emotions and polarity
SenseLearner: all words word sense disambiguation tool
Benchmark for the evaluation of back-of-the-book indexing systems
FrameNet - WordNet verb sense mapping
Resources and Tools for Romanian NLP
Open Mind Word Expert sense tagged data
TWA sense tagged data set
Sense annotated data from Senseval 3
Word Alignment Resources
SemCor 1.6, SemCor 1.7, SemCor 1.7.1, SemCor 2.0, SemCor 2.1, SemCor 3.0
Mappings among various WordNet versions
Senseval-2 and Senseval-3 English all-words data converted into SemCor format
Evaluation code for text filtering
Questions annotated with answer types

Compositional Demographic Word Embeddings

A repository of code to create compositional demographic embeddings and personalized embeddings and language models for specific users. [download] [github] (March, 2021)

Charles Welch, Jonathan K. Kummerfeld, Verónica Pérez-Rosas, Rada Mihalcea, Compositional Demographic Word Embeddings, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, [pdf]
Charles Welch, Jonathan K. Kummerfeld, Verónica Pérez-Rosas, Rada Mihalcea, Exploring the Value of Personalized Word Embeddings, Proceedings of the International Conference on Computational Linguistics (COLING 2020) [pdf]

Social Roles across Cultures

A dataset containing manual annotations of social role perceptions (descriptors and actions) for 49 roles, covering two different cultures (US and India). [data] (November 20, 2019)

Meixing Dong, David Jurgens, Carmen Banea and Rada Mihalcea, Perceptions of social roles across cultures, Proceedings of Social Informatics (SocInfo), 2019. [pdf]

Character Relatedness in Movies

A dataset containing relatedness scores between every pair of characters in 18 movies. It consists of a dense character interaction matrix for 4,761 unique character pairs over 22 hours of dialogue. [data] (October 25, 2019)

Mahmoud Azab, Stephane Dadian, Vivi Nastase, Larry An, Rada Mihalcea, Towards Extracting Medical Family History from Natural Language Interactions: A New Dataset and Baselines, Proceedings of the International Conference on Empirical Methods in Natural Language Processing (EMNLP 2019). [pdf]

Mulimodal Deception Detection in Dialogs

A dataset consisting of the dialogs from several Box of Lies shows, annotated for deception at utterance level. [data] (June 1, 2019)

Felix Soldner, Verónica Pérez-Rosas, Rada Mihalcea, Box of Lies: Multimodal Deception Detection in Dialogues, in Proceedings of the North American Association for Computational Linguistics (NAACL 2019), Minneapolis, June 2019. [pdf]

Code to Process and Analyze Longitudinal Dialog Data

A collection of scripts to collect, process, and analyze several aspects of communication in longitudinal dialog, with a focus on personal text messages. [download code] (April 30, 2019)

Charles Welch, Verónica Pérez-Rosas, Jonathan K. Kummerfeld, Rada Mihalcea, Look Who's Talking: Inferring Speaker Attributes from Personal Longitudinal Dialog, in Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France, 2019. Best paper award [pdf]

Stability of Word Embeddings

Code for evaluating the stability of word embeddings. [download code] (April 17, 2018)

Laura Wendlandt, Jonathan K. Kummerfeld, Rada Mihalcea, Factors Influencing the Surprising Instability of Word Embeddings, Proceedings of the North American Conference on Computational Linguistics (NAACL), 2018. [pdf]

Value Hierarchical Lexicon and Lexicon Construction Methodology

A hierarchical lexicon of personal values, along with words used to express such values, which can be used for the recognition of values in text. Comes along with a crowd-powered approach to hierarchical lexicon construction. [download lexicon] [code for crowd-powered lexicon construction] (Oct. 12, 2018)

Steven R. Wilson, Yiting Shen, Rada Mihalcea, Building and Validating Hierarchical Lexicons with a Case Study on Personal Values, in Proceedings of the 10th International Conference on Social Informatics (SocInfo), St. Petersburg, Russia, 2018. Best paper award [pdf]

Evaluation Benchmark for Similarity of Human Activities

A dataset of annotated pairs of activities. Each pair of activities is annotated with four scores: similarity, relatedness, motivational alignment, and perceived actor congruence. [download dataset] (October 31, 2017)

Steven R. Wilson, Rada Mihalcea, Measuring Semantic Relations between Human Activities, Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Taiwan, Taipei, November 2017 [pdf]

Fake News

A dataset of fake and legitimate news, covering several domains (technology, education, business, sports, politics, entertainment and celebrity news). It consists of nearly 1,000 news, split evenly between fake and legitimate, collected through crowdsourcing or from web sources. download (August 20, 2018)

Veronica Perez-Rosas, Bennett Kleinberg, Alexandra Lefevre, Rada Mihalcea, Automatic Detection of Fake News, in Proceedings of the International Conference on Computational Linguistics (COLING 2018), New Mexico, NM, August 2018. [pdf]

Usage Expressions

A dataset used to develop and evaluate methods for the identification of usage expression sentences in consumer product reviews. It consists of 565 reviews spanning five distinct product categories, with more than 3,000 annotated sentences. download (October 29, 2017)

Shibamouli Lahiri, V. G. Vinod Vydiswaran, Rada Mihalcea, Identifying Usage Expression Sentences in Consumer Product Reviews, in Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP 2017), Taipei, Taiwan, November 2017. [pdf]

Multimodal Prediction of Gender and Personality

Code for predicting gender and personality giving a user’s images and text. [download code] (April 9, 2018)

Laura Wendlandt, Rada Mihalcea, Ryan L. Boyd, James W. Pennebaker, Multimodal Analysis and Prediction of Latent User Dimensions, Proceedings of the 9th International Conference on Social Informatics (SocInfo 2017), Oxford, UK, September 2017 [pdf]

Grounded Emotions

A dataset consisting of external factors associated with emotions expressed in tweets, including weather, news events, social network, user predisposition, and timing, used in experiments aiming to show the role played by these factors in predicting emotions. download (October 26, 2017)

Vicki Liu, Carmen Banea, Rada Mihalcea, Grounded Emotions, in Proceedings of the International Conference on Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, Texas, October 2017. [pdf]

Demographic-aware Word Associations

A dataset consisting of word association responses for approximately 300 stimulus words collected from 800 respondents of different gender (male/female) and from different locations (India/United States). download (September 5, 2017)

Aparna Garimella, Carmen Banea, Rada Mihalcea, Demographic-Aware Word Associations, in Proceedings of the International Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), Copenhagen, Denmark, September 2017 [pdf]

Targeted Sentiment

A dataset consisting of 1042 students comments annotated with targeted sentiment, i.e., all the courses and instructors mentioned and the sentiment that the student has toward them. download (July 15, 2017)

Charles Welch, Rada Mihalcea, Targeted Sentiment to Understand Student Comments, in Proceedings of the International Conference on Computational Linguistics (COLING 2016), Osaka, Japan, December 2016 [pdf]

Real-life Deception

A multimodal dataset consisting of real-life deception: deceptive and truthful trial testimonies, manually transcribed and annotated. The dataset includes 121 short videos, along with their transcriptions and gesture annotations. download (June 15, 2016)

Veronica Perez-Rosas, Mohamed Abouelenien, Rada Mihalcea, Mihai Burzo, Deception Detection using Real-life Trial Data, in Proceedings of the ACM International Conference on Multimodal Interaction (ICMI 2015), Seattle, November 2015. [pdf]

Semantic Affordances

A crowdsourced dataset consisting of ground truth affordances on 20 PASCAL VOC object classes and 957 action classes. Given an object (noun), we identify whether an action (verb) can be performed on it. This is equivalent to connecting verb nodes and noun nodes in WordNet, or filling an affordance matrix encoding the plausibility of each action-object pair. download (June 15, 2016)

Yu-Wei Chao, Zhan Wang, Rada Mihalcea, Jia Deng, Mining Semantic Affordances of Visual Object Categories, In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015, Boston, June 2015.[pdf][data]

Open-Domain Deception

This is a crowdsourced deception dataset consisting of short open domain truths and lies from 512 users. Seven lies and seven truths are provided for each user. The dataset also includes user's demographic information, such as gender, age, country of origin, and education level. download (August 27, 2015)

Veronica Perez-Rosas and Rada Mihalcea, Experiments in Open Domain Deception Detection, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, Portugal, September 2015.

Cross-Cultural Deception

This is a deception dataset covering four different cultures: US, India, Mexico, and Romania. Each dataset consists of short deceptive and truthful essays for three topics: opinions on abortion, opinions on death penalty, and feelings about a best friend. download (November 20, 2014)

Veronica Perez-Rosas and Rada Mihalcea, Cross-cultural Deception Detection, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, Maryland, June 2014.

Linguistic Ethnography

A package of tools/resources to perform linguistic ethnography. That is, given a collection of documents that are representative for a certain phenomenon (e.g., happiness blogs; lies; women-authored texts; etc.), these tools can assist in analysing the collection and discovering potentially interesting patterns. download (March 9, 2015)

Rada Mihalcea and Stephen Pulman, Linguistic Ethnography: Identifying Dominant Word Classes in Text, in Proceedings of the Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2009), Mexico City, Mexico, March 2009.

Summarization and Keyword Extraction from Emails

This is a dataset consisting of pairs of 349 emails manually annotated with abstractive summaries, extractive summaries, and keywords. senses from Wikipedia, manually clusters. download (June 20, 2014)

Vanessa Loza and Shibamouli Lahiri and Rada Mihalcea and Po-Hsiang Lai, Building a Dataset for Summarization and Keyword Extraction from Emails, in Proceedings of the International Conference on Language Resources and Evaluations (LREC 2014), Reykjavik, Iceland, May 2014.

Sense Clustering Dataset

This is a dataset consisting of pairs of senses from Wikipedia, manually clusters. download (December 7, 2013).

Bharath Dandala, Chris Hokamp, Rada Mihalcea and Razvan Bunescu, Sense Clustering using Wikipedia, in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2013), Bulgaria, September 2013.

MOUD: Multimodal Opinion Utterances Dataset

This is a collection of video reviews, segmented at utterance level, transcribed, and annotated for sentiment. Acoustic and visual features, automatically extracted, are also included. download (large file, over 500M) (August 15, 2013).

Veronica Perez-Rosas, Rada Mihalcea, and Louis-Philippe Morency Utterance-Level Multimodal Sentiment Analysis, in Proceedings of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, 2013.

Efficient Indexer for the Google Web 1T Ngram corpus

This is an efficient indexer for the Google Web 1T Ngram corpus, along with a client-server model for fast querying. The software also accepts queries with wildcards. download (July 15, 2012).

Hakan Ceylan and Rada Mihalcea, An Efficient Indexer for Large N-Gram Corpora, in Proceedings of the ACL-HLT 2011 System Demonstrations, Portland, Oregon, 2011.

Wikipedia Interlingual Links Evaluation Dataset

This resource contains manual annotations for 195 pairs of articles in Wikipedia, covering four language pairs. download. The metafile, containing all the candidate interlingual links for ten language pairs can also be downloaded (July 15, 2012).

Bharath Dandala, Rada Mihalcea, Razvan Bunescu, Towards Building a Multilingual Semantic Network: Identifying Interlingual Links in Wikipedia, in Proceedings of *SEM 2012: The First Joint Conference on Lexical and Computational Semantics, Montreal, Canada, June 2012.

Sentiment Lexicons in Spanish

This resource contains two polarity lexicons in Spanish. The lexicons have been automatically or semi-automatically generated. [download] (April 3, 2012).

Veronica Perez Rosas, Carmen Banea, Rada Mihalcea, Learning Sentiment Lexicons in Spanish, in Proceedings of the International Conference on Language Resources and Evaluations (LREC 2012), Istanbul, Turkey, May 2012.

Measuring the Semantic Relatedness between Words and ImagesThis dataset contains the list of synsets from ImageNet and related data used in an experiment to compute semantic relatedness between words and images. download (January 12, 2011).

Ben Leong and Rada Mihalcea, Measuring the semantic relatedness between words and images, in Proceedings of the International Conference on Computational Semantics (IWCS 2011), Oxford, UK, January 2011.

Text Mining for Automatic Image Tagging

This dataset contains images, texts and gold-standard annotations of 300 image-text pairs randomly collected over the web. download (August 23, 2010).

Text Mining for Automatic Image Tagging, in Proceedings of the International Conference on Computational Linguistics (COLING 2010), Beijing, China, August 2010

Learning to Identify Educational Material (LIEM)

The data set is a collection of 862 documents annotated for its educative-ness value along with other user selected features. download (August 11, 2009).

Learning to Identify Educational Materials, in Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP 2009), Borovets, Bulgaria, September 2009

Cross-Lingual Semantic Relatedness (CLSR)

A validated translation of the original Miller-Charles (Miller and Charles, 1998) and WordSimilarity-353 (Finkelstein et al., 2001) in Spanish, Romanian, and Arabic. download (August 10, 2009)

Samer Hassan and Rada Mihalcea, Cross-Lingual Semantic Relatedness using Encyclopedic Knowledge, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Suntec, Singapore August 2009.

Data for Automatic Short Answer Grading

A collection of short student answers and grades for a course in Computer Science. The data set consists of 21 questions with 30 student answers each. [download] (February 10, 2009)

Michael Mohler and Rada Mihalcea, Text-to-text Semantic Similarity for Automatic Short Answer Grading, in Proceedings of the European Chapter of the Association for Computational Linguistics (EACL 2009), Athens, Greece, March 2009. [pdf]

A larger collection of short student answers and grades for a course in Computer Science. The data set consists of 10 assignments (with 4-7 questions each) and 2 exams (with 10 questions each), with 30 student answers each. [download] (July 1, 2011)

Michael Mohler, Razvan Bunescu, Rada Mihalcea, Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments, Proceedings of the 49th Annual Meeting of the Association of Computational Linguistics – Human Language Technologies (ACL HLT 2011), Portland, June 2011. [pdf]

Multilingual Subjectivity Analysis: Gold Standard and Training Data

Gold standard data for multilingual subjectivity analysis. The data set consists of 500 sentences in English, Romanian and Spanish, manually annotated for subjectivity. [download] (October 25, 2008)
- Rada Mihalcea, Carmen Banea and Jan Wiebe, Learning Multilingual Subjective Language via Cross-Lingual Projections, In Proceedings of the Association for Computational Linguistics (ACL 2007), Prague, June 2007. [pdf]
Multilingual training data, automatically annotated for subjectivity, in English, Romanian, and Spanish. [download] (October 25, 2008)
- Carmen Banea, Rada Mihalcea, Janyce Wiebe and Samer Hassan, Multilingual subjectivity analysis using machine translation, In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), Honolulu, Hawaii, October 2008. [pdf]
Multilingual training data, automatically annotated for subjectivity, in English, Arabic, French, German, Romanian, and Spanish. [download] (August 20, 2010)
- Carmen Banea, Rada Mihalcea and Janyce Wiebe, Multilingual subjectivity: are more languages better, In Proceedings of the International Conference on Computational Linguistics (COLING 2010), Beijing, China, August 2010. [pdf]

GWSD: Unsupervised Graph-based Word Sense Disambiguatio

GWSD is a system for unsupervised all-words graph-based word sense disambiguation download GWSD 1.0 (September 13, 2007).
- Ravi Sinha and Rada Mihalcea, Unsupervised Graph-based Word Sense Disambiguation Using Measures of Word Semantic Similarity, In Proceedings of the IEEE International Conference on Semantic Computing (ICSC 2007), Irvine, CA, September 2007. [pdf]
- Rada Mihalcea, Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling, In Proceedings of the Joint Conference on Human Language Technology / Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, October, 2005. [pdf]

Affective Text: Data Annotated for Emotions and Polarity

Affective Text is a data set consisting of 1000 test headlines and 200 development headlines, each of them annotated with the six Eckman emotions and the polarity orientation. [download] (July 13, 2007).
- Carlo Strapparava and Rada Mihalcea, SemEval-2007 Task 14: Affective Text, in Proceedings of the 4th International Workshop on the Semantic Evaluations (SemEval 2007), Prague, Czech Republic, June 2007. [pdf]
- Read more about the task here.

SenseLearner: All-Words Word Sense Disambiguation Tool

SenseLearner 2.0 [download] (June 13, 2005).
- Changes in version 2.0: a client-server model that allows for significantly faster tagging; simpler input file format (the SemCor-like format is not anymore required)
SenseLearner 1.0 (beta) [download] (Nov 18, 2004)

Benchmark for the evaluation of back-of-the-book indexing systems

A benchmark for the evaluation of systems for back-of-the-book indexing [download]. The benchmark is described in:

FrameNet - WordNet verb sense mapping

FnWnVerbMap 1.0 [download]
A mapping between verb lexical units in FrameNet II and verb senses in WordNet. The mapping process is described in:

Resources and Tools for Romanian NLP

Romanian corpus of newspaper articles (and two novels), 50 mil. words. [research purpose only - send a request to mihalcea at umich dot edu]
Romanian sense tagged data, 39 ambiguous words [download]
Romanian-English parallel texts, sentence-aligned, 1 mil. words (each side) [download; research purpose only - send a request to mihalcea at umich dot edu]
Romanian-English word aligned data (2003) [download]
- See also the webpage of the HLT/NAACL 2003 workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond for related tools & resources.
Romanian-English word aligned data (2005) [download]
- See also the webpage of the ACL 2005 workshop on Building and Using Parallel Texts: Data Driven Mach ine Translation and Beyond for related tools & resources.
Romanian-English dictionary (38,000 entries) [download]
For other resources and tools for Romanian, see the ConsiLR webpage.

Open Mind Word Expert Sense Tagged Data

OMWE 1.0: Sense tagged data for 288 nouns, created within the Open Mind Word Expert framework during one year of activity (2002) [download]
OMWE 2.0: Sense tagged data for nouns, verbs, adjectives, created within the Open Mind Word Expert framework. These data sets were used during the Senseval-3 evaluations.
- Romanian OMWE: Data for 39 ambiguous words in Romanian [download]
- English OMWE: Data for 57 ambiguous words, annotated with WordNet/Wordsmyth senses [download]
- English-Hindi OMWE: Data for 41 English words annotated with their corresponding Hindi translation [download]

TWA Sense Tagged Data

Sense tagged data for six words with two-way ambiguities (bass, crane, motion, palm, plant, tank). [download]

Sense annotated data from Senseval 3

Sense tagged data for many tasks, including English all words, English, Italian, Basque, Catalan, Chinese, Romanian, Spanish lexical sample, Mutlilingual lexical sample, WSD of WordNet glosses, semantic roles, and logic forms. [site with links to all datasets]

Resources for Word Alignment

Word aligned data for Romanian-English, English-French.
Parallel texts for training.
Code for word alignment evaluation.

All these available from the webpage of the HLT/NAACL 2003 workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond.

SemCor

SemCor 1.6 [download]
SemCor 1.7 [download]
SemCor 1.7.1 [download]
SemCor 2.0 [download]
SemCor 2.1 [download]
SemCor 3.0 [download]

WordNet mappings

WordNet 1.6 - 1.7 [download]
WordNet 1.6 - 1.7.1 [download]
WordNet 1.7 - 1.7.1 [download]
WordNet 1.6 - 2.0 [download]
WordNet 1.7.1 - 2.0 [download]

Senseval-2 and Senseval-3 English all-words data converted into SemCor format

Senseval-2 English all-words converted into SemCor format. [download]
Senseval-3 English all-words converted into SemCor format. [download]

Text Filtering

Evaluation software for text filtering systems, implements the normalized utility, F-measure, precision, and recall, as defined in the TREC 2002 Filtering task. Straightforward usage, follows closely the TREC 2002 Filtering guidelines. [download].
More soon...

QA Data Set: Annotated questions

Annotations for about 5,500 questions used in an analysis of information requests. Questions are drawn from the Excite log, respectively the TREC QA benchmark. This is the data set used in the experiments reported in:
Rada Mihalcea, The Semantic Wildcard, in Proceedings of the LREC 2002 Workshop on "Using Semantics for Information Retrieval and Filtering: State of the Art and Future Research", Las Palmas, Spain, May 2002.

	Excite	TREC
Annotated data	What Which	What Which
Question types	What Which	What Which