[see also the research page for related information]
Various software modules and data sets that are/were used in my research. They are made available under the terms of GNU General Public License. Both data and software are distributed without any warranty.
For any questions regarding the content of this page, please contact Rada Mihalcea, mihalcea at umich.edu
Compositional Demographic Word Embeddings
A repository of code to create compositional demographic embeddings and personalized embeddings and language models for specific users. [download] [github] (March, 2021)
Social Roles across Cultures
A dataset containing manual annotations of social role perceptions (descriptors and actions) for 49 roles, covering two different cultures (US and India). [data] (November 20, 2019)
- Charles Welch, Jonathan K. Kummerfeld, Verónica Pérez-Rosas, Rada Mihalcea, Compositional Demographic Word Embeddings, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, [pdf]
- Charles Welch, Jonathan K. Kummerfeld, Verónica Pérez-Rosas, Rada Mihalcea, Exploring the Value of Personalized Word Embeddings, Proceedings of the International Conference on Computational Linguistics (COLING 2020) [pdf]
Character Relatedness in Movies
A dataset containing relatedness scores between every pair of characters in 18 movies. It consists of a dense character interaction matrix for 4,761 unique character pairs over 22 hours of dialogue. [data] (October 25, 2019)
- Meixing Dong, David Jurgens, Carmen Banea and Rada Mihalcea, Perceptions of social roles across cultures, Proceedings of Social Informatics (SocInfo), 2019. [pdf]
Mulimodal Deception Detection in Dialogs
A dataset consisting of the dialogs from several Box of Lies shows, annotated for deception at utterance level. [data] (June 1, 2019)
- Mahmoud Azab, Stephane Dadian, Vivi Nastase, Larry An, Rada Mihalcea, Towards Extracting Medical Family History from Natural Language Interactions: A New Dataset and Baselines, Proceedings of the International Conference on Empirical Methods in Natural Language Processing (EMNLP 2019). [pdf]
Code to Process and Analyze Longitudinal Dialog Data
A collection of scripts to collect, process, and analyze several aspects of communication in longitudinal dialog, with a focus on personal text messages.
[download code] (April 30, 2019)
- Felix Soldner, Verónica Pérez-Rosas, Rada Mihalcea, Box of Lies: Multimodal Deception Detection in Dialogues, in Proceedings of the North American Association for Computational Linguistics (NAACL 2019), Minneapolis, June 2019. [pdf]
Stability of Word Embeddings
Code for evaluating the stability of word embeddings. [download code] (April 17, 2018)
- Charles Welch, Verónica Pérez-Rosas, Jonathan K. Kummerfeld, Rada Mihalcea, Look Who's Talking: Inferring Speaker Attributes from Personal Longitudinal Dialog, in Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France, 2019. Best paper award [pdf]
Value Hierarchical Lexicon and Lexicon Construction Methodology
A hierarchical lexicon of personal values, along with words used to express such values, which can be used for the recognition of values in text. Comes along with a crowd-powered approach to hierarchical lexicon construction.
[download lexicon] [code for crowd-powered lexicon construction] (Oct. 12, 2018)
- Laura Wendlandt, Jonathan K. Kummerfeld, Rada Mihalcea, Factors Influencing the Surprising Instability of Word Embeddings, Proceedings of the North American Conference on Computational Linguistics (NAACL), 2018. [pdf]
Evaluation Benchmark for Similarity of Human Activities
A dataset of annotated pairs of activities. Each pair of activities is annotated with four scores: similarity, relatedness, motivational alignment, and perceived actor congruence. [download dataset] (October 31, 2017)
- Steven R. Wilson, Yiting Shen, Rada Mihalcea, Building and Validating Hierarchical Lexicons with a Case Study on Personal Values, in Proceedings of the 10th International Conference on Social Informatics (SocInfo), St. Petersburg, Russia, 2018. Best paper award [pdf]
A dataset of fake and legitimate news, covering several domains (technology, education, business, sports, politics, entertainment and celebrity news). It consists of nearly 1,000 news, split evenly between fake and legitimate, collected through crowdsourcing or from web sources.
download (August 20, 2018)
- Steven R. Wilson, Rada Mihalcea, Measuring Semantic Relations between Human Activities, Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Taiwan, Taipei, November 2017 [pdf]
A dataset used to develop and evaluate methods for the identification of usage expression sentences in consumer product reviews. It consists of 565 reviews spanning five distinct product categories, with more than 3,000 annotated sentences.
download (October 29, 2017)
- Veronica Perez-Rosas, Bennett Kleinberg, Alexandra Lefevre, Rada Mihalcea, Automatic Detection of Fake News, in Proceedings of the International Conference on Computational Linguistics (COLING 2018), New Mexico, NM, August 2018. [pdf]
Multimodal Prediction of Gender and Personality
Code for predicting gender and personality giving a user’s images and text. [download code] (April 9, 2018)
- Shibamouli Lahiri, V. G. Vinod Vydiswaran, Rada Mihalcea, Identifying Usage Expression Sentences in Consumer Product Reviews, in Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP 2017), Taipei, Taiwan, November 2017. [pdf]
A dataset consisting of external factors associated with emotions expressed in tweets, including weather, news events, social network, user predisposition, and timing, used in experiments aiming to show the role played by these factors in predicting emotions.
download (October 26, 2017)
- Laura Wendlandt, Rada Mihalcea, Ryan L. Boyd, James W. Pennebaker,
Multimodal Analysis and Prediction of Latent User Dimensions, Proceedings of the 9th International Conference on Social Informatics (SocInfo 2017), Oxford, UK, September 2017 [pdf]
Demographic-aware Word Associations
A dataset consisting of word association responses for approximately 300 stimulus words collected from 800 respondents of different gender (male/female) and from different locations (India/United States). download (September 5, 2017)
- Vicki Liu, Carmen Banea, Rada Mihalcea, Grounded Emotions, in Proceedings of the International Conference on Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, Texas, October 2017. [pdf]
A dataset consisting of 1042 students comments annotated with targeted sentiment, i.e., all the courses and instructors mentioned and the sentiment that the student has toward them. download (July 15, 2017)
- Aparna Garimella, Carmen Banea, Rada Mihalcea, Demographic-Aware Word Associations, in Proceedings of the International Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), Copenhagen, Denmark, September 2017 [pdf]
A multimodal dataset consisting of real-life deception: deceptive and truthful trial testimonies, manually transcribed and annotated. The dataset includes 121 short videos, along with their transcriptions and gesture annotations. download (June 15, 2016)
- Charles Welch, Rada Mihalcea, Targeted Sentiment to Understand Student Comments, in Proceedings of the International Conference on Computational Linguistics (COLING 2016), Osaka, Japan, December 2016 [pdf]
A crowdsourced dataset consisting of ground truth affordances on 20 PASCAL VOC object classes and 957 action classes. Given an object (noun), we identify whether an action (verb) can be performed on it. This is equivalent to connecting verb nodes and noun nodes in WordNet, or filling an affordance matrix encoding the plausibility of each action-object pair. download (June 15, 2016)
- Veronica Perez-Rosas, Mohamed Abouelenien, Rada Mihalcea, Mihai Burzo, Deception Detection using Real-life Trial Data, in Proceedings of the ACM International Conference on Multimodal Interaction (ICMI 2015), Seattle, November 2015. [pdf]
This is a crowdsourced deception dataset consisting of short open domain truths and lies from 512 users. Seven lies and seven truths are provided for each user. The dataset also includes user's demographic information, such as gender, age, country of origin, and education level. download (August 27, 2015)
This is a deception dataset covering four different cultures: US, India, Mexico, and Romania. Each dataset consists of short deceptive and truthful essays for three topics: opinions on abortion, opinions on death penalty, and feelings about a best friend. download (November 20, 2014)
- Yu-Wei Chao, Zhan Wang, Rada Mihalcea, Jia Deng, Mining Semantic Affordances of Visual Object Categories, In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015, Boston, June 2015.[pdf][data]
A package of tools/resources to perform linguistic ethnography. That is, given a collection of documents that are representative for a certain phenomenon (e.g., happiness blogs; lies; women-authored texts; etc.), these tools can assist in analysing the collection and discovering potentially interesting patterns. download (March 9, 2015)
Summarization and Keyword Extraction from Emails
This is a dataset consisting of pairs of 349 emails manually annotated with abstractive summaries, extractive
summaries, and keywords. senses from Wikipedia, manually clusters. download (June 20, 2014)
Sense Clustering Dataset
This is a dataset consisting of pairs of senses from Wikipedia, manually clusters. download (December 7, 2013).
- Veronica Perez-Rosas and Rada Mihalcea, Cross-cultural Deception Detection, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014),
Baltimore, Maryland, June 2014.
MOUD: Multimodal Opinion Utterances Dataset
This is a collection of video reviews, segmented at utterance level, transcribed, and annotated for sentiment. Acoustic and visual features, automatically extracted, are also included. download (large file, over 500M) (August 15, 2013).
Efficient Indexer for the Google Web 1T Ngram corpus
This is an efficient indexer for the Google Web 1T Ngram corpus, along with a client-server model for fast querying. The software also accepts queries with wildcards. download (July 15, 2012).
Wikipedia Interlingual Links Evaluation Dataset
This resource contains manual annotations for 195 pairs of articles in Wikipedia, covering four language pairs. download. The metafile, containing all the candidate interlingual links for ten language pairs can also be downloaded (July 15, 2012).
Sentiment Lexicons in Spanish
This resource contains two polarity lexicons in Spanish. The lexicons have been automatically or semi-automatically generated. [download] (April 3, 2012).
- Bharath Dandala, Chris Hokamp, Rada Mihalcea and Razvan Bunescu, Sense Clustering using Wikipedia, in Proceedings of the
International Conference on Recent Advances in Natural Language Processing (RANLP 2013), Bulgaria, September 2013.
Measuring the Semantic Relatedness between Words and ImagesThis dataset contains the list of synsets from ImageNet and related data used in an experiment to compute semantic relatedness between words and images. download (January 12, 2011).
Text Mining for Automatic Image Tagging
This dataset contains images, texts and gold-standard annotations of 300 image-text pairs randomly collected over the web. download (August 23, 2010).
Learning to Identify Educational Material (LIEM)
The data set is a collection of 862 documents annotated for its educative-ness value along with other user selected features. download (August 11, 2009).
Cross-Lingual Semantic Relatedness (CLSR)
A validated translation of the original Miller-Charles (Miller and Charles, 1998) and WordSimilarity-353 (Finkelstein et al., 2001) in Spanish, Romanian, and Arabic. download (August 10, 2009)
Data for Automatic Short Answer Grading
A collection of short student answers and grades for a course in Computer Science. The data set consists of 21 questions with 30 student answers each. [download] (February 10, 2009)
- Veronica Perez Rosas, Carmen Banea, Rada Mihalcea, Learning Sentiment Lexicons in Spanish, in Proceedings of the International Conference on Language Resources and Evaluations (LREC 2012), Istanbul, Turkey, May 2012.
A larger collection of short student answers and grades for a course in Computer Science. The data set consists of 10 assignments (with 4-7 questions each) and 2 exams (with 10 questions each), with 30 student answers each. [download] (July 1, 2011)
Michael Mohler and Rada Mihalcea, Text-to-text Semantic Similarity for Automatic Short Answer Grading, in Proceedings of the European Chapter of the Association for Computational Linguistics (EACL 2009), Athens, Greece, March 2009. [pdf]
- Michael Mohler, Razvan Bunescu, Rada Mihalcea, Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments, Proceedings of the 49th Annual Meeting of the Association of Computational Linguistics – Human Language Technologies (ACL HLT 2011), Portland, June 2011. [pdf]
Multilingual Subjectivity Analysis: Gold Standard and Training Data
- Gold standard data for multilingual subjectivity analysis. The data set consists of 500 sentences in English, Romanian and Spanish, manually annotated for subjectivity. [download] (October 25, 2008)
Rada Mihalcea, Carmen Banea and Jan Wiebe, Learning Multilingual Subjective Language via Cross-Lingual Projections, In Proceedings of the Association for Computational Linguistics (ACL 2007), Prague, June 2007. [pdf]
- Multilingual training data, automatically annotated for subjectivity, in English, Romanian, and Spanish. [download] (October 25, 2008)
- Carmen Banea, Rada Mihalcea, Janyce Wiebe and Samer Hassan, Multilingual subjectivity analysis using machine translation, In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), Honolulu, Hawaii, October 2008. [pdf]
- Multilingual training data, automatically annotated for subjectivity, in English, Arabic, French, German, Romanian, and Spanish. [download] (August 20, 2010)
Carmen Banea, Rada Mihalcea and Janyce Wiebe, Multilingual subjectivity: are more languages better, In Proceedings of the International Conference on Computational Linguistics (COLING 2010), Beijing, China, August 2010. [pdf]
GWSD: Unsupervised Graph-based Word Sense Disambiguatio
- GWSD is a system for unsupervised all-words graph-based word sense disambiguation download GWSD 1.0 (September 13, 2007).
- Ravi Sinha and Rada Mihalcea, Unsupervised Graph-based Word Sense Disambiguation Using Measures of Word Semantic Similarity, In Proceedings of the IEEE International Conference on Semantic Computing (ICSC 2007), Irvine, CA, September 2007. [pdf]
- Rada Mihalcea, Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling, In Proceedings of the Joint Conference on Human Language Technology / Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, October, 2005. [pdf]
Affective Text: Data Annotated for Emotions and Polarity
- Affective Text is a data set consisting of 1000 test headlines and 200 development headlines, each of them annotated with the six Eckman emotions and the polarity orientation. [download] (July 13, 2007).
- Carlo Strapparava and Rada Mihalcea, SemEval-2007 Task 14: Affective Text, in Proceedings of the 4th International Workshop on the Semantic Evaluations (SemEval 2007), Prague, Czech Republic, June 2007. [pdf]
- Read more about the task here.
SenseLearner: All-Words Word Sense Disambiguation Tool
- SenseLearner 2.0 [download] (June 13, 2005).
- Changes in version 2.0: a client-server model that allows for significantly faster tagging; simpler input file format (the SemCor-like format is not anymore required)
- SenseLearner 1.0 (beta) [download] (Nov 18, 2004)
Benchmark for the evaluation of back-of-the-book indexing systems
- A benchmark for the evaluation of systems for back-of-the-book indexing [download].
The benchmark is described in:
Andras Csomai and Rada Mihalcea, Creating a Testbed for the Evaluation of Automatically Generated Back-of-the-
book Indexes, in Proceedings of the Conference on Computational Linguistics and Intelligent Text Processing (CICLing), LNCS, Mex
ico City, February 2006. [pdf]
FrameNet - WordNet verb sense mapping
- FnWnVerbMap 1.0 [download]
A mapping between verb lexical units in FrameNet II and verb senses in WordNet. The mapping process is described in:
Lei Shi and Rada Mihalcea, Putting Pieces Together: Combining FrameNet, VerbNet and WordNet for Robust Semantic Parsing, Cicling 2005, Mexico [pdf]
Resources and Tools for Romanian NLP
- Romanian corpus of newspaper articles (and two novels), 50 mil. words. [research purpose only - send a request to mihalcea at umich dot edu]
- Romanian sense tagged data, 39 ambiguous words [download]
- Romanian-English parallel texts, sentence-aligned, 1 mil. words (each side) [download; research purpose only - send a request to mihalcea at umich dot edu]
- Romanian-English word aligned data (2003) [download]
- Romanian-English word aligned data (2005) [download]
- Romanian-English dictionary (38,000 entries) [download]
- For other resources and tools for Romanian, see the ConsiLR webpage.
Open Mind Word Expert Sense Tagged Data
- OMWE 1.0: Sense tagged data for 288 nouns, created within the Open Mind Word Expert framework during one year of activity (2002) [download]
- OMWE 2.0: Sense tagged data for nouns, verbs, adjectives, created within the Open Mind Word Expert framework. These data sets were used during the Senseval-3 evaluations.
- Romanian OMWE: Data for 39 ambiguous words in Romanian [download]
- English OMWE: Data for 57 ambiguous words, annotated with WordNet/Wordsmyth senses [download]
- English-Hindi OMWE: Data for 41 English words annotated with their corresponding Hindi translation [download]
TWA Sense Tagged Data
- Sense tagged data for six words with two-way ambiguities (bass, crane,
motion, palm, plant, tank). [download]
Sense annotated data from Senseval 3
- Sense tagged data for many tasks, including English all words, English, Italian, Basque, Catalan, Chinese, Romanian, Spanish lexical sample, Mutlilingual lexical sample, WSD of WordNet glosses, semantic roles, and logic forms.
[site with links to all datasets]
Resources for Word Alignment
All these available from the webpage of the HLT/NAACL 2003 workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond.
- Word aligned data for Romanian-English, English-French.
- Parallel texts for training.
- Code for word alignment evaluation.
Texts semantically annotated with WordNet 1.6 senses (created at Princeton
University), and automatically mapped to WordNet 1.7, WordNet 1.7.1, WordNet 2.0, WordNet 2.1, WordNet 3.0
- SemCor 1.6 [download]
- SemCor 1.7 [download]
- SemCor 1.7.1 [download]
- SemCor 2.0 [download]
- SemCor 2.1 [download]
- SemCor 3.0 [download]
A mapping between synsets offsets in various WordNet versions.
- WordNet 1.6 - 1.7 [download]
- WordNet 1.6 - 1.7.1 [download]
- WordNet 1.7 - 1.7.1 [download]
- WordNet 1.6 - 2.0 [download]
- WordNet 1.7.1 - 2.0 [download]
Senseval-2 and Senseval-3 English all-words data converted into SemCor format
- Evaluation software for text filtering systems, implements the normalized utility, F-measure, precision, and recall, as defined in the TREC 2002 Filtering task. Straightforward usage, follows closely the TREC 2002 Filtering guidelines. [download].
- More soon...
QA Data Set: Annotated questions
Annotations for about 5,500 questions used in an analysis of information requests. Questions are drawn from the Excite log, respectively the TREC QA benchmark. This is the data set used in the experiments reported in:
- Rada Mihalcea, The Semantic Wildcard, in Proceedings of the LREC 2002 Workshop on "Using Semantics for Information Retrieval and Filtering: State of the Art and Future Research", Las Palmas, Spain, May 2002.