Research Goals

Our goals are to advance speech-centered machine learning for human behavior detection. We focus on three main areas: 1) emotion recognition, 2) mental health modeling, and 3) assistive technology.

The research in this page is broken into focus areas. Please click on the blue boxes in each section to see a subset of relevant papers, published since 2016.

Emotion Recognition

General Purpose Emotion Recognition

Emotion is often subtle and ambiguous. Yet, it is central to our communication and provides critical information about our wellness. Consequently, there has been extensive work asking how emotion can be automatically detected from speech samples. This work has focused on classifier design (more recently, network architectures) and feature engineering (or feature learning).

Biqiao Zhang, Yuqing Kong, Georg Essl, Emily Mower Provost. “f-Similarity Preservation Loss for Soft Labels: A Demonstration on Cross-Corpus Speech Emotion Recognition.” AAAI. Hawaii. January 2019.
	Abstract: In this paper, we propose a Deep Metric Learning (DML) approach that supports soft labels. DML seeks to learn representations that encode the similarity between examples through deep neural networks. DML generally presupposes that data can be divided into discrete classes using hard labels. However, some tasks, such as our exemplary domain of speech emotion recognition (SER), work with inherently subjective data, data for which it may not be possible to identify a single hard label. We propose a family of loss functions, f- Similarity Preservation Loss (f-SPL), based on the dual form of f-divergence for DML with soft labels. We show that the minimizer of f-SPL preserves the pairwise label similarities in the learned feature embeddings. We demonstrate the efficacy of the proposed loss function on the task of cross-corpus SER with soft labels. Our approach, which combines f-SPL and classification loss, significantly outperforms a baseline SER system with the same structure but trained with only classification loss in most experiments.We show that the presented techniques are more robust to over-training and can learn an embedding space in which the similarity between examples is meaningful.

Biqiao Zhang, Soheil Khorram, and Emily Mower Provost. “Exploiting Acoustic and Lexical Properties of Phonemes to Recognize Valence from Speech.” International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Brighton, England. May 2019.
	Abstract: Emotions modulate speech acoustics as well as language. The lat- ter influences the sequences of phonemes that are produced, which in turn further modulate the acoustics. Therefore, phonemes impact emotion recognition in two ways: (1) they introduce an additional source of variability in speech signals and (2) they provide informa- tion about the emotion expressed in speech content. Previous work in speech emotion recognition has considered (1) or (2), individu- ally. In this paper, we investigate how we can jointly consider both factors to improve the prediction of emotional valence (positive vs. negative), and the relationship between improved prediction and the emotion elicitation process (e.g., fixed script, improvisation, natural interaction). We present a network that exploits both the acoustic and the lexical properties of phonetic information using multi-stage fu- sion. Our results on the IEMOCAP and MSP-Improv datasets show that our approach outperforms systems that either do not consider the influence of phonetic information or that only consider a single aspect of this influence.

Biqiao Zhang, Georg Essl, and Emily Mower Provost. "Predicting the Distribution of Emotion Perception: Capturing Inter-Rater Variability." International Conference on Multimodal Interaction (ICMI). Glasgow, Scotland, November 2017. [Note: full paper, oral presentation]
	Abstract: Emotion perception is person-dependent and variable. Dimensional characterizations of emotion can capture this variability by describing emotion in terms of its properties (e.g., valence, positive vs. negative, and activation, calm vs. excited). However, in many emotion recognition systems, this variability is often considered “noise” and is attenuated by averaging across raters. Yet, inter-rater variability provides information about the subtlety or clarity of an emotional expression and can be used to describe complex emotions. In this paper, we investigate methods that can effectively capture the variabil- ity across evaluators by predicting emotion perception as a discrete probability distribution in the valence-activation space. We propose: (1) a label processing method that can generate two-dimensional discrete probability distributions of emotion from a limited number of ordinal labels; (2) a new approach that predicts the generated probabilistic distributions using dynamic audio-visual features and Convolutional Neural Networks (CNNs). Our experimental results on the MSP-IMPROV corpus suggest that the proposed approach is more effective than the conventional Support Vector Regressions (SVRs) approach with utterance-level statistical features, and that feature-level fusion of the audio and video modalities outperforms decision-level fusion. The proposed CNN model predominantly improves the prediction accuracy for the valence dimension and brings a consistent performance improvement over data recorded from natural interactions. The results demonstrate the effectiveness of generating emotion distributions from limited number of labels and predicting the distribution using dynamic features and neural networks.

Zakaria Aldeneh, Soheil Khorram, Dimitrios Dimitriadis, Emily Mower Provost. "Pooling Acoustic and Lexical Features for the Prediction of Valence." International Conference on Multimodal Interaction (ICMI). Glasgow, Scotland, November 2017. [Note: short paper, oral presentation]
	Abstract: In this paper, we present an analysis of different multimodal fusion approaches in the context of deep learning, focusing on pooling intermediate representations learned for the acoustic and lexical modalities. Traditional approaches to multimodal feature pooling include: concatenation, element-wise addition, and element-wise multiplication. We compare these traditional methods to outer-product and compact bilinear pooling approaches, which consider more comprehensive interactions between features from the two modalities. We also study the influence of each modality on the overall performance of a multimodal system. Our experiments on the IEMOCAP dataset suggest that: (1) multimodal methods that combine acoustic and lexical features outperform their unimodal counterparts; (2) the lexical modality is better for predicting va- lence than the acoustic modality; (3) outer-product-based pooling strategies outperform other pooling strategies.

Duc Le, Zakariah Aldeneh, and Emily Mower Provost. “Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network.” Interspeech. Stockholm, Sweden, August 2017.
	Abstract: Estimating continuous emotional states from speech as a function of time has traditionally been framed as a regression problem. In this paper, we present a novel approach that moves the problem into the classification domain by discretizing the training labels at different resolutions. We employ a multi-task deep bidirectional long-short term memory (BLSTM) recurrent neural network (RNN) trained with cost-sensitive Cross Entropy loss to model these labels jointly. We introduce an emotion decoding algorithm that incorporates long- and short-term temporal properties of the signal to produce more robust time series estimates. We show that our proposed approach achieves competitive audio-only performance on the RECOLA dataset, relative to previously published works as well as other strong regression baselines. This work provides a link between regression and classification, and contributes an alternative approach for continuous emotion recognition.

John Gideon, Soheil Khorram, Zakariah Aldeneh, Dimitrios Dimitriadis, and Emily Mower Provost. “Progressive Neural Networks for Transfer Learning in Emotion Recognition.” Interspeech. Stockholm, Sweden, August 2017.
	Abstract: Many paralinguistic tasks are closely related and thus represen- tations learned in one domain can be leveraged for another. In this paper, we investigate how knowledge can be transferred between three paralinguistic tasks: speaker, emotion, and gender recognition. Further, we extend this problem to cross-dataset tasks, asking how knowledge captured in one emotion dataset can be transferred to another. We focus on progressive neural networks and compare these networks to the conventional deep learning method of pre-training and fine-tuning. Progressive neural networks provide a way to transfer knowledge and avoid the forgetting effect present when pre-training neural networks on different tasks. Our experiments demonstrate that: (1) emotion recognition can benefit from using representations origi- nally learned for different paralinguistic tasks and (2) transfer learning can effectively leverage additional datasets to improve the performance of emotion recognition systems.

Soheil Khorram, Zakariah Aldeneh, Dimitrios Dimitriadis, Melvin McInnis, and Emily Mower Provost. “Capturing Long-term Temporal Dependencies with Convolutional Networks for Continuous Emotion.” Interspeech. Stockholm, Sweden, August 2017.
	Abstract: The goal of continuous emotion recognition is to assign an emotion value to every frame in a sequence of acoustic features. We show that incorporating long-term temporal dependencies is critical for continuous emotion recognition tasks. To this end, we first investigate architectures that use dilated convolutions. We show that even though such architectures outperform previously reported systems, the output signals produced from such architectures undergo erratic changes between consecutive time steps. This is inconsistent with the slow moving ground-truth emotion labels that are obtained from human annotators. To deal with this problem, we model a downsampled version of the input signal and then generate the output signal through upsampling. Not only does the resulting downsampling/upsampling network achieve good performance, it also generates smooth output trajectories. Our method yields the best known audioonly performance on the RECOLA dataset.

Zakariah Aldeneh and Emily Mower Provost. “Using Regional Saliency for Speech Emotion Recognition.” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). New Orleans, Louisiana, USA, March 2017.
	Abstract: In this paper, we show that convolutional neural networks can be directly applied to temporal low-level acoustic features to identify emotionally salient regions without the need for defining or applying utterance-level statistics. We show how a convolutional neural network can be applied to minimally hand-engineered features to obtain competitive results on the IEMOCAP and MSP-IMPROV datasets. In addition, we demonstrate that, despite their common use across most categories of acoustic features, utterance-level statistics may obfuscate emotional information. Our results suggest that convolutional neural networks with Mel Filterbanks (MFBs) can be used as a replacement for classifiers that rely on features obtained from applying utterance-level statistics.

Zakaria Aldeneh and Emily Mower Provost. “You’re Not You When You’re Angry: Robust Emotion Features Emerge by Recognizing Speakers,” IEEE Transactions on Affective Computing, vol: To appear, 2021.
	Abstract: The robustness of an acoustic emotion recognition system hinges on first having access to features that represent an acoustic input signal. These representations should abstract extraneous low-level variations present in acoustic signals and only capture speaker characteristics relevant for emotion recognition. Previous research has demonstrated that, in other classification tasks, when large labeled datasets are available, neural networks trained on these data learn to extract robust features from the input signal. However, the datasets used for developing emotion recognition systems remain significantly smaller than those used for developing other speech systems. Thus, acoustic emotion recognition systems remain in need of robust feature representations. In this work, we study the utility of speaker embeddings, representations extracted from a trained speaker recognition network, as robust features for detecting emotions. We first study the relationship between emotions and speaker embeddings and demonstrate how speaker embeddings highlight the differences that exist between neutral speech and emotionally expressive speech. We quantify the modulations that variations in emotional expression incur on speaker embeddings and show how these modulations are greater than those incurred from lexical variations in an utterance. Finally, we demonstrate how speaker embeddings can be used as a replacement for traditional low-level acoustic features for emotion recognition.

Zakaria Aldeneh, Matthew Perez, Emily Mower Provost. "Learning Paralinguistic Features from Audiobooks through Style Voice Conversions." Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Mexico City, Mexico. June 2021.
	Abstract: Paralinguistics, the non-lexical components of speech, play a crucial role in human-human interaction. Models designed to recognize paralinguistic information, particularly speech emotion and style, are difficult to train because of the limited labeled datasets available. In this work, we present a new framework that enables a neural network to learn to extract par- alinguistic attributes from speech using data that are not annotated for emotion. We as- sess the utility of the learned embeddings on the downstream tasks of emotion recognition and speaking style detection, demonstrating significant improvements over surface acous- tic features as well as over embeddings ex- tracted from other unsupervised approaches. Our work enables future systems to leverage the learned embedding extractor as a separate component capable of highlighting the paralin- guistic components of speech.

Robust and Generalizable Emotion Recognition

There is also a critical secondary question: how can we create emotion recognition algorithms that are functional in real-world environments? In order to reap the benefits of emotion recognition technologies, we must have systems that are robust and generalizable. Our work focuses on how we can encourage classifiers to learn to recognize emotion in contexts different from the ones in which they have been trained.

Alex Wilf and Emily Mower Provost. “Towards Noise Robust Speech Emotion Recognition Using Dynamic Layer Customization.” Affective Computing and Intelligent Interaction (ACII). Tokyo, Japan. September 2021.
Abstract: Robustness to environmental noise is important to creating automatic speech emotion recognition systems that are deployable in the real world. In this work, we experiment with two paradigms, one where we can anticipate noise sources that will be seen at test time and one where we cannot. In our first experiment, we assume that we have advance knowledge of the noise conditions that will be seen at test time. We show that we can use this knowledge to create “expert” feature encoders for each noise condition. If the noise condition is unchanging, data can be routed to a single encoder to improve robustness. However, if the noise source is variant, this paradigm is too restrictive. In- stead, we introduce a new approach, dynamic layer customization (DLC), that allows the data to be dynamically routed to noise- matched encoders and then recombined. Critically, this process maintains temporal order, enabling extensions for multimodal models that generally benefit from long-term context. In our second experiment, we investigate whether partial knowledge of noise seen at test time can still be used to train systems that generalize well to unseen noise conditions using state-of-the- art domain adaptation algorithms. We find that DLC enables performance increases in both cases, highlighting the utility of mixture-of-expert approaches, domain adaptation methods and DLC to noise robust automatic speech emotion recognition.

Mimansa Jaiswal, Christian-Paul Bara, Yuanhang Luo, Mihai Burzo, Rada Mihalcea, and Emily Mower Provost, MuSE: a Multimodal Dataset of Stressed Emotion." Language Resources and Evaluation Conference (LREC) . Marseille, France. May 2020.
	Abstract: Endowing automated agents with the ability to provide support, entertainment and interaction with human beings requires sensing of the users’ affective state. These affective states are impacted by a combination of emotion inducers, current psychological state, and various contextual factors. Although emotion classification in both singular and dyadic settings is an established area, the effects of these additional factors on the production and perception of emotion is understudied. This paper presents a dataset, Multimodal Stressed Emotion (MuSE), to study the multimodal interplay between the presence of stress and expressions of affect. We describe the data collection protocol, the possible areas of use, and the annotations for the emotional content of the recordings. The paper also presents several baselines to measure the performance of multimodal features for emotion and stress classification.

Mimansa Jaiswal, Zakaria Aldeneh, Emily Mower Provost, “Using Adversarial Training to Investigate the Effect of Confounders on Multimodal Emotion Classification.” International Conference on Multimodal Interaction (ICMI). Suzhou, Jiangsu, China. October 2019.
	Abstract: Endowing automated agents with the ability to provide support, entertainment and interaction with human beings requires sensing of the users’ affective state. These affective states are impacted by a combination of emotion inducers, current psychological state, and various contextual factors. Although emotion classification in both singular and dyadic settings is an established area, the effects of these additional factors on the production and perception of emotion is understudied. This paper presents a dataset, Multimodal Stressed Emotion (MuSE), to study the multimodal interplay between the presence of stress and expressions of affect. We describe the data collection protocol, the possible areas of use, and the annotations for the emotional content of the recordings. The paper also presents several baselines to measure the performance of multimodal features for emotion and stress classification.

Mimansa Jaiswal, Zakaria Aldeneh, Cristian-Paul Bara, Yuanhang Luo, Mihai Burzo, Rada Mihalcea, Emily Mower Provost. “MuSE-ing on the impact of utterance ordering on crowdsourced emotion annotations.” International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Brighton, England. May 2019.
Abstract: Emotion recognition algorithms rely on data annotated with high quality labels. However, emotion expression and perception are in- herently subjective. There is generally not a single annotation that can be unambiguously declared “correct”. As a result, annotations are colored by the manner in which they were collected. In this paper, we conduct crowdsourcing experiments to investigate this impact on both the annotations themselves and on the performance of these al- gorithms. We focus on one critical question: the effect of context. We present a new emotion dataset, Multimodal Stressed Emotion (MuSE), and annotate the dataset using two conditions: randomized, in which annotators are presented with clips in random order, and contextualized, in which annotators are presented with clips in order. We find that contextual labeling schemes result in annotations that are more similar to a speaker’s own self-reported labels and that la- bels generated from randomized schemes are most easily predictable by automated systems.

John Gideon, Melvin McInnis, Emily Mower Provost. "Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)," IEEE Transactions on Affective Computing, vol:12, issue:4, Oct.-Dec., 2019.
	Abstract: Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train “meet in the middle” approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.

Biqiao (Didi) Zhang, Emily Mower Provost, and Georg Essl. "Cross-corpus Acoustic Emotion Recognition with Multi-task Learning: Seeking Common Ground while Preserving Differences," IEEE Transactions on Affective Computing, vol: To appear, 2017.
Abstract: There is growing interest in emotion recognition due to its potential in many applications. However, a pervasive challenge is the presence of data variability caused by factors such as differences across corpora, speaker’s gender, and the “domain” of expression (e.g., whether the expression is spoken or sung). Prior work has addressed this challenge by combining data across corpora and/or genders, or by explicitly controlling for these factors. In this work, we investigate the influence of corpus, domain, and gender on the cross-corpus generalizability of emotion recognition systems. We use a multi-task learning approach, where we define the tasks according to these factors. We find that incorporating variability caused by corpus, domain, and gender through multi-task learning outperforms approaches that treat the tasks as either identical or independent. Domain is a larger differentiating factor than gender for multi-domain data. When considering only the speech domain, gender and corpus are similarly influential. Defining tasks by gender is more beneficial than by either corpus or corpus and gender for valence, while the opposite holds for activation. On average, cross- corpus performance increases with the number of training corpora. The results demonstrate that effective cross-corpus modeling requires that we understand how emotion expression patterns change as a function of non-emotional factors.

Privacy in Emotion Recognition

An emotion recognition algorithm that isn't secure has dire consequences for its users. The consequences range from risks of demographic information being sensed and recorded without a user's consent to biases that result from a model's unequal performance across demographic groups. An emerging line of our work focuses on how to design classifiers to mask demographic information and to increase model generalizability across different groups of users.

Mimansa Jaiswal and Emily Mower Provost, "Privacy Enhanced Multimodal Neural Representations for Emotion Recognition," AAAI. New York, New York. February 2020.
Abstract: Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train “meet in the middle” approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.

Mimansa Jaiswal and Emily Mower Provost, "Privacy Enhanced Multimodal Neural Representations for Emotion Recognition," AAAI. New York, New York. February 2020.

Abstract: Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train “meet in the middle” approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.

Time in Emotion Recognition

Emotion data are labeled either statically or dynamically. When emotion is labeled statically, a single sentence (or unit of speech) is assigned a label that is believed to capture the entirity of that sample. The goal of the classifier is then to predict that single label. When emotion is labeled dynamically, a human evaluator (or annotator) is asked to continuously adjust their rating of the emotion present. This provides a time-continuous description of emotion. Classifiers are then trained to predict these dynamic ratings. However, different annotators have very different behavior when it comes to evaluating a given emotional display. These differences must be considered the evaluations from multiple annotators are used.

Soheil Khorram, Melvin McInnis, Emily Mower Provost. "Jointly Aligning and Predicting Continuous Emotion Annotations," IEEE Transactions on Affective Computing, Vol: To appear, 2019.
	Abstract: Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.

Soheil Khorram, Melvin McInnis, and Emily Mower Provost. “Trainable Time Warping: aligning time-series in the continuous-time domain.” International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Brighton, England. May 2019.
Abstract: DTW calculates the similarity or alignment between two signals, subject to temporal warping. However, its computational complex- ity grows exponentially with the number of time-series. Although there have been algorithms developed that are linear in the number of time-series, they are generally quadratic in time-series length. The exception is generalized time warping (GTW), which has linear computational cost. Yet, it can only identify simple time warping functions. There is a need for a new fast, high-quality multi-sequence alignment algorithm. We introduce trainable time warping (TTW), whose complexity is linear in both the number and the length of time-series. TTW performs alignment in the continuous-time domain using a sinc convolutional kernel and a gradient-based optimization technique. We compare TTW and GTW on 85 UCR datasets in time-series averaging and classification. TTW outperforms GTW on 67.1% of the datasets for the averaging tasks, and 61.2% of the datasets for the classification tasks.

Self-Reported Emotion Recognition

The majority of research in the field of emotion recognition is focused on estimating how a group of outside observers would perceive an emotional display. This is a convenient thing to do, from a machine learning perspective. It allows us to mitigate the challenges that are associated with trying to guess how a given speaker is truly feeling and instead quantify the emotion present in their outward displays of behavior. However, often, particularly in mental health modeling, this isn't what we need. What we need is to understand how an individual is interpreting their emotion. In this line of work, we investigate the feasibility of self reported emotion recognition and effective methods to estimate these types of labels.

Zhang, Biqiao, and Emily Mower Provost. "Automatic recognition of self-reported and perceived emotions." Multimodal Behavior Analysis in the Wild. Academic Press, 2019. 443-470.
Abstract: Emotion is an essential component in our interactions with others. It trans- mits information that helps us interpret the meaning behind an individual’s behavior. The goal of automatic emotion recognition is to provide this in- formation, distilling emotion from behavioral data. Yet, emotion may be de- fined in multiple manners: recognition of a person’s true felt sense, how that person believes his/her behavior will be interpreted, or how others actually do interpret that person’s behavior. The selection of a definition fundamen- tally impacts system design, behavior, and performance. The goal of this chapter is to provide an overview of the theories, resources, and ongoing research related to automatic emotion recognition that considers multiple definitions of emotion.

Biqiao Zhang, Georg Essl, and Emily Mower Provost. "Automatic Recognition of Self-Reported and Perceived Emotion: Does Joint Modeling Help?" International Conference on Multimodal Interaction (ICMI). Tokyo, Japan, November 2016. [Note: full paper, oral presentation, best paper honorable mention]
Abstract: Emotion labeling is a central component of automatic emo- tion recognition. Evaluators are asked to estimate the emo- tion label given a set of cues, produced either by them- selves (self-report label ) or others (perceived label ). This process is complicated by the mismatch between the inten- tions of the producer and the interpretation of the perceiver. Traditionally, emotion recognition systems use only one of these types of labels when estimating the emotion content of data. In this paper, we explore the impact of jointly modeling both an individual’s self-report and the perceived label of others. We use deep belief networks (DBN) to learn a representative feature space, and model the potentially complementary relationship between intention and percep- tion using multi-task learning. We hypothesize that the use of DBN feature-learning and multi-task learning of self- report and perceived emotion labels will improve the perfor- mance of emotion recognition systems. We test this hypoth- esis on the IEMOCAP dataset, an audio-visual and motion- capture emotion corpus. We show that both DBN feature learning and multi-task learning offer complementary gains. The results demonstrate that the perceived emotion tasks see greatest performance gain for emotionally subtle utter- ances, while the self-report emotion tasks see greatest per- formance gain for emotionally clear utterances. Our results suggest that the combination of knowledge from the self- report and perceived emotion labels lead to more effective emotion recognition systems.

Mental Health Modeling

Our speech, both language and acoustics provides critical insight into an our well-being. In this line of work, we ask how we can design speech-centered approaches to determine level of symptom severity for individuals with bipolar disorder and risk factors for individuals at risk of suicide.

We discuss our research under the umbrella of PRIORI (Predicting Individual Outcomes for Rapid Intervention). The PRIORI project asks how natural collections of speech can be used to intuit changes in mental health symptom severity. The original version PRIORI was a phone-based app that would record one-side of conversational telephone speech data. This app was used to collect audio data from both individuals with bipolar disorder and individuals at risk for suicide.

This work is part of a long-running collaboration with Dr. Melvin McInnis at the Prechter Bipolar Research Program (Depression Center, University of Michigan) and Dr. Heather Schatten at Brown University.

Katie Matton, Melvin G McInnis, Emily Mower Provost, “Into the Wild: Transitioning from Recognizing Mood in Clinical Interactions to Personal Conversations for Individuals with Bipolar Disorder.” Interspeech. Graz, Austria. September 2019.
Abstract: Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.

Zakaria Aldeneh, Mimansa Jaiswal, Michael Picheny, Melvin McInnis, Emily Mower Provost. “Identifying Mood Episodes Using Dialogue Features from Clinical Interviews.” Interspeech. Graz, Austria. September 2019.
Abstract: Bipolar disorder, a severe chronic mental illness characterized by pathological mood swings from depression to mania, re- quires ongoing symptom severity tracking to both guide and measure treatments that are critical for maintaining long-term health. Mental health professionals assess symptom severity through semi-structured clinical interviews. During these inter- views, they observe their patients’ spoken behaviors, including both what the patients say and how they say it. In this work, we move beyond acoustic and lexical information, investigating how higher-level interactive patterns also change during mood episodes. We then perform a secondary analysis, asking if these interactive patterns, measured through dialogue features, can be used in conjunction with acoustic features to automatically rec- ognize mood episodes. Our results show that it is beneficial to consider dialogue features when analyzing and building auto- mated systems for predicting and monitoring mood.

Soheil Khorram, Mimansa Jaiswal, John Gideon, Melvin McInnis, Emily Mower Provost. “The PRIORI Emotion Dataset: Linking Mood to Emotion Detected In-the-Wild.” Interspeech. Hyderabad, India. September 2018.
	Abstract: Bipolar Disorder is a chronic psychiatric illness characterized by pathological mood swings associated with severe disruptions in emotion regulation. Clinical monitoring of mood is key to the care of these dynamic and incapacitating mood states. Frequent and detailed monitoring improves clinical sensitivity to detect mood state changes, but typically requires costly and limited resources. Speech characteristics change during both depressed and manic states, suggesting automatic methods applied to the speech signal can be effectively used to monitor mood state changes. However, speech is modulated by many factors, which renders mood state prediction challenging. We hypothesize that emotion can be used as an intermediary step to improve mood state prediction. This paper presents critical steps in developing this pipeline, including (1) a new in the wild emotion dataset, the PRIORI Emotion Dataset, collected from everyday smartphone conversational speech recordings, (2) activation/valence emotion recognition baselines on this dataset (PCC of 0.71 and 0.41, respectively), and (3) significant correlation between predicted emotion and mood state for individuals with bipolar disorder. This provides evidence and a working baseline for the use of emotion as a meta-feature for mood state monitoring.

Soheil Khorram, John Gideon, Melvin McInnis, and Emily Mower Provost. "Recognition of Depression in Bipolar Disorder: Leveraging Cohort and Person-Specific Knowledge." Interspeech. San Francisco, CA, September 2016.
Abstract: Individuals with bipolar disorder typically exhibit changes in the acoustics of their speech. Mobile health systems seek to model these changes to automatically detect and correctly identify current states in an individual and to ultimately predict impending mood episodes. We have developed a program, PRIORI (Predicting Individual Outcomes for Rapid Intervention), that analyzes acoustics of speech as predictors of mood states from mobile smartphone data. Mood prediction systems generally assume that the symptomatology of an individual can be modeled using patterns common in a cohort population due to limitations in the size of available datasets. However, individuals are unique. This paper explores person-level systems that can be developed from the current PRIORI database of an extensive and longitudinal collection composed of two subsets: a smaller labeled portion and a larger unlabeled portion. The person-level system employs the unlabeled portion to extract i-vectors, which characterize single individuals. The labeled portion is then used to train person-level and population-level supervised classifiers, operating on the i-vectors and on speech rhythm statistics, respectively. The unification of these two ap- proaches results in a significant improvement over the baseline system, demonstrating the importance of a multi-level approach to capturing depression symptomatology.

John Gideon, Emily Mower Provost, Melvin McInnis. “Mood State Prediction From Speech Of Varying Acoustic Quality For Individuals With Bipolar Disorder.” International Conference on Acoustics, Speech and Signal Processing (ICASSP). Shanghai, China, March 2016.
Abstract: Speech contains patterns that can be altered by the mood of an in- dividual. There is an increasing focus on automated and distributed methods to collect and monitor speech from large groups of patients suffering from mental health disorders. However, as the scope of these collections increases, the variability in the data also increases. This variability is due in part to the range in the quality of the de- vices, which in turn affects the quality of the recorded data, neg- atively impacting the accuracy of automatic assessment. It is nec- essary to mitigate variability effects in order to expand the impact of these technologies. This paper explores speech collected from phone recordings for analysis of mood in individuals with bipolar disorder. Two different phones with varying amounts of clipping, loudness, and noise are employed. We describe methodologies for use during preprocessing, feature extraction, and data modeling to correct these differences and make the devices more comparable. The results demonstrate that these pipeline modifications result in statistically significantly higher performance, which highlights the potential of distributed mental health systems.

Brian Stasak, Julien Epps, Heather T. Schatten, Ivan W. Miller, Emily Mower Provost, and Michael F. Armey. “Read Speech Voice Quality and Disfluency in Individuals with Recent Suicidal Ideation or Suicide Attempt,” Speech Communication, vol:132, pages 10-20. 2021.
Abstract: Individuals that have incurred trauma due to a suicide attempt often acquire residual health complications, such as cognitive, mood, and speech-language disorders. Due to limited access to suicidal speech audio corpora, behavioral differences in patients with a history of suicidal ideation and/or behavior have not been thoroughly examined using subjective voice quality and manual disfluency measures. In this study, we examine the Butler-Brown Read Speech (BBRS) database that includes 20 healthy controls with no history of suicidal ideation or behavior (HC group) and 226 psychiatric inpatients with recent suicidal ideation (SI group) or a recent suicide attempt (SA group). During read aloud sentence tasks, SI and SA groups reveal poorer average subjective voice quality composite ratings when compared with individuals in the HC group. In particular, the SI and SA groups exhibit average ‘grade’ and ‘roughness’ voice quality scores four to six times higher than those of the HC group. We demonstrate that manually annotated voice quality measures, converted into a low-dimensional feature vector, help to identify individuals with recent suicidal ideation and behavior from a healthy population, generating an automatic classification accuracy of up to 73%. Furthermore, our novel investigation of manual speech disfluencies (e.g., manually detected hesitations, word/phrase repeats, malapropisms, speech errors, non-self-correction) shows that inpatients in the SI and SA groups produce on average approximately twice as many hesitations and four times as many speech errors when compared with individuals in the HC group. We demonstrate automatic classification of inpatients with a suicide history from individuals with no suicide history with up to 80% accuracy using manually annotated speech disfluency features. Knowledge regarding voice quality and speech disfluency behaviors in individuals with a suicide history presented herein will lead to a better understanding of this complex phenomenon and thus contribute to the future development of new automatic speech-based suicide-risk identification systems.

John Gideon, Heather T Schatten, Melvin G McInnis, Emily Mower Provost. “Emotion Recognition from Natural Phone Conversations in Individuals With and Without Recent Suicidal Ideation.” Interspeech. Graz, Austria. September 2019.
Abstract: Suicide is a serious public health concern in the U.S., taking the lives of over 47,000 people in 2017. Early detection of suicidal ideation is key to prevention. One promising approach to symptom monitoring is suicidal speech prediction, as speech can be passively collected and may indicate changes in risk. However, directly identifying suicidal speech is difficult, as characteristics of speech can vary rapidly compared with suicidal thoughts. Suicidal ideation is also associated with emotion dysregulation. Therefore, in this work, we focus on the detection of emotion from speech and its relation to suicide. We introduce the Eco- logical Measurement of Affect, Speech, and Suicide (EMASS) dataset, which contains phone call recordings of individuals recently discharged from the hospital following admission for sui- cidal ideation or behavior, along with controls. Participants self-report their emotion periodically throughout the study. How- ever, the dataset is relatively small and has uncertain labels. Be- cause of this, we find that most features traditionally used for emotion classification fail. We demonstrate how outside emo- tion datasets can be used to generate more relevant features, making this analysis possible. Finally, we use emotion predictions to differentiate healthy controls from those with suicidal ideation, providing evidence for suicidal speech detection using emotion.

Assistive Technology

An individual's speech patterns provides insight into their physical health. Speech changes are reflective of language impairments, muscular changes, and cognitive impairment. In this line of work, we ask how new speech-centered algorithms can be designed to detect changes in health.

Matthew Perez, Amrit Romana, Angela Roberts, Noelle Carlozzi, Jennifer Ann Miner, Praveen Dayalu and Emily Mower Provost. ”Articulatory Coordination for Speech Motor Tracking in Huntington Disease.” Interspeech. Brno, Czech Republic. August 2021.
Abstract: Huntington Disease (HD) is a progressive disorder which often manifests in motor impairment. Motor severity (captured via motor score) is a key component in assessing overall HD severity. However, motor score evaluation involves in-clinic visits with a trained medical professional, which are expensive and not always accessible. Speech analysis provides an attrative avenue for tracking HD severity because speech is easy to collect remotely and provides insight into motor changes. HD speech is typically characterized as having irregular articulation. With this in mind, acoustic features that can capture vocal tract movement and articulatory coordination are particularly promising for characterizing motor symptom progression in HD. In this paper, we present an experiment that uses Vocal Tract Coordination (VTC) features extracted from read speech to estimate a motor score. When using an elastic-net regression model, we find that VTC features significantly outperform other acoustic features across varied-length audio segments, which highlights the effectiveness of these features for both shortand long-form reading tasks. Lastly, we analyze the F-value scores of VTC features to visualize which channels are most related to motor score. This work enables future research efforts to consider VTC features for acoustic analyses which target HD motor symptomatology tracking.

Amrit Romana, John Bandon, Noelle Carlozzi, Angela Roberts, Emily Mower Provost. “Classification of Manifest Huntington Disease using Vowel Distortion Measures.” Interspeech. Shanghai, China. October 2020.
Abstract: Huntington disease (HD) is a fatal autosomal dominant neurocognitive disorder that causes cognitive disturbances, neuropsychiatric symptoms, and impaired motor abilities (e.g., gait, speech, voice). Due to its progressive nature, HD treatment requires ongoing clinical monitoring of symptoms. Individuals with the gene mutation which causes HD may exhibit a range of speech symptoms as they progress from premanifest to manifest HD. Differentiating between premanifest and manifest HD is an important yet understudied problem, as this distinction marks the need for increased treatment. Speech-based passive monitoring has the potential to augment clinical assessments by continuously tracking manifestation symptoms. In this work we present the first demonstration of how changes in connected speech can be measured to differentiate between premanifest and manifest HD. To do so, we focus on a key speech symptom of HD: vowel distortion. We introduce a set of vowel features which we extract from connected speech. We show that our vowel features can differentiate between premanifest and manifest HD with 87% accuracy.

Matthew Perez, Wenyu Jin, Duc Le, Noelle Carlozzi, Praveen Dayalu, Angela Roberts, Emily Mower Provost. “Classification of Huntington’s Disease Using Acoustic and Lexical Features.” Interspeech. Hyderabad, India. September 2018.
Abstract: Speech is a critical biomarker for Huntington Disease (HD), with changes in speech increasing in severity as the disease progresses. Speech analyses are currently conducted using either transcriptions created manually by trained professionals or using global rating scales. Manual transcription is both expensive and time-consuming and global rating scales may lack sufficient sensitivity and fidelity. Ultimately, what is needed is an unobtrusive measure that can cheaply and continuously track disease progression. We present first steps towards the development of such a system, demonstrating the ability to automatically differentiate between healthy controls and individuals with HD using speech cues. The results provide evidence that objective analyses can be used to support clinical diagnoses, moving towards the tracking of symptomatology outside of laboratory and clinical environments.

Amrit Romana, John Bandon, Matthew Perez, Stephanie Gutierrez, Richard Richter, Angela Roberts and Emily Mower Provost. ”Automatically Detecting Errors and Disfluencies in Read Speech to Predict Cognitive Impairment in People with Parkinson’s Disease.” Interspeech. Brno, Czech Republic. August 2021.
Abstract: Parkinson’s disease (PD) is a central nervous system disorder that causes motor impairment. Recent studies have found that people with PD also often suffer from cognitive impairment (CI). While a large body of work has shown that speech can be used to predict motor symptom severity in people with PD, much less has focused on cognitive symptom severity. Existing work has investigated if acoustic features, derived from speech, can be used to detect CI in people with PD. However, these acoustic features are general and are not targeted toward capturing CI. Speech errors and disfluencies provide additional insight into CI. In this study, we focus on read speech, which offers a controlled template from which we can detect errors and disfluencies, and we analyze how errors and disfluencies vary with CI. The novelty of this work is an automated pipeline, including transcription and error and disfluency detection, capable of predicting CI in people with PD. This will enable efficient analyses of how cognition modulates speech for people with PD, leading to scalable speech assessments of CI.

Amrit Romana, John Bandon, Matthew Perez, Stephanie Gutierrez, Richard Richter, Angela Roberts and Emily Mower Provost. ”Automatically Detecting Errors and Disfluencies in Read Speech to Predict Cognitive Impairment in People with Parkinson’s Disease.” Interspeech. Brno, Czech Republic. August 2021.

Abstract: Parkinson’s disease (PD) is a central nervous system disorder that causes motor impairment. Recent studies have found that people with PD also often suffer from cognitive impairment (CI). While a large body of work has shown that speech can be used to predict motor symptom severity in people with PD, much less has focused on cognitive symptom severity. Existing work has investigated if acoustic features, derived from speech, can be used to detect CI in people with PD. However, these acoustic features are general and are not targeted toward capturing CI. Speech errors and disfluencies provide additional insight into CI. In this study, we focus on read speech, which offers a controlled template from which we can detect errors and disfluencies, and we analyze how errors and disfluencies vary with CI. The novelty of this work is an automated pipeline, including transcription and error and disfluency detection, capable of predicting CI in people with PD. This will enable efficient analyses of how cognition modulates speech for people with PD, leading to scalable speech assessments of CI.

Matthew Perez, Zakaria Aldeneh, Emily Mower Provost. “Aphasic Speech Recognition using a Mixture of Speech Intelligibility Experts.” Interspeech. Shanghai, China. October 2020.
Abstract: Robust speech recognition is a key prerequisite for semantic feature extraction in automatic aphasic speech analysis. However, standard one-size-fits-all automatic speech recognition models perform poorly when applied to aphasic speech. One reason for this is the wide range of speech intelligibility due to different levels of severity (i.e., higher severity lends itself to less intelligible speech). To address this, we propose a novel acoustic model based on a mixture of experts (MoE), which handles the varying intelligibility stages present in aphasic speech by explicitly defining severity-based experts. At test time, the contribution of each expert is decided by estimating speech intelligibility with a speech intelligibility detector (SID). We show that our proposed approach significantly reduces phone error rates across all severity stages in aphasic speech compared to a baseline approach that does not incorporate severity information into the modeling process.

Duc Le, Keli Licata, and Emily Mower Provost. "Automatic Quantitative Analysis of Spontaneous Aphasic Speech," Speech Communication, vol: To appear, 2018.
Abstract: Spontaneous speech analysis plays an important role in the study and treatment of aphasia, but can be difficult to perform manually due to the time consuming nature of speech transcription and coding. Techniques in automatic speech recognition and assessment can potentially alleviate this problem by allowing clinicians to quickly process large amount of speech data. However, automatic analysis of spontaneous aphasic speech has been relatively under-explored in the engineering literature, partly due to the limited amount of available data and difficulties associated with aphasic speech processing. In this work, we perform one of the first large-scale quantitative analysis of spontaneous aphasic speech based on automatic speech recognition (ASR) output. We describe our acoustic modeling method that sets a new recognition benchmark on AphasiaBank, a large-scale aphasic speech corpus. We propose a set of clinically-relevant quantitative measures that are shown to be highly robust to automatic transcription errors. Finally, we demonstrate that these measures can be used to accurately predict the revised Western Aphasia Battery (WAB-R) Aphasia Quotient (AQ) without the need for manual transcripts. The results and techniques presented in our work will help advance the state-of-the-art in aphasic speech processing and make ASR-based technology for aphasia treatment more feasible in real-world clinical applications.

Duc Le, Keli Licata, Carol Persad, and Emily Mower Provost. "Automatic Assessment of Speech Intelligibility for Individuals with Aphasia," IEEE Transactions on Audio, Speech, and Language Processing, vol: 24, no: 11, Nov. 2016.
Abstract: Traditional in-person therapy may be difficult to access for individuals with aphasia due to the shortage of speech-language pathologists and high treatment cost. Computerized exercises offer a promising low-cost and constantly accessible supplement to in-person therapy. Unfortunately, the lack of feedback for verbal expression in existing programs hinders the applicability and effectiveness of this form of treatment. A prerequisite for producing meaningful feedback is speech intelligibility assessment. In this work, we investigate the feasibility of an automated system to assess three aspects of aphasic speech intelligibility: clarity, fluidity, and prosody. We introduce our aphasic speech corpus, which contains speech-based interaction between individuals with aphasia and a tablet-based application designed for therapeutic purposes. We present our method for eliciting reliable ground-truth labels for speech intelligibility based on the perceptual judgment of nonexpert human evaluators. We describe and analyze our feature set engineered for capturing pronunciation, rhythm, and intonation. We investigate the classification performance of our system under two conditions, one using human-labeled transcripts to drive feature extraction, and another using transcripts generated automatically. We show that some aspects of aphasic speech intelligibility can be estimated at human-level performance. Our results demonstrate the potential for the computerized treatment of aphasia and lay the groundwork for bridging the gap between human and automatic intelligibility assessment.

Duc Le, Keli Licata, and Emily Mower Provost. “Automatic Paraphasia Detection from Aphasic Speech: A Preliminary Study.” Interspeech. Stockholm, Sweden, August 2017.
Abstract: Aphasia is an acquired language disorder resulting from brain damage that can cause significant communication difficulties. Aphasic speech is often characterized by errors known as paraphasias, the analysis of which can be used to determine an appropriate course of treatment and to track an individual’s recovery progress. Being able to detect paraphasias automatically has many potential clinical benefits; however, this problem has not previously been investigated in the literature. In this paper, we perform the first study on detecting phonemic and neologistic paraphasias from scripted speech samples in AphasiaBank. We propose a speech recognition system with task-specific language models to transcribe aphasic speech automatically. We investigate features based on speech duration, Goodness of Pronunciation, phone edit distance, and Dynamic Time Warping on phoneme posteriorgrams. Our results demonstrate the feasibility of automatic paraphasia detection and outline the path toward enabling this system in real-world clinical applications.

Duc Le and Emily Mower Provost. "Improving Automatic Recognition of Aphasic Speech with AphasiaBank." Interspeech. San Francisco, CA, September 2016.
Abstract: Aphasia is an acquired language disorder resulting from brain damage that can cause significant communication difficulties. Aphasic speech is often characterized by errors known as paraphasias, the analysis of which can be used to determine an appropriate course of treatment and to track an individual’s recovery progress. Being able to detect paraphasias automatically has many potential clinical benefits; however, this problem has not previously been investigated in the literature. In this paper, we perform the first study on detecting phonemic and neologistic paraphasias from scripted speech samples in AphasiaBank. We propose a speech recognition system with task-specific language models to transcribe aphasic speech automatically. We investigate features based on speech duration, Goodness of Pronunciation, phone edit distance, and Dynamic Time Warping on phoneme posteriorgrams. Our results demonstrate the feasibility of automatic paraphasia detection and outline the path toward enabling this system in real-world clinical applications.