The research in this page is broken into focus areas. Please click on the blue boxes in each section to see a subset of relevant papers, published since 2016.
Emotion is often subtle and ambiguous. Yet, it is central to our communication and provides critical information about our wellness. Consequently, there has been extensive work asking how emotion can be automatically detected from speech samples. This work has focused on classifier design (more recently, network architectures) and feature engineering (or feature learning).
Abstract: In this paper, we propose a Deep Metric Learning (DML) approach
that supports soft labels. DML seeks to learn representations
that encode the similarity between examples through
deep neural networks. DML generally presupposes that data
can be divided into discrete classes using hard labels. However,
some tasks, such as our exemplary domain of speech
emotion recognition (SER), work with inherently subjective
data, data for which it may not be possible to identify a single
hard label. We propose a family of loss functions, f-
Similarity Preservation Loss (f-SPL), based on the dual form
of f-divergence for DML with soft labels. We show that the
minimizer of f-SPL preserves the pairwise label similarities
in the learned feature embeddings. We demonstrate the efficacy
of the proposed loss function on the task of cross-corpus
SER with soft labels. Our approach, which combines f-SPL
and classification loss, significantly outperforms a baseline
SER system with the same structure but trained with only
classification loss in most experiments.We show that the presented
techniques are more robust to over-training and can
learn an embedding space in which the similarity between
examples is meaningful.
Abstract: Emotions modulate speech acoustics as well as language. The lat- ter influences the sequences of phonemes that are produced, which in turn further modulate the acoustics. Therefore, phonemes impact emotion recognition in two ways: (1) they introduce an additional source of variability in speech signals and (2) they provide informa- tion about the emotion expressed in speech content. Previous work in speech emotion recognition has considered (1) or (2), individu- ally. In this paper, we investigate how we can jointly consider both factors to improve the prediction of emotional valence (positive vs. negative), and the relationship between improved prediction and the emotion elicitation process (e.g., fixed script, improvisation, natural interaction). We present a network that exploits both the acoustic and the lexical properties of phonetic information using multi-stage fu- sion. Our results on the IEMOCAP and MSP-Improv datasets show that our approach outperforms systems that either do not consider the influence of phonetic information or that only consider a single aspect of this influence.
Abstract:
Emotion perception is person-dependent and variable. Dimensional characterizations of emotion can capture this variability by describing emotion in terms of its properties (e.g., valence, positive vs. negative, and activation, calm vs. excited). However, in many emotion recognition systems, this variability is often considered “noise” and is attenuated by averaging across raters. Yet, inter-rater variability provides information about the subtlety or clarity of an emotional expression and can be used to describe complex emotions. In this paper, we investigate methods that can effectively capture the variabil- ity across evaluators by predicting emotion perception as a discrete probability distribution in the valence-activation space. We propose: (1) a label processing method that can generate two-dimensional discrete probability distributions of emotion from a limited number of ordinal labels; (2) a new approach that predicts the generated probabilistic distributions using dynamic audio-visual features and Convolutional Neural Networks (CNNs). Our experimental results on the MSP-IMPROV corpus suggest that the proposed approach is more effective than the conventional Support Vector Regressions (SVRs) approach with utterance-level statistical features, and that feature-level fusion of the audio and video modalities outperforms decision-level fusion. The proposed CNN model predominantly improves the prediction accuracy for the valence dimension and brings a consistent performance improvement over data recorded from natural interactions. The results demonstrate the effectiveness of generating emotion distributions from limited number of labels and predicting the distribution using dynamic features and neural networks.
Abstract:
In this paper, we present an analysis of different multimodal fusion approaches in the context of deep learning, focusing on pooling intermediate representations learned for the acoustic and lexical modalities. Traditional approaches to multimodal feature pooling include: concatenation, element-wise addition, and element-wise multiplication. We compare these traditional methods to outer-product and compact bilinear pooling approaches, which consider more comprehensive interactions between features from the two modalities. We also study the influence of each modality on the overall performance of a multimodal system. Our experiments on the IEMOCAP dataset suggest that: (1) multimodal methods that combine acoustic and lexical features outperform their unimodal counterparts; (2) the lexical modality is better for predicting va- lence than the acoustic modality; (3) outer-product-based pooling strategies outperform other pooling strategies.
Abstract:
Estimating continuous emotional states from speech as a function of time has traditionally been framed as a regression problem. In this paper, we present a novel approach that moves the problem into the classification domain by discretizing the training labels at different resolutions. We employ a multi-task deep bidirectional long-short term memory (BLSTM) recurrent neural network (RNN) trained with cost-sensitive Cross Entropy loss to model these labels jointly. We introduce an emotion decoding algorithm that incorporates long- and short-term temporal properties of the signal to produce more robust time series estimates. We show that our proposed approach achieves competitive audio-only performance on the RECOLA dataset, relative to previously published works as well as other strong regression baselines. This work provides a link between regression and classification, and contributes an alternative approach for continuous emotion recognition.
Abstract:
Many paralinguistic tasks are closely related and thus represen- tations learned in one domain can be leveraged for another. In this paper, we investigate how knowledge can be transferred between three paralinguistic tasks: speaker, emotion, and gender recognition. Further, we extend this problem to cross-dataset tasks, asking how knowledge captured in one emotion dataset can be transferred to another. We focus on progressive neural networks and compare these networks to the conventional deep learning method of pre-training and fine-tuning. Progressive neural networks provide a way to transfer knowledge and avoid the forgetting effect present when pre-training neural networks on different tasks. Our experiments demonstrate that: (1) emotion recognition can benefit from using representations origi- nally learned for different paralinguistic tasks and (2) transfer learning can effectively leverage additional datasets to improve the performance of emotion recognition systems.
Soheil Khorram, Zakariah Aldeneh, Dimitrios Dimitriadis, Melvin McInnis, and Emily Mower Provost. “Capturing Long-term Temporal Dependencies with Convolutional Networks for Continuous Emotion.” Interspeech. Stockholm, Sweden, August 2017.
Abstract:
The goal of continuous emotion recognition is to assign an emotion value to every frame in a sequence of acoustic features. We show that incorporating long-term temporal dependencies is critical for continuous emotion recognition tasks. To this end, we first investigate architectures that use dilated convolutions. We show that even though such architectures outperform previously reported systems, the output signals produced from such architectures undergo erratic changes between consecutive time steps. This is inconsistent with the slow moving ground-truth emotion labels that are obtained from human annotators. To deal with this problem, we model a downsampled version of the input signal and then generate the output signal through upsampling. Not only does the resulting downsampling/upsampling network achieve good performance, it also generates smooth output trajectories. Our method yields the best known audioonly performance on the RECOLA dataset.
Abstract:
In this paper, we show that convolutional neural networks
can be directly applied to temporal low-level acoustic features
to identify emotionally salient regions without the need
for defining or applying utterance-level statistics. We show
how a convolutional neural network can be applied to minimally
hand-engineered features to obtain competitive results
on the IEMOCAP and MSP-IMPROV datasets. In addition,
we demonstrate that, despite their common use across most
categories of acoustic features, utterance-level statistics may
obfuscate emotional information. Our results suggest that
convolutional neural networks with Mel Filterbanks (MFBs)
can be used as a replacement for classifiers that rely on features
obtained from applying utterance-level statistics.
Abstract: The robustness of an acoustic emotion recognition system hinges on first having access to features that represent an acoustic input signal. These representations should abstract extraneous low-level variations present in acoustic signals and only capture speaker characteristics relevant for emotion recognition. Previous research has demonstrated that, in other classification tasks, when large labeled datasets are available, neural networks trained on these data learn to extract robust features from the input signal. However, the datasets used for developing emotion recognition systems remain significantly smaller than those used for developing other speech systems. Thus, acoustic emotion recognition systems remain in need of robust feature representations. In this work, we study the utility of speaker embeddings, representations extracted from a trained speaker recognition network, as robust features for detecting emotions. We first study the relationship between emotions and speaker embeddings and demonstrate how speaker embeddings highlight the differences that exist between neutral speech and emotionally expressive speech. We quantify the modulations that variations in emotional expression incur on speaker embeddings and show how these modulations are greater than those incurred from lexical variations in an utterance. Finally, we demonstrate how speaker embeddings can be used as a replacement for traditional low-level acoustic features for emotion recognition.
Abstract: Paralinguistics, the non-lexical components of speech, play a crucial role in human-human interaction. Models designed to recognize paralinguistic information, particularly speech emotion and style, are difficult to train because of the limited labeled datasets available. In this work, we present a new framework that enables a neural network to learn to extract par- alinguistic attributes from speech using data that are not annotated for emotion. We as- sess the utility of the learned embeddings on the downstream tasks of emotion recognition and speaking style detection, demonstrating significant improvements over surface acous- tic features as well as over embeddings ex- tracted from other unsupervised approaches. Our work enables future systems to leverage the learned embedding extractor as a separate component capable of highlighting the paralin- guistic components of speech.
Robust and Generalizable Emotion Recognition
There is also a critical secondary question: how can we create emotion recognition algorithms that are functional in real-world environments? In order to reap the benefits of emotion recognition technologies, we must have systems that are robust and generalizable. Our work focuses on how we can encourage classifiers to learn to recognize emotion in contexts different from the ones in which they have been trained.
Abstract: Robustness to environmental noise is important to creating automatic speech emotion recognition systems that are deployable in the real world. In this work, we experiment with two paradigms, one where we can anticipate noise sources that will be seen at test time and one where we cannot. In our first experiment, we assume that we have advance knowledge of the noise conditions that will be seen at test time. We show that we can use this knowledge to create “expert” feature encoders for each noise condition. If the noise condition is unchanging, data can be routed to a single encoder to improve robustness. However, if the noise source is variant, this paradigm is too restrictive. In- stead, we introduce a new approach, dynamic layer customization (DLC), that allows the data to be dynamically routed to noise- matched encoders and then recombined. Critically, this process maintains temporal order, enabling extensions for multimodal models that generally benefit from long-term context. In our second experiment, we investigate whether partial knowledge of noise seen at test time can still be used to train systems that generalize well to unseen noise conditions using state-of-the- art domain adaptation algorithms. We find that DLC enables performance increases in both cases, highlighting the utility of mixture-of-expert approaches, domain adaptation methods and DLC to noise robust automatic speech emotion recognition.
Mimansa Jaiswal, Christian-Paul Bara, Yuanhang Luo, Mihai Burzo, Rada Mihalcea, and Emily Mower Provost, MuSE: a Multimodal Dataset of Stressed Emotion."Language Resources and Evaluation Conference (LREC) . Marseille, France. May 2020.
Abstract: Endowing automated agents with the ability to provide support, entertainment and interaction with human beings requires sensing of
the users’ affective state. These affective states are impacted by a combination of emotion inducers, current psychological state, and
various contextual factors. Although emotion classification in both singular and dyadic settings is an established area, the effects of
these additional factors on the production and perception of emotion is understudied. This paper presents a dataset, Multimodal Stressed
Emotion (MuSE), to study the multimodal interplay between the presence of stress and expressions of affect. We describe the data
collection protocol, the possible areas of use, and the annotations for the emotional content of the recordings. The paper also presents
several baselines to measure the performance of multimodal features for emotion and stress classification.
Abstract: Endowing automated agents with the ability to provide support, entertainment and interaction with human beings requires sensing of
the users’ affective state. These affective states are impacted by a combination of emotion inducers, current psychological state, and
various contextual factors. Although emotion classification in both singular and dyadic settings is an established area, the effects of
these additional factors on the production and perception of emotion is understudied. This paper presents a dataset, Multimodal Stressed
Emotion (MuSE), to study the multimodal interplay between the presence of stress and expressions of affect. We describe the data
collection protocol, the possible areas of use, and the annotations for the emotional content of the recordings. The paper also presents
several baselines to measure the performance of multimodal features for emotion and stress classification.
Abstract: Emotion recognition algorithms rely on data annotated with high quality labels. However, emotion expression and perception are in- herently subjective. There is generally not a single annotation that can be unambiguously declared “correct”. As a result, annotations are colored by the manner in which they were collected. In this paper, we conduct crowdsourcing experiments to investigate this impact on both the annotations themselves and on the performance of these al- gorithms. We focus on one critical question: the effect of context. We present a new emotion dataset, Multimodal Stressed Emotion (MuSE), and annotate the dataset using two conditions: randomized, in which annotators are presented with clips in random order, and contextualized, in which annotators are presented with clips in order. We find that contextual labeling schemes result in annotations that are more similar to a speaker’s own self-reported labels and that la- bels generated from randomized schemes are most easily predictable by automated systems.
Abstract: Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train “meet in the middle” approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.
Abstract: There is growing interest in emotion recognition due to its potential in many applications. However, a pervasive challenge is the presence of data variability caused by factors such as differences across corpora, speaker’s gender, and the “domain” of expression (e.g., whether the expression is spoken or sung). Prior work has addressed this challenge by combining data across corpora and/or genders, or by explicitly controlling for these factors. In this work, we investigate the influence of corpus, domain, and gender on the cross-corpus generalizability of emotion recognition systems. We use a multi-task learning approach, where we define the tasks according to these factors. We find that incorporating variability caused by corpus, domain, and gender through multi-task learning outperforms approaches that treat the tasks as either identical or independent. Domain is a larger differentiating factor than gender for multi-domain data. When considering only the speech domain, gender and corpus are similarly influential. Defining tasks by gender is more beneficial than by either corpus or corpus and gender for valence, while the opposite holds for activation. On average, cross- corpus performance increases with the number of training corpora. The results demonstrate that effective cross-corpus modeling requires that we understand how emotion expression patterns change as a function of non-emotional factors.
Privacy in Emotion Recognition
An emotion recognition algorithm that isn't secure has dire consequences for its users. The consequences range from risks of demographic information being sensed and recorded without a user's consent to biases that result from a model's unequal performance across demographic groups. An emerging line of our work focuses on how to design classifiers to mask demographic information and to increase model generalizability across different groups of users.
Abstract: Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train “meet in the middle” approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.
Time in Emotion Recognition
Emotion data are labeled either statically or dynamically. When emotion is labeled statically, a single sentence (or unit of speech) is assigned a label that is believed to capture the entirity of that sample. The goal of the classifier is then to predict that single label. When emotion is labeled dynamically, a human evaluator (or annotator) is asked to continuously adjust their rating of the emotion present. This provides a time-continuous description of emotion. Classifiers are then trained to predict these dynamic ratings. However, different annotators have very different behavior when it comes to evaluating a given emotional display. These differences must be considered the evaluations from multiple annotators are used.
Abstract: Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.
Abstract: DTW calculates the similarity or alignment between two signals, subject to temporal warping. However, its computational complex- ity grows exponentially with the number of time-series. Although there have been algorithms developed that are linear in the number of time-series, they are generally quadratic in time-series length. The exception is generalized time warping (GTW), which has linear computational cost. Yet, it can only identify simple time warping functions. There is a need for a new fast, high-quality multi-sequence alignment algorithm. We introduce trainable time warping (TTW), whose complexity is linear in both the number and the length of time-series. TTW performs alignment in the continuous-time domain using a sinc convolutional kernel and a gradient-based optimization technique. We compare TTW and GTW on 85 UCR datasets in time-series averaging and classification. TTW outperforms GTW on 67.1% of the datasets for the averaging tasks, and 61.2% of the datasets for the classification tasks.
Self-Reported Emotion Recognition
The majority of research in the field of emotion recognition is focused on estimating how a group of outside observers would perceive an emotional display. This is a convenient thing to do, from a machine learning perspective. It allows us to mitigate the challenges that are associated with trying to guess how a given speaker is truly feeling and instead quantify the emotion present in their outward displays of behavior. However, often, particularly in mental health modeling, this isn't what we need. What we need is to understand how an individual is interpreting their emotion. In this line of work, we investigate the feasibility of self reported emotion recognition and effective methods to estimate these types of labels.
Abstract: Emotion is an essential component in our interactions with others. It trans- mits information that helps us interpret the meaning behind an individual’s behavior. The goal of automatic emotion recognition is to provide this in- formation, distilling emotion from behavioral data. Yet, emotion may be de- fined in multiple manners: recognition of a person’s true felt sense, how that person believes his/her behavior will be interpreted, or how others actually do interpret that person’s behavior. The selection of a definition fundamen- tally impacts system design, behavior, and performance. The goal of this chapter is to provide an overview of the theories, resources, and ongoing research related to automatic emotion recognition that considers multiple definitions of emotion.
Abstract: Emotion labeling is a central component of automatic emo- tion recognition. Evaluators are asked to estimate the emo- tion label given a set of cues, produced either by them- selves (self-report label ) or others (perceived label ). This process is complicated by the mismatch between the inten- tions of the producer and the interpretation of the perceiver. Traditionally, emotion recognition systems use only one of these types of labels when estimating the emotion content of data. In this paper, we explore the impact of jointly modeling both an individual’s self-report and the perceived label of others. We use deep belief networks (DBN) to learn a representative feature space, and model the potentially complementary relationship between intention and percep- tion using multi-task learning. We hypothesize that the use of DBN feature-learning and multi-task learning of self- report and perceived emotion labels will improve the perfor- mance of emotion recognition systems. We test this hypoth- esis on the IEMOCAP dataset, an audio-visual and motion- capture emotion corpus. We show that both DBN feature learning and multi-task learning offer complementary gains. The results demonstrate that the perceived emotion tasks see greatest performance gain for emotionally subtle utter- ances, while the self-report emotion tasks see greatest per- formance gain for emotionally clear utterances. Our results suggest that the combination of knowledge from the self- report and perceived emotion labels lead to more effective emotion recognition systems.
Our speech, both language and acoustics provides critical insight into an our well-being. In this line of work, we ask how we can design speech-centered approaches to determine level of symptom severity for individuals with bipolar disorder and risk factors for individuals at risk of suicide.
We discuss our research under the umbrella of PRIORI (Predicting Individual Outcomes for Rapid Intervention). The PRIORI project asks how natural collections of speech can be used to intuit changes in mental health symptom severity. The original version PRIORI was a phone-based app that would record one-side of conversational telephone speech data. This app was used to collect audio data from both individuals with bipolar disorder and individuals at risk for suicide.
Abstract: Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.
Abstract: Bipolar disorder, a severe chronic mental illness characterized by pathological mood swings from depression to mania, re- quires ongoing symptom severity tracking to both guide and measure treatments that are critical for maintaining long-term health. Mental health professionals assess symptom severity through semi-structured clinical interviews. During these inter- views, they observe their patients’ spoken behaviors, including both what the patients say and how they say it. In this work, we move beyond acoustic and lexical information, investigating how higher-level interactive patterns also change during mood episodes. We then perform a secondary analysis, asking if these interactive patterns, measured through dialogue features, can be used in conjunction with acoustic features to automatically rec- ognize mood episodes. Our results show that it is beneficial to consider dialogue features when analyzing and building auto- mated systems for predicting and monitoring mood.
Abstract: Bipolar Disorder is a chronic psychiatric illness characterized by pathological mood swings associated with severe disruptions in emotion regulation. Clinical monitoring of mood is key to the care of these dynamic and incapacitating mood states. Frequent and detailed monitoring improves clinical sensitivity to detect mood state changes, but typically requires costly and limited resources. Speech characteristics change during both depressed and manic states, suggesting automatic methods applied to the speech signal can be effectively used to monitor mood state changes. However, speech is modulated by many factors, which renders mood state prediction challenging. We hypothesize that emotion can be used as an intermediary step to improve mood state prediction. This paper presents critical steps in developing this pipeline, including (1) a new in the wild emotion dataset, the PRIORI Emotion Dataset, collected from everyday smartphone conversational speech recordings, (2) activation/valence emotion recognition baselines on this dataset (PCC of 0.71 and 0.41, respectively), and (3) significant correlation between predicted emotion and mood state for individuals with bipolar disorder. This provides evidence and a working baseline for the use of emotion as a meta-feature for mood state monitoring.
Abstract: Individuals with bipolar disorder typically exhibit changes in the acoustics of their speech. Mobile health systems seek to model these changes to automatically detect and correctly identify current states in an individual and to ultimately predict impending mood episodes. We have developed a program, PRIORI (Predicting Individual Outcomes for Rapid Intervention), that analyzes acoustics of speech as predictors of mood states from mobile smartphone data. Mood prediction systems generally assume that the symptomatology of an individual can be modeled using patterns common in a cohort population due to limitations in the size of available datasets. However, individuals are unique. This paper explores person-level systems that can be developed from the current PRIORI database of an extensive and longitudinal collection composed of two subsets: a smaller labeled portion and a larger unlabeled portion. The person-level system employs the unlabeled portion to extract i-vectors, which characterize single individuals. The labeled portion is then used to train person-level and population-level supervised classifiers, operating on the i-vectors and on speech rhythm statistics, respectively. The unification of these two ap- proaches results in a significant improvement over the baseline system, demonstrating the importance of a multi-level approach to capturing depression symptomatology.
Abstract:
Speech contains patterns that can be altered by the mood of an in- dividual. There is an increasing focus on automated and distributed methods to collect and monitor speech from large groups of patients suffering from mental health disorders. However, as the scope of these collections increases, the variability in the data also increases. This variability is due in part to the range in the quality of the de- vices, which in turn affects the quality of the recorded data, neg- atively impacting the accuracy of automatic assessment. It is nec- essary to mitigate variability effects in order to expand the impact of these technologies. This paper explores speech collected from phone recordings for analysis of mood in individuals with bipolar disorder. Two different phones with varying amounts of clipping, loudness, and noise are employed. We describe methodologies for use during preprocessing, feature extraction, and data modeling to correct these differences and make the devices more comparable. The results demonstrate that these pipeline modifications result in statistically significantly higher performance, which highlights the potential of distributed mental health systems.
Abstract: Individuals that have incurred trauma due to a suicide attempt often acquire residual health complications, such as cognitive, mood, and speech-language disorders. Due to limited access to suicidal speech audio corpora, behavioral differences in patients with a history of suicidal ideation and/or behavior have not been thoroughly examined using subjective voice quality and manual disfluency measures. In this study, we examine the Butler-Brown Read Speech (BBRS) database that includes 20 healthy controls with no history of suicidal ideation or behavior (HC group) and 226 psychiatric inpatients with recent suicidal ideation (SI group) or a recent suicide attempt (SA group). During read aloud sentence tasks, SI and SA groups reveal poorer average subjective voice quality composite ratings when compared with individuals in the HC group. In particular, the SI and SA groups exhibit average ‘grade’ and ‘roughness’ voice quality scores four to six times higher than those of the HC group. We demonstrate that manually annotated voice quality measures, converted into a low-dimensional feature vector, help to identify individuals with recent suicidal ideation and behavior from a healthy population, generating an automatic classification accuracy of up to 73%. Furthermore, our novel investigation of manual speech disfluencies (e.g., manually detected hesitations, word/phrase repeats, malapropisms, speech errors, non-self-correction) shows that inpatients in the SI and SA groups produce on average approximately twice as many hesitations and four times as many speech errors when compared with individuals in the HC group. We demonstrate automatic classification of inpatients with a suicide history from individuals with no suicide history with up to 80% accuracy using manually annotated speech disfluency features. Knowledge regarding voice quality and speech disfluency behaviors in individuals with a suicide history presented herein will lead to a better understanding of this complex phenomenon and thus contribute to the future development of new automatic speech-based suicide-risk identification systems.
Abstract: Suicide is a serious public health concern in the U.S., taking the lives of over 47,000 people in 2017. Early detection of suicidal ideation is key to prevention. One promising approach to symptom monitoring is suicidal speech prediction, as speech can be passively collected and may indicate changes in risk. However, directly identifying suicidal speech is difficult, as characteristics of speech can vary rapidly compared with suicidal thoughts. Suicidal ideation is also associated with emotion dysregulation. Therefore, in this work, we focus on the detection of emotion from speech and its relation to suicide. We introduce the Eco- logical Measurement of Affect, Speech, and Suicide (EMASS) dataset, which contains phone call recordings of individuals recently discharged from the hospital following admission for sui- cidal ideation or behavior, along with controls. Participants self-report their emotion periodically throughout the study. How- ever, the dataset is relatively small and has uncertain labels. Be- cause of this, we find that most features traditionally used for emotion classification fail. We demonstrate how outside emo- tion datasets can be used to generate more relevant features, making this analysis possible. Finally, we use emotion predictions to differentiate healthy controls from those with suicidal ideation, providing evidence for suicidal speech detection using emotion.
An individual's speech patterns provides insight into their physical health. Speech changes are reflective of language impairments, muscular changes, and cognitive impairment. In this line of work, we ask how new speech-centered algorithms can be designed to detect changes in health.
Abstract: Huntington Disease (HD) is a progressive disorder which often manifests in motor impairment. Motor severity (captured via motor score) is a key component in assessing overall HD severity. However, motor score evaluation involves in-clinic visits with a trained medical professional, which are expensive and not always accessible. Speech analysis provides an attrative avenue for tracking HD severity because speech is easy to collect remotely and provides insight into motor changes. HD speech is typically characterized as having irregular articulation. With this in mind, acoustic features that can capture vocal tract movement and articulatory coordination are particularly promising for characterizing motor symptom progression in HD. In this paper, we present an experiment that uses Vocal Tract Coordination (VTC) features extracted from read speech to estimate a motor score. When using an elastic-net regression model, we find that VTC features significantly outperform other acoustic features across varied-length audio segments, which highlights the effectiveness of these features for both shortand long-form reading tasks. Lastly, we analyze the F-value scores of VTC features to visualize which channels are most related to motor score. This work enables future research efforts to consider VTC features for acoustic analyses which target HD motor symptomatology tracking.
Abstract: Huntington disease (HD) is a fatal autosomal dominant neurocognitive disorder that causes cognitive disturbances, neuropsychiatric symptoms, and impaired motor abilities (e.g., gait, speech, voice). Due to its progressive nature, HD treatment requires ongoing clinical monitoring of symptoms. Individuals with the gene mutation which causes HD may exhibit a range of speech symptoms as they progress from premanifest to manifest HD. Differentiating between premanifest and manifest HD is an important yet understudied problem, as this distinction marks the need for increased treatment. Speech-based passive monitoring has the potential to augment clinical assessments by continuously tracking manifestation symptoms. In this work we present the first demonstration of how changes in connected speech can be measured to differentiate between premanifest and manifest HD. To do so, we focus on a key speech symptom of HD: vowel distortion. We introduce a set of vowel features which we extract from connected speech. We show that our vowel features can differentiate between premanifest and manifest HD with 87% accuracy.
Abstract: Speech is a critical biomarker for Huntington Disease (HD), with changes in speech increasing in severity as the disease progresses. Speech analyses are currently conducted using either transcriptions created manually by trained professionals or using global rating scales. Manual transcription is both expensive and time-consuming and global rating scales may lack sufficient sensitivity and fidelity. Ultimately, what is needed is an unobtrusive measure that can cheaply and continuously track disease progression. We present first steps towards the development of such a system, demonstrating the ability to automatically differentiate between healthy controls and individuals with HD using speech cues. The results provide evidence that objective analyses can be used to support clinical diagnoses, moving towards the tracking of symptomatology outside of laboratory and clinical environments.
Abstract: Parkinson’s disease (PD) is a central nervous system disorder that causes motor impairment. Recent studies have found that people with PD also often suffer from cognitive impairment (CI). While a large body of work has shown that speech can be used to predict motor symptom severity in people with PD, much less has focused on cognitive symptom severity. Existing work has investigated if acoustic features, derived from speech, can be used to detect CI in people with PD. However, these acoustic features are general and are not targeted toward capturing CI. Speech errors and disfluencies provide additional insight into CI. In this study, we focus on read speech, which offers a controlled template from which we can detect errors and disfluencies, and we analyze how errors and disfluencies vary with CI. The novelty of this work is an automated pipeline, including transcription and error and disfluency detection, capable of predicting CI in people with PD. This will enable efficient analyses of how cognition modulates speech for people with PD, leading to scalable speech assessments of CI.
Abstract: Robust speech recognition is a key prerequisite for semantic feature extraction in automatic aphasic speech analysis. However, standard one-size-fits-all automatic speech recognition models perform poorly when applied to aphasic speech. One reason for this is the wide range of speech intelligibility due to different levels of severity (i.e., higher severity lends itself to less intelligible speech). To address this, we propose a novel acoustic model based on a mixture of experts (MoE), which handles the varying intelligibility stages present in aphasic speech by explicitly defining severity-based experts. At test time, the contribution of each expert is decided by estimating speech intelligibility with a speech intelligibility detector (SID). We show that our proposed approach significantly reduces phone error rates across all severity stages in aphasic speech compared to a baseline approach that does not incorporate severity information into the modeling process.
Abstract: Spontaneous speech analysis plays an important role in the study and treatment of aphasia, but can be difficult to perform manually due to the time consuming nature of speech transcription and coding. Techniques in automatic speech recognition and assessment can potentially alleviate this problem by allowing clinicians to quickly process large amount of speech data. However, automatic analysis of spontaneous aphasic speech has been relatively under-explored in the engineering literature, partly due to the limited amount of available data and difficulties associated with aphasic speech processing. In this work, we perform one of the first large-scale quantitative analysis of spontaneous aphasic speech based on automatic speech recognition (ASR) output. We describe our acoustic modeling method that sets a new recognition benchmark on AphasiaBank, a large-scale aphasic speech corpus. We propose a set of clinically-relevant quantitative measures that are shown to be highly robust to automatic transcription errors. Finally, we demonstrate that these measures can be used to accurately predict the revised Western Aphasia Battery (WAB-R) Aphasia Quotient (AQ) without the need for manual transcripts. The results and techniques presented in our work will help advance the state-of-the-art in aphasic speech processing and make ASR-based technology for aphasia treatment more feasible in real-world clinical applications.
Abstract: Traditional in-person therapy may be difficult to access for individuals with aphasia due to the shortage of speech-language pathologists and high treatment cost. Computerized exercises offer a promising low-cost and constantly accessible supplement to in-person therapy. Unfortunately, the lack of feedback for verbal expression in existing programs hinders the applicability and effectiveness of this form of treatment. A prerequisite for producing meaningful feedback is speech intelligibility assessment. In this work, we investigate the feasibility of an automated system to assess three aspects of aphasic speech intelligibility: clarity, fluidity, and prosody. We introduce our aphasic speech corpus, which contains speech-based interaction between individuals with aphasia and a tablet-based application designed for therapeutic purposes. We present our method for eliciting reliable ground-truth labels for speech intelligibility based on the perceptual judgment of nonexpert human evaluators. We describe and analyze our feature set engineered for capturing pronunciation, rhythm, and intonation. We investigate the classification performance of our system under two conditions, one using human-labeled transcripts to drive feature extraction, and another using transcripts generated automatically. We show that some aspects of aphasic speech intelligibility can be estimated at human-level performance. Our results demonstrate the potential for the computerized treatment of aphasia and lay the groundwork for bridging the gap between human and automatic intelligibility assessment.
Abstract: Aphasia is an acquired language disorder resulting from brain damage that can cause significant communication difficulties. Aphasic speech is often characterized by errors known as paraphasias, the analysis of which can be used to determine an appropriate course of treatment and to track an individual’s recovery progress. Being able to detect paraphasias automatically has many potential clinical benefits; however, this problem has not previously been investigated in the literature. In this paper, we perform the first study on detecting phonemic and neologistic paraphasias from scripted speech samples in AphasiaBank. We propose a speech recognition system with task-specific language models to transcribe aphasic speech automatically. We investigate features based on speech duration, Goodness of Pronunciation, phone edit distance, and Dynamic Time Warping on phoneme posteriorgrams. Our results demonstrate the feasibility of automatic paraphasia detection and outline the path toward enabling this system in real-world clinical applications.
Abstract: Aphasia is an acquired language disorder resulting from brain damage that can cause significant communication difficulties. Aphasic speech is often characterized by errors known as paraphasias, the analysis of which can be used to determine an appropriate course of treatment and to track an individual’s recovery progress. Being able to detect paraphasias automatically has many potential clinical benefits; however, this problem has not previously been investigated in the literature. In this paper, we perform the first study on detecting phonemic and neologistic paraphasias from scripted speech samples in AphasiaBank. We propose a speech recognition system with task-specific language models to transcribe aphasic speech automatically. We investigate features based on speech duration, Goodness of Pronunciation, phone edit distance, and Dynamic Time Warping on phoneme posteriorgrams. Our results demonstrate the feasibility of automatic paraphasia detection and outline the path toward enabling this system in real-world clinical applications.