A New Perspective on Stress Detection: An Automated Approach for Detecting Eustress and Distress
Previous studies have solely focused on establishing Machine Learning (ML) models for automated detection of stress arousal. However, these studies do not recognize stress appraisal and presume stress is a negative mental state. Yet, stress can be classified according to its influence on individuals; the way people perceive a stressor determines whether the stress reaction is considered as eustress (positive stress) or distress (negative stress). Thus, this study aims to assess the potential of using an ML approach to determine stress appraisal and identify eustress and distress instances using physiological and behavioral features. The results indicate that distress leads to higher perceived stress arousal compared to eustress. An XGBoost model that combined physiological and behavioral features using a 30 second time window had 83.38% and 78.79% F-scores for predicting eustress and distress, respectively. Gender-based models resulted in an average increase of 2-4% in eustress and distress prediction accuracy. Finally, a model to predict the simultaneous assessment of eustress and distress, distinguishing between pure eustress, pure distress, eustress-distress coexistence, and the absence of stress achieved a moderate F-score of 65.12%. The results of this study lay the foundation for work management interventions to maximize eustress and minimize distress in the workplace.
Multimodal Feature Selection for Detecting Mothers' Depression in Dyadic Interactions with their Adolescent Offspring
Depression is the most common psychological disorder, a leading cause of disability world-wide, and a major contributor to inter-generational transmission of psychopathology within families. To contribute to our understanding of depression within families and to inform modality selection and feature reduction, it is critical to identify interpretable features in developmentally appropriate contexts. Mothers with and without depression were studied. Depression was defined as history of treatment for depression and elevations in current or recent symptoms. We explored two multimodal feature selection strategies in dyadic interaction tasks of mothers with their adolescent children for depression detection. Modalities included face and head dynamics, facial action units, speech-related behavior, and verbal features. The initial feature space was vast and inter-correlated (collinear). To reduce dimensionality and gain insight into the relative contribution of each modality and feature, we explored feature selection strategies using Variance Inflation Factor (VIF) and Shapley values. On an average collinearity correction through VIF resulted in about 4 times feature reduction across unimodal and multimodal features. Collinearity correction was also found to be an optimal intermediate step prior to Shapley analysis. Shapley feature selection following VIF yielded best performance. The top 15 features obtained through Shapley achieved 78% accuracy. The most informative features came from all four modalities sampled, which supports the importance of multimodal feature selection.
Applying Probabilistic Programming to Affective Computing
Affective Computing is a rapidly growing field spurred by advancements in artificial intelligence, but often, held back by the inability to translate psychological theories of emotion into tractable computational models. To address this, we propose a probabilistic programming approach to affective computing, which models psychological-grounded theories as generative models of emotion, and implements them as stochastic, executable computer programs. We first review probabilistic approaches that integrate reasoning about emotions with reasoning about other latent mental states (e.g., beliefs, desires) in context. Recently-developed probabilistic programming languages offer several key desidarata over previous approaches, such as: (i) flexibility in representing emotions and emotional processes; (ii) modularity and compositionality; (iii) integration with deep learning libraries that facilitate efficient inference and learning from large, naturalistic data; and (iv) ease of adoption. Furthermore, using a probabilistic programming framework allows a standardized platform for theory-building and experimentation: Competing theories (e.g., of appraisal or other emotional processes) can be easily compared via modular substitution of code followed by model comparison. To jumpstart adoption, we illustrate our points with executable code that researchers can easily modify for their own models. We end with a discussion of applications and future directions of the probabilistic programming approach.
Modeling emotion in complex stories: the Stanford Emotional Narratives Dataset
Human emotions unfold over time, and more affective computing research has to prioritize capturing this crucial component of real-world affect. Modeling dynamic emotional stimuli requires solving the twin challenges of time-series modeling and of collecting high-quality time-series datasets. We begin by assessing the state-of-the-art in time-series emotion recognition, and we review contemporary time-series approaches in affective computing, including discriminative and generative models. We then introduce the first version of the Stanford Emotional Narratives Dataset (SENDv1): a set of rich, multimodal videos of self-paced, unscripted emotional narratives, annotated for emotional valence over time. The complex narratives and naturalistic expressions in this dataset provide a challenging test for contemporary time-series emotion recognition models. We demonstrate several baseline and state-of-the-art modeling approaches on the SEND, including a Long Short-Term Memory model and a multimodal Variational Recurrent Neural Network, which perform comparably to the human-benchmark. We end by discussing the implications for future research in time-series affective computing.
Learning Pain from Action Unit Combinations: A Weakly Supervised Approach via Multiple Instance Learning
Patient pain can be detected highly reliably from facial expressions using a set of facial muscle-based action units (AUs) defined by the Facial Action Coding System (FACS). A key characteristic of facial expression of pain is the simultaneous occurrence of pain-related AU combinations, whose automated detection would be highly beneficial for efficient and practical pain monitoring. Existing general Automated Facial Expression Recognition (AFER) systems prove inadequate when applied specifically for detecting pain as they either focus on detecting individual pain-related AUs but not on combinations or they seek to bypass AU detection by training a binary pain classifier directly on pain intensity data but are limited by lack of enough labeled data for satisfactory training. In this paper, we propose a new approach that mimics the strategy of human coders of decoupling pain detection into two consecutive tasks: one performed at the individual video-frame level and the other at video-sequence level. Using state-of-the-art AFER tools to detect single AUs at the frame level, we propose two novel data structures to encode AU combinations from single AU scores. Two weakly supervised learning frameworks namely multiple instance learning (MIL) and multiple clustered instance learning (MCIL) are employed corresponding to each data structure to learn pain from video sequences. Experimental results show an 87% pain recognition accuracy with 0.94 AUC (Area Under Curve) on the UNBC-McMaster Shoulder Pain Expression dataset. Tests on long videos in a lung cancer patient video dataset demonstrates the potential value of the proposed system for pain monitoring in clinical settings.
A Scalable Off-the-Shelf Framework for Measuring Patterns of Attention in Young Children and its Application in Autism Spectrum Disorder
Autism spectrum disorder (ASD) is associated with deficits in the processing of social information and difficulties in social interaction, and individuals with ASD exhibit atypical attention and gaze. Traditionally, gaze studies have relied upon precise and constrained means of monitoring attention using expensive equipment in laboratories. In this work we develop a low-cost off-the-shelf alternative for measuring attention that can be used in natural settings. The head and iris positions of 104 16-31 months children, an age range appropriate for ASD screening and diagnosis, 22 of them diagnosed with ASD, were recorded using the front facing camera in an iPad while they watched on the device screen a movie displaying dynamic stimuli, social stimuli on the left and nonsocial stimuli on the right. The head and iris position were then automatically analyzed via computer vision algorithms to detect the direction of attention. Children in the ASD group paid less attention to the movie, showed less attention to the social as compared to the nonsocial stimuli, and often fixated their attention to one side of the screen. The proposed method provides a low-cost means of monitoring attention to properly designed stimuli, demonstrating that the integration of stimuli design and automatic response analysis results in the opportunity to use off-the-shelf cameras to assess behavioral biomarkers.
Automated Classification of Dyadic Conversation Scenarios using Autonomic Nervous System Responses
Two people's physiological responses become more similar as those people talk or cooperate, a phenomenon called physiological synchrony. The degree of synchrony correlates with conversation engagement and cooperation quality, and could thus be used to characterize interpersonal interaction. In this study, we used a combination of physiological synchrony metrics and pattern recognition algorithms to automatically classify four different dyadic conversation scenarios: two-sided positive conversation, two-sided negative conversation, and two one-sided scenarios. Heart rate, skin conductance, respiration and peripheral skin temperature were measured from 16 dyads in all four scenarios, and individual as well as synchrony features were extracted from them. A two-stage classifier based on stepwise feature selection and linear discriminant analysis achieved a four-class classification accuracy of 75.0% in leave-dyad-out crossvalidation. Removing synchrony features reduced accuracy to 65.6%, indicating that synchrony is informative. In the future, such classification algorithms may be used to, e.g., provide real-time feedback about conversation mood to participants, with applications in areas such as mental health counseling and education. The approach may also generalize to group scenarios and adjacent areas such as cooperation and competition.
Modeling multiple time series annotations as noisy distortions of the ground truth: An Expectation-Maximization approach
Studies of time-continuous human behavioral phenomena often rely on ratings from multiple annotators. Since the ground truth of the target construct is often latent, the standard practice is to use ad-hoc metrics (such as averaging annotator ratings). Despite being easy to compute, such metrics may not provide accurate representations of the underlying construct. In this paper, we present a novel method for modeling multiple time series annotations over a continuous variable that computes the ground truth by modeling annotator specific distortions. We condition the ground truth on a set of features extracted from the data and further assume that the annotators provide their ratings as modification of the ground truth, with each annotator having specific distortion tendencies. We train the model using an Expectation-Maximization based algorithm and evaluate it on a study involving natural interaction between a child and a psychologist, to predict confidence ratings of the children's smiles. We compare and analyze the model against two baselines where: (i) the ground truth in considered to be framewise mean of ratings from various annotators and, (ii) each annotator is assumed to bear a distinct time delay in annotation and their annotations are aligned before computing the framewise mean.
Probabilistic Multigraph Modeling for Improving the Quality of Crowdsourced Affective Data
We proposed a probabilistic approach to joint modeling of participants' and humans' in crowdsourced affective studies. Reliability measures how likely a subject will respond to a question seriously; and regularity measures how often a human will agree with other seriously-entered responses coming from a targeted population. Crowdsourcing-based studies or experiments, which rely on human self-reported affect, pose additional challenges as compared with typical crowdsourcing studies that attempt to acquire labels of objects. The reliability of participants has been massively pursued for typical non-affective crowdsourcing studies, whereas the regularity of humans in an affective experiment in its own right has not been thoroughly considered. It has been often observed that different individuals exhibit different feelings on the same test question, which does not have a sole correct response in the first place. High reliability of responses from one individual thus cannot conclusively result in high consensus across individuals. Instead, globally testing consensus of a population is of interest to investigators. Built upon the agreement multigraph among tasks and workers, our probabilistic model differentiates subject regularity from population reliability. We demonstrate the method's effectiveness for in-depth robust analysis of large-scale crowdsourced affective data, including emotion and aesthetic assessments collected by presenting visual stimuli to human subjects.
Multi-label Multi-task Deep Learning for Behavioral Coding
We propose a methodology for estimating human behaviors in psychotherapy sessions using mutli-label and multi-task learning paradigms. We discuss the problem of behavioral coding in which data of human interactions is the annotated with labels to describe relevant human behaviors of interest. We describe two related, yet distinct, corpora consisting of therapist client interactions in psychotherapy sessions. We experimentally compare the proposed learning approaches for estimating behaviors of interest in these datasets. Specifically, we compare single and multiple label learning approaches, single and multiple task learning approaches, and evaluate the performance of these approaches when incorporating turn context. We demonstrate the prediction performance gains which can be achieved by using the proposed paradigms and discuss the insights these models provide into these complex interactions.
Interpretation of Depression Detection Models via Feature Selection Methods
Given the prevalence of depression worldwide and its major impact on society, several studies employed artificial intelligence modelling to automatically detect and assess depression. However, interpretation of these models and cues are rarely discussed in detail in the AI community, but have received increased attention lately. In this study, we aim to analyse the commonly selected features using a proposed framework of several feature selection methods and their effect on the classification results, which will provide an interpretation of the depression detection model. The developed framework aggregates and selects the most promising features for modelling depression detection from 38 feature selection algorithms of different categories. Using three real-world depression datasets, 902 behavioural cues were extracted from speech behaviour, speech prosody, eye movement and head pose. To verify the generalisability of the proposed framework, we applied the entire process to depression datasets individually and when combined. The results from the proposed framework showed that speech behaviour features (e.g. pauses) are the most distinctive features of the depression detection model. From the speech prosody modality, the strongest feature groups were F0, HNR, formants, and MFCC, while for the eye activity modality they were left-right eye movement and gaze direction, and for the head modality it was yaw head movement. Modelling depression detection using the selected features (even though there are only 9 features) outperformed using all features in all the individual and combined datasets. Our feature selection framework did not only provide an interpretation of the model, but was also able to produce a higher accuracy of depression detection with a small number of features in varied datasets. This could help to reduce the processing time needed to extract features and creating the model.
Exploring Complexity of Facial Dynamics in Autism Spectrum Disorder
Atypical facial expression is one of the early symptoms of autism spectrum disorder (ASD) characterized by reduced regularity and lack of coordination of facial movements. Automatic quantification of these behaviors can offer novel biomarkers for screening, diagnosis, and treatment monitoring of ASD. In this work, 40 toddlers with ASD and 396 typically developing toddlers were shown developmentally-appropriate and engaging movies presented on a smart tablet during a well-child pediatric visit. The movies consisted of social and non-social dynamic scenes designed to evoke certain behavioral and affective responses. The front-facing camera of the tablet was used to capture the toddlers' face. Facial landmarks' dynamics were then automatically computed using computer vision algorithms. Subsequently, the complexity of the landmarks' dynamics was estimated for the eyebrows and mouth regions using multiscale entropy. Compared to typically developing toddlers, toddlers with ASD showed higher complexity (i.e., less predictability) in these landmarks' dynamics. This complexity in facial dynamics contained novel information not captured by traditional facial affect analyses. These results suggest that computer vision analysis of facial landmark movements is a promising approach for detecting and quantifying early behavioral symptoms associated with ASD.
Indirect Identification of Perinatal Psychosocial Risks from Natural Language
During the perinatal period, psychosocial health risks, including depression and intimate partner violence, are associated with serious adverse health outcomes for birth parents and children. To appropriately intervene, healthcare professionals must first identify those at risk, yet stigma often prevents people from directly disclosing the information needed to prompt an assessment. In this research we use short diary entries to indirectly elicit information that could indicate psychosocial risks, then examine patterns that emerge in the language of those at risk. We find that diary entries exhibit consistent themes, extracted using topic modeling, and emotional perspective, drawn from dictionary-informed sentiment features. Using these features, we use regularized regression to predict screening measures for depression and psychological aggression by an intimate partner. Journal text entries quantified through topic models and sentiment features show promise for depression prediction, corresponding with self-reported screening measures almost as well as closed-form questions. Text-based features are less useful in predicting intimate partner violence, but topic models generate themes that align with known risk correlates. The indirect features uncovered in this research could aid in the detection and analysis of stigmatized risks.
Artificial Emotional Intelligence in Socially Assistive Robots for Older Adults: A Pilot Study
This paper presents our recent research on integrating artificial emotional intelligence in a social robot (Ryan) and studies the robot's effectiveness in engaging older adults. Ryan is a socially assistive robot designed to provide companionship for older adults with depression and dementia through conversation. We used two versions of Ryan for our study, empathic and non-empathic. The empathic Ryan utilizes a multimodal emotion recognition algorithm and a multimodal emotion expression system. Using different input modalities for emotion, i.e. facial expression and speech sentiment, the empathic Ryan detects users emotional state and utilizes an affective dialogue manager to generate a response. On the other hand, the non-empathic Ryan lacks facial expression and uses scripted dialogues that do not factor in the users emotional state. We studied these two versions of Ryan with 10 older adults living in a senior care facility. The statistically significant improvement in the users' reported face-scale mood measurement indicates an overall positive effect from the interaction with both the empathic and non-empathic versions of Ryan. However, the number of spoken words measurement and the exit survey analysis suggest that the users perceive the empathic Ryan as more engaging and likable.
Computer Vision Analysis for Quantification of Autism Risk Behaviors
Observational behavior analysis plays a key role for the discovery and evaluation of risk markers for many neurodevelopmental disorders. Research on autism spectrum disorder (ASD) suggests that behavioral risk markers can be observed at 12 months of age or earlier, with diagnosis possible at 18 months. To date, these studies and evaluations involving observational analysis tend to rely heavily on clinical practitioners and specialists who have undergone intensive training to be able to reliably administer carefully designed behavioural-eliciting tasks, code the resulting behaviors, and interpret such behaviors. These methods are therefore extremely expensive, time-intensive, and are not easily scalable for large population or longitudinal observational analysis. We developed a self-contained, closed-loop, mobile application with movie stimuli designed to engage the child's attention and elicit specific behavioral and social responses, which are recorded with a mobile device camera and then analyzed via computer vision algorithms. Here, in addition to presenting this paradigm, we validate the system to measure engagement, name-call responses, and emotional responses of toddlers with and without ASD who were presented with the application. Additionally, we show examples of how the proposed framework can further risk marker research with fine-grained quantification of behaviors. The results suggest these objective and automatic methods can be considered to aid behavioral analysis, and can be suited for objective automatic analysis for future studies.
Personalized Multitask Learning for Predicting Tomorrow's Mood, Stress, and Health
While accurately predicting mood and wellbeing could have a number of important clinical benefits, traditional machine learning (ML) methods frequently yield low performance in this domain. We posit that this is because a one-size-fits-all machine learning model is inherently ill-suited to predicting outcomes like mood and stress, which vary greatly due to individual differences. Therefore, we employ Multitask Learning (MTL) techniques to train personalized ML models which are customized to the needs of each individual, but still leverage data from across the population. Three formulations of MTL are compared: i) MTL deep neural networks, which share several hidden layers but have final layers unique to each task; ii) Multi-task Multi-Kernel learning, which feeds information across tasks through kernel weights on feature types; and iii) a Hierarchical Bayesian model in which tasks share a common Dirichlet Process prior. We offer the code for this work in open source. These techniques are investigated in the context of predicting future mood, stress, and health using data collected from surveys, wearable sensors, smartphone logs, and the weather. Empirical results demonstrate that using MTL to account for individual differences provides large performance improvements over traditional machine learning methods and provides personalized, actionable insights.
Automatic Estimation of Self-Reported Pain by Trajectory Analysis in the Manifold of Fixed Rank Positive Semi-Definite Matrices
We propose an automatic method to estimate self-reported pain based on facial landmarks extracted from videos. For each video sequence, we decompose the face into four different regions and the pain intensity is measured by modeling the dynamics of facial movement using the landmarks of these regions. A formulation based on Gram matrices is used for representing the trajectory of landmarks on the Riemannian manifold of symmetric positive semi-definite matrices of fixed rank. A curve fitting algorithm is used to smooth the trajectories and temporal alignment is performed to compute the similarity between the trajectories on the manifold. A Support Vector Regression classifier is then trained to encode extracted trajectories into pain intensity levels consistent with self-reported pain intensity measurement. Finally, a late fusion of the estimation for each region is performed to obtain the final predicted pain level. The proposed approach is evaluated on two publicly available datasets, the UNBCMcMaster Shoulder Pain Archive and the Biovid Heat Pain dataset. We compared our method to the state-of-the-art on both datasets using different testing protocols, showing the competitiveness of the proposed approach.
A Computational Study of Expressive Facial Dynamics in Children with Autism
Several studies have established that facial expressions of children with autism are often perceived as atypical, awkward or less engaging by typical adult observers. Despite this clear deficit in the quality of facial expression production, very little is understood about its underlying mechanisms and characteristics. This paper takes a computational approach to studying details of facial expressions of children with high functioning autism (HFA). The objective is to uncover those characteristics of facial expressions, notably distinct from those in typically developing children, and which are otherwise difficult to detect by visual inspection. We use motion capture data obtained from subjects with HFA and typically developing subjects while they produced various facial expressions. This data is analyzed to investigate how the overall and local facial dynamics of children with HFA differ from their typically developing peers. Our major observations include reduced complexity in the dynamic facial behavior of the HFA group arising primarily from the eye region.
Jointly Aligning and Predicting Continuous Emotion Annotations
Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network () that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the . It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.
Cognitive Load Measurement in a Virtual Reality-based Driving System for Autism Intervention
Autism Spectrum Disorder (ASD) is a highly prevalent neurodevelopmental disorder with enormous individual and social cost. In this paper, a novel virtual reality (VR)-based driving system was introduced to teach driving skills to adolescents with ASD. This driving system is capable of gathering eye gaze, electroencephalography, and peripheral physiology data in addition to driving performance data. The objective of this paper is to fuse multimodal information to measure cognitive load during driving such that driving tasks can be individualized for optimal skill learning. Individualization of ASD intervention is an important criterion due to the spectrum nature of the disorder. Twenty adolescents with ASD participated in our study and the data collected were used for systematic feature extraction and classification of cognitive loads based on five well-known machine learning methods. Subsequently, three information fusion schemes-feature level fusion, decision level fusion and hybrid level fusion-were explored. Results indicate that multimodal information fusion can be used to measure cognitive load with high accuracy. Such a mechanism is essential since it will allow individualization of driving skill training based on cognitive load, which will facilitate acceptance of this driving system for clinical use and eventual commercialization.
Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)
Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train "meet in the middle" approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.