IMAGE AND VISION COMPUTING

A survey on computer vision based human analysis in the COVID-19 era
Eyiokur FI, Kantarcı A, Erakın ME, Damer N, Ofli F, Imran M, Križaj J, Salah AA, Waibel A, Štruc V and Ekenel HK
The emergence of COVID-19 has had a global and profound impact, not only on society as a whole, but also on the lives of individuals. Various prevention measures were introduced around the world to limit the transmission of the disease, including face masks, mandates for social distancing and regular disinfection in public spaces, and the use of screening applications. These developments also triggered the need for novel and improved computer vision techniques capable of providing support to the prevention measures through an automated analysis of visual data, on the one hand, and facilitating normal operation of existing vision-based services, such as biometric authentication schemes, on the other. Especially important here, are computer vision techniques that focus on the analysis of people and faces in visual data and have been affected the most by the partial occlusions introduced by the mandates for facial masks. Such computer vision based human analysis techniques include face and face-mask detection approaches, face recognition techniques, crowd counting solutions, age and expression estimation procedures, models for detecting face-hand interactions and many others, and have seen considerable attention over recent years. The goal of this survey is to provide an introduction to the problems induced by COVID-19 into such research and to present a comprehensive review of the work done in the computer vision based human analysis field. Particular attention is paid to the impact of facial masks on the performance of various methods and recent solutions to mitigate this problem. Additionally, a detailed review of existing datasets useful for the development and evaluation of methods for COVID-19 related applications is also provided. Finally, to help advance the field further, a discussion on the main open challenges and future research direction is given at the end of the survey. This work is intended to have a broad appeal and be useful not only for computer vision researchers but also the general public.
Progressive ShallowNet for large scale dynamic and spontaneous facial behaviour analysis in children
Qayyum A, Razzak I, Moustafa N and Mazher M
COVID-19 has severely disrupted every aspect of society and left negative impact on our life. Resisting the temptation in engaging face-to-face social connection is not as easy as we imagine. Breaking ties within social circle makes us lonely and isolated, that in turns increase the likelihood of depression related disease and even can leads to death by increasing the chance of heart disease. Not only adults, children's are equally impacted where the contribution of emotional competence to social competence has long term implications. Early identification skill for facial behaviour emotions, deficits, and expression may help to prevent the low social functioning. Deficits in young children's ability to differentiate human emotions can leads to social functioning impairment. However, the existing work focus on adult emotions recognition mostly and ignores emotion recognition in children. By considering the working of pyramidal cells in the cerebral cortex, in this paper, we present progressive lightweight shallow learning for the classification by efficiently utilizing the skip-connection for spontaneous facial behaviour recognition in children. Unlike earlier deep neural networks, we limit the alternative path for the gradient at the earlier part of the network by increase gradually with the depth of the network. Progressive ShallowNet is not only able to explore more feature space but also resolve the over-fitting issue for smaller data, due to limiting the residual path locally, making the network vulnerable to perturbations. We have conducted extensive experiments on benchmark facial behaviour analysis in children that showed significant performance gain comparatively.
FMD-Yolo: An efficient face mask detection method for COVID-19 prevention and control in public
Wu P, Li H, Zeng N and Li F
Coronavirus disease 2019 (COVID-19) is a world-wide epidemic and efficient prevention and control of this disease has become the focus of global scientific communities. In this paper, a novel face mask detection framework FMD-Yolo is proposed to monitor whether people wear masks in a right way in public, which is an effective way to block the virus transmission. In particular, the feature extractor employs Im-Res2Net-101 which combines Res2Net module and deep residual network, where utilization of hierarchical convolutional structure, deformable convolution and non-local mechanisms enables thorough information extraction from the input. Afterwards, an enhanced path aggregation network En-PAN is applied for feature fusion, where high-level semantic information and low-level details are sufficiently merged so that the model robustness and generalization ability can be enhanced. Moreover, localization loss is designed and adopted in model training phase, and Matrix NMS method is used in the inference stage to improve the detection efficiency and accuracy. Benchmark evaluation is performed on two public databases with the results compared with other eight state-of-the-art detection algorithms. At  = 0.5 level, proposed FMD-Yolo has achieved the best precision 50 of 92.0% and 88.4% on the two datasets, and 75 at  = 0.75 has improved 5.5% and 3.9% respectively compared with the second one, which demonstrates the superiority of FMD-Yolo in face mask detection with both theoretical values and practical significance.
Learning Facial Action Units with Spatiotemporal Cues and Multi-label Sampling
Chu WS, De la Torre F and Cohn JF
Facial action units (AUs) may be represented , and in terms of their . Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during training the network, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.
Dense 3D Face Alignment from 2D Video for Real-Time Use
Jeni LA, Cohn JF and Kanade T
To enable real-time, person-independent 3D registration from 2D video, we developed a 3D cascade regression approach in which facial landmarks remain invariant across pose over a range of approximately 60 degrees. From a single 2D image of a person's face, a dense 3D shape is registered in real time for each frame. The algorithm utilizes a fast cascade regression framework trained on high-resolution 3D face-scans of posed and spontaneous emotion expression. The algorithm first estimates the location of a dense set of landmarks and their visibility, then reconstructs face shapes by fitting a part-based 3D model. Because no assumptions are required about illumination or surface properties, the method can be applied to a wide range of imaging conditions that include 2D video and uncalibrated multi-view video. The method has been validated in a battery of experiments that evaluate its precision of 3D reconstruction, extension to multi-view reconstruction, temporal integration for videos and 3D head-pose estimation. Experimental findings strongly support the validity of real-time, 3D registration and reconstruction from 2D video. The software is available online at http://zface.org.
Lessons from Collecting a Million Biometric Samples
Phillips PJ, Flynn PJ and Bowyer KW
Over the past decade, independent evaluations have become commonplace in many areas of experimental computer science, including face and gesture recognition. A key attribute of many successful independent evaluations is a curated data set. Desired aspects associated with these data sets include appropriateness to the experimental design, a corpus size large enough to allow statistically rigorous characterization of results, and the availability of comprehensive metadata that allow inferences to be made on various data set attributes. In this paper, we review a ten-year biometric sampling effort that enabled the creation of several key biometrics challenge problems. We summarize the design and execution of data collections, identify key challenges, and convey some lessons learned.
Multiview stereo and silhouette fusion via minimizing generalized reprojection error
Li Z, Wang K, Jia W, Chen HC, Zuo W, Meng D and Sun M
Accurate reconstruction of 3D geometrical shape from a set of calibrated 2D multiview images is an active yet challenging task in computer vision. The existing multiview stereo methods usually perform poorly in recovering deeply concave and thinly protruding structures, and suffer from several common problems like slow convergence, sensitivity to initial conditions, and high memory requirements. To address these issues, we propose a two-phase optimization method for generalized reprojection error minimization (TwGREM), where a generalized framework of reprojection error is proposed to integrate stereo and silhouette cues into a unified energy function. For the minimization of the function, we first introduce a convex relaxation on 3D volumetric grids which can be efficiently solved using variable splitting and Chambolle projection. Then, the resulting surface is parameterized as a triangle mesh and refined using surface evolution to obtain a high-quality 3D reconstruction. Our comparative experiments with several state-of-the-art methods show that the performance of TwGREM based 3D reconstruction is among the highest with respect to accuracy and efficiency, especially for data with smooth texture and sparsely sampled viewpoints.
Nonverbal Social Withdrawal in Depression: Evidence from manual and automatic analysis
Girard JM, Cohn JF, Mahoor MH, Mavadati SM, Hammal Z and Rosenwald DP
The relationship between nonverbal behavior and severity of depression was investigated by following depressed participants over the course of treatment and video recording a series of clinical interviews. Facial expressions and head pose were analyzed from video using manual and automatic systems. Both systems were highly consistent for FACS action units (AUs) and showed similar effects for change over time in depression severity. When symptom severity was high, participants made fewer affiliative facial expressions (AUs 12 and 15) and more non-affiliative facial expressions (AU 14). Participants also exhibited diminished head motion (i.e., amplitude and velocity) when symptom severity was high. These results are consistent with the Social Withdrawal hypothesis: that depressed individuals use nonverbal behavior to maintain or increase interpersonal distance. As individuals recover, they send more signals indicating a willingness to affiliate. The finding that automatic facial expression analysis was both consistent with manual coding and revealed the same pattern of findings suggests that automatic facial expression analysis may be ready to relieve the burden of manual coding in behavioral and clinical science.
Classification and Weakly Supervised Pain Localization using Multiple Segment Representation
Sikka K, Dhall A and Bartlett MS
Automatic pain recognition from videos is a vital clinical application and, owing to its spontaneous nature, poses interesting challenges to automatic facial expression recognition (AFER) research. Previous pain vs no-pain systems have highlighted two major challenges: (1) ground truth is provided for the sequence, but the presence or absence of the target expression for a given frame is unknown, and (2) the time point and the duration of the pain expression event(s) in each video are unknown. To address these issues we propose a novel framework (referred to as MS-MIL) where each sequence is represented as a bag containing multiple segments, and multiple instance learning (MIL) is employed to handle this weakly labeled data in the form of sequence level ground-truth. These segments are generated via multiple clustering of a sequence or running a multi-scale temporal scanning window, and are represented using a state-of-the-art Bag of Words (BoW) representation. This work extends the idea of detecting facial expressions through 'concept frames' to 'concept segments' and argues through extensive experiments that algorithms such as MIL are needed to reap the benefits of such representation. The key advantages of our approach are: (1) joint detection and localization of painful frames using only sequence-level ground-truth, (2) incorporation of temporal dynamics by representing the data not as individual frames but as segments, and (3) extraction of multiple segments, which is well suited to signals with uncertain temporal location and duration in the video. Extensive experiments on UNBC-McMaster Shoulder Pain dataset highlight the effectiveness of the approach by achieving competitive results on both tasks of pain classification and localization in videos. We also empirically evaluate the contributions of different components of MS-MIL. The paper also includes the visualization of discriminative facial patches, important for pain detection, as discovered by our algorithm and relates them to Action Units that have been associated with pain expression. We conclude the paper by demonstrating that MS-MIL yields a significant improvement on another spontaneous facial expression dataset, the FEEDTUM dataset.
Non-rigid Face Tracking with Local Appearance Consistency Constraint
Wang Y, Lucey S, Cohn JF and Saragih J
In this paper we present a new discriminative approach to achieve consistent and efficient tracking of non-rigid object motion, such as facial expressions. By utilizing both spatial and temporal appearance coherence at the patch level, the proposed approach can reduce ambiguity and increase accuracy. Recent research demonstrates that feature based approaches, such as constrained local models (CLMs), can achieve good performance in non-rigid object alignment/tracking using local region descriptors and a non-rigid shape prior. However, the matching performance of the learned generic patch experts is susceptible to local appearance ambiguity. Since there is no motion continuity constraint between neighboring frames of the same sequence, the resultant object alignment might not be consistent from frame to frame and the motion field is not temporally smooth. In this paper, we extend the CLM method into the spatio-temporal domain by enforcing the appearance consistency constraint of each local patch between neighboring frames. More importantly, we show that the global warp update can be optimized jointly in an efficient manner using convex quadratic fitting. Finally, we demonstrate that our approach receives improved performance for the task of non-rigid facial motion tracking on the videos of clinical patients.
A Portable Stereo Vision System for Whole Body Surface Imaging
Yu W and Xu B
This paper presents a whole body surface imaging system based on stereo vision technology. We have adopted a compact and economical configuration which involves only four stereo units to image the frontal and rear sides of the body. The success of the system depends on a stereo matching process that can effectively segment the body from the background in addition to recovering sufficient geometric details. For this purpose, we have developed a novel sub-pixel, dense stereo matching algorithm which includes two major phases. In the first phase, the foreground is accurately segmented with the help of a predefined virtual interface in the disparity space image, and a coarse disparity map is generated with block matching. In the second phase, local least squares matching is performed in combination with global optimization within a regularization framework, so as to ensure both accuracy and reliability. Our experimental results show that the system can realistically capture smooth and natural whole body shapes with high accuracy.
Efficient Constrained Local Model Fitting for Non-Rigid Face Alignment
Lucey S, Wang Y, Cox M, Sridharan S and Cohn JF
Active appearance models (AAMs) have demonstrated great utility when being employed for non-rigid face alignment/tracking. The "simultaneous" algorithm for fitting an AAM achieves good non-rigid face registration performance, but has poor real time performance (2-3 fps). The "project-out" algorithm for fitting an AAM achieves faster than real time performance (> 200 fps) but suffers from poor generic alignment performance. In this paper we introduce an extension to a discriminative method for non-rigid face registration/tracking referred to as a constrained local model (CLM). Our proposed method is able to achieve superior performance to the "simultaneous" AAM algorithm along with real time fitting speeds (35 fps). We improve upon the canonical CLM formulation, to gain this performance, in a number of ways by employing: (i) linear SVMs as patch-experts, (ii) a simplified optimization criteria, and (iii) a composite rather than additive warp update step. Most notably, our simplified optimization criteria for fitting the CLM divides the problem of finding a single complex registration/warp displacement into that of finding N simple warp displacements. From these N simple warp displacements, a single complex warp displacement is estimated using a weighted least-squares constraint. Another major advantage of this simplified optimization lends from its ability to be parallelized, a step which we also theoretically explore in this paper. We refer to our approach for fitting the CLM as the "exhaustive local search" (ELS) algorithm. Experiments were conducted on the CMU Multi-PIE database.
Visual and Multimodal Analysis of Human Spontaneous Behavior: Introduction to the Special Issue of Image & Vision Computing Journal
Pantic M and Cohn JF
Modelling and Recognition of the Linguistic Components in American Sign Language
Ding L and Martinez AM
The manual signs in sign languages are generated and interpreted using three basic building blocks: handshape, motion, and place of articulation. When combined, these three components (together with palm orientation) uniquely determine the meaning of the manual sign. This means that the use of pattern recognition techniques that only employ a subset of these components is inappropriate for interpreting the sign or to build automatic recognizers of the language. In this paper, we define an algorithm to model these three basic components form a single video sequence of two-dimensional pictures of a sign. Recognition of these three components are then combined to determine the class of the signs in the videos. Experiments are performed on a database of (isolated) American Sign Language (ASL) signs. The results demonstrate that, using semi-automatic detection, all three components can be reliably recovered from two-dimensional video sequences, allowing for an accurate representation and recognition of the signs.
The Painful Face - Pain Expression Recognition Using Active Appearance Models
Ashraf AB, Lucey S, Cohn JF, Chen T, Ambadar Z, Prkachin KM and Solomon PE
Pain is typically assessed by patient self-report. Self-reported pain, however, is difficult to interpret and may be impaired or in some circumstances (i.e., young children and the severely ill) not even possible. To circumvent these problems behavioral scientists have identified reliable and valid facial indicators of pain. Hitherto, these methods have required manual measurement by highly skilled human observers. In this paper we explore an approach for automatically recognizing acute pain without the need for human observers. Specifically, our study was restricted to automatically detecting pain in adult patients with rotator cuff injuries. The system employed video input of the patients as they moved their affected and unaffected shoulder. Two types of ground truth were considered. Sequence-level ground truth consisted of Likert-type ratings by skilled observers. Frame-level ground truth was calculated from presence/absence and intensity of facial actions previously associated with pain. Active appearance models (AAM) were used to decouple shape and appearance in the digitized face images. Support vector machines (SVM) were compared for several representations from the AAM and of ground truth of varying granularity. We explored two questions pertinent to the construction, design and development of automatic pain detection systems. First, at what level (i.e., sequence- or frame-level) should datasets be labeled in order to obtain satisfactory automatic pain detection performance? Second, how important is it, at both levels of labeling, that we non-rigidly register the face?
Figure-Ground Segmentation Using Factor Graphs
Shen H, Coughlan J and Ivanchenko V
Foreground-background segmentation has recently been applied [26,12] to the detection and segmentation of specific objects or structures of interest from the background as an efficient alternative to techniques such as deformable templates [27]. We introduce a graphical model (i.e. Markov random field)-based formulation of structure-specific figure-ground segmentation based on simple geometric features extracted from an image, such as local configurations of linear features, that are characteristic of the desired figure structure. Our formulation is novel in that it is based on factor graphs, which are graphical models that encode interactions among arbitrary numbers of random variables. The ability of factor graphs to express interactions higher than pairwise order (the highest order encountered in most graphical models used in computer vision) is useful for modeling a variety of pattern recognition problems. In particular, we show how this property makes factor graphs a natural framework for performing grouping and segmentation, and demonstrate that the factor graph framework emerges naturally from a simple maximum entropy model of figure-ground segmentation.We cast our approach in a learning framework, in which the contributions of multiple grouping cues are learned from training data, and apply our framework to the problem of finding printed text in natural scenes. Experimental results are described, including a performance analysis that demonstrates the feasibility of the approach.
The role of image registration in brain mapping
Toga AW and Thompson PM
Image registration is a key step in a great variety of biomedical imaging applications. It provides the ability to geometrically align one dataset with another, and is a prerequisite for all imaging applications that compare datasets across subjects, imaging modalities, or across time. Registration algorithms also enable the pooling and comparison of experimental findings across laboratories, the construction of population-based brain atlases, and the creation of systems to detect group patterns in structural and functional imaging data. We review the major types of registration approaches used in brain imaging today. We focus on their conceptual basis, the underlying mathematics, and their strengths and weaknesses in different contexts. We describe the major goals of registration, including data fusion, quantification of change, automated image segmentation and labeling, shape measurement, and pathology detection. We indicate that registration algorithms have great potential when used in conjunction with a digital brain atlas, which acts as a reference system in which brain images can be compared for statistical analysis. The resulting armory of registration approaches is fundamental to medical image analysis, and in a brain mapping context provides a means to elucidate clinical, demographic, or functional trends in the anatomy or physiology of the brain.
Postnatal gestational age estimation of newborns using Small Sample Deep Learning
Torres Torres M, Valstar M, Henry C, Ward C and Sharkey D
A baby's gestational age determines whether or not they are premature, which helps clinicians decide on suitable post-natal treatment. The most accurate dating methods use Ultrasound Scan (USS) machines, but these are expensive, require trained personnel and cannot always be deployed to remote areas. In the absence of USS, the Ballard Score, a postnatal clinical examination, can be used. However, this method is highly subjective and results vary widely depending on the experience of the examiner. Our main contribution is a novel system for automatic postnatal gestational age estimation using small sets of images of a newborn's face, foot and ear. Our two-stage architecture makes the most out of Convolutional Neural Networks trained on small sets of images to predict broad classes of gestational age, and then fuses the outputs of these discrete classes with a baby's weight to make fine-grained predictions of gestational age using Support Vector Regression. On a purpose-collected dataset of 130 babies, experiments show that our approach surpasses current automatic state-of-the-art postnatal methods and attains an expected error of 6 days. It is three times more accurate than the Ballard method. Making use of images improves predictions by 33% compared to using weight only. This indicates that even with a very small set of data, our method is a viable candidate for postnatal gestational age estimation in areas were USS is not available.