IEEE TRANSACTIONS ON MULTIMEDIA

Head Motion Modeling for Human Behavior Analysis in Dyadic Interaction
Xiao B, Georgiou P, Baucom B and Narayanan SS
This paper presents a computational study of head motion in human interaction, notably of its role in conveying interlocutors' behavioral characteristics. Head motion is physically complex and carries rich information; current modeling approaches based on visual signals, however, are still limited in their ability to adequately capture these important properties. Guided by the methodology of kinesics, we propose a data driven approach to identify typical head motion patterns. The approach follows the steps of first segmenting motion events, then parametrically representing the motion by linear predictive features, and finally generalizing the motion types using Gaussian mixture models. The proposed approach is experimentally validated using video recordings of communication sessions from real couples involved in a couples therapy study. In particular we use the head motion model to classify binarized expert judgments of the interactants' specific behavioral characteristics where entrainment in head motion is hypothesized to play a role: , and behavior. We achieve accuracies in the range of 60% to 70% for the various experimental settings and conditions. In addition, we describe a measure of motion similarity between the interaction partners based on the proposed model. We show that the relative change of head motion similarity during the interaction significantly correlates with the expert judgments of the interactants' behavioral characteristics. These findings demonstrate the effectiveness of the proposed head motion model, and underscore the promise of analyzing human behavioral characteristics through signal processing methods.
Real-Time and Accurate UAV Pedestrian Detection for Social Distancing Monitoring in COVID-19 Pandemic
Shao Z, Cheng G, Ma J, Wang Z, Wang J and Li D
Coronavirus Disease 2019 (COVID-19) is a highly infectious virus that has created a health crisis for people all over the world. Social distancing has proved to be an effective non-pharmaceutical measure to slow down the spread of COVID-19. As unmanned aerial vehicle (UAV) is a flexible mobile platform, it is a promising option to use UAV for social distance monitoring. Therefore, we propose a lightweight pedestrian detection network to accurately detect pedestrians by human head detection in real-time and then calculate the social distancing between pedestrians on UAV images. In particular, our network follows the PeleeNet as backbone and further incorporates the multi-scale features and spatial attention to enhance the features of small objects, like human heads. The experimental results on Merge-Head dataset show that our method achieves 92.22% AP (average precision) and 76 FPS (frames per second), outperforming YOLOv3 models and SSD models and enabling real-time detection in actual applications. The ablation experiments also indicate that multi-scale feature and spatial attention significantly contribute the performance of pedestrian detection. The test results on UAV-Head dataset show that our method can also achieve high precision pedestrian detection on UAV images with 88.5% AP and 75 FPS. In addition, we have conducted a precision calibration test to obtain the transformation matrix from images (vertical images and tilted images) to real-world coordinate. Based on the accurate pedestrian detection and the transformation matrix, the social distancing monitoring between individuals is reliably achieved.
Cross-Referencing Self-Training Network for Sound Event Detection in Audio Mixtures
Park S, Han DK and Elhilali M
Sound event detection is an important facet of audio tagging that aims to identify sounds of interest and define both the sound category and time boundaries for each sound event in a continuous recording. With advances in deep neural networks, there has been tremendous improvement in the performance of sound event detection systems, although at the expense of costly data collection and labeling efforts. In fact, current state-of-the-art methods employ supervised training methods that leverage large amounts of data samples and corresponding labels in order to facilitate identification of sound category and time stamps of events. As an alternative, the current study proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training. Additionally, this paper explores post-processing which extracts sound intervals from network prediction, for further improvement in sound event detection performance. The proposed approach is evaluated on sound event detection task for the DCASE2020 challenge. The results of these methods on both "validation" and "public evaluation" sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.
Indoor Camera Pose Estimation from Room Layouts and Image Outer Corners
Chen X and Fan G
To support indoor scene understanding, room layouts have been recently introduced that define a few typical space configurations according to junctions and boundary lines. In this paper, we study camera pose estimation from eight common room layouts with at least two boundary lines that is cast as a PnL (Perspective-n-Line) problem. Specifically, the intersecting points between image borders and room layout boundaries, named image outer corners (IOCs), are introduced to create additional auxiliary lines for PnL optimization. Therefore, a new PnL-IOC algorithm is proposed which has two implementations according to the room layout types. The first one considers six layouts with more than two boundary lines where 3D correspondence estimation of IOCs creates sufficient line correspondences for camera pose estimation. The second one is an extended version to handle two challenging layouts with only two coplanar boundaries where correspondence estimation of IOCs is ill-posed due to insufficient conditions. Thus the powerful NSGA-II algorithm is embedded in PnL-IOC to estimate the correspondences of IOCs. At the last step, the camera pose is jointly optimized with 3D correspondence refinement of IOCs in the iterative Gauss-Newton algorithm. Experiment results on both simulated and real images show the advantages of the proposed PnL-IOC method on the accuracy and robustness of camera pose estimation from eight different room layouts over the existing PnL methods. The code is available at https://github.com/XiaoweiChenOSU/PnL-IOC.
Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA
Vosoughi A, Deng S, Zhang S, Tian Y, Xu C and Luo J
To increase the generalization capability of VQA systems, many recent studies have tried to de-bias spurious language or vision associations that shortcut the question or image to the answer. Despite these efforts, the literature fails to address the confounding effect of vision and language simultaneously. As a result, when they reduce bias learned from one modality, they usually increase bias from the other. In this paper, we first model a confounding effect that causes language and vision bias simultaneously, then propose a counterfactual inference to remove the influence of this effect. The model trained in this strategy can concurrently and efficiently reduce vision and language bias. To the best of our knowledge, this is the first work to reduce biases resulting from confounding effects of vision and language in VQA, leveraging causal explain-away relations. We accompany our method with an explain-away strategy, pushing the accuracy of the questions with numerical answers results compared to existing methods that have been an open problem. The proposed method outperforms the state-of-the-art methods in VQA-CP v2 datasets. R2: Providing brief insights into the experimental setup and results would add valuable context for readers. In response to R2, we released the code and documentation for the implementation as follows. Our codes are available at https://github.com/ali-vosoughi/PW-VQA.
Support Vector Regression-based Reduced-Reference Perceptual Quality Model for Compressed Point Clouds
Su H, Liu Q, Yuan H, Cheng Q and Hamzaoui R
Video-based point cloud compression (V-PCC) is a state-of-the-art moving picture experts group (MPEG) standard for point cloud compression. V-PCC can be used to compress both static and dynamic point clouds in a lossless, near lossless, or lossy way. Many objective quality metrics have been proposed for distorted point clouds. Most of these metrics are full-reference metrics that require both the original point cloud and the distorted one. However, in some real-time applications, the original point cloud is not available, and no-reference or reduced-reference quality metrics are needed. Three main challenges in the design of a reduced-reference quality metric are how to build a set of features that characterize the visual quality of the distorted point cloud, how to select the most effective features from this set, and how to map the selected features to a perceptual quality score. We address the first challenge by proposing a comprehensive set of features consisting of compression, geometry, normal, curvature, and luminance features. To deal with the second challenge, we use the least absolute shrinkage and selection operator (LASSO) method, which is a variable selection method for regression problems. Finally, we map the selected features to the mean opinion score in a nonlinear space. Although we have used only 19 features in our current implementation, our metric is flexible enough to allow any number of features, including future more effective ones. Experimental results on the Waterloo point cloud dataset version 2 (WPC2.0) and the MPEG point cloud compression dataset (M-PCCD) show that our method, namely PCQAML, outperforms state-of-the-art full-reference and reduced-reference quality metrics in terms of Pearson linear correlation coefficient, Spearman rank order correlation coefficient, Kendall's rank-order correlation coefficient, and root mean squared error.