IEEE-ACM Transactions on Audio Speech and Language Processing

Deep Learning Based Real-time Speech Enhancement for Dual-microphone Mobile Phones
Tan K, Zhang X and Wang D
In mobile speech communication, speech signals can be severely corrupted by background noise when the far-end talker is in a noisy acoustic environment. To suppress background noise, speech enhancement systems are typically integrated into mobile phones, in which one or more microphones are deployed. In this study, we propose a novel deep learning based approach to real-time speech enhancement for dual-microphone mobile phones. The proposed approach employs a new densely-connected convolutional recurrent network to perform dual-channel complex spectral mapping. We utilize a structured pruning technique to compress the model without significantly degrading the enhancement performance, which yields a low-latency and memory-efficient enhancement system for real-time processing. Experimental results suggest that the proposed approach consistently outperforms an earlier approach to dual-channel speech enhancement for mobile phone communication, as well as a deep learning based beamformer.
Towards Model Compression for Deep Learning Based Speech Enhancement
Tan K and Wang D
The use of deep neural networks (DNNs) has dramatically elevated the performance of speech enhancement over the last decade. However, to achieve strong enhancement performance typically requires a large DNN, which is both memory and computation consuming, making it difficult to deploy such speech enhancement systems on devices with limited hardware resources or in applications with strict latency requirements. In this study, we propose two compression pipelines to reduce the model size for DNN-based speech enhancement, which incorporates three different techniques: sparse regularization, iterative pruning and clustering-based quantization. We systematically investigate these techniques and evaluate the proposed compression pipelines. Experimental results demonstrate that our approach reduces the sizes of four different models by large margins without significantly sacrificing their enhancement performance. In addition, we find that the proposed approach performs well on speaker separation, which further demonstrates the effectiveness of the approach for compressing speech separation models.
Towards Robust Speech Super-resolution
Wang H and Wang D
Speech super-resolution (SR) aims to increase the sampling rate of a given speech signal by generating high-frequency components. This paper proposes a convolutional neural network (CNN) based SR model that takes advantage of information from both time and frequency domains. Specifically, the proposed CNN is a time-domain model that takes the raw waveform of low-resolution speech as the input, and outputs an estimate of the corresponding high-resolution waveform. During the training stage, we employ a cross-domain loss to optimize the network. We compare our model with several deep neural network (DNN) based SR models, and experiments show that our model outperforms existing models. Furthermore, the robustness of DNN-based models is investigated, in particular regarding microphone channels and downsampling schemes, which have a major impact on the performance of DNN-based SR models. By training with proper datasets and preprocessing, we improve the generalization capability for untrained microphone channels and unknown downsampling schemes.
Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation
Wang ZQ, Wang P and Wang D
We propose multi-microphone complex spectral mapping, a simple way of applying deep learning for time-varying non-linear beamforming, for speaker separation in reverberant conditions. We aim at both speaker separation and dereverberation. Our study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation (CSS). Assuming a fixed array geometry between training and testing, we train deep neural networks (DNN) to predict the real and imaginary (RI) components of target speech at a reference microphone from the RI components of multiple microphones. We then integrate multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation, and combine it with frame-level speaker counting for block-online CSS. Although our system is trained on simulated room impulse responses (RIR) based on a fixed number of microphones arranged in a given geometry, it generalizes well to a real array with the same geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
Audio object classification using distributed beliefs and attention
Bellur A and Elhilali M
One of the unique characteristics of human hearing is its ability to recognize acoustic objects even in presence of severe noise and distortions. In this work, we explore two mechanisms underlying this ability: 1) redundant mapping of acoustic waveforms along distributed latent representations and 2) adaptive feedback based on prior knowledge to selectively attend to targets of interest. We propose a bio-mimetic account of acoustic object classification by developing a novel distributed deep belief network validated for the task of robust acoustic object classification using the UrbanSound database. The proposed distributed belief network (DBN) encompasses an array of independent sub-networks trained generatively to capture different abstractions of natural sounds. A supervised classifier then performs a readout of this distributed mapping. The overall architecture not only matches the state of the art system for acoustic object classification but leads to significant improvement over the baseline in mismatched noisy conditions (31.4% relative improvement in 0dB conditions). Furthermore, we incorporate mechanisms of attentional feedback that allows the DBN to deploy local memories of sounds targets estimated at multiple views to bias network activation when attending to a particular object. This adaptive feedback results in further improvement of object classification in unseen noise conditions (relative improvement of 54% over the baseline in 0dB conditions).
Glottal Airflow Estimation using Neck Surface Acceleration and Low-Order Kalman Smoothing
Morales A, Yuz JI, Cortés JP, Fontanet JG and Zañartu M
The use of non-invasive skin accelerometers placed over the extrathoracic trachea has been proposed in the literature for measuring vocal function. Glottal airflow is estimated using inverse filtering or Bayesian techniques based on a subglottal impedance-based model when utilizing these sensors. However, deviations in glottal airflow estimates can arise due to sensor positioning and model mismatch, and addressing them requires a significant computational load. In this paper, we utilize system identification techniques to obtain a low order state-space representation of the subglottal impedance-based model. We then employ the resulting low order model in a Kalman smoother to estimate the glottal airflow. Our proposed approach reduces the model order by 94% and requires only 1.5% of the computing time compared to previous Bayesian methods in the literature, while achieving slightly better accuracy when correcting for glottal airflow deviations. Additionally, our Kalman smoother approach provides a measure of uncertainty in the airflow estimate, which is valuable when measurements are taken under different conditions. With its comparable accuracy in signal estimation and reduced computational load, the proposed approach has the potential for real-time estimation of glottal airflow and its associated uncertainty in wearable voice ambulatory monitors using neck-surface acceleration.
Self-attending RNN for Speech Enhancement to Improve Cross-corpus Generalization
Pandey A and Wang D
Deep neural networks (DNNs) represent the mainstream methodology for supervised speech enhancement, primarily due to their capability to model complex functions using hierarchical representations. However, a recent study revealed that DNNs trained on a single corpus fail to generalize to untrained corpora, especially in low signal-to-noise ratio (SNR) conditions. Developing a noise, speaker, and corpus independent speech enhancement algorithm is essential for real-world applications. In this study, we propose a self-attending recurrent neural network (SARNN) for time-domain speech enhancement to improve cross-corpus generalization. SARNN comprises of recurrent neural networks (RNNs) augmented with self-attention blocks and feedforward blocks. We evaluate SARNN on different corpora with nonstationary noises in low SNR conditions. Experimental results demonstrate that SARNN substantially outperforms competitive approaches to time-domain speech enhancement, such as RNNs and dual-path SARNNs. Additionally, we report an important finding that the two popular approaches to speech enhancement: complex spectral mapping and time-domain enhancement, obtain similar results for RNN and SARNN with large-scale training. We also provide a challenging subset of the test set used in this study for evaluating future algorithms and facilitating direct comparisons.
Proportionate Adaptive Filtering Algorithms Derived Using an Iterative Reweighting Framework
Lee CH, Rao BD and Garudadri H
In this paper, based on sparsity-promoting regularization techniques from the sparse signal recovery (SSR) area, least mean square (LMS)-type sparse adaptive filtering algorithms are derived. The approach mimics the iterative reweighted and SSR methods that majorize the regularized objective function during the optimization process. We show that introducing the majorizers leads to the same algorithm as simply using the gradient update of the regularized objective function, as is done in existing approaches. Different from the past works, the reweighting formulation naturally leads to an affine scaling transformation (AST) strategy, which effectively introduces a diagonal weighting on the gradient, giving rise to new algorithms that demonstrate improved convergence properties. Interestingly, setting the regularization coefficient to zero in the proposed AST-based framework leads to the Sparsity-promoting LMS (SLMS) and Sparsity-promoting Normalized LMS (SNLMS) algorithms, which exploit but do not strictly enforce the sparsity of the system response if it already exists. The SLMS and SNLMS realize proportionate adaptation for convergence speedup should sparsity be present in the underlying system response. In this manner, we develop a new way for rigorously deriving a large class of proportionate algorithms, and also explain why they are useful in applications where the underlying systems admit certain sparsity, e.g., in acoustic echo and feedback cancellation.
Low-Latency Active Noise Control Using Attentive Recurrent Network
Zhang H, Pandey A and Wang D
Processing latency is a critical issue for active noise control (ANC) due to the causality constraint of ANC systems. This paper addresses low-latency ANC in the context of deep learning (i.e. deep ANC). A time-domain method using an attentive recurrent network (ARN) is employed to perform deep ANC with smaller frame sizes, thus reducing algorithmic latency of deep ANC. In addition, we introduce a delay-compensated training to perform ANC using predicted noise for several milliseconds. Moreover, a revised overlap-add method is utilized during signal resynthesis to avoid the latency introduced due to overlaps between neighboring time frames. Experimental results show the effectiveness of the proposed strategies for achieving low-latency deep ANC. Combining the proposed strategies is capable of yielding zero, even negative, algorithmic latency without affecting ANC performance much, thus alleviating the causality constraint in ANC design.
Consonant-Vowel Transition Models Based on Deep Learning for Objective Evaluation of Articulation
Mathad VC, Liss JM, Chapman K, Scherer N and Berisha V
Spectro-temporal dynamics of consonant-vowel (CV) transition regions are considered to provide robust cues related to articulation. In this work, we propose an objective measure of precise articulation, dubbed the objective articulation measure (OAM), by analyzing the CV transitions segmented around vowel onsets. The OAM is derived based on the posteriors of a convolutional neural network pre-trained to classify between different consonants using CV regions as input. We demonstrate that the OAM is correlated with perceptual measures in a variety of contexts including (a) adult dysarthric speech, (b) the speech of children with cleft lip/palate, and (c) a database of accented English speech from native Mandarin and Spanish speakers.
Attentive Training: A New Training Framework for Speech Enhancement
Pandey A and Wang D
Dealing with speech interference in a speech enhancement system requires either speaker separation or target speaker extraction. Speaker separation has multiple output streams with arbitrary assignments while target speaker extraction requires additional cueing for speaker selection. Both of these are not suitable for a standalone speech enhancement system with one output stream. In this study, we propose a novel training framework, called , to extend speech enhancement to deal with speech interruptions. Attentive training is based on the observation that, in the real world, multiple talkers very unlikely start speaking at the same time, and therefore, a deep neural network can be trained to create a representation of the first speaker and utilize it to attend to or track that speaker in a multitalker noisy mixture. We present experimental results and comparisons to demonstrate the effectiveness of attentive training for speech enhancement.
Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection
Zhang J, Liss J, Jayasuriya S and Berisha V
Approximately 1.2% of the world's population has impaired voice production. As a result, automatic dysphonic voice detection has attracted considerable academic and clinical interest. However, existing methods for automated voice assessment often fail to generalize outside the training conditions or to other related applications. In this paper, we propose a deep learning framework for generating acoustic feature embeddings sensitive to vocal quality and robust across different corpora. A contrastive loss is combined with a classification loss to train our deep learning model jointly. Data warping methods are used on input voice samples to improve the robustness of our method. Empirical results demonstrate that our method not only achieves high in-corpus and cross-corpus classification accuracy but also generates good embeddings sensitive to voice quality and robust across different corpora. We also compare our results against three baseline methods on clean and three variations of deteriorated in-corpus and cross-corpus datasets and demonstrate that the proposed model consistently outperforms the baseline methods.
Bilateral Cochlear Implant Processing of Coding Strategies With CCi-MOBILE, an Open-Source Research Platform
Ghosh R and Hansen JHL
While speech understanding for cochlear implant (CI) users in quiet is relatively effective, listeners experience difficulty in identification of speaker and sound location. To assist for better residual hearing abilities and speech intelligibility support, bilateral and bimodal forms of assisted hearing is becoming popular among CI users. Effective bilateral processing calls for testing precise algorithm synchronization and fitting between both left and right ear channels in order to capture interaural time and level difference cues (ITD and ILDs). This work demonstrates bilateral implant algorithm processing using a custom-made CI research platform - CCi-MOBILE, which is capable of capturing precise source localization information and supports researchers in testing bilateral CI processing in real-time naturalistic environments. Simulation-based, objective, and subjective testing has been performed to validate the accuracy of the platform. The subjective test results produced an RMS error of ±8.66° for source localization, which is comparable to the performance of commercial CI processors.
Dense CNN with Self-Attention for Time-Domain Speech Enhancement
Pandey A and Wang D
Speech enhancement in the time domain is becoming increasingly popular in recent years, due to its capability to jointly enhance both the magnitude and the phase of speech. In this work, we propose a dense convolutional network (DCN) with self-attention for speech enhancement in the time domain. DCN is an encoder and decoder based architecture with skip connections. Each layer in the encoder and the decoder comprises a dense block and an attention module. Dense blocks and attention modules help in feature extraction using a combination of feature reuse, increased network depth, and maximum context aggregation. Furthermore, we reveal previously unknown problems with a loss based on the spectral magnitude of enhanced speech. To alleviate these problems, we propose a novel loss based on magnitudes of enhanced speech and a predicted noise. Even though the proposed loss is based on magnitudes only, a constraint imposed by noise prediction ensures that the loss enhances both magnitude and phase. Experimental results demonstrate that DCN trained with the proposed loss substantially outperforms other state-of-the-art approaches to causal and non-causal speech enhancement.
Neural Cascade Architecture with Triple-domain Loss for Speech Enhancement
Wang H and Wang D
This paper proposes a neural cascade architecture to address the monaural speech enhancement problem. The cascade architecture is composed of three modules which optimize in turn enhanced speech with respect to the magnitude spectrogram, the time-domain signal and the complex spectrogram. Each module takes as input the noisy speech and the output obtained from the previous module, and generates a prediction of the respective target. Our model is trained in an end-to-end manner, using a triple-domain loss function that accounts for three domains of signal representation. Experimental results on the WSJ0 SI-84 corpus show that the proposed model outperforms other strong speech enhancement baselines in terms of objective speech quality and intelligibility.
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization
Pal M, Kumar M, Peri R, Park TJ, Kim SH, Lord C, Bishop S and Narayanan S
The performance of most speaker diarization systems with x-vector embeddings is both vulnerable to noisy environments and lacks domain robustness. Earlier work on speaker diarization using generative adversarial network (GAN) with an encoder network (ClusterGAN) to project input x-vectors into a latent space has shown promising performance on meeting data. In this paper, we extend the ClusterGAN network to improve diarization robustness and enable rapid generalization across various challenging domains. To this end, we fetch the pre-trained encoder from the ClusterGAN and fine tune it by using prototypical loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD-II (dev set) which includes challenging multi-domain corpus, and two child-clinician interaction corpora (ADOS, BOSCC) related to the autism spectrum disorder domain. Extensive analyses of the experimental data are done to investigate the effectiveness of the proposed ClusterGAN and MCGAN embeddings over x-vectors. The results show that the proposed embeddings with normalized maximum eigengap spectral clustering (NME-SC) back-end consistently outperform the Kaldi state-of-the-art x-vector diarization system. Finally, we employ embedding fusion with x-vectors to provide further improvement in diarization performance. We achieve a relative diarization error rate (DER) improvement of 6.67% to 53.93% on the aforementioned datasets using the proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide better performance in the number of speakers estimation and short speech segment diarization compared to x-vectors and ClusterGAN on telephonic conversations.
Fusing Bone-conduction and Air-conduction Sensors for Complex-Domain Speech Enhancement
Wang H, Zhang X and Wang D
Speech enhancement aims to improve the listening quality and intelligibility of noisy speech in adverse environments. It proves to be challenging to perform speech enhancement in very low signal-to-noise ratio (SNR) conditions. Conventional speech enhancement utilizes air-conduction (AC) microphones, which are sensitive to background noise but capable of capturing full-band signals. On the other hand, bone-conduction (BC) sensors are unaffected by acoustic noise, but recorded speech has limited bandwidth. This study proposes an attention-based fusion method to combine the strengths of AC and BC signals and perform complex spectral mapping for speech enhancement. Experiments on the EMSB dataset demonstrate that the proposed approach effectively leverages the advantages of AC and BC sensors, and outperforms a recent time-domain baseline in all conditions. We also show that the sensor fusion method is superior to single-sensor counterparts, especially in low SNR conditions. As the amount of BC data is very limited, we additionally propose a semi-supervised technique to utilize both parallelly and unparallely recorded AC and BC speech signals. With additional AC speech from the AISHELL-1 dataset, we achieve similar performance to supervised learning with only 50% parallel data.
Speech Intelligibility Prediction using Spectro-Temporal Modulation Analysis
Edraki A, Chan WY, Jensen J and Fogerty D
Spectro-temporal modulations are believed to mediate the analysis of speech sounds in the human primary auditory cortex. Inspired by humans' robustness in comprehending speech in challenging acoustic environments, we propose an intrusive speech intelligibility prediction (SIP) algorithm, wSTMI, for normal-hearing listeners based on spectro-temporal modulation analysis (STMA) of the clean and degraded speech signals. In the STMA, each of 55 modulation frequency channels contributes an intermediate intelligibility measure. A sparse linear model with parameters optimized using Lasso regression results in combining the intermediate measures of 8 of the most salient channels for SIP. In comparison with a suite of 10 SIP algorithms, wSTMI performs consistently well across 13 datasets, which together cover degradation conditions including modulated noise, noise reduction processing, reverberation, near-end listening enhancement, and speech interruption. We show that the optimized parameters of wSTMI may be interpreted in terms of modulation transfer functions of the human auditory system. Thus, the proposed approach offers evidence affirming previous studies of the perceptual characteristics underlying speech signal intelligibility.
Microscopic and Blind Prediction of Speech Intelligibility: Theory and Practice
Karbasi M, Zeiler S and Kolossa D
Being able to estimate speech intelligibility without the need for listening tests would confer great benefits for a wide range of speech processing applications. Many attempts have therefore been made to introduce an objective, and ideally referencefree measure for this purpose. Most works analyze speech intelligibility prediction (SIP) methods from a macroscopic point of view, averaging over longer time spans. This paper, in contrast, presents a theoretical framework for the microscopic evaluation of SIP methods. Within our framework, a Statistically estimated Accuracy based on Theory (StAT) is derived, which numerically quantifies the statistical limitations inherent in microscopic SIP. A state-of-the-art approach to microscopic SIP, namely, the use of automatic speech recognition (ASR) to directly predict listening test results, is evaluated within this framework. The practical results are in good agreement with the theory. As the final contribution, a fully blind DIscriminative Speech intelligibility Predictor (DISP) is introduced and is also evaluated within the StAT framework. It is shown that this novel, blind estimator can predict intelligibility as well as-and often even with better accuracy than-the non-blind ASR-based approach, and that its results are again in good agreement with its theoretically derived performance potential.
The Temporal Limits Encoder as a Sound Coding Strategy for Bilateral Cochlear Implants
Kan A and Meng Q
The difference in binaural benefit between bilateral cochlear implant (CI) users and normal hearing (NH) listeners has typically been attributed to CI sound coding strategies not encoding the acoustic fine structure (FS) interaural time differences (ITD). The Temporal Limits Encoder (TLE) strategy is proposed as a potential way of improving binaural hearing benefits for CI users in noisy situations. TLE works by downward-transposition of mid-frequency band-limited channel information and can theoretically provide FS-ITD cues. In this work, the effect of choice of lower limit of the modulator in TLE was examined by measuring performance on a word recognition task and computing the magnitude of binaural benefit in bilateral CI users. Performance listening with the TLE strategy was compared with the commonly used Advanced Combinational Encoder (ACE) CI sound coding strategy. Results showed that setting the lower limit to ≥200 Hz maintained word recognition performance comparable to that of ACE. While most CI listeners exhibited a large binaural benefit (≥6 dB) in at least one of the conditions tested, there was no systematic relationship between the lower limit of the modulator and performance. These results indicate that the TLE strategy has potential to improve binaural hearing abilities in CI users but further work is needed to understand how binaural benefit can be maximized.
Selective Acoustic Feature Enhancement for Speech Emotion Recognition With Noisy Speech
Leem SG, Fulford D, Onnela JP, Gard D and Busso C
A (SER) system deployed on a real-world application can encounter speech contaminated with unconstrained background noise. To deal with this issue, a (SE) module can be attached to the SER system to compensate for the environmental difference of an input. Although the SE module can improve the quality and intelligibility of a given speech, there is a risk of affecting discriminative acoustic features for SER that are resilient to environmental differences. Exploring this idea, we propose to enhance only weak features that degrade the emotion recognition performance. Our model first identifies weak feature sets by using multiple models trained with one acoustic feature at a time using clean speech. After training the single-feature models, we rank each speech feature by measuring three criteria: performance, robustness, and a joint rank ranking that combines performance and robustness. We group the weak features by cumulatively incrementing the features from the bottom to the top of each rank. Once the weak feature set is defined, we only enhance those weak features, keeping the resilient features unchanged. We implement these ideas with the (LLDs). We show that directly enhancing the weak LLDs leads to better performance than extracting LLDs from an enhanced speech signal. Our experiment with clean and noisy versions of the MSP-Podcast corpus shows that the proposed approach yields a 17.7% (arousal), 21.2% (dominance), and 3.3% (valence) performance gains over a system that enhances all the LLDs for the 10dB (SNR) condition.