Speechformer-CTC: Sequential Modeling of Depression Detection with Speech Temporal Classification
Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input-output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.
Post-Processing Automatic Transcriptions with Machine Learning for Verbal Fluency Scoring
To compare verbal fluency scores derived from manual transcriptions to those obtained using automatic speech recognition enhanced with machine learning classifiers.
Influence of chromium and sodium on development, physiology, and anatomy of Conilon coffee seedlings
Some components found in the composition of the tannery sludge are nutrients for the plants; it can be considered an alternative source of fertilization as they have favorable agronomic characteristics. However, it is reported in some studies that the presence of chromium and sodium in this residue causes physiological and anatomical disturbances that inhibit the development of the plants. The objective of this study was to evaluate the influence of chromium and sodium on the physiology, anatomy, and development of Conilon coffee seedlings grown on substrates produced with tannery sludge and equivalent doses of chromium and sodium. The experiment was carried out in nursery using randomized block design, containing 5 treatments and 7 repetitions. The treatments consisted of the application of a 40% tannery sludge dose and equivalent doses of chromium and sodium mixed with a conventional substrate. Notably, the presence of sodium in the substrate caused greater damage to the plants, negatively influencing the physiology, anatomy, and, consequently, development of the plants, while the presence of chromium suggests that it does not influence much the evaluated characteristics. The treatment with tannery sludge, on the other hand, despite containing the same chromium and sodium contents, revealed a more pronounced negative influence on the physiology, anatomy, and development patterns of the seedlings. This shows that sodium and chromium alone are not the only factors responsible for the lowest growth indicators studied.
Audibility emphasis of low-level sounds improves consonant identification while preserving vowel identification for cochlear implant users
Consonant perception is challenging for listeners with hearing loss, and transmission of speech over communication channels further deteriorates the acoustics of consonants. Part of the challenge arises from the short-term low energy spectro-temporal profile of consonants (for example, relative to vowels). We hypothesized that an audibility enhancement approach aimed at boosting the energy of low-level sounds would improve identification of consonants without diminishing vowel identification. We tested this hypothesis with 11 cochlear implant users, who completed an online listening experiment remotely using the media device and implant settings that they most commonly use when making video calls. Loudness growth and detection thresholds were measured for pure tone stimuli to characterize the relative loudness of test conditions. Consonant and vowel identification were measured in quiet and in speech-shaped noise for progressively difficult signal-to-noise ratios (+12, +6, 0, -6 dB SNR). These conditions were tested with and without an audibility-emphasis algorithm designed to enhance consonant identification at the source. The results show that the algorithm improves consonant identification in noise for cochlear implant users without diminishing vowel identification. We conclude that low-level emphasis of audio can improve speech recognition for cochlear implant users in the case of video calls or other telecommunications where the target speech can be preprocessed separately from environmental noise.
Nonlinear waveform distortion: Assessment and detection of clipping on speech data and systems
Speech, speaker, and language systems have traditionally relied on carefully collected speech material for training acoustic models. There is an enormous amount of freely accessible audio content. A major challenge, however, is that such data is not professionally recorded, and therefore may contain a wide diversity of background noise, nonlinear distortions, or other unknown environmental or technology-based contamination or mismatch. There is a crucial need for automatic analysis to screen such unknown datasets before acoustic model development training, or to perform input audio purity screening prior to classification. In this study, we propose a waveform based clipping detection algorithm for naturalistic audio streams and examine the impact of clipping at different severities on speech quality measurements and automatic speaker recognition systems. We use the TIMIT and NIST SRE08 corpora as case studies. The results show, as expected, that clipping introduces a nonlinear distortion into clean speech data, which reduces speech quality and performance for speaker recognition. We also investigate what degree of clipping can be present to sustain effective speech system performance. The proposed detection system, which will be released, could contribute to massive new audio collections for speech and language technology development (e.g. Google Audioset (Gemmeke et al., 2017), CRSS-UTDallas Apollo Fearless-Steps (Yu et al., 2014) (19,000 h naturalistic audio from NASA Apollo missions)).
Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations
Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory. Infant vocalizations were manually labeled as cry, fus (fuss), lau (laugh), bab (babble) or scr (screech), while parent (mostly mother) vocalizations were labeled as ids (infant-directed speech), ads (adult-directed speech), pla (playful), rhy (rhythmic speech or singing), lau (laugh) or whi (whisper). Linear discriminant analysis (LDA) was selected as a baseline classifier, because it gave the highest accuracy in a previously published study covering part of this corpus. LDA was compared to two neural network architectures: a two-layer fully-connected network (FCN), and a convolutional neural network with self-attention (CNSA). Baseline features extracted using the OpenSMILE toolkit were augmented by extra voice quality, phonetic, and prosodic features, each targeting perceptual features of one or more of the vocalization types. Three web data augmentation and transfer learning methods were tested: pre-training of network weights for a related task (adult emotion classification), augmentation of under-represented classes using data uniformly sampled from other corpora, and augmentation of under-represented classes using data selected by a minimum cross-corpus information difference criterion. Feature selection using Fisher scores and experiments of using weighted and unweighted samplers were also tested. Two datasets were evaluated: a benchmark dataset (CRIED) and our own corpus. In terms of unweighted-average recall of CRIED dataset, the CNSA achieved the best UAR compared with previous studies. In terms of classification accuracy, weighted F1, and macro F1 of our own dataset, the neural networks both significantly outperformed LDA; the FCN slightly (but not significantly) outperformed the CNSA. Cross-examining features selected by different feature selection algorithms permits a type of post-hoc feature analysis, in which the most important acoustic features for each binary type discrimination are listed. Examples of each vocalization type of overlapped features were selected, and their spectrograms are presented, and discussed with respect to the type-discriminative acoustic features selected by various algorithms. MFCC, log Mel Frequency Band Energy, LSP frequency, and F1 are found to be the most important spectral envelope features; F0 is found to be the most important prosodic feature.
Oral configurations during vowel nasalization in English
Speech nasalization is achieved primarily through the opening and closing of the velopharyngeal port. However, the resultant acoustic features can also be influenced by tongue configuration. Although vowel nasalization is not contrastive in English, two previous studies have found possible differences in the oral articulation of nasal and oral vowel productions, albeit with inconsistent results. In an attempt to further understand the conflicting findings, we evaluated the oral kinematics of nasalized and non-nasalized vowels in a cohort of both male and female American English speakers via electromagnetic articulography. Tongue body and lip positions were captured during vowels produced in nasal and oral contexts (e.g., /mɑm/, /bɑb/). Large contrasts were seen in all participants between tongue position of /æ/ in oral and nasal contexts, in which tongue positions were higher and more forward during /mæm/ than /bæb/. Lip aperture was smaller in a nasal context for /æ/. Lip protrusion was not different between vowels in oral and nasal contexts. Smaller contrasts in tongue and lip position were seen for vowels /ɑ, i, u/; this is consistent with biomechanical accounts of vowel production that suggest that /i, u/ are particularly constrained, whereas /æ/ has fewer biomechanical constraints, allowing for more flexibility for articulatory differences in different contexts. Thus we conclude that speakers of American English do indeed use different oral configurations for vowels that are in nasal and oral contexts, despite vowel nasalization being non-contrastive. This effect was consistent across speakers for only one vowel, perhaps accounting for previously-conflicting results.
Analysis of Glottal Inverse Filtering in the Presence of Source-Filter Interaction
The validity of glottal inverse filtering (GIF) to obtain a glottal flow waveform from radiated pressure signal in the presence and absence of source-filter interaction was studied systematically. A driven vocal fold surface model of vocal fold vibration was used to generate source signals. A one-dimensional wave reflection algorithm was used to solve for acoustic pressures in the vocal tract. Several test signals were generated with and without source-filter interaction at various fundamental frequencies and vowels. Linear Predictive Coding (LPC), Quasi Closed Phase (QCP), and Quadratic Programming (QPR) based algorithms, along with supraglottal impulse response, were used to inverse filter the radiated pressure signals to obtain the glottal flow pulses. The accuracy of each algorithm was tested for its recovery of maximum flow declination rate (MFDR), peak glottal flow, open phase ripple factor, closed phase ripple factor, and mean squared error. The algorithms were also tested for their absolute relative errors of the Normalized Amplitude Quotient, the Quasi-Open Quotient, and the Harmonic Richness Factor. The results indicated that the mean squared error decreased with increase in source-filter interaction level suggesting that the inverse filtering algorithms perform better in the presence of source-filter interaction. All glottal inverse filtering algorithms predicted the open phase ripple factor better than the closed phase ripple factor of a glottal flow waveform, irrespective of the source-filter interaction level. Major prediction errors occurred in the estimation of the closed phase ripple factor, MFDR, peak glottal flow, normalized amplitude quotient, and Quasi-Open Quotient. Feedback-related nonlinearity (source-filter interaction) affected the recovered signal primarily when was well below the first formant frequency of a vowel. The prediction error increased when was close to the first formant frequency due to the difficulty of estimating the precise value of resonance frequencies, which was exacerbated by nonlinear kinetic losses in the vocal tract.
Temporal envelope cues and simulations of cochlear implant signal processing
Conventional signal processing implemented on clinical cochlear implant (CI) sound processors is based on envelope signals extracted from overlapping frequency regions. Conventional strategies do not encode temporal envelope or temporal fine-structure cues with high fidelity. In contrast, several research strategies have been developed recently to enhance the encoding of temporal envelope and fine-structure cues. The present study examines the salience of temporal envelope cues when encoded into vocoder representations of CI signal processing. Normal-hearing listeners were evaluated on measures of speech reception, speech quality ratings, and spatial hearing when listening to vocoder representations of CI signal processing. Conventional vocoder techniques using envelope signals with noise- or tone-excited reconstruction were evaluated in comparison to a novel approach based on impulse-response reconstruction. A variation of this impulse-response approach was based on a research strategy, the Fundamentally Asynchronous Stimulus Timing (FAST) algorithm, designed to improve temporal precision of envelope cues. The results indicate that the introduced impulse-response approach, combined with the FAST algorithm, produces similar results on speech reception measures as the conventional vocoder approaches, while providing significantly better sound quality and spatial hearing outcomes. This novel approach for stimulating how temporal envelope cues are encoded into CI stimulation has potential for examining diverse aspects of hearing, particularly in aspects of musical pitch perception and spatial hearing.
The Prosodic Marionette: a method to visualize speech prosody and assess perceptual and expressive prosodic abilities
Speech technology applications have emerged as a promising method for assessing speech-language abilities and at-home therapy, including prosody. Many applications assume that observed prosody errors are due to an underlying disorder; however, they may be instead due to atypical representations of prosody such as immature and developing speech motor control, or compensatory adaptations by those with congenital neuromotor disorders. The result is the same - vocal productions may not be a reliable measure of prosody knowledge. Therefore, in this study we examine the usability of a new technology application to express prosody knowledge without relying on vocalizations using the Prosodic Marionette (PM) graphical user interface for artificial resynthesis of speech prosody. We tested the ability of neurotypical participants to use the PM interface to control prosody through 2D movements of word-icon blocks vertically (fundamental frequency), horizontally (pause length), and by stretching (word duration) to correctly mark target prosodic contrasts. Nearly all participants used vertical movements to correctly mark fundamental frequency changes where appropriate (e.g., raised second word for pitch accent on second word). A smaller percentage of participants used the stretching feature to mark duration changes; when used, participants correctly lengthened the appropriate word (e.g., stretch the second item to accent the second word). Our results suggest the PM interface can be used reliably to correctly signal speech prosody, which validates future use of the interface to assess prosody in clinical and developmental populations with atypical speech motor control.
WHERE HAS ALL THE POWER GONE? ENERGY PRODUCTION AND LOSS IN VOCALIZATION
Human voice production for speech is an inefficient process in terms of energy expended to produce acoustic output. A traditional measure of vocal efficiency relates acoustic power radiated from the mouth to aerodynamic power produced in the trachea. This efficiency ranges between 0.001 % and 1.0 % in speech-like vocalization. Simplified Navier-Stokes equations for non-steady compressible airflow from trachea to lips were used to calculate steady aerodynamic power, acoustic power, and combined total power at seven strategic locations along the airway. A portion of the airway was allowed to collapse to produce self-sustained oscillation for sound production. A , defined as acoustic power generated in the glottis to aerodynamic power dissipated, was found to be on the order of 10%, but wall vibration, air viscosity, and kinetic pressure losses consumed almost all of that power. This sound, reflected back and forth in the airway, was dissipated at a level on the order of 99.9 %.
SEDA: A tunable Q-factor wavelet-based noise reduction algorithm for multi-talker babble
We introduce a new wavelet-based algorithm to enhance the quality of speech corrupted by multi-talker babble noise. The algorithm comprises three stages: The first stage classifies short frames of the noisy speech as speech-dominated or noise-dominated. We design this classifier specifically for multi-talker babble noise. The second stage performs preliminary de-nosing of noisy speech frames using oversampled wavelet transforms and parallel group thresholding. The final stage performs further denoising by attenuating residual high frequency components in the signal produced by the second stage. A significant improvement in intelligibility and quality was observed in evaluation tests of the algorithm with cochlear implant users.
The development and validation of the Closed-set Mandarin Sentence (CMS) test
Matrix-styled sentence tests offer a closed-set paradigm that may be useful when evaluating speech intelligibility. Ideally, sentence test materials should reflect the distribution of phonemes within the target language. We developed and validated the Closed-set Mandarin Sentence (CMS) test to assess Mandarin speech intelligibility in noise. CMS test materials were selected to be familiar words and to represent the natural distribution of vowels, consonants, and lexical tones found in Mandarin Chinese. Ten key words in each of five categories (Name, Verb, Number, Color, and Fruit) were produced by a native Mandarin talker, resulting in a total of 50 words that could be combined to produce 100,000 unique sentences. Normative data were collected in 10 normal-hearing, adult Mandarin-speaking Chinese listeners using a closed-set test paradigm. Two test runs were conducted for each subject, and 20 sentences per run were randomly generated while ensuring that each word was presented only twice in each run. First, the level of the words in each category were adjusted to produce equal intelligibility in noise. Test-retest reliability for word-in-sentence recognition was excellent according to Cronbach's alpha (0.952). After the category level adjustments, speech reception thresholds (SRTs) for sentences in noise, defined as the signal-to-noise ratio (SNR) that produced 50% correct whole sentence recognition, were adaptively measured by adjusting the SNR according to the correctness of response. The mean SRT was -7.9 (SE=0.41) and -8.1 (SE=0.34) dB for runs 1 and 2, respectively. The mean standard deviation across runs was 0.93 dB, and paired t-tests showed no significant difference between runs 1 and 2 (p=0.74) despite random sentences being generated for each run and each subject. The results suggest that the CMS provides large stimulus set with which to repeatedly and reliably measure Mandarin-speaking listeners' speech understanding in noise using a closed-set paradigm.
An acoustically-driven vocal tract model for stop consonant production
The purpose of this study was to further develop a multi-tier model of the vocal tract area function in which the modulations of shape to produce speech are generated by the product of a vowel substrate and a consonant superposition function. The new approach consists of specifying input parameters for a target consonant as a set of directional changes in the resonance frequencies of the vowel substrate. Using calculations of acoustic sensitivity functions, these "resonance deflection patterns" are transformed into time-varying deformations of the vocal tract shape without any direct specification of location or extent of the consonant constriction along the vocal tract. The configuration of the constrictions and expansions that are generated by this process were shown to be physiologically-realistic and produce speech sounds that are easily identifiable as the target consonants. This model is a useful enhancement for area function-based synthesis and can serve as a tool for understanding how the vocal tract is shaped by a talker during speech production.
Noise Perturbation for Supervised Speech Separation
Speech separation can be treated as a mask estimation problem, where interference-dominant portions are masked in a time-frequency representation of noisy speech. In supervised speech separation, a classifier is typically trained on a mixture set of speech and noise. It is important to efficiently utilize limited training data to make the classifier generalize well. When target speech is severely interfered by a nonstationary noise, a classifier tends to mistake noise patterns for speech patterns. Expansion of a noise through proper perturbation during training helps to expose the classifier to a broader variety of noisy conditions, and hence may lead to better separation performance. This study examines three noise perturbations on supervised speech separation: noise rate, vocal tract length, and frequency perturbation at low signal-to-noise ratios (SNRs). The speech separation performance is evaluated in terms of classification accuracy, hit minus false-alarm rate and short-time objective intelligibility (STOI). The experimental results show that frequency perturbation is the best among the three perturbations in terms of speech separation. In particular, the results show that frequency perturbation is effective in reducing the error of misclassifying a noise pattern as a speech pattern.
Generating Tonal Distinctions in Mandarin Chinese Using an Electrolarynx with Preprogrammed Tone Patterns
An electrolarynx (EL) is a valuable rehabilitative option for individuals who have undergone laryngectomy, but current monotone ELs do not support controlled variations in fundamental frequency for producing tonal languages. The present study examined the production and perception of Mandarin Chinese using a customized hand-held EL driven by computer software to generate tonal distinctions (tonal EL). Four native Mandarin speakers were trained to articulate their speech coincidentally with preprogrammed tonal patterns in order to produce mono- and di-syllabic words with a monotone EL and tonal EL. Three native Mandarin speakers later transcribed and rated the speech samples for intelligibility and acceptability. Results indicated that words produced using the tonal EL were significantly more intelligible and acceptable than those produced using the monotone EL.
Cry-based infant pathology classification using GMMs
Traditional studies of infant cry signals focus more on non-pathology-based classification of infants. In this paper, we introduce a noninvasive health care system that performs acoustic analysis of unclean noisy infant cry signals to extract and measure certain cry characteristics quantitatively and classify healthy and sick newborn infants according to only their cries. In the conduct of this newborn cry-based diagnostic system, the dynamic MFCC features along with static Mel-Frequency Cepstral Coefficients (MFCCs) are selected and extracted for both expiratory and inspiratory cry vocalizations to produce a discriminative and informative feature vector. Next, we create a unique cry pattern for each cry vocalization type and pathological condition by introducing a novel idea using the Boosting Mixture Learning (BML) method to derive either healthy or pathology subclass models separately from the Gaussian Mixture Model-Universal Background Model (GMM-UBM). Our newborn cry-based diagnostic system (NCDS) has a hierarchical scheme that is a treelike combination of individual classifiers. Moreover, a score-level fusion of the proposed expiratory and inspiratory cry-based subsystems is performed to make a more reliable decision. The experimental results indicate that the adapted BML method has lower error rates than the Bayesian approach or the maximum a posteriori probability (MAP) adaptation approach when considered as a reference method.
Using Automatic Speech Recognition to Assess Spoken Responses to Cognitive Tests of Semantic Verbal Fluency
Cognitive tests of verbal fluency (VF) consist of verbalizing as many words as possible in one minute that either start with a specific letter of the alphabet or belong to a specific semantic category. These tests are widely used in neurological, psychiatric, mental health, and school settings and their validity for clinical applications has been extensively demonstrated. However, VF tests are currently administered and scored manually making them too cumbersome to use, particularly for longitudinal cognitive monitoring in large populations. The objective of the current study was to determine if automatic speech recognition (ASR) could be used for computerized administration and scoring of VF tests. We examined established techniques for constraining language modeling to a predefined vocabulary from a specific semantic category (e.g., animals). We also experimented with post-processing ASR output with confidence scoring, as well as with using speaker adaptation to improve automated VF scoring. Audio responses to a VF task were collected from 38 novice and experienced professional fighters (boxing and mixed martial arts) participating in a longitudinal study of effects of repetitive head trauma on brain function. Word error rate, correlation with manual word count and distance from manual word count were used to compare ASR-based approaches to scoring to each other and to the manually scored reference standard. Our study's results show that responses to the VF task contain a large number of extraneous utterances and noise that lead to relatively poor baseline ASR performance. However, we also found that speaker adaptation combined with confidence scoring significantly improves all three metrics and can enable use of ASR for reliable estimates of the traditional manual VF scores.
A Method of Speech Periodicity Enhancement Using Transform-domain Signal Decomposition
Periodicity is an important property of speech signals. It is the basis of the signal's fundamental frequency and the pitch of voice, which is crucial to speech communication. This paper presents a novel framework of periodicity enhancement for noisy speech. The enhancement is applied to the linear prediction residual of speech. The residual signal goes through a constant-pitch time warping process and two sequential lapped-frequency transforms, by which the periodic component is concentrated in certain transform coefficients. By emphasizing the respective transform coefficients, periodicity enhancement of noisy residual signal is achieved. The enhanced residual signal and estimated linear prediction filter parameters are used to synthesize the output speech. An adaptive algorithm is proposed for adjusting the weights for the periodic and aperiodic components. Effectiveness of the proposed approach is demonstrated via experimental evaluation. It is observed that harmonic structure of the original speech could be properly restored to improve the perceptual quality of enhanced speech.
Determining the relevance of different aspects of formant contours to intelligibility
Previous studies have shown that "clear" speech, where the speaker intentionally tries to enunciate, has better intelligibility than "conversational" speech, which is produced in regular conversation. However, conversational and clear speech vary along a number of acoustic dimensions and it is unclear what aspects of clear speech lead to better intelligibility. Previously, Kain et al. [J. Acoust. Soc. Am. (4), 2308-2319 (2008)] showed that a combination of short-term spectra and duration was responsible for the improved intelligibility of one speaker. This study investigates subsets of specific features of short-term spectra including temporal aspects. Similar to Kain's study, hybrid stimuli were synthesized with a combination of features from clear speech and complementary features from conversational speech to determine which acoustic features cause the improved intelligibility of clear speech. Our results indicate that, although steady-state formant values of tense vowels contributed to the intelligibility of clear speech, neither the steady-state portion nor the formant transition was sufficient to yield comparable intelligibility to that of clear speech. In contrast, when the entire formant contour of conversational speech including the phoneme duration was replaced by that of clear speech, intelligibility was comparable to that of clear speech. It indicated that the combination of formant contour and duration information was relevant to the improved intelligibility of clear speech. The study provides a better understanding of the relevance of different aspects of formant contours to the improved intelligibility of clear speech.
Formant measurement in children's speech based on spectral filtering
Children's speech presents a challenging problem for formant frequency measurement. In part, this is because high fundamental frequencies, typical of a children's speech production, generate widely spaced harmonic components that may undersample the spectral shape of the vocal tract transfer function. In addition, there is often a weakening of upper harmonic energy and a noise component due to glottal turbulence. The purpose of this study was to develop a formant measurement technique based on cepstral analysis that does not require modification of the cepstrum itself or transformation back to the spectral domain. Instead, a narrow-band spectrum is low-pass filtered with a cutoff point (i.e., cutoff "quefrency" in the terminology of cepstral analysis) to preserve only the spectral envelope. To test the method, speech representative of a 2-3 year-old child was simulated with an airway modulation model of speech production. The model, which includes physiologically-scaled vocal folds and vocal tract, generates sound output analogous to a microphone signal. The vocal tract resonance frequencies can be calculated independently of the output signal and thus provide test cases that allow for assessing the accuracy of the formant tracking algorithm. When applied to the simulated child-like speech, the spectral filtering approach was shown to provide a clear spectrographic representation of formant change over the time course of the signal, and facilitates tracking formant frequencies for further analysis.