The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation
Given an orthographic transcription, forced alignment systems automatically determine boundaries between segments in speech, facilitating the use of large corpora. In the present paper, we introduce a neural network-based forced alignment system, the Mason-Alberta Phonetic Segmenter (MAPS). MAPS serves as a testbed for two possible improvements we pursue for forced alignment systems. The first is treating the acoustic model as a tagger, rather than a classifier, motivated by the common understanding that segments are not truly discrete and often overlap. The second is an interpolation technique to allow more precise boundaries than the typical 10 ms limit in modern systems. During testing, all system configurations we trained significantly outperformed the state-of-the-art Montreal Forced Aligner in the 10 ms boundary placement tolerance threshold. The greatest difference achieved was a 28.13 % relative performance increase. The Montreal Forced Aligner began to slightly outperform our models at around a 30 ms tolerance. We also reflect on the training process for acoustic modeling in forced alignment, highlighting how the output targets for these models do not match phoneticians' conception of similarity between phones and that reconciling this tension may require rethinking the task and output targets or how speech itself should be segmented.
Vertical larynx actions and intergestural timing stability in Hausa ejectives and implosives
The current project undertakes a kinematic examination of vertical larynx actions and intergestural timing stability within multi-gesture complex segments such as ejectives and implosives that may possess specific temporal goals critical to their articulatory realization. Using real-time MRI (rtMRI) speech production data from Hausa non-pulmonic and pulmonic consonants, this study illuminates speech timing between oral constriction and vertical larynx actions within segments and the role this intergestural timing plays in realizing phonological contrasts and processes in varying prosodic contexts. Results suggest that vertical larynx actions have greater magnitude in the production of ejectives compared to their pulmonic counterparts, but implosives and pulmonic consonants are differentiated not by vertical larynx magnitude but by the intergestural timing patterns between their oral and vertical larynx gestures. Moreover, intergestural timing stability/variability between oral and non-oral (vertical larynx) actions differ among ejectives, implosives, and pulmonic consonants, with ejectives having the most stable temporal lags, followed by implosives and pulmonic consonants, respectively. Lastly, the findings show how contrastive linguistic 'molecules' - here, segment-sized phonological complexes with multiple gestures - interact with phrasal context in speech in such a way that it variably shapes temporal organization between participating gestures as well as respecting stability in relative timing between such gestures comprising a segment.
Tracey M. Derwing, Murray J. Munro, Ron I. Thomson: The Routledge Handbook of Second Language Acquisition and Speaking
What R Mandarin Chinese /ɹ/s? - acoustic and articulatory features of Mandarin Chinese rhotics
Rhotic sounds are well known for their considerable phonetic variation within and across languages and their complexity in speech production. Although rhotics in many languages have been examined and documented, the phonetic features of Mandarin rhotics remain unclear, and debates about the prevocalic rhotic (the syllable-onset rhotic) persist. This paper extends the investigation of rhotic sounds by examining the articulatory and acoustic features of Mandarin Chinese rhotics in prevocalic, syllabic (the rhotacized vowel [ɚ]), and postvocalic (r-suffix) positions. Eighteen speakers from Northern China were recorded using ultrasound imaging. Results showed that Mandarin syllabic and postvocalic rhotics can be articulated with various tongue shapes, including tongue-tip-up retroflex and tongue-tip-down bunched shapes. Different tongue shapes have no significant acoustic differences in the first three formants, demonstrating a many-to-one articulation-acoustics relationship. The prevocalic rhotics in our data were found to be articulated only with bunched tongue shapes, and were sometimes produced with frication noise at the start. In general, rhotics in all syllable positions are characterized by a close F2 and F3, though the prevocalic rhotic has a higher F2 and F3 than the syllabic and postvocalic rhotics. The effects of syllable position and vowel context are also discussed.
The vowel space of multiethnolectal (Stuttgart) German
The emergence of multiethnolects, i.e. specific speaking styles or varieties associated with second and third generation speakers from immigrant backgrounds, has been observed and studied in several major cities in Europe and elsewhere in the world. The multiethnolect that is the focus of this study is one such variety of colloquial German. Most previous research on multiethnolectal German has focused on grammatical features. This paper reports on the first comprehensive study of the vowel system (vowel quality and global vowel space size) of multiethnolectal German, based on data from Stuttgart. The results show that the vowel space of multiethnolectal speakers is in generally more centralized than that of a comparison group. A more detailed analysis reveals that the linguistic background plays an important role, as speakers with a Turkish or South Slavonic language background are responsible for this effect.
The acoustic characteristics of Swedish vowels
The Swedish vowel space is relatively densely populated with 21 categories that differ in quality and quantity. Existing descriptions of the entire space rest on recordings made in the late 1990s or earlier, while recent work in general has focused on subsets of the space. The present paper reports on static and dynamic acoustic analyses of the entire vowel space using a recently released database of words (SwehVd). The results highlight the importance of static and dynamic spectral and temporal cues for Swedish vowel category distinction. The first two formants and vowel duration are the primary acoustic cues to vowel identity, however, the third formant contributes to increased category separability for neighboring contrasts presumed to differ in lip-rounding. In addition, even though all long-short vowel pairs differ systematically in duration, they also display considerable spectral differences, suggesting that quantity distinctions are not separate from quality distinctions in Swedish. The dynamic analysis further suggests formant movements in both long and short vowels, with [e:] and [o:] displaying clearer patterns of diphthongization.
Controversial issues in the
The introductory text in the is an instruction manual, not an authoritative textbook in phonetics/phonology. That does not, however, exempt it from an obligation to be explicit and unambiguous. In that respect, the account of the distinction between phonetic and phonemic representations leaves something to be desired. Another issue is the ambiguous attitude to force of articulation (fortis and lenis) in obstruent consonants. Finally, the should be consistent in the notation of diphthongs and affricates.
Revisiting the nature and the context of T4 alternations: insights from disyllabic, trisyllabic and quadrisyllabic word/digit productions in Taiwan Mandarin
The tone values of a Tone 4 (T4) syllable are conventionally assumed to change from '51' to '53' when the syllable is followed by another T4 syllable in Mandarin Chinese. Literature focusing on T4 alternation is still inconclusive regarding the contexts for the alternations and whether the phenomenon should be better categorized as tone sandhi (i.e., represented as an abstract phonological rule in mental grammar) or tonal coarticulation (i.e., a natural articulation phenomenon at the phonetic level). The current study probes into these issues by focusing on disyllabic pseudowords, right-branching trisyllabic words as well as unstructured trisyllabic and quadrisyllabic digits. Productions from a total of 148 participants were collected and fundamental frequency (f0) contours, vowel lengths and f0 slopes were included in the analysis. The results from the experiments supported the tonal coarticulation view and showed that the trigger for the alternations was the high-onset tones following T4. Implications to the phonological analysis on tonal alternations in Mandarin Chinese are discussed.
The development of English language connected speech perception skills: an empirical study on Chinese EFL children
Connected speech processes (CSPs) occur randomly in everyday conversations of native speakers; however, such phonological variations can bring about challenges for non-native listeners. Looking at CSP literature, there seems to be very few studies that involved young foreign language learners. Therefore, the present study aimed to explore the development of connected speech perception skills by focusing on 201 9- to 12-year-old Chinese EFL children. It also incorporated systematic error analysis to further probe into the specific perceptual difficulties. The results indicate that: (1) Despite a significantly ascending trend for the overall growth of perception skills, no significant differences were found between 11 and 12 year olds in elision and contraction, which suggests that the developmental trend varied depending on different CSP types; (2) Although random errors decreased with age, the number of lexicon and syntax errors gradually increased, and the distribution of perceptual errors shifted from the level of words and syllables to that of phonemes; (3) The primary types of errors resulting in the perception difficulties for elision and contraction were consonant errors, grammatical errors and morphology errors. Ergo, this study enhances the understanding of connected speech perception among EFL children and provides some implications for EFL/ESL listening instructions.
The role of place and manner of articulation in Kurtöp tonogenesis: refining the model
While the general acoustic mechanisms that explain the development of tone in language have been understood since at least Maspero (1912. Étude Sur La Phonétique Historique de La Langue Annamite: Les Initiales. 12. 1-126), we are still far from having a predictive theory of tonogenesis. Kurtöp, a Tibeto-Burman language of Bhutan shown to be undergoing tonogenesis, provides a rare opportunity to advance our understanding of how and why languages develop lexical tone. This study examines the role that sonority and place of articulation have in the spread of tone from voicing contrasts on preceding consonants in Kurtöp. First, we find that tone is more likely to be produced following fricatives than when following stops. Second, we see that within the stops, tone phonologises more readily following some places of articulation over others. Taken as a whole, this shows us that tone is moving through Kurtöp, following the most sonorous segments first and moving to the least sonorous segments. These findings thus help us refine our theory of tonogenesis and show that functional pressures have strong influences in this particular pathway of sound change.
Exploring and explaining variation in phrase-final f0 movements in spontaneous Papuan Malay
This study investigates the variation in phrase-final f0 movements found in dyadic unscripted conversations in Papuan Malay, an Eastern Indonesian language. This is done by a novel combination of exploratory and confirmatory classification techniques. In particular, this study investigates the linguistic factors that potentially drive f0 contour variation in phrase-final words produced in a naturalistic interactive dialogue task. To this end, a cluster analysis, manual labelling and random forest analysis are carried out to reveal the main sources of contour variation. These are: taking conversational interaction into account; turn transition, topic continuation, information structure (givenness and contrast), and context-independent properties of words such as word class, syllable structure, voicing and intrinsic f0. Results indicate that contour variation in Papuan Malay, in particular f0 direction and target level, is best explained by turn transitions between speakers, corroborating similar findings for related languages. The applied methods provide opportunities to further lower the threshold of incorporating intonation and prosody in the early stages of language documentation.
The effects of watching subtitled videos on the perception of L2 connected speech by L1 Chinese-L2 English speakers
The current study explores whether watching subtitled videos could facilitate L1 Chinese-L2 English speakers' perception of L2 English connected speech. Three hundred ninty seven Chinese college students of L2 English completed a video-based spot dictation task after watching English videos with or without L1/L2 subtitles, featuring various connected speech types (e.g., linking, deletion, and their combinations). Results suggested an overall facilitation effect of watching videos on L2 connected speech perception, which was modulated by proficiency, subtitle form, and the complexity of connected speech. First, subtitled videos were more facilitative than non-subtitled videos in L2 perception. Second, participants with higher L2 proficiency better perceived English connected speech than those with lower proficiency. Third, the more connective devices an item used, the more difficult it was for L2 perception. When this complexity was controlled, the L2 perception was not influenced by connected speech type. Finally, the complexity of connected speech also mediated the subtitle facilitation effects. When the connected speech involved triple connective devices, L2 speakers benefited more from L1 subtitles than L2 subtitles. The findings can provide insights into multi-modal speech perception and English connected speech learning.
Perception of illusory clusters: the role of native timing
We explore the influence of native timing patterns on nonnative speech perception, by asking whether a nonnative CVCV sequence can be perceived as CCV when the temporal organization of nonnative CVCV is similar to native CCV. To explore this question, Georgian listeners are tested on a CCa-CVCá discrimination in French. Georgian has a rich word-onset cluster inventory, with component consonants loosely timed. The loose timing often, though not always, results in a schwa-like CC transition. French, the stimulus language, exhibits tighter timing in biconsonantal clusters, no vocalic transitions, and a reduced non-prominent first vowel in CVCá sequences. We hypothesize that the cross-language difference in inter-consonantal timing can facilitate the perception of an illusory cluster when Georgian listeners hear French CVCá. The findings reveal such perceptual confusion, particularly in the CCa-CøCá contrast in which the nonnative /ø/ is phonetically similar to the CC transition in Georgian, both in terms of temporal organizations and tongue shape. This confirms the possibility of illusory clusters, which is consistent with the interpretation that Georgian listeners utilize their knowledge of how word-onset CC clusters are temporally implemented in their native language when responding to the task. We propose that the timing pattern may constitute language-specific knowledge and that it can influence the perceptual assimilation patterns in nonnative speech perception.
Do letters matter? The influence of spelling on acoustic duration
The present article describes a modified and extended replication of a corpus study by Brewer (2008. . Tucson, AZ: University of Arizona PhD thesis) which reports differences in the acoustic duration of homophonous but heterographic sounds. The original findings point to a quantity effect of spelling on acoustic duration, i.e., the more letters are used to spell a sound, the longer the sound's duration. Such a finding would have extensive theoretical implications and necessitate more research on how exactly spelling would come to influence speech production. However, the effects found by Brewer (2008) did not consistently reach statistical significance and the analysis did not include many of the covariates which are known by now to influence segment duration, rendering the robustness of the results at least questionable. Employing a more nuanced operationalization of graphemic units and a more advanced statistical analysis, the current replication fails to find the reported effect of letter quantity. Instead, we find an effect of graphemic complexity. Speakers realize consonants that do not have a visible graphemic correlate with shorter durations: the /s/ in is shorter that the /s/ in . The effect presumably resembles orthographic visibility effects found in perception. In addition, our results highlight the need for a more rigorous approach to replicability in linguistics.
Dynamic specification of vowels in Hijazi Arabic
Research on various languages shows that dynamic approaches to vowel acoustics - in particular Vowel-Inherent Spectral Change (VISC) - can play a vital role in characterising and classifying monophthongal vowels compared with a static model. This study's aim was to investigate whether dynamic cues also allow for better description and classification of the Hijazi Arabic (HA) vowel system, a phonological system based on both temporal and spectral distinctions. Along with static and dynamic F1 and F2 patterns, we evaluated the extent to which vowel duration, F0, and F3 contribute to increased/decreased discriminability among vowels. Data were collected from 20 native HA speakers (10 females and 10 males) producing eight HA monophthongal vowels in a word list with varied consonantal contexts. Results showed that dynamic cues provide further insights regarding HA vowels that are not normally gleaned from static measures alone. Using discriminant analysis, the dynamic cues (particularly the seven-point model) had relatively higher classification rates, and vowel duration was found to play a significant role as an additional cue. Our results are in line with dynamic approaches and highlight the importance of looking beyond static cues and beyond the first two formants for further insights into the description and classification of vowel systems.
Hiatus resolution and linguistic diversity in Australian English
Vowel hiatus is typically resolved in Australian English through complementary strategies of liaison (j-gliding/w-gliding/linking-r) and glottalisation. Previous work suggests a change in progress towards increased use of glottalisation as an optimal hiatus-breaker, which creates syntagmatic contrast between adjacent vowels, particularly when the right-edge vowel is strong (i.e. at the foot boundary). Liaison continues to be used when right-edge vowels are weak, but glottalisation as a hiatus resolution strategy in general appears to be increasing and may be more common in speakers from non-English speaking backgrounds raising the question of whether exposure to linguistic diversity could be driving the change. We examine hiatus resolution in speakers from neighbourhoods that vary according to levels of language diversity. We elicited gliding and linking-r hiatus contexts to determine how prosodic strength of flanking vowels and speakers' exposure to linguistic diversity affect hiatus resolution. Results confirm that glottalisation occurs most frequently with strong right-edge vowels, and gliding/linking-r are more likely with weak right-edge vowels. However, strategies differ between gliding and linking-r contexts, suggesting differing implementation mechanisms. In addition, speakers from ethnolinguistically diverse areas produce increased glottalisation in all contexts supporting the idea that change to the hiatus resolution system may be driven by language contact.
On the two rhotic schwas in Southwestern Mandarin: when homophony meets morphology in articulation
This is an acoustic and articulatory study of the two rhotic schwas in Southwestern Mandarin (SWM), i.e., the -suffix (a functional morpheme) and the rhotic schwa phoneme. Electromagnetic Articulography (EMA) and ultrasound results from 10 speakers show that the two rhotic schwas were both produced exclusively with the bunching of the tongue body. No retroflex versions of the two rhotic schwas were found, nor was retraction of the tongue root into the pharynx observed. On the other hand, the -suffix and the rhotic schwa, though homophonous, significantly differ in certain types of acoustic and articulatory measurements. In particular, more pronounced lip protrusion is involved in the production of the rhotic schwa phoneme than in the -suffix. It is equally remarkable that contrast preservation is not an issue because the two rhotic schwas are in complementary distribution. Taken together, the present results suggest that while morphologically-induced phonetic variation can be observed in articulation, gestural economy may act to constrain articulatory variability, resulting in the absence of retroflex tongue variants in the two rhotic schwas, the only two remaining r-colored sounds in SWM.
Variability in cross-language and cross-dialect perception. How Irish and Chinese migrants process Australian English vowels
We investigate how three adult groups - experienced L2 English listeners; experienced D2 (second dialect) listeners; and native L1/D1 listeners - categorise Australian English (AusE) lax front vowels /ɪ e æ/ in /hVt/, /hVl/ and /mVl/ environments in a forced-choice categorisation task of synthesised continua. In study 1, AusE listeners show predictable categorisations, with an effect of coarticulation raising the vowel in perception for nasal onset stimuli, and a following lateral lowering the vowel in perception. In study 2, Irish (D2) and Chinese listeners (L2) have different categorisations than AusE listeners, likely guided by their D1/L1. Coarticulation influences the D1/D2 groups in similar ways, but results in more difficulty and less agreement for the Chinese. We also investigate the role of extralinguistic factors. For the Chinese listeners, higher proficiency in English does not correlate with more Australian-like categorisation behaviour. However, having fewer Chinese in their social network results in more Australian-like categorisation for some stimuli. These findings lend partial support to the role of experience and exposure in L2/D2 contexts, whereby categorisation is likely still driven by native categories, with increased exposure leading to better mapping, but not to a restructuring of underlying phonetic categories.
A perception-induced /t/-to-/k/ sound change: evidence from a cross-linguistic study
John Ohala claimed that the source of sound change may lie in misperceptions which can be replicated in the laboratory. We tested this claim for a historical change of /t/ to /k/ in the coda in the Southern Min dialect of Chaoshan. We conducted a forced-choice segment identification task with CVC syllables in which the final C varied across the segments [p t k ʔ] in addition to a number of further variables, including the V, which ranged across [i u a]. The results from three groups of participants whose native languages have the coda systems /p t k ʔ/ (Zhangquan), /p k ʔ/ (Chaoshan) and /p t k/ (Dutch) indicate that [t] is the least stably perceived segment overall. It is particularly disfavoured when it follows [a], where there is a bias towards [k]. We argue that this finding supports a perceptual account of the historically documented scenario whereby a change from /at/ to /ak/ preceded and triggered a more general merger of /t/ with /k/ in the coda of Chaoshan. While we grant that perceptual sound changes are not the only or even the most common type of sound change, the fact that the perception results are essentially the same across the three language groups lends credibility to Ohala's perceptually motivated sound changes.