9.3: Speech categorization as a Machine Learning task

Last updated
Save as PDF

Page ID: 129552

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Given the messy nature of speech categorization, speech recognition technologies also face the same challenges as human infants. Last twenty years have witnessed high speed development of speech recognition technology, that enabled the brith of products like Siri and Alexa. However, the performance of Siri and Alexa are still not satisfactory. “Sorry I don’t understand what you said” is a constant frustration by Siri and Alexa users. Siri and Alexa are based on the traditional speech technology, which relies on linguistic resources and textual information to build acoustic and speech models. Traditional speech technology is developed based on larger and larger amounts of labeled data to train models. Traditional approaches in speech recognition include Hidden Markov Models (HMM), Dynamic Time Warping (DTW), combined with artificial neural networks. HMM based models are the most popular models in speech recognition field. Speech signals can be viewed as piecewise signals that fit into ta quantities matrix. Acoustic models and language model are both trained on speech signals in HMM. One advantage of HMM based models is that accuracy is highly related to training data size. It is easier to improve a HMM based model as long as large size of training data are available. However, when several gigabytes of data or memory space are not available, HMM based models could not perform well. This is also the reason Amazon and Apple both deployed cloud space and require network connection as for the working environment of Alexa and Siri, instead of local device. More recent work has focused on an end-to-end speech recognition model that jointly combine all components of the speech and train them together. Recent attempts have successfully trained supervised systems using textual transcripts only (Hannun et al., 2014, Miao, Gowayyed and Metze, 2015). However, this is still not the most efficient model in speech recognition since there is an extra step of translating speech input into textual information. Computational linguists therefore turned to a more efficient approach, which is the way of how human infants process speech. From babbling at 6 months of age and producing full sentence by age of 3 years, young children learn how to talk before they know how to read and write, and with minimal instructions. Inspired by early language acquisition, zero-resource speech technologies were first proposed in the JH CLSP Workshop in 2012, “with an aim to construct a system that learn an end-to-end Spoken Dialog (SD) system, in an unknown language, from scratch, using only information available to a language learning infant” (Jansen et al., 2013). Zero resource refers to zero labelled data in training data in order to imitate the unsupervised learning process. In 2015, first Zero Resource Speech Challenge was organized in order to bring researchers together and compare their systems within a common open source evaluation setting. The participants work on the same data provided and evaluated based on the same criterion.

There are two major tasks in zero resource speech challenge: subword modeling and spoken term discovery. Subword modeling requires to build a representation of speech signal that is robust across different talkers, speech speed and context, which is similar to speech categorization. Since this is an unsupervised task, the definition of subword is not confined to phonemes or sound, or any arbitrary linguistic category; instead, subword is defined as basic unit to distinguish words. Subword modeling is similar to speech categorization in child language acquisition. Both tasks involve finding speech features in sound stream that are linguistically relevant (i.e. phoneme structure) and discard non linguistic features (i.e speaker identity). In Zero Resource Speech Challenge, the participants are required to provide a feature representation that maximally discriminate speech units in the raw input. The evaluation usually involves training a phone classifier and evaluating its classification accuracy. In Zero Resource Speech Challenge, Minimal-Pair ABX tasks are used to evaluate feature representation, which does not require any labelled training data (Schatz, 2013; Schatz, 2014). Minimal-Pair ABX task is a match-to-sample tasks to measure discriminability between two sound categories. If sounds A and sounds B belong to two separate categories, α and β, given a new sound X, the task is to decide whether X belongs to α or β. Discriminability of ABX task is defined as the probability that the Dynamic Time Warping (DTW) divergence between α and X and β and X. The dissimilarity is calculated either by the cosine distance or KL-divergence.

In 2015 Zero Resource Speech Challenge, two data set were used for participants to use: Buckeye corpus of conversational English (Pitt et al.,2007) and the Xitsonga section of the NCHLT corpus of South Africa’s languages (de Vries et al.,2014). For the English corpus, 6 male and 6 female native speaker of English recorded a total of 4h59m05s of speech; for Xitsonga section, 12 male and 12 female speakers recorded a total of 2h29m07s speech (Versteegh et al., 2016). There were total 5 algorithms on subword modeling accepted for publications. The scores on ABX discriminability is shown in Table 1. The baseline feature representation is the result of Mel-FrequencyCepstral Coefficients (MFCC). MFCC are coefficients of collective representation of short-term power spectrum of a sound. MFCC is used as baseline feature since it is not linguistic specific. The topline feature representation is the result from labeled data training, which is derived from Kaldi GMM-HMM system. As shown in table 1, for English language, most of the algorithms performed better than baseline in across-speaker and within-speaker task. Two of them even beat the topline in within-speaker task. For Xitsonga language, most of the algorithms performed better than baseline but none of them performed better than topline.

The best performing algorithm for crossand withinspeaker in English and within speaker in Xitsonga is DPGMM (Chen et al. 2015). Chen and his colleagues applied a pipeline of talker-normalized MFCC’s followed by a Dirichlet process Gaussian mixture model (DPGMM). DPGMM is a Bayesian nonparametric model which automatically learn the number of components according to the observed data which has been successfully applied to speech segments clustering (Kamper, Jansen, King and Goldwater, 2014). This approach generated very close result to the topline in across-speaker tasks and in within-speaker tasks, it even out performed topline. This results indicate that speech recognition without previous labeled data is a plausible that worth further pursuit. Badino et al (2015) also applied feature space modeling in their algorithm. They use binarized auto-encoders and HMM encoders to learn input features. The results in cross-speaker tasks were only slightly better than MFCC model and worse in within-speaker tasks. One possible explanation is that phonological features are not binary in nature. Applying binary coder to analyzing non-binary data will result in overrepresentation.

Instead of modeling the feature space, Renshaw’s and Thiolliere’s team both applied topdown information exploiting. They generated word-like pairs using an unsupervised discovery system and used the found pairs as input into a neural network. Renshaw’s team used correspondence auto-encoder (CAE) to learn the patterns in the input. Thiolliere’s team used the discovered pairs to train a siamese network. Their achieved the best results in Xitsonga cross-speaker task.

Baljekar’s team applied articulatory information derived from previously trained speech synthesis system for languages without a writing system. The results were worse than the baseline. They also compared the articulatory features with segment-based inferred phones, and found that inferred phones had the worst performance in Xitsonga tasks. Baljekar’s team did not build a strict unsupervised system since they relied on the information from a partially supervised system. Their results are interesting in the way that it demonstrated how supervised feature interact with unsupervised systems.

In 2017 Zero Resource Speech Challenge, there were two group of data sets: development data and the surprise data. The development data consists English, French and Mandarin corpora, with phone force-aligned using Kaldi (Povey, et al.,2011, Wang, Zhang and Zhang, 2015). The surprise data consists of German and Wolof corpora (Gauthier et al., 2016), but it is not revealed to the participants (Dunbar et al., 2017). A description of the corpus statistics is shown in table 2. There are total 6 papers with 16 systems for subword modeling, which is almost three times as last challenge. All the systems are evaluated using Mixed pair ABX tasks, with a focus on phone triplet minimal pairs that differ in the central sound. For example, A = beg (α) and B = bag (β), X = bag’ should be categorized as α. The scores for each system is shown in table 3. In general, most of the submitted models have better performance on development data than surprise data. All the sixteen systems can be categorized into four strategies.

Heck et al. applied bottom-up frame-level clustering, inspired by the success of Chen et well as learned feature transformations (LDA, neutralize talker variance. The training lables sound is the same as that of its left and right neighbors. The results showed that both P1 and P2 are successful since they are all better than baseline results. Comparing P1 and P2, re-estimation the centroids only slightly improved the results.

Chen et al. applied DPGMM to cluster frames separately on each language. The labels then is trained on MFCCs (C1) and transformed using unsupervised linear VTLN (C2). The results on both development data and surprise data all outperformed baseline. In German within speaker task, the algorithm outperformed topline too. Ansari et al. trained all five languages on two sets of features. The first set is high-dimensional hidden layer trained by MFCC frames. The second set is a hidden layer trained based labels gathered by a Gaussian mixture model on speech frames. The input to the deep neural network are labels trained by MFCC (A1), Gaussian-mixture-HMM (A2), auto encoder features (A3), and another HMM posteriograms features (A4). The results showed that all four models have better results than baseline. MFCC and GaussianMixture-HMM models had better performance than the other two models.

The third strategy is to improve spoken term discovery. Inspired by Thiolliere et al (2015) and Renshaw et al (2015), Yuan (2017) obtained bottle-neck features through unsupervised word-pair generating model and applied STD system to discover acoustic features of word pairs on English only (Y1), all five languages (Y2). They also created a supervised comparison, using transcribed pairs from Switchboard corpus as labels to train STD system (YS). The results are better than baseline. The results of two unsupervised models were very similar to the supervised one.

The last strategy is to use supervised training on nontarget language. Shibata et al generated features from a neural network acoustic models on Japanese as part of an HMM (S1). In (S2), they trained ten other languages (including English, Mandarin and German) on an end-toend convolutional network and bidirectional LSTM. The model with ten languages out-performed the Japanese one. However, since target languages are also included in the training data, it is not a strict zero resource speech recognition task.

The clear winner for 2015 and 2017 Zero Resource Speech Challenge is DPGMM model, as demonstrated in Chen et al (2016) and Heck et al (2017). The most successful strategy for speech unit categorization is bottom-up clustering, which is true for both monolingual environment and multilingual environment. Bottom-up clustering also best resembles how young children build mental representation of speech units among all other algorithms. The success in bottom-up clustering is inspiring to the field of child language acquisition. For decades, psycholinguists struggle to model the process of speech categorization. Successful machine learning algorithms like bottom-up clustering could be useful as a basis to build a child speech categorization model. Meanwhile, in both years of Zero Resource Speech Challenge tasks, there are some unsupervised algorithms outperformed supervised ones, which might indicate that the mechanism in child speech categorization, similar unsupervised algorithms, requires no innate knowledge or structure.