9.5: Word Categorization as a Machine Learning Problem

Last updated
Save as PDF

Page ID: 129554

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Word Categorization, or part-of-speech tagging (POS tagging) is one of the most developed field in Natural Language Processing. POS tagging is the process of word categories in given input. The input to POS tagging tasks usually consist of word-tag pair sets. Most modern language processing on English uses Penn Treebank tags which has 45 tags (Marcus et al., 1993). Two most used algorithms for tagging are the Hidden Markov Model (HMM) and the Maximum Entropy Markov Model (MEMM). HMM is a generative model based on the probability of ngrams word combination. MEMM is a sequence model adaption on logistic regression, which is a discriminative sequence model. The accuracy of POS-tagging tasks highly depends on the labeled training dataset, which is not efficient enough.

Before discussing about how to build algorithms for word category, it is important to understand how grammatical categories are defined. Pike (1967) has discussed this question from etic and emic perspectives: whether the categories were created to fit words or words were sorted into different categories. This similar to the difference in supervised and unsupervised learning tasks. Supervised learning tasks, such as POS tagging which uses labeled data to train the algorithm, which is similar to “category before words” point of view. While in unsupervised learning tasks, categories are emerged from the words, which is in line with “words before category” point of view.

In addition to speech segmentation, young children are also able to abstract syntactic and semantic information from speech stream. The mechanism of syntactic acquisition is still open to debate. In the perspective of formal language theory and generative grammar, there are are finite set of words, which is denoted by Σ. A language is a subset of Σ, denoted as L. L can be defined as a set of grammatical and semantically well-formed sentences of a language. L, in principle, is an infinite set, as there are infinite sentences in any language. Grammar is a generative device to represent L in a finite way. In Chomsky’s terms, these grammars are i-languages, which are innate and universal to all speakers. He also assumed that there is a Language Acquisition Device (LAD), which takes input linguistic data and outputs a grammar of some kind. The existence of LAD has been evoked decades-long debates between empiricists and nativists. In the computational learning theory, LAD is not a mysterious black box any more; it is seen as an algorithm, which can be studied and interpreted using mathematical tools and computational modeling. To crack the internal structure of LAD, reverse engineering is needed. In the software engineering or machine learning tasks, an algorithm is designed to perform desired tasks. In the case of LAD, an algorithm that performs the desired tasks need to be interpreted and modeled for its internal structure.

In machine learning area, the generative approach could also be insightful., in the way that “generative models can be tested in its prevision and recall two customary measures of performance in natural language engineering, which can address perennial questions of model relevance and scalability” (Kolodyn, Lotem and Edelman, 2015). The algorithm wants to imitate the generative sense in language acquisition so that it can parse and produce new materials. The existing statistically based models are still struggling in parsing new materials, let along producing novel utterances.

In Zero Resource Speech Challenges, word categorization is addressed as the Spoken Term Discovery. As described in Versteegh et al (2016), spoken term discovery is the task of finding speech fragments, ideally the speech fragments could correspond to the word-like units in language. Unlike evaluating dissimilarity scores in speech categorization tasks, there are three steps to evaluate a spoken term discovery algorithm. The first is to examine pairwise fragment discovery in audio stream. Normalized Edit Distance (NED) and paired speech intervals and the coverage (COV) are evaluated the decide to acoustic/phonological discrimination of the pairwise fragment. The second step is to cluster all the discovered pairs into classes. The clusters are evaluated against the gold lexicon. Finally, all the classes are used as labels to “parse” the new input. This step is evaluated by how many word tokens were correctly segmented (Token scores) and how many word boundaries were correctly defined (Boundaries scores). The baseline is provided by a randomized matching algorithms (Jansen and Van Durme, 2011), which has a high NED score (good matching) but poor coverage.

In 2015 Zero Speech Resource Challenge, two papers on algorithms of spoken term discovery tasks were accepted for publication. The results of the algorithms in two papers are summarized in table 4. Rasanen et al. () proposed to use syllable segmentation for spoken term discovery. They compared three systems for segmenting speech stream into syllable units: Vseg, EnvMin and Osc (FIND RASANEN PAPER WRITE MORE ABOUT WHAT IS Vseg EnvMin and Osc). Using syllables to determine spoken terms is a highly original approach, since it relies on the prior knowledge about speech. As shown in table 3, this approach is effective, since it consistently beat baseline results. The Osc algorithm seem to be the most effective one among all syllable-based categorization. Lyzinski et al. focused on the second step of spoken term discovery process, which is clustering discovered pairwise into classes. The study used baseline algorithm for pairwise matching segments, and applied three algorithms to cluster the pairs segmented by baseline algorithm: one simple Connected Components (CC), and two modularity based algorithms, FG and Louvain. Although the performance of the three algorithms were similar to baseline or sometimes worse than bsseline, the results differ among three algorithms. This investigation provides insight on how clustering algorithm could impact performance of spoken term discovery system.