2.3: Previous Work

Last updated
Save as PDF

Page ID: 129493

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Computational classification techniques have been around for decades and their application is nothing short of diverse. In this section, we will review more text-specific classification techniques, such as those that fall under machine learning and are associated with natural language processing, and have been widely tested with text. Much medical and clinical data is written text, ranging from clinical notes and patient charts to papers written in medical journals. Machine learning applications, such as classification, have a wide range of practical applications that help to make sense out of all this data. Popular classification techniques include topic modeling, neural networks, and clustering, but these algorithms often need the support of word sense disambiguation.

Topic Modeling

Topic modeling (also formerly referred to as Latent Semantic Analysis/Indexing, LSA/LSI) is a statistical model that works at the word or sentence level to classify documents into similar categories, or by “topics”. Landauer et al wrote their 1997 paper “Introduction to Latent Semantic Analysis” as a broad introduction to Latent Semantic Analysis and its potential. Landauer et al described the method as an application for extracting and representing the semantic meaning of words by statistical computations applied to a large corpus of text. Landauer et al argue that in a way, LSA mimics human sorting and categorization of words. For example, LSA has been found to be capable of simulating human cognition by developing vocabulary to word recognition, sentence-word semantic priming, discourse comprehension, and judgement of essay quality.

The knowledge derived from LSA can be described as sufficient but lacking in experience. While humans often understand their world though experience, human knowledge is not limited to experience-only learning. LSA’s uniqueness is not just limited to its comparability to human learning, but also because it is unlike other traditional natural language processing or artificial intelligence applications of its time. LSA takes raw text as input and does not utilize dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, or morphology. Instead, the raw text is parsed into words represented as unique character strings and are then organized into a matrix where each row is a unique word and each column is text passage (most typically documents), and singular value decomposition is applied to the matrix. Landauer et al tested their LSA on multiple judgment tasks and reported good performance but concluded that LSA lacks the necessity of raw experience that makes it somewhat incomparable to human cognition. However, as an overall computational classification technique, LSA did lead the way for more sophisticated topic modeling.

Even just two years after the publication of “An Introduction to Latent Semantic Analysis”, the paper “Latent Semantic Indexing: A Probabilistic Analysis” by Papadimitriou et al sought to improve LSA by introducing the technique of random projection. Random Projection is a mathematical application used to reduce dimensionality, which Papadimitriou et al believed would increase speed while maintaining accuracy. The application of random projection to the initial corpus by Papadimitriou et al was intended to reduce the bottleneck that often comes with LSI, which they achieved with some success. While the model performed faster, Papadimitriou et al were somewhat dissatisfied with the LSI performance. Both Papadimitriou et al and Landauer et al agreed that LSA is somewhat lacking in polysemy and synonymy. The bag of words model employed by LSA could be to blame, given that bag of words considers the context in which words appears independent from the word itself.

Despite the effort that LSA puts forward in classifying documents and retrieving relevant information, LSA is indeed limited in disambiguating word senses. More recent versions of LSA, now called topic models, have attempted to address these issues with polysemy and synonymy. Wallach attempted to bridge this gap in her 2006 paper “Topic Modeling: Beyond Bag-of- Words” by employing both bag of words and n-gram statistics. She extended Latent Dirichlet allocation (Blei et al., 2003), which represents documents as random mixtures over latent topics, where each topic is characterized by a distribution over words, by introducing a bigram model that takes word order into account. Her results showed that the predictive accuracy of her model is significantly better than that of either latent Dirichlet allocation or the hierarchical Dirichlet language model. Also, her model automatically infers a separate topic for function words, meaning that the other topics are less dominated by these words. This contribution is especially important because Wallach’s model uses a larger number of topics than either Dirichlet model and has a greater information rate reduction as more topics are added, again while being able to maintain accuracy.

Neural Networks

Another popular application for text analysis and classification is neural networks. Neural networks are currently very popular for their speed and accuracy across many applications. Partially responsible for the rise of neural network applications in natural language processing of late is the popular Word2Vec model from Mikolov et al of Google. Mikolov el al published their paper “Efficient Estimation of Word Representation in Vector Space” and made the Word2Vec algorithm publicly available in 2013.

Part of Word2Vec’s attractiveness is the speed in which the word vectors are developed. This is due to the structure in which the word vectors are derived, which are shallow neural networks. The Word2Vec model creates a two-layer network, in which one layer is hidden. This structure is supported by the log linear architectures proposed by Mikolov et al that learns distributed representations of words while minimizing computational complexity. This includes a continuous bag of words model (CBOW) and a continuous skip- gram model.

The CBOW architecture is like that of a feedforward neural net language model where the non-linear hidden layer is removed, and the projection layer is shared for all words. Therefore, all words get projected in the same position their vectors are averaged. The continuous skip-gram model is like the CBOW model, but instead tries to maximize classification of a word based on another word in the same sentence. Each current word is used as input to a log-linear classifier with a continuous projection layer. The word vectors returned are capable of various tasks, such as generating similar words and deciding which word does not belong to the set.

Perhaps the most interesting contribution of the Word2Vec model is the ability to conceptualize words. This is explained best through the famous king and queen example. Simply, v(king) - v(man) + v(woman) = v(queen), where the vector man is taken from the vector king while adding the vector woman, which results in the vector queen. While Word2Vec’s application to natural language is more general and flexible, there exists other neural network systems that are more targeted towards specific natural language processing applications. For example, Chen et al proposed a standard neural network dependency parser that utilizes part of speech tag and arc label embeddings to yield 1,000 parses per second with a 92.2% accuracy. Since Word2Vec was released, many other algorithms have been published piggybacking off the model. For example, Sense2Vec was published two years after Word2Vec, which uses part of speech tagging to help disambiguate word sense.

Trask et al argue that Word2Vec does not do enough linguistic preprocessing to accurately disambiguate words such as “duck”, which depending on the context, can either be a noun of a verb. Sense2Vec maintains Word2Vec’s general architecture but adds more linguistic features to better enhance word embeddings. Another popular natural language processing topic is morphology, which Luong et al addressed in their paper “Better Word Representations with Recursive Neural Networks for Morphology”. Luong et al argue that while vector space representations have had success over the past few years, morphological relation has been lacking. Their solution was to create a recursive neural network capable of finding word similarity and distinguishing rare words. Luong et al used both supervised and unsupervised approaches to their experiment and were able to yield comparable results between the two approaches. Neural networks continue to prove to be a practical and useful application in natural language processing and in many areas of machine classification.

Clustering

The final computational classification technique discussed here will be clustering. Clustering is an unsupervised learning task of mapping data to find where it “clusters”, or where data situates itself to other similar data. Clustering is a popular classification technique especially since it can be performed unsupervised. Unsupervised learning techniques have become more popular in the age of the internet where researchers have a constant stream of raw, accessible data to analyze. One especially popular source is the micro blogging site Twitter, with its easy to access APIs and rich textual content.

In the paper “Social Network Data Mining Using Natural Language Processing and Density Based Clustering” Khanaferov et al proposed a system to mine Twitter data for information relevant to obesity and health. Their goal of demonstrating a practical approach to solving a healthcare issue through a computational method focused on mining useful patterns out of public data. First, they used a data warehouse to complete online mining operations. The data warehouse had three distinct layers and served as step for mining processing. After the collected data was cleaned and standardized, a density-based clustering algorithm was implemented to find relevant patterns. The output resulted in sets of transactions, and each transaction included a set of search terms associated with it. To better visualize the cluster data, it was plotted onto a map using Google Maps API, which helped show that tweets coming out of the United States and Europe had a negative sentiment, while those coming out of South Asia, Canada, and Central Africa had a positive sentiment. Overall, they were able to cluster tweets in a somewhat meaningful way, although the loose relationship between healthcare and social media is a tricky one to extract meaningful results from. Cardie et al experimented with noun phrase clustering in their paper “Noun Phrase Coreference as Clustering”. They introduced a new, unsupervised algorithm for this task by approaching each group of coreferential noun phrases as an equivalent class. They identified various features in each noun phrase such as individual words, head noun, position, pronoun type, and more, then defined the distance between two nouns. The clustering algorithm applied worked backwards in the document, since noun phrases tend to reference the noun phrase preceding it. This approach was somewhat accurate in its results, ranging from 41.3%-64.9%, leaving room for improvement.

Word Sense Disambiguation

An important distinction in computational classification of text is that of word sense disambiguation. This applies to choosing the correct sense of a polysemous word, given the context in which is occurs. Word sense disambiguation within the medical field, called biomedical text normalization, is especially relevant given the specialized nature of the data at hand. Applications of biomedical text normalization that work within a medical and clinical context may not be successfully applied to outside subjects, and vice versa. For example, the abbreviation “CA” can mean two things within the medical field: “cancer” or “carbohydrate antigen”. However, outside of the medical field, “CA” could very likely stand for “California”, or other words that have nothing to do with “cancer” or “carbohydrate antigen”. For this reason, research within biomedical text normalization is an important task that will hopefully lead to achieving higher accuracy in classification applications.

Given how much medical data exists, techniques range from supervised to semi-supervised and unsupervised learning, although in recent years unsupervised learning is favored by many researchers. Tulkens et al managed to create a successful biomedical text normalization program in their paper “Using Distributed Representations to Disambiguate Biomedical and Clinical Concepts”. Their approach was an unsupervised learning method that classified concepts by clustering. To achieve this, they utilized the Word2Vec continuous skip-gram model to create their word representations. Those representations were then transformed via compositional functions to become concept vectors, essentially an entire concept representation in one vector. Every ambiguous concept tested in their experiment was defined by having more than one concept unique identifier (CUI) from the Unified Medical Language System (UMLS). Much like the “cancer” or “carbohydrate antigen” example above, the test concepts had multiple, but distinct, meanings. Tulkens et al managed to obtain between 69%-89% accuracy by transforming both the training and test data into concept vectors and measuring the cosine distance between them.

In a semi-supervised approach, Siu et al experimented with semantic type classification of complex noun phrases. Often in medical text, complex noun phrases consist of specific names (diseases, drugs, etc.) and common words such as “condition”, “degree”, or “process”. The common words can have different semantic types depending on their context in the noun phrase, and in their experiment Siu et al attempted to classify these common words into fine-grained semantic types. Siu et al argue that it is crucial to consider these common nouns in information extraction because while they can carry biomedical meaning, but they can also be used in a general, uninformative sense. Their semi- supervised method labeled target words within a noun phrase with its suitable semantic type or tagging it as uninformative. Experiments with this method yielded a 91.34% micro-average and an 83.57% macro-average over 50 frequently appearing target words.

Another unsupervised approach to biomedical word sense disambiguation is that of Henry et al in their paper “Evaluation Feature Extraction Methods for Knowledge-Based Biomedical Word Sense Disambiguation”. They compared vector representations in the 2-MRD WSD algorithm and evaluated four dimensionality reduction methods: continuous bag of words and skip-gram, singular value decomposition, and principal component analysis. Like Tulkens et al, Henry et al also measured their accuracy with cosine similarity. Singular value decomposition performed well in their experiments, however it may not do as well with larger data sets. Regarding dimensionality, low vector dimensionality was sufficient for the continuous bag of words and skip-gram models, but higher dimensionality achieved better results for singular value decomposition. Although principle component analysis is commonly used for dimensionality reduction, in this case it did not improve results for word sense disambiguation. Regardless of the method, normalization of biomedical and clinical text remains a nuanced and necessary step in processing for information retrieval and document classification. Part of the urgency to make advancements in this task is the fact that computational methods are being applied to medical and clinical data every day. Their effectiveness relies on the ability to properly disambiguate terms and classify accurately, as a human would, otherwise they could be rendered useless.