2.4: Medical and Clinical Applications

Last updated
Save as PDF

Page ID: 129494

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Computational classification techniques have been applied to medical data for years in hope of contributing to new methods for helping patients by identifying and diagnosing diseases, as well as preventing illnesses from occurring in the first place. This work is predictive and requires copious amounts of data to be proven effective. Just like with any other computational classification technique, there are both supervised and unsupervised approaches to the task. Since there is an ever-growing amount of medical data available, many researchers are turning to semi- supervised and unsupervised techniques to wrangle more data in a more efficient amount of time. While supervised learning is extremely practical and yields highly accurate results, there is always the cost of annotating data. Annotation must be conducted by people and given the size of the data set or task, can be very time consuming. In this section, we will review various contributions researchers are making to the biomedical natural language processing community and the techniques they use.

The ability for a computer or artificial intelligence to aid health care providers in diagnosing patients sounds like something out of a science fiction novel. However, experimental applications for machine diagnosis have become popular in recent years by taking medical data from patients that have specific ailments and using their history to learn models that can predict this occurrence in future patients. Apostolova et al sought to create a system to detect sepsis and septic shock in patients early when treatment is effective. They report that sepsis has an approximate 50% mortality rate worldwide, and often the infection can be detected through clues in nurses’ notes. Using the MIMIC-III corpus, a publicly available data set from ICU patients, alone was unsuccessful. However, once they noticed that when a patient has and infection or is suspected of having one, nurses tend to mention that they are on an antibiotic. Using this heuristic along with a list of commonly prescribed antibiotics, they were able to extract the language used to describe the patient’s state when they had an infection. These notes with infection-hinting and infection-confirmed language were used in combination with notes where infection was not present as training data to train Support Vector Machines (SVM), a type of unsupervised clustering machine learning algorithm, in binary classification of these free-form notes. Using this technique, they were able to achieve an F1-score ranging between 79%-96%. These results are a good start to their end goal of creating an automated system for detecting early sepsis in at risk patients.

SVMs have also been experimented with for cancer diagnosis with mixed reviews. In one paper from 2002, “Gene Selection for Cancer Classification Using Support Vector Machines”, Guyon et al proposed a method to make sense out of the massive amount of data DNA microarrays generate. Briefly, DNA microarrays are microscopic collections of DNA attached to a surface. The task researchers use them for is to classify and predict the diagnostic category of a sample based on its gene expression profile; in this case, the expression is cancer. Guyon et al used samples from both cancer patients and patients without cancer to train their model, an SVM based on Recursive Feature Elimination, which uses weight magnitude as ranking criterion. Their technique was able to extract biologically relevant genes from patients with colon cancer or leukemia and yielded a high classification accuracy of 98% accuracy in colon cancer compared to the 86% of the baseline system. Guyon el al argued that SVMs lend themselves well to this type of gene classification because of their ability to easily handle large feature sets (here, thousands of genes) and a small number of patterns (dozens of patients). However, Koo et al argued in their 2006 paper “Structured polychotomous machine diagnosis of multiple cancer types using gene expression” that even though SVMs are a popular and accurate classification technique, their results are implicit and therefore difficult to interpret. In attempts to change this drawback, Koo et all proposed an extension of import vector machines by using an analysis of variance decomposition and structured kernels, called the structured polychotomous machine.

Import vector machines are like SVMs, but they are typically computationally cheaper than SVMs and can provide estimates for posterior probabilities. The DNA microarray data Koo et al used came from a few sources, including the small round blue cell tumor data set and a leukemia data set. They wanted to create a system that not only improved upon import vector machines, but they also wanted to provide a method for finding genes that accurately discriminate cancer subtypes. Overall, Koo et al were able to achieve 0% error rates and their method was able to select a smaller set of genes and successfully classify among samples. Their model also outperformed the SVM baseline in several tests, as they expected. As seen in these two comparative studies, robust machine learning systems are necessary for biomedical classification applications.

DNA microarrays are one example of an expensive data set commonly used in biomedical classification techniques, however one other commonly used data source is that of Twitter. Mentioned earlier in this paper, Twitter is a frequently used source for text data. In a study from Nadeem et al, “Identifying Depression on Twitter”, data was crowdsourced from Twitter users who have been diagnosed with Major Depressive Disorder in efforts to measure and predict depression in users. From these users, along with a general demographic, tweets from up to a year prior were extracted and a bag of words model was applied to quantify each tweet.

Finally, statistical classifiers were applied to the tweets to analyze the risk of depression. Linguistic features were extracted from the tweets of those with depression as an attempt to see what language people with depression use that varies from the language of those who don’t. From this analysis, Nadeem et al found approximately 20 words (and one sad-face emoticon) that were used at a much higher rate in depressed users. Nadeem et al employed the use of Decision Tree, Support Vector Machine, Logistic Regression, Ridge Classifier, and two Naïve Bayes classifiers. Of the six, the Logistic Regression classifier had the highest precision and F1-score, while the SVM had the highest recall and the Naïve Bayes with 1-gram had the highest overall accuracy of 86%. The statistical classifiers here were trained with supervised learning, resulting in accuracies comparable to the unsupervised experiments of Koo et al and Guyon et al. Both methods proved to be useful and accurate in their classification tasks of diagnosing disease. Like the goal of Nadeem et al, Gorrell et al attempted to identify first episodes of psychosis in psychiatric patient records in their experiment. They filtered thousands of records and obtained 9,109 individual clinical records. Of those, 560 screened positive for psychosis, 5,234 screened negative (but remained at risk) and 3,315 were excluded for various reasons. Gorrell et al chose to use SVM, Random Forests, and JRip algorithms to classify their data for speed and accuracy reasons. They used two- and three-fold validation to define their features. Three-fold features included missing demographic information, such as borough, ethnicity, gender, postcode, first primary diagnosis, and age. However, where available, first primary diagnosis was included (bipolar hypomanic/unspecified and severe depressive with psychotic symptoms). Text features included in three-fold validation consisted of “olanzapine”, “risperidone”, “auditory hallucinations”, “voices”, “paranoid”, “psychotic” and “psychosis”. Two-fold validated features were somewhat less specific, including first primary diagnosis (bipolar, organic delusional schizophrenia-like disorder, organic mood disorder), and text features such as “aripiprazole”, “quetiapine”, “persecutory”, and “schizophrenia”. The three algorithms chosen along with varying feature set size obtained decent results, ranging from 66.46%-82.2%. Surprisingly, the Random Forests classifier had both the weakest and strongest accuracy, scoring 66.46% accuracy with the full feature set size plus unigrams, and an 82.2% accuracy with a reduced feature set size.

While text classification is a useful tool in diagnosing depression and other mental illnesses, researchers have also experimented with multimodal tools to identify and classify these illnesses. Morales et al explored this technique in “OpenMM: An Open-source Multimodal Feature Extraction Tool” where they used text, speech, and face mapping features to identify depression in individuals. Morales et al argue that to usefully model situational awareness, machines must have access to the same visual and verbal cues that humans have. To do this, Morales et al built a pipeline to extract visual and acoustic features that performed automatic speech recognition and use that data to transcribe and extract relevant linguistic features. OpenMM was tested on deception, depression, and sentiment classification showing promising results. Depression detection had a baseline of 55.36% accuracy, and OpenMM’s acoustic feature set was able to produce an accuracy of 76.79%. OpenMM is publicly available for other researchers to experiment with and build upon, which is necessary for making use of all these classification techniques.

Like the ability of machines to diagnose disease, machine learning can also be leveraged to help prevent disease by developing predictive models. This goal can be successfully obtained through text mining techniques on clinical and medical data. Jacobson et al experimented with detecting healthcare-associated infections in patients by applying deep learning techniques to Swedish medical records. The Swedish Health Record Research Bank data obtained contained two million patient records from over 800 clinical units between 2006 and 2014. They also used the special subset Stockholm EPR Detect-HAI Corpus which contains 213 patient records and classified by two domain experts and gold annotated. After the necessary preprocessing of the data, the records were transformed into numerical vectors, one of which being bag of words and tf-idf representations, the other Word2Vec word vectors. Artificial neural networks were then built from these models, including stacked sparse auto encoders and stacked restricted Boltzmann machines. The results were somewhat diverse, ranging from 66% to 91%, but most of the scores hovered between 70%-80%. Jacobson et al admitted that deep learning techniques are often expensive to train and usually researchers sacrifice some agility for increased accuracy, which was unfortunately not seen in this experiment. Despite the shortcomings, this research is still incredibly useful and future work is promising.

Having a broad range of potential diseases to identify and classify is attractive, but narrow topics are also necessary. To this point, Abdinurova et al sought to create a model that could classify epilepsy, namely, the various stages of it. The stages of epilepsy include absence of seizure, pre-seizure, seizure, and seizure-free and are all used in clinical data. Their system utilized artificial neural networks and SVMs combined with supervised learning algorithms, and k-means clustering combined with unsupervised techniques. The result from these various techniques experimented with showed favorable results, all with high accuracies. As expected, the supervised methods performed better than their unsupervised counterparts, however the unsupervised results were not drastically worse. This experiment is an excellent comparison of state of the art supervised and unsupervised learning tasks and proves that both methods can yield comparable results. In addition to being excellent classification systems, the models also performed well on information retrieval tasks, an important function of machine classification.