Friday, June 7, 2024

Kate Weber's Dissertation Defense

10:00 AM to 11:00 AM

Room 2903 Taubman Health Sciences Library, 1135 Catherine Street, Ann Arbor, MI

Kate will defend her PhD dissertation, "Detecting Risky Alcohol Use with Natural Language Processing and Computable Phenotypes in Clinical Records."

Abstract

Alcohol use is common throughout the United States. Although severe alcohol use disorder is rare, it has destructive impacts on the physical and social health of an individual. Unless a person has sought help for alcohol abuse, information about their consumption is largely restricted to free-text notes in their social history and is difficult to locate with simple searches. Because of the correlation between alcohol use and poor surgical outcomes, there is a need to locate this information and calculate alcohol-use risk for clinicians who may be seeing a patient for the first time before a procedure. This information can help with decisions to perform additional screening or perioperative interventions.
The aims of this dissertation are to 1) Develop a binary natural language processing (NLP) classifier that indicates whether patients’ text records indicate high- or low-risk alcohol use and compare it to a similar algorithm using only standardized medical codes; 2) Develop a four-class ordinal NLP classifier of patient risk ranging from “does not drink” to “probably dependent on alcohol”; and 3) develop a four-class computable phenotype for risky alcohol use that uses structured data in the clinical record.
The binary classifier has an F1 score of 0.78, far outstripping the ability of ICD codes alone to correctly identify high-risk patients. The four-class NLP algorithm applies a transformer architecture for intermediate labeling and a bidirectional LSTM neural network as an inference head to effectively build a model with scarce data in rare classes to an overall macro F1 score of 0.77, with true-positive performance of 0.83 and 0.73 for Probable-Dependence and High-Risk, respectively. The proposed computable phenotype is novel in its ability to provide stratified levels of risk and as a two-class classifier, is significantly more effective than others in the literature. However, it is unable to classify 41% of patients and relies on non-standard structured data in the record for its improvements over another published phenotype.
Finally, we consider this model in the context of other Large Language Model approaches to clinical concept extraction, examine the utility of alternatives to the F1 statistic for model selection in ordinal classifiers, and validate the NLP approach to extracting and calculating a numeric value for a patient’s weekly alcohol consumption. This dissertation’s contributions include a multi-stage, novel approach for extracting sparse information in a noisy, imbalanced dataset, a new ordinal NLP classifier representing alcohol-use risk, and a 4-class computable phenotype for alcohol-use risk.