Projects

I use Neural Multi-Task Logistic Regression (N-MTLR) to model all-cause readmission-free survival as a function of time. N-MTLR, despite producing probability predictions for all future time-points, out-performs XGBoost and Deep Learning approaches trained specifically to predict readmissions at 30 days. Further, I show that N-MTLR, augmented with a sequence model, can learn a patient’s representation directly from their history of medical codes, predicting all-cause readmission with an AUROC of 0.846 using only sequences of historical medical codes as input. This approach significantly outperforms the LACE baseline, common in hospital settings.

NRC_Amii_agronomics_presentation.pdf

Deep Learning for Crop Gene Expression Prediction from Sequence Data (2021-2022)

Recent changes in global climate patterns (such as extreme average temperatures, droughts, and floods) have made the yields of Canadian farming methods much more unpredictable. Given these trends, scientists have taken to examining another major factor that can affect the success of a growing season: the genetics of the crop organism. To what extent can we predict gene expression changes driven by DNA sequences in Canadian protein crop species? To explore this question, we tested a number of deep-transfer-learning convolutional models on this problem and proposed some appropriate baselines for use with both chromosomal and organellar genomes.

readmission_results.pdf

Predicting 30-Day Hospital Readmissions using Machine-Learned Skip-Gram Features (2020-2021)

Nearly 10% of patients hospitalized in Canada are readmitted within 30 days, costing approximately 2 billion Canadian dollars per year. Predicting the readmission risk of individual patients can help better target interventions, which can save lives and expenses. Consider a patient that has accrued a number of medical codes associated with their care, such as diagnoses and procedures. By listing these medical codes in chronological order, we now have a representation of this patient, which is analogous to a document containing a sequence of words. From this point, skip-gram can be used to create representations of medical concepts, which can be combined to represent a patient. We test a number of machine learning models on these representations for the prediction of 30-day hospital readmission.

topic_modeling_final_pres.pdf

Modeling Multi-Cancer "Topics" Using Discretized Latent Dirichlet Allocation (dLDA) (2019)

Cancer is a molecular disease with an estimated 19.3 million new cases and 10 million deaths in 2020 alone. Personalized medical oncology aims to provide individualized cancer treatments by acknowledging that every cancer patient is unique in terms of prognosis, treatment tolerance, and survival outcome due in part to each individual tumor’s distinctive molecular profile. We describe how to learn meaningful survival distributions from whole-transcriptome data, that models trained on multi-cancer data can outperform models trained on single-cancer data, that our unsupervised learned transcriptome representations outperform the supervised benchmarks, and how these models can be interpreted.

d-calibration.pdf

D-Calibration: An Effective Way to Evaluate Individual Survival Distributions (2018)

An accurate model of a patient’s individual survival distribution can help determine the appropriate treatment for terminal patients. Unfortunately, risk scores do not provide survival probabilities, single-time probability models only provide for a single time point, and standard Kaplan-Meier survival curves provide only population averages that are not specific to individual patients. This motivates an alternative class of tools that can learn a model that provides an individual survival distribution (ISD) for each subject, which gives survival probabilities across all times, such as extensions to the Cox model, Accelerated Failure Time, an extension to Random Survival Forests, and Multi-Task Logistic Regression. We motivate and define a novel evaluation approach, “D-Calibration”, which determines whether an ISD model’s probability estimates are meaningful. [GitHub]

Capstones

Are Multi-Layer BERT Features and EEG Representations of Words Correlated? (2021)

Here, we experiment to see if correlations exist between BERT representations and EEG brain-wave data for a given input sequence of words. Exploring the similarities and differences between machine learning and the brain can help inform the building of future ML architectures. As part of this project, we came up with sets of features based on combinations of BERT layer embeddings and trained different types of mapping models to see which are most correlated with EEG data. [GitHub]

transduction.pdf

Exploring Stack Augmented Neural Networks for Transduction (2021)

It has long been known that augmenting a recurrent neural network with external memory can increase the complexity of patterns it can express. This extension is particularly interesting with respect to the transduction problem, which is described as the learning of a model that both accepts or rejects a string as being part of an input language, and subsequently translates it into another target language. In this work, we explore ways in which a Stack-RNN (SRNN) can be incorporated into a Seq2Seq model, with the end goal of creating an effective transduction pipeline. [GitHub]

geospatial.pdf

Describing Multi-Centered Spatial Events in Geo-Tagged Data Using Twitter COVID-19 Data (2020)

This project's aim is to improve upon existing methods for using coordinate data to localize events that have multiple centers. To accomplish this, we propose a framework that allows for greater flexibility in center localization compared to previous literature, then test our method on publically available geo-tagged Tweets. This work is especially important during the current pandemic, as understanding hot spots of COVID-19 infection is of particular societal interest. [GitHub]

CMPUT_501_Term_Project.pdf

Evaluation of Latent Dirichlet Allocation Topic Modeling for Sentiment Analysis (2020)

Previous authors generated Latent Dirichlet Allocation (LDA) topics for four sentiment datasets, and applied combinations of classifiers and ensemble methods to better understand the effects of LDA on sentiment classification performance. We follow their procedure on two datasets and attempt to answer the question "Is LDA worth it?" by exploring the sentiment analysis performance and computational burden with and without LDA preprocessing, which was absent from the original publication. [GitHub]

final_presentation_sarah.pdf

SHEST: Secular Heritability ESTimate (2020)

Heritability is the standard measure used to quantify the relative genetic contributions to various phenotypes, diseases, and conditions. However, heritability values derived from the two most popular estimation methods (namely, twin studies and GWAS) rarely agree. This discrepancy is called the missing heritability problem. In this project, we develop a novel approach to estimate heritability by assessing the magnitude of the secular trend in the incidence or prevalence of a specific phenotype or disease for a given population and compare our heritability values to those in literature.

Extracurricular

Medical Tool Image Classification (2021)

Every 1/1000 to 1/3000 abdominal operations result in incomplete retention of surgical tools — in other words, tweezers, scalpels, needles, sponges, towels, and more occasionally remain in patients after surgery. In this project, I lead a team of people responsible for the creation of the machine-learning part of a medical tool object-detection application. Downstream iterations of the model could one day act as another set of tool-counting “eyes” in the operating room. Please refer to the AIMSS official release to learn more. [GitHub]

Determining_Factors_That_Are_Negatively_Associated_With_Personal_COVID_19_Curve_Flattening_Behaviors_in_the_United_States.pdf

Modeling COVID-19 Curve Flattening Behavior in the United States (2020)

COVID-19 has arguably impacted every dimension of social living — be that employment, schooling, healthcare, or recreational activities. One key element of the North American pandemic response has been the emphasis that the spread or prevention of the pandemic is largely dependent on the measures taken by members of the general public. Our central research question focuses on outlining the socioeconomic, health, and demographic factors that are related to an individual avoiding personal measures that prevent the spread of COVID-19.

Live-Video Organ Object Detection (2019)

This application was developed for the purpose of demonstrating AI (with a medical twist) to children aged 6-12. First, I created and labeled a large dataset of photos containing paper cutouts of organs, then used Tensorflow's Object Detection Library to tune a model to classify them. The application itself takes webcam feed as input and uses the model to predict which medical item appears in the video, draws a bounding box around it in every frame, and live-outputs the augmented video on the screen. This tool was demonstrated to over 200 children and family members.

Page updated

Google Sites

Report abuse