Research

I'm interested in contributing to research in genomics such as improving sequence alignment techniques or methods for identifying variants and how they correspond to human disease, as well as creating support tools for EHR analysis.

Harvard Medical School - Cassa Lab

Deep learning methods toward resolving uncertain variant classifications

PI: Chris Cassa, PhD

Many patients have rare variants in genes associated with actionable diseases, however, it is more difficult to accurately assess the risk of these variants just from population data due to their low frequency. Methods that enable a better understanding of impact prediction based on specific sources of evidence would serve to improve the assessment of these rare variants and their classifications. In this project, we aimed to select and optimize a large language model (LLM) for the purpose of improving classifications of variants of uncertain significance within the ClinVar database and creating an improved framework for better understanding these variants using the text summary information provided. ClinVar submissions were preprocessed, extracting plaintext comments and their interpretations for model training. Six different LLMs were trained and tested and dmis-lab/biobert-base-cased-v1.2, a BERT model trained on additional data from PubMed and PMC, performed best with an AUPRC of 0.92. Using a case review approach, a subset of comments were analyzed with attention mapping using non-negative matrix factorization (ecco package) to determine where the model is focused given the prediction task. These findings enhance interpretability of the fine-tuned large language model and suggest the model can identify key concepts that are relevant to the American College of Medical Genetics and Genomics evidence framework for determining variant pathogenicity.

Vertex Pharmaceuticals - Biometrics

Implementing the Admiral Pharmaverse R package to generate ADaM Datasets;

Automated Program for Cross-Checking cSDRG documents with relevant SDTM and Metadata datasets in R

Supervisor: Todd Case, MS, Senior Director

The aims of the project were to utilize and build upon recently developed open-source R packages for generating clinical trial analysis data sets. I researched and implemented this Admiral Pharamverse R package in parallel with Tidyverse to generate Subject-Level (ADSL) and Laboratory Test Result (ADLB) Analysis datasets that followed the pre-established specifications provided to me. This included also creating a variety of supplemental functions to be used either at the dataset level for specific variable creation, or across datasets to efficiently structure relevant metadata parameters. Additionally, when preparing and cross-checking documents for regulatory submission to government agencies for drug development or approval, many errors can arise due to the large number of fields and documents it is necessary for a person to look over and compare. Thus, I also wrote an automated program in R to cross-check standardized clinical trial documents with respective SDTM and Metadata datasets. This required processing multiple data modalities, and subsequently extracting relevant features and text for comparison. Ultimately, I created a summarized output dataset showing pertinent information to a human reviewer to minimize cross-checking errors during study submissions.

RED HAT - AIOps Data Group

As a Data Scientist Software Engineer Intern I improved my knowledge of version control by tracking progress using git. I worked on updating the content and organization of the Data Science Workflow repository and created documentation on best practices for data science visualizations.

Here is my presentation to the engineering team - Examples of the Best Practices for Data Science Visualization. Enjoy!

ZED Lab at UChicago

I joined Dr. Ishanu Chattopadhyay’s lab during my summer break in 2020. Zed Lab studies automated inference and investigates the core algorithmic principles behind data analysis with minimal human intervention. I've learned a great deal about machine learning concepts such as classification and regression since I have joined the lab.

I began experimenting with the creation of my own Jupyter notebooks to try out some of the methods I was learning about, such as running PCA on Spellman gene data.

Feel free to browse my collection of Jupyter notebooks here.

Dr. Mark Nicolls’ Lab at Stanford

I had the privilege of working in Dr. Mark Nicolls’ Lab at Stanford during Summer 2019. I deeply enjoyed working with and learning from all members of the lab, and appreciated how collaborative and dedicated everyone was. I learned how to collect tissue samples from dissected rats for staining and analysis, maintained the mouse colony, performed PCR for genotyping, and ultimately, absorbed as much information as I could.

I assisted on a project that tracked growth changes over time in a mouse tail lymphedema model, to assess how dysregulation of hypoxia inducible factor 2 ɑ (HIF-2ɑ) affects disease pathogenesis. Taking control HIF-2ɑ and HIF-2ɑ OE mice, we performed lymphatic surgery and induced lymphedema. I then tracked the lymphedema progression through photos. I analyzed pictures of the mouse tails using Photoshop and Excel, and calculated percent changes to tail volume using a truncated cone model. Ultimately, we found that HIF-2ɑ OE mice had a smaller tail volume over time, indicating that HIF-2ɑ may serve as a mediator in promoting lymphatic health.