Machine Learning 🚀 In Genomics 🧬 and HealTh ❤️ = 💡

Research overview

The general theme of our research is to develop computational frameworks or machine learning algorithms that effectively integrate genome-wide and phenome-wide heterogeneous measurements to reveal meaningful biological patterns that are associated with specific phenotypic traits including other complex diseases.

Phenotyping via topic models from health big data

The broad adoption of electronic health record (EHR) systems has created unprecedented resources and opportunities for conducting health informatics research. Hospitals routinely generate EHR data for millions of patients, which are increasingly being standardized along with free-form text from clinical notes. In America, the number of non-federal acute care hospitals with basic digital systems increased from 9.4% to 96% over the 7 year period between 2008 and 2015. The amount of comprehensive EHR data recording multiple data types, including clinical notes, increased from only 1.6% in 2008 to 40% in 2015. With the aid of effective computational methods, these EHR data promise to define an encyclopedia of diseases, disorders, injuries and other related health conditions, uncovering a modular phenotypic network. However, EHRs present several modeling challenges, including highly sparse data matrices, noisy irregular clinical notes, arbitrary biases in billing code assignment, diagnosis-driven lab tests leading to not-missing-at-random (NMAR) biases, and heterogeneous data types across clinical notes, billing codes, lab tests, and medications, and longitudinal with sparse irregularly sampled time points. To address these challenges, we develop a multi-view Bayesian and deep learning frameworks for EHR data integration and modeling in the ravine of collaborative filtering and latent topic models.

Selected publications (trainnes/mentees from Li lab; †equal contribution; *crresponding authors):
  1. Yue Li* et al (2020). Inferring multimodal latent topics from electronic health records. Nature Communications 1–17. http://doi.org/10.1038/s41467-020-16378-3
  2. Yuri Anjuha*, Yuesong Zou, Aman Verma, David Buckeridge*, Yue Li*. (2022) MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record. Journal Biomedical Informatics 104190 doi:10.1016/j.jbi.2022.104190
  3. Yixuan Li, Archer Y. Yang*, Ariane Marelli* and Yue Li*. (2024) MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records. Journal Biomedical Informatics. 153 (104638). doi.org/10.1016/j.jbi.2024.104638 
  4. Yuening Wang, Rodrigo Benavides, Luda Diatchenko, Audrey Grant*, Yue Li*. (2022) A graph-embedded topic model enables characterization of diverse pain phenotypes among UK Biobank individuals. iScience 25, 104390 doi: https://doi.org/10.1016/j.isci.2022.104390
  5. Ziyang Song, Yuanyi Hu Aman Verma, David Buckeridge, and Yue Li*. (2022) Automatic phenotyping by a seed-guided topic model. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), August 14–18, 2022, Washington, DC, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3534678.3542675 (KDD HealthDay2022 Best Paper award)
  6. Yuesong Zou, Ahmad Pesaranghader, Aman Verma, David Buckeridge, and Yue Li*. (2022) Modeling electronic health record data using a knowledge-graph-embedded topic model. Scientific Reports 12, 17868. doi.org/10.1038/s41598-022-22956-w
  7. Ruohan Wang, Zilong Wang, Ziyang Song, David L. Buckeridge and Yue Li*. (2024) MixEHR-Nest: Identifying Subphenotypes within Electronic HealthRecords through Hierarchical Guided-Topic Modeling. In Proceedings ofThe 15th ACM Conference on Bioinformatics, Computational Biology, andHealth Informatics (ACM BCB ’24). ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3698587.3701368 (acceptance rate: 35%)

Time-series forecasting and representational learning of health

EHR data is inherently longitudinal. For example, the Quebec Congenital Heart Disease (CHD) database includes up to 20 years of follow-up diagnosis codes for each patient. To address this, I collaborated with Dr. Ariane Marelli (Department of Medicine) and Dr. David Buckeridge (School of Public Health) at McGill to model longitudinal EHR data from MUHC and PopHR. Our early work developed the Deep Heart Trajectory Model (DHTM), a recurrent neural network predicting future heart failure risk from past healthcare records. Building on this, we created hART, a transformer-based model, for more accurate predictions, published in the International Journal of Medical Informatics (IJMI) and led by Harry Moroz, a Master’s student in Digital Health Innovation, co-supervised by Dr. Marelli and me. EHR data is also irregularly observed, with visit frequencies varying by disease state. To infer longer trajectories than those in training data, we developed TimelyGPT and TrajGPT, leveraging extrapolatable position embeddings (xPOS) and scalable linearized attention. Trained on 1.2 million Quebec patients over 20 years, these models show strong capabilities in predicting future diagnoses. For bi-directional representation of longitudinal data, we further extended TimelyGPT to biTimelyGPT.

Selected publications:
  1. Harry Moroz, Yue Li*, Ariane Marelli*. (2024) hART: Deep Learning-Informed Lifespan Heart Failure Risk Trajectories. International Journal of Medical Informatics (In press) https://doi.org/10.1016/j.ijmedinf.2024.105384
  2. Ziyang Song, Qincheng Lu, He Zhu and Yue Li* (2024) Bidirectional Generative Pre-training for Improving Time Series Representation Learning. In the Proceedings of the 9th Machine Learning for Healthcare Conference (MLHC) and the Proceeding of Machine Learning Research (PMLR). Volume 252arXiv arXiv:2402.09558
  3. Ziyang Song, Qincheng Lu, He Zhu, David L. Buckeridge and Yue Li* (2024) TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare. In Proceedings of The 15th ACMConference on Bioinformatics, Computational Biology, and Health Informatics(ACM BCB ’24). ACM, Shenzhen, Guangdong, PR China, 16 pages. https://doi.org/10.1145/3698587.3701364 (acceptance rate: 35%)
  4. Ziyang Song, Qincheng Lu, He Zhu, David L. Buckeridge and Yue Li* (2024) TrajGPT: Irregular Time-Series Representation Learning for Health Trajectory Analysis. arXiv preprint arXiv:2410.02133

Inferring causal variants and cell types from GWAS

Genome wide association studies (GWAS) can help gain numerous insights on the genetic basis of complex diseases, and ultimately contribute to personalized risk prediction and precision medicine. When the regions of association contain protein-altering variants, the path from GWAS hits to disease mechanism and eventually therapeutic development can start with the disrupted gene as a candidate target. However, over 90% of the disease-associated loci consist exclusively of non-coding variants, hindering the ability to interpret their function. Moreover, genome-wide significant loci explain only a small fraction of the phenotypic variance attributable to genetics, a difference often referred to as the "missing heritability". Finally, because of linkage disequilibrium many significant varaints are non-causal but merely linked to causal variants within the LD block, which may range from hundreds to a hundred thousands of kilobases.  We are interested in developing statistical inference models to computationally fine-map causal variants by rigorously integrating various sources of functional genomic reference information and infer the causal mechanisms underlying various related genetic traits or diseases.

Selected publications:
  1. Shadi Zabad, Simon Gravel*, Yue Li*. (2023) Fast and Scalable Polygenic Risk Modeling with Variational Inference. American Journal of Human Genetics doi:10.1016/j.ajhg.2023.03.009.
  2. Wenmin Zhang*, Hamed Najabadi, Yue Li*. (2023) SparsePro: an efficient genome-wide fine-mapping method integrating summary statistics and functional annotations. PLOS Genetics 19(12): e1011104. https://doi.org/10.1371/journal.pgen.1011104
  3. Yue Li* & Manolis Kellis* (2016). Joint Bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases Nucleic Acids Research, 16(2), 1-13.

Single-cell data integration

The rapid technological innovations of single-cell multi-omic profiling assays allows investigating gene regulatory programs from multiple dimensions. Specifically, single-cell multi-omic data include single-cell proteomics, methylations, chromatin interactions, Assay for Transposase-Accessible Chromatin (ATAC), etc. Appropriately integrating these multi-omic data will lead to new mechanistic insights into the human complex diseases. A key challenge is to develop rigorous computational strategies to account for different data modalities while identifying the disease causal pathways by borrowing information across these omics.

Selected publications:
  1. Xinyu Yuan, Zhihao Zhan, Zuobai Zhang, Manqi Zhou, Jianan Zhao, Boyu Han, Yue Li*, Jian Tang*. Cell ontology guided transcriptome foundation model. NeurIPS 2024 spotlight. arXiv preprint arXiv:2408.12373
  2. Anjali Chawla, Doruk Cakmakci, Wenmin Zhang, Malosree Maitra, Reza Rahimian, Haruka Mitsuhashi, MA Davoli, Jenny Yang, Gary Gang Chen, Ryan Denniston, Deborah Mash, Naguib Mechawar, Matthew Suderman, Yue Li*, Corina Nagy*, Gustavo Turecki*. Differential Chromatin Architecture and Risk Variants in Deep Layer Excitatory Neurons and Grey Matter Microglia Contribute to Major Depressive Disorder. bioRxiv 2023.10.02.560567; doi: https://doi.org/10.1101/2023.10.02.560567 (accepted at Nature Genetics)
  3. Yimin Fan, Yu Li, Jun Ding*, Yue Li*. (2024). GFETM: Genome Foundation-Based Embedded Topic Model for scATAC-seq Modeling. In: Ma, J. (eds) Research in Computational Molecular Biology. RECOMB 2024. Lecture Notes in Computer Science, vol 14758. Springer, Cham. Springer Link
  4. Yifan Zhao†, Huiyu Cai†, Zuobai Zhang, Jian Tang*, Yue Li*. (2021) Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data. Nature Communications, 12(1), 5261. https://doi.org/10.1038/s41467-021-25534-2
  5. Lakshmipuram Seshadri Swapna, Michael Huang, and Yue Li*. (2023) Guided-topic modelling of single-cell transcriptomes enables joint cell-type-specific and disease-subtype deconvolution of bulk transcriptomes with a focus on cancer studies. Preprint available from bioRxiv doi: https://doi.org/10.1101/2022.12.22.521640
  6. Bahrami, M., Maitra, M., Nagy, C., Turecki, G., Rabiee, H. R., Li, Y. (2020). Deep feature extraction of single-cell transcriptomes by generative adversarial network. Bioinformatics (Oxford, England). http://doi.org/10.1093/bioinformatics/btaa976

Learning regulatory potential from functional genomic data

Disruption and aberrant coordination of gene expression is often at the higher hierarchy among the causes of complex human diseases. To improve our ability to interpret non-coding sequence various functional genomics data were recently generated from ChIP-seq, massively parallel reporter assays (MPRA), Hi-C. To harness the information provided by these data in order to improve inferring eQTL/GWAS causal SNPs, I'm interested in developping supervised and semi-supervised learning strategies to jointly learn the underlying regulatory properties implicated by both functional genomic data and GWAS/eQTL signals.

  1. Li, Y., Shi, A., Tewhey, R., Sabeti, P., Ernst, J., and Kellis, M. (2017), Genome-wide regulatory model from MPRA data predicts functional regions, eQTLs, and GWAS hits. bioRxiv. doi: https://doi.org/10.1101/110171
  2. Kreimer, A., Zeng, H., Edwards, M. D., Guo, Y., Tian, K., Shin, S., Welch, R., Wainberg, M., Mohan, R., Sinnott-Armstrong, N. A., Li, Y., Eraslan, G., AMIN, T. B., Goke, J., Mueller, N. S., Kellis, M., Kundaje, A., Beer, M. A., Keles, S., Gifford, D. K. and Yosef, N. (2017), Predicting gene expression in massively parallel reporter assays: a comparative study. Human Mutation. 38(9):1240-1250

Inferring microRNA regulatory networks in cancers

MicroRNAs (miRNAs) are ~22 nucleotide long noncoding RNA species. The regulatory roles of microRNAs (miRNA) have important implication in developments and diseases. Functional characterization of miRNAs require accurate identifications of their RNA targets, which has been a challenging computational task due to various confounding factors centering around the combinatorial co-regulatory relationships between miRNA and mRNA. Earlier developed sequence-based methods are mostly based on seed match, phylogenetic conservation, and binding energy. Recently, there is a paradigm shift from the sequence-based binary classification to more quantitative expression-based and network -focused approach. The momentum of this shift is largely facilitated by the increasing amount of expression profiling data of mRNAs and miRNAs across various experimental conditions. One of our research interests is to infer cancer-specific miRNA regulatory networks that can characterize cancer phenotypes and/or facilitate prognostic biomarkers development.

Selected publications:
  1. Li, Y. and Zhang, Z. (2015). WIREs RNA.
  2. Li, Y., Zhang, Z. (2014). Potential microRNA-mediated oncogenic intercellular communication revealed by pan-cancer analysis. Scientific Reports. 4(7097)

  3. Li Y, Liang M, Zhang Z. (2014) Regression Analysis of Combined Gene Expression Regulation in Acute Myeloid Leukemia. PLoS Computational Biology. 10(10) e1003908

  4. Li, Y.*, Liang, C.*, Wong, KC., Luo, J., Zhang, Z. (2014). Mirsynergy: detecting synergistic miRNA regulatory modules by overlapping neighbourhood expansion. Bioinformatics. 30(18), 2627-2635.

  5. Li, Y., Liang, C., Wong, KC, Jin, K., and Zhang, Z. (2014) Inferring probabilistic miRNA-mRNA interaction signatures in cancers: a role-switch approach. Nucleic Acids Research, 42(9), e76.

  6. Li, Y., Goldenberg, A., Wong, KC., Zhang Z. (2013). A probabilistic approach to explore human miRNA targetome by integrating miRNA-overexpression data and sequence information. Bioinformatics (Oxford, England), 30(5), 621–628

N6-methyladenosine RNA modification

N6-methyladenosine (m6A) is the most prevalent endogenous methylation in RNA. Recently, Dominissini et al. (2010) and Mayer et al. (2010) have demonstrated a novel NGS protocol to interrogate transcriptome-wide m6A methylation using m6A-seq, based on antibody-mediated capture and massively parallel sequencing. Despite implicated in regulation of gene expression, the functional roles of m6A are still largely unknown. In collaboration with Prof. Crystal Zhao, we are exploring deeper the fundamental biology of m6A in mammalian development with combined experimental and computational approach. Selected publications:
  1. Wang Y, Li Y, Yue M, Wang J, Kumar S, Wechsler-Reya RJ, Zhang Z, Ogawa Y, Kellis M, Duester G, Zhao JC. (2018) N6-methyladenosine RNA modification regulates embryonic neural stem cell self-renewal through histone modifications. Nature Neuroscience 21(2):195-206
  2. Li Y*, Wang Y*, Zhang Z, Zamudio AV, Zhao JC. Genome-wide detection of high abundance N6-methyladenosine sites by microarray. (2015) RNA (8):1511-8
  3. Wang, Y., Li, Y., Toth, J. I., Petroski, M. D., Zhang, Z., & Zhao, J. C. (2014). N6-methyladenosine modification destabilizes developmental regulators in embryonic stem cells. Nature Cell Biology, 16(2) 1-10

Detection of protein-associated noncoding RNA from RIP-seq, CLIP-seq, and PAR-CLIP experiments

Comprehensive transcriptome analyses suggest that only 1%-2% of the human or mouse genome is protein coding whereas 70%-90% is transcriptionally active, but do not code for proteins, and thus denoted as non-coding RNA (ncRNA) (ENCODE Project Consortium, 2007). Many lines of evidence suggests that many of these ncRNAs are evolutionarily conserved, functionally interact with transcription factors and/or chromatin regulators, and participate in gene regulation. NGS platforms such as PAR-CLIP and RIP-Seq enables unbiased genome-wide identification of these ncRNAs and thus promise to reveal unique aspects of molecular biology. Selected publications:
  1. Li, Y., Zhao, D. Y., Greenblatt, J. F., & Zhang, Z. (2013). RIPSeeker: a statistical package for identifying protein-associated transcripts from RIP-seq experiments. Nucleic Acids Research, 41(8), e94

Kinome analysis

We proposed and implemented a computational pipeline to analyze peptide array kinome data (Li et al., 2012). The work as my B.Sc. Honours thesis was under supervision of Dr. Anthony Kusalik and in collaboration with immunologists (co-authors) from the Vaccine and Infectious Disease Organization (VIDO) at the U of S. To our knowledge, the proposed pipeline is the first integrative approach that addresses kinome-specific computational challenges in microarray analyses. In particular, our statistical testing for differentially phosphorylated kinase peptides takes into account the technical and biological variation inherent to the technology and dynamic kinase activities between biological replicates, respectively. Comparing to existing methods, our approach is more sensitive in detecting kinases involved in well-defined signaling pathways activated by the select stimuli. The central roles of kinases in immune defence make them promising therapeutic targets. Rigorous detection of subtle changes in treatment-specific kinase activities via a powerful platform such as kinome microarray may facilitate pharmaceutical design against diseases.

Selected publications:
  1. Li, Y., Arsenault, R. J., Trost, B., Slind, J., Griebel, P. J., Napper, S., and Kusalik, A. (2012). A Systematic Approach for Analysis of Peptide Array Kinome Data. Science Signaling, 5(220), pl2–pl2.