Large Language Models (LLMs) like GPT and BERT are starting to appear on the scene in genomics. For those new to machine learning in genomics, this might seem like a puzzling development – could we really develop something like ChatGPT to interpret my single-cell RNA-seq data?
But LLMs can do much more than be chatbots. LLMs are more complex and potentially more powerful than some of the easier-to-grasp machine learning techniques that many biologists have only recently become familiar with, like convolutional neural networks (CNNs) and support vector machines (SVMs). Anyone interested in genomics is going to hear a lot more about LLMs in the coming years. Here are five useful links to get you started on LLMs:
Video primer: Large Language Models for Computational Biology
Dr. Jian Ma, an outstanding computational biologist at Carnegie Mellon presented an introductory talk on LLMs and computational biology in 2023:
LLMs and Deep Learning for Genomics
Here is 2023 review that is a bit more technical, discussing where LLMs fit into the larger picture of deep learning in genomics:
Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models
Geneformer and Single Cell Omics
Geneformer was published in Nature last summer and demonstrates how models with LLM architectures can analyze your single cell transcriptomic data:
Transfer learning enables predictions in network biology
Gene enrichment analysis: SPINDOCTOR
A group out of the Lawrence Berkeley National Lab shows that LLMs for genomic analysis are still a work in progress. The model SPINDOCTOR doesn’t yet outperform standard Gene Ontology-type analyses to find interesting pathways in your gene set.
Gene Set Summarization using Large Language Models
LLMs in genomics: The technical details
If you’re ready to get into the weeds of this field, a group at the University of Toronto has you covered with this excellent review:
To Transformers and Beyond: Large Language Models for the Genome
Enjoy!