In less than a decade, deep learning has gone from a niche effort in genomics to something that nobody working in this field can ignore. For most of that decade, deep learning basically meant convolutional neural networks (CNNs). It’s easy to see how CNNs are a natural fit for genomic prediction tasks: they take DNA sequence or some other biological entity as input, transform that input over multiple “neural” layers, and then output a prediction about functional activity or classification, such as whether that DNA sequence will bind regulatory factors. CNNs seem like a natural fit for genomics because in many ways they resemble more familiar statistical modeling like logistic regression.
What may at first seem like a less natural fit for genomic analysis is ChatGPT. Most people learned about ChatGPT when it was publicly released in late November 2022. In the year and a half since then, millions of us have been impressed by ChatGPT’s capacity to summarize and explain information, while also being amused or appalled (or both) at its propensity to confidently spout BS. As a tool, GPT can be unreliable in ways that we generally don’t tolerate in our data analyses. For example, the quality of the answer you get from ChatGPT depends knowing how to write good prompts, and even so, the exact same prompt will yield different responses in different sessions. And the output doesn’t come with any estimates of how reliable its answers are.
Given this variability in ChatGPT, it may surprise you that large language models (LLMs) are quickly becoming a hot area in machine learning for genomics. A fascinating new preprint from the labs of Rahul Dhodapkar at USC and David van Dijk at Yale describes a method to analyze single-cell transcriptomics data using a modified version of GPT. It converts gene expression data into “cell sentences” which are associated with different cell types. Using natural language prompts that include gene names or cell types, the model allows the user to predict cell types from transcriptomic data. You can also go backwards by giving cell types as a prompt, and the model will reconstruct the expected gene expression profiles.
Beginning in 2023, interesting potential applications of LLMs have started to show up on preprint servers and now in journals. (PubMed has 673 hits for large language models in 2023 alone, though most of these are not related to genomics.) LLMs are now used to annotate microbial proteins in sequencing datasets, identify biological pathways in gene interaction data, and predict the function of uncharacterized proteins.
This trend towards LLMs raises some important questions: Can we make these models reliable? Is ‘writing good model prompts’ going to be a new critical skill for genomic scientists? How do you convert quantitative, multi-omics data to English (or any other language the model might use)?
We can’t definitively answer those questions yet, but over the next few weeks we’ll take a deeper dive into LLMs and what they mean for genomics.