When are Foundation Models Not the Answer?

Genomic large language models can be impressive but we need 'on-device' data-efficient deep learning strategies to tackle all human cell types.

Nov 18, 2024

Seeing the light at the end of an LLM tunnel

Apple recently rolled out Apple Intelligence, their version of deep learning tools that are meant to be “AI for the rest of us.” Generative AI is supposed to help us be more efficient by drafting our emails we’re too busy to write, summarizing documents we don’t want to read, and turning us into passable artists and graphic designers no matter how terrible our skills are. (See above image.) Apple’s release is the latest entry in the tight competition among tech companies to turn the impressive capabilities of generative AI like ChatGPT and Imagen into must-have consumer features that we will all use in our everyday lives.

However, the problem with running generative AI on your phone is that you can’t actually run these enormous models on your phone’s hardware. State-of-the-art large language models take vast amounts of data to train and massive amounts of computing power to run. One solution is to have the AI on your phone rely on remote servers to do the heavy lifting, but this approach has some problems. An obvious one is the privacy risk, but another is that generative AI on your phone should ideally continue to learn from your interactions with it, and thereby optimize its output for you. Issues like these are why “on-device” learning is a big part of Apple’s implementation of AI on its devices. Apple uses a variety of strategies to achieve this, one of which a set of fine-tuned “adapters”, which are “small collections of model weights that are overlaid onto the common base foundation model.” Apple says that “they can be dynamically loaded and swapped — giving the foundation model the ability to specialize itself on-the-fly for the task at hand.”

We ways need to train models of cell type activities with less data

In a not yet well-developed way, I’ve been mulling over the possibilities of genomic “on-device learning”. Perhaps the analogy to phones isn’t great, but there is a clear need for data-efficient deep learning in genomics that I don’t think very big foundation models can solve, at least by themselves. Alphafold is impressive, Nobel prize-level impressive, but protein structure prediction doesn’t face one crucial challenge that modeling the non-coding genome very hard: cell type specificity.

Most functional DNA elements outside of protein coding sequences function in highly cell type-specific ways. For these sequences, there is no context-free functional target that your model needs to predict. The structure of a G-protein coupled receptor doesn’t change dramatically from cell type to cell type, but the Rhodopsin promoter is completely inactive in all cell types except rod photoreceptors.

Foundation models in genomics are not yet, and may not ever be up to the task of predicting the activity of non-coding DNA in all human cell types, because we may never have the right data at the necessary scale. I say this tentatively, but as of this moment most of the training data used in the field comes from comes from cell lines, which even in the best cases are mediocre proxies of endogenous cell types that function within the context of whole tissues. For the disease-relevant cell types we care about, some of which can be very rare in the human body, will we ever collect enough data to learn, say, the sequence grammar of regulatory DNA?

I don’t think so for two reasons: 1) Single-cell technologies, which can assay rare populations, may not generate all of the measurements that we need from rare cell populations. Also, the epigenomic readouts we do get are only proxies for function, which is harder to measure direcly. 2) For any given cell type, there simply may not be enough examples in the genome to learn, say, the grammar of regulatory DNA. 40,000 open chromatin regions in one cell type is not enough to train a foundation model.

Thus the need to think about the genomic equivalent of on-device learning in a small data regime, if we ever want models that accurately predict the effects of non-coding mutations or generate specific regulatory activity in rare, disease-affected cell types.

I don’t really have solutions to propose, but I’ve been doing a little reading. If you’re interested in data-efficient deep learning, below are a few papers that seem helpful.

Apple’s on-device learning

Apple published a paper describing its approach to implementing foundation models on phones.

Transfer learning to predict in vivo expression

Alex Stark’s lab published a nice paper earlier this year demonstrating one promising approach for dealing with low data, pre-training and fine-tuning. The lab trained a CNN on an abundant dataset (ATAC-seq), then fine-tuned on a less abundant in vivo dataset.

Data efficient deep learning for medical imaging

In a quick scan of the literature, I found that data efficiency for deep learning is a focus of attention in medical imaging. Here is a review of the field, with a focus on alternatives to supervised learning.

Low-N protein engineering

George Church’s lab published an approach to model a fitness landscape using as few as 24 assayed mutant sequences.

Optimized training datasets for MPRAs

Georg Seelig has put up a cool preprint describing different approaches for model-guided design of regulatory elements, using optimized training datasets from massively-parallel reporter gene assays (MPRAs). This is an approach I would like to see extended beyond cell lines.

Importance sampling

This next one outside of the context of biology. A team from Huwei describes an importance sampling strategy that subsamples the available data based on an adaptable importance criterion.

Active learning for nanopore sequencing and drug discovery

Nanopore sequencing blows my mind. Single DNA molecules are threaded through a nanopore and generate currents as the bases pass through. The raw data of nanopore sequencing is very complex and machine learning is crucial for processing the data. Innovations in nanopore sequencing rely on models that can interpret the signals generated when marcomolecules pass through the pore. This paper out of Nanjing University of Aeronautics and Astronautics describes active learning to get around the challenge of few labeled training examples.

Here is another active learning paper, this one for low-data drug discovery.

Geospatial sensing

Remote sensing using satellites is also an area in which deep learning needs to be done in a low-data context. This paper reviews ten strategies for small datasets.

Foundation models

Finally, here are a couple of papers on foundation models. The cover story in the latest issue of Science described a prokaryotic foundation model trained on over 2 million genomes. One potential solution for models of human cell types would be to gather more cross-species datasets, particularly when you know you have orthologous cell types. And here is a review of deep active learning for foundation models.

Other thoughts or resources? Leave a comment or reach me on X or Bluesky @genologos.