Learning How Proteins Navigate the Genome: PADIT-seq
New work on the signals that DNA-binding proteins use to find their targets among 3.2 billion base pairs
Plaque at The Eagle pub. Credit: Benjah-bmm27 via Wikimedia Commons.
On February 28th 1953, an excited Francis Crick walked over the Cambridge pub The Eagle and announced that he and James Watson had “found the secret of life” by figuring out the structure of DNA. Crick and Watson realized that the ordered sequence of DNA base pairs was likely how genetic information is stored. But the real secret of life is not how genetic information is stored, but how it is read out by millions of acts of molecular recognition. The secret of life is specificity. As D.L. Nanney put it in a prescient 1958 paper on epigenetics, the genome is a “library of specificities.”
One of the outstanding problems in genomics has to do with the specificity of DNA-binding regulatory proteins. To do the job of regulating gene expression, these proteins need to find their correct binding sites within a vast, 3.2 billion base-pair genome. The puzzle is that DNA-binding proteins don’t seem to have enough specificity to do this. They intrinsically recognize short sequence motifs that occur millions of times in the genome, and in most cases, probably fewer than one in a thousand of those sites is actually bound. So where is the information that enables DNA-binding proteins to find their targets?
Last week, here at Washington University, we heard a nice talk by Harvard biologist Martha Bulyk, who has long been a leader in this field. Martha was participating in our symposium held in honor of my colleague Gary Stormo, a pioneer of computational biology who, with his colleagues, invented the ubiquitous position weight matrix model of DNA motifs. (More on this symposium next month… stay tuned.)
Dr. Bulyk spoke about her lab’s newest technology for studying transcription factor binding sites: PADIT-seq (protein affinity to DNA by in vitro transcription and RNA sequencing). PADIT-seq cleverly sits in between in vivo functional assays with reporter genes and biochemical binding assays like the protein-binding microarrays (PBMs) that the Bulyk lab invented. In vivo assays measure the function of a regulatory protein binding site within the complex environment of a cell, where the experimenter doesn’t have complete knowledge of what is binding to a particular segment of DNA and contributing to its functional activity. In vitro assays like PBMs only measure binding affinity, but we know that binding affinity of a protein to DNA is only partially correlated with regulatory activity.
PADIT-seq measures regulatory function, which is typically obtained in vivo, in an in vitro system. In PADIT-seq, a library of reporter genes with transcription factor binding sites control in vitro transcription of barcoded reporter genes. This transcription isn’t subject to all of the complexities of an in vivo cell environment, but it is more sensitive than typical in vitro binding assays because it amplifies a protein binding event by producing RNA transcripts. For the aficionados, PADIT-seq is a state-of-the-art method that is related to the original MPRA assay and to the one-hybrid yeast assays by Gary Stormo and Scot Wolfe (who also spoke at our Stormo symposium).
The punch line of the PADIT-seq paper is more evidence for an idea that has been floating around in the field for a little while: effective DNA binding sites are often surrounded by a halo of weaker affinity binding sites. These weaker sites may act to create an elevated binding affinity landscape around the main binding site, thus distinguishing functional binding sites from non-functional sites.
Figure 8 of Khetan and Bulyk.
How common is this in the genome? It would be nice to see a systematic evaluation, but this may be a common mechanism by which DNA-binding proteins read the genome.
Along with the PADIT-seq preprint, Martha also spoke about her labs recent paper looking at how mutations in DNA binding proteins change their affinity and specificity. It’s worth reading along with the PADIT-seq work.