Columbia University in the City of New York

Tavazoie Lab

Predicting gene expression from sequence

Deciphering the regulatory genome

The cell's control over the context of gene expression is the most important early step in the manifestation of phenotype. The expression output of a gene is dictated by the convergence of various upstream inputs that impinge upon DNA and RNA sequence elements in the vicinity of genes. The richness of gene expression programs—in a given cell across time, across distinct cell types, and in response to diverse stimuli—results from the combinatorial logic of spatially organized nucleic acid elements that bind transcription factors, RNA binding proteins, and microRNAs. The identification of these regulatory elements and elucidation of the rules by which they operate remains a central challenge for modern biology.

Even before the publication of the very first microarray expression data, we were intrigued by the possibility that statistical methods applied to such data could lead to unbiased identification of regulatory elements on a comprehensive scale. Subsequent work led to the development of the first computational framework that systematically revealed these elements through their over-representation within the promoter regions of co-expressed genes (Tavazoie et al. Nature Genetics 1999, 22:281). By making minimal a priori assumptions, and identifying most of the known cell-cycle regulatory elements, we were able to demonstrate the power of computational methods in extracting real biology from large genomic datasets. In addition to regulatory element discovery, we introduced the notion of biological validation of gene sets through statistical analysis of functional category enrichment.

Although regulatory element discovery from expression data represented a significant step forward in decoding the regulatory genome, the full elucidation of the regulatory logic by which these elements function has proven more challenging. We found that single regulatory elements were poor predictors of a gene's expression pattern. This, in part, reflects combinatorial regulation, where the expression level of a gene can depend on the occupancy states of multiple transcription-factor binding sites. Also, the function of these sites may require precise genomic context and spatial configurations. A central focus of our work has been the development of computational methods that reveal these context-dependent rules well enough to allow prediction of gene expression patterns from DNA sequence alone. To achieve this goal, we developed a Bayesian computational framework to systematically explore the immense space of potential sequence motifs and their combinatorial interactions. The application of this framework to yeast revealed highly predictive combinatorial rules involving multiple transcription-factor binding sites with precise constraints on their strengths, positions and orientations (Beer & Tavazoie, Cell 2004, 117:185). Our observations solved a major conundrum in the field: how can we distinguish between functional and non-functional sites in the genome? Furthermore, our analyses revealed pervasive combinatorial interactions involving logical relationships (AND, OR, NOT) between multiple known and novel predicted regulatory elements. This framework allowed us, for the first time, to predict expression patterns of ~70% of yeast genes from their local regulatory sequences.

A universal framework for regulatory element discovery

A major focus of on-going work is to extract mammalian regulatory programs from the rapidly increasing flood of expression data, encompassing tissue-specific gene expression in normal and disease states (e.g. cancer), and highly structured expression patterns in organs like the brain. However, regulatory-element discovery in mammalian genomes poses unique challenges such as large inter-genic regions, distal regulatory-elements, local and large-scale chromosomal composition trends, a vast diversity of cell types, and highly complex expression programs. To deal with these challenges, we have rephrased the problem in an information-theoretic framework, which has freed us from making any assumptions about the structure of the underlying sequence, or the specific model by which the sequence affects gene expression. This has led to a versatile framework for regulatory element discovery across all genomes and data-types, with exceptional sensitivity and near-zero false-positive rates (Elemento et al. Molecular Cell 2007, 28:337). Our approach, named FIRE (for Finding Informative Regulatory Elements) is available to the public, both as standalone software and through a user-friendly web interface: iget.c2b2.columbia.edu.

Global regulatory perturbations in disease states

Our computational methods have led the way in pathway and regulatory network analysis of microarray expression data in the context of human disease. We have developed an information-theoretic framework that addresses the challenging and largely unsolved problem of: ( 1) discovering perturbed pathways from microarray expression studies, and ( 2) revealing the underlying transcriptional and post-transcriptional regulatory processes through which the observed pathway perturbations are orchestrated. In a major recent study, we focused on the analysis of a compendium of cancer microarray expression data existing in the public domain. Across this extensive and diverse dataset, we discovered a large number of known and novel pathway perturbations, and then went on to identify the local DNA and RNA regulatory elements that mediate these changes in gene expression. This ab initio approach yields the vast majority of previously known cancer pathways along with their associated transcription factor binding sites. The majority of pathways and regulatory elements are novel. In fact, a surprisingly large fraction of cis-regulatory elements reside within 3' UTR regions, and most do not correspond to microRNA targeting sequences. This represents a vastly unexplored role for post-transcriptional processes in cancer. Our framework connects these putative regulatory elements to specific pathways, providing a powerful starting point for understanding their biology and mechanisms of action. We are encouraged by the high sensitivity with which our approach rediscovers the components of classical tumor promoting and tumor suppressing pathways (e.g. MAPK, WNT, P53, CREB, TNF, ERK, JNK, VEGF, etc.). These pathways and their gene regulatory consequences were worked out over the course of 20 years of labor-intensive genetics and biochemistry across hundreds of laboratories. Our rediscovery of this preexisting knowledge, along with an equally significant set of novel findings, represents an important step forward in our ability to characterize and understand regulatory perturbations in cancer. More broadly, our approach should lead to new molecular understanding of perturbations that are associated with other disease states. A manuscript describing this work was recently published in Molecular Cell (Goodarzi et al. 2009, 36:900-911). Our ongoing work is focused on the experimental validation of these regulatory network predictions, especially in exploring the role of post-transcriptional regulators in cancer.

Related publications

Revealing global regulatory perturbations across human cancers
Molecular Cell. 2009 Dec 11; 36:900-911. PDF
Goodarzi, H, Elemento, O, Tavazoie S

Global protein occupancy landscape of a bacterial genome
Molecular Cell. 2009 Jul 31;35(2):247-53 PDF
Vora T, Hottes AK, Tavazoie S

Coupling of zygotic transcription to mitotic control at the Drosophila mid-blastula transition.
Development. 2009 Jun;136(12):2101-10. PDF
Lu X, Li JM, Elemento O, Tavazoie S, Wieschaus EF

Microarray profiling of phage-display selections for rapid mapping of transcription factor-DNA interactions.
PLoS Genet. 2009 Apr;5(4):e1000449. Epub 2009 Apr 10. PDF
Freckleton G, Lippman SI, Broach JR, Tavazoie S

let-7 Overexpression leads to an increased fraction of cells in G2/M, direct down-regulation of Cdc34, and stabilization of Wee1 kinase in primary fibroblasts.
J Biol Chem. 2009 Mar 13;284(11):6605-9. Epub 2009 Jan 6. PDF
Legesse-Miller A, Elemento O, Pfau SJ, Forman JJ, Tavazoie S, Coller HA

A universal framework for regulatory element discovery across all genomes and data-types
Molecular Cell (2007) 28(2):337-50 PDF
Elemento O, Slonim N, Tavazoie S

Unmasking the zygotic genome using chromosome deletion in the Drosophila embryo.
PLoS Biology (2007) 5(5): e117 PDF
De Renzis S, Elemento O, Tavazoie S, Wieschaus EF

Role of transcription factor Kar4 in regulating downstream events in the Saccharomyces cerevisiae pheromone response pathway.
Mol. Cell Biol. 2006 Nov. 13, Epub PDF
Lahav R, Gammie A, Tavazoie S, Rose MD

Predicting gene expression from sequence.
Cell 2004 Apr 16; 117(2):185-98 PDF
Beer MA, Tavazoie S

Ras and Gpa2 mediate one branch of a redundant glucose signaling pathway in yeast.
PLoS Biology 2004 May; 2 (5): Epub 2004 May 11 PDF
Wang Y, Pierce M, Schneper L, Güldal CG, Zhang X, Tavazoie S, Broach JR

Mapping global histone acetylation patterns to gene expression.
Cell 2004 Jun 11; 117(6):721-33 PDF
Kurdistani SK, Tavazoie S, Grunstein M

Genomewide binding map of RPD3 histone deacetylase in yeast.
Nature Genetics 2002 31: 248-254. PDF
Kurdistani S, Robyr D, Tavazoie S, Grunstein M

Computational identification of cis-regulatory elements within functionally related groups of genes in Saccharomyces cerevisiae.
Journal of Molecular Biology 2000 296: 1205-1214. PDF
Hughes JD, Estep PW, Tavazoie S, Church GM

Systematic determination of genetic network architecture.
Nature Genetics 1999 22: 281-285. PDF
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM

© 2011 Columbia University