BioMed Notes: ENCODE salvages “junk” DNA

The “ENCyclopedia Of DNA Elements”, ENCODE, founded in 2003 with grants from the NIH Genome Institute, seeks to identify all the functional parts of the human genome, assessed by DNA and histone modifications, chromatin looping, transcription factor binding, chromatin compaction (DNAse accessibility), and transcripts. The collaboration of ~37 groups, first developed technology. Recently they published their first salvo of 30 research papers, several published in Nature along with a News & Views.

The paper by Djebali and scores of colleagues offers “a genome-wide catalogue of human transcripts”, together with their location (nucleus or cytoplasm), and whether they have a 7mG cap 5’ or a poly-A tail 3’. They prepared RNA from 15 human cells lines after fractionation (whole cell, nucleus and cytosol) and separation of RNA into short and long (>200 nucleotides). Long RNAs were further separated into +/- polyA tails. They sequenced these RNAs and determined their initiation sites and their 5’ and 3’ termini (using technologies felicitously named CAGE and PET). Then they did bioinformatics: compared to annotated genome (GENCODE) statistics, etc., All these data are available for your perusal using the RNA Dashboard.

They made many interesting observations; e.g., they conclude there is very little “junk” DNA. Nearly 75% of the genome is transcribed in at least one of the cell lines, though only a little over 50% in any given line. (This is similar to previous findings, albeit not as “encyclopedic”). Only 28% of the 7,053 small RNAs (including snRNAs, snoRNAs, miRNAs, and tRNAs) annotated by GENCODE are found in any of these cell lines, suggesting the expression of many annotated small RNAs is cell type specific.

They also find that protein-coding transcripts are more abundant than long non-coding RNAs (lncRNAs) and that the same genes are transcribed in different cells. Figure 3, shown here, plots the number of transcripts (r.p.k.m., reads per kilobase per million reads) on the x axis vs. the ratio of nuclear/cytoplasmic for protein-coding (orange), which are abundant (right) in the cytoplasm (down), non-coding (blue), and novel intergenic (green), which tend to be expressed at lower levels (left) and mostly nuclear (up). A few individual transcripts are also identified, giving appreciation for the range of expression.

Also not for the first time, they suggest that shrinking “intergenic” regions “prompts the reconsideration of the definition of a gene”. They “propose that the transcript be considered as the basic atomic unit of inheritance” and that “gene … denote … all those transcripts …. that contribute to a given phenotypic trait". Mendel would approve.