The word landscape of the non-coding segments of the Arabidopsis thaliana genome
Loading...
Date
2009-10-08
Journal Title
Journal ISSN
Volume Title
Publisher
BioMed Central
Abstract
Background: Genome sequences can be conceptualized as arrangements of motifs or words.
The frequencies and positional distributions of these words within particular non-coding genomic
segments provide important insights into how the words function in processes such as mRNA
stability and regulation of gene expression.
Results: Using an enumerative word discovery approach, we investigated the frequencies and
positional distributions of all 65,536 different 8-letter words in the genome of Arabidopsis thaliana.
Focusing on promoter regions, introns, and 3’ and 5’ untranslated regions (3’UTRs and 5’UTRs), we
compared word frequencies in these segments to genome-wide frequencies. The statistically
interesting words in each segment were clustered with similar words to generate motif logos. We
investigated whether words were clustered at particular locations or were distributed randomly
within each genomic segment, and we classified the words using gene expression information from
public repositories. Finally, we investigated whether particular sets of words appeared together
more frequently than others.
Conclusion: Our studies provide a detailed view of the word composition of several segments of
the non-coding portion of the Arabidopsis genome. Each segment contains a unique word-based
signature. The respective signatures consist of the sets of enriched words, ‘unwords’, and word
pairs within a segment, as well as the preferential locations and functional classifications for the
signature words. Additionally, the positional distributions of enriched words within the segments
highlight possible functional elements, and the co-associations of words in promoter regions likely
represent the formation of higher order regulatory modules. This work is an important step
toward fully cataloguing the functional elements of the Arabidopsis genome.
Description
Keywords
Citation
Jens Lichtenberg et al, "The word landscape of the non-coding segments of the Arabidopsis thaliana genome," BMC Genomics 10 (2009), doi: 10.1186/1471-2164-10-463, http://www.biomedcentral.com/1471-2164/10/463