Tex Willer 631 Pdf Download HOT!
In principle, any gene feature expected to be held (or avoided) in concert for disease genes can be used in this approach. However, given our concern for biasing predictions towards well-studied genes, we built an extensive set of >40,000 features with minimal preference for previously characterized genes. Genomic features were derived from publicly available genome-wide expression data, transcription factor binding information (both observed and predicted), phylogenetic profiles, protein domain organization, and predicted microRNA (miRNA) targets (Table 1). In most cases, preprocessing was required to generate useful predictors. For example, for microarray analysis, we automated downloading, normalization, clustering, and differential gene expression analysis for 1,437 human and murine microarray data sets obtained through the Gene Expression Omnibus (GEO) [23]. As a comparison to assess the impact of biased features on our predictions, we also downloaded GO annotations and used each of the 17,156 terms as descriptive features.
tex willer 631 pdf download
Complete datasets corresponding to the GNF Body Atlas [55] and Neurocrine Tissue Atlases [56] were downloaded from GEO (GSE7307). The GNF atlas includes 158 microarrays corresponding to 79 different human tissue types (2 technical replicates). The Neurocrine atlas includes 676 microarrays corresponding to 65 human tissue types or cell lines (normal and/or diseased), from 10 donors (a total of 141 conditions were available - such as 'normal prostate' or 'diseased prostate'). Intensities for the GNF arrays were averaged across replicates. For each tissue atlas, the intensity corresponding to the 75th, 90th and 99th percentile across all tissues was identified. Only those probesets mapping uniquely to a single gene were considered, and for those genes mapping to multiple probesets, the probeset with the highest mean expression across all tissues was used. Each combination of percentile threshold and array was included as a single feature, with 1s and 0s assigned to genes whose expression exceeded or fell below the percentile cutoff, respectively. Imputation was performed for any genes not included on the microarrays.
We devised an automated approach to determine gene signatures for a large number of microarrays experiments deposited in GEO. First we identified all experiments performed on the two most commonly used microarray platforms: Affymetrix Human Genome U133A Plus 2.0 and Affymetrix Mouse Genome 430 2.0 Array. Next, for each experiment we downloaded all the corresponding microarrays, normalized them, and performed hierarchical clustering of samples using the hclust function in the R statistical framework [57]. We defined individual sample groups based on multiple cutpoints along the dendrogram (corresponding to tree heights of 0.975, 0.95, 0.925) and evaluated differentially expressed genes between groups using limma [53]. Features consisted of genes that were significantly changed (false discovery rate
A list of predicted TFBSs based on evolutionary conservation was downloaded from the ECRbase (Database of Evolutionary Conserved Regions) website [60]. The exact file used was tfbs_ecrs.hg18mm9.v102.txt. In this file, evolutionarily conserved sites were identified and mapped to transcription factors in the TRANSFAC v9.4 database [61]. Chromosomal positions provided by the authors were mapped to Refseq genes within 25 kb, and for each gene the number of each TFBS was tallied. Four discrete features were created for each TFBS, with a feature corresponding to whether a gene had 1, more than 1, more than 2, or more than 3 copies of the TFBS. Thus, a gene with five copies of a given TFBS would have 1s for all the above features while a gene with two copies would have a score of 1, 1, 0 and 0 for the four features, respectively. Since genes vary widely in the number of TFBSs, including these four different features of varying stringency increased the likelihood of having at least one informative feature for every TFBS (that is, not predominantly 1s or 0s for every gene).
The results of 691 individual ChIP-chip or ChIP-Seq experiments were downloaded from GEO [62] or the UCSC Genome Browser [63] and used to generate predictive features. Each binding site was mapped to all nearby genes (within 10 kb). Since genes could have one or more nearby binding sites, we determined the number of binding sites corresponding to the 75th, 90th and 99th percentile. Each cutoff was used to derive a separate feature.
Protein domain compositions were extracted from the file protein2ipr.dat downloaded from the Interpro web site [65]. For each protein domain, we generated a single feature corresponding to all proteins with one or more domains of that type. A total of 12,623 protein domains were included as features.
Phylogenetic profiles, consisting of the presence or absence of human gene orthologs in 49 other species, were downloaded from the Ensembl database [67]. Presence or absence of an ortholog in each species was used as a feature.