Identifying the relevant genes (or other genomic features such as transcripts, miRNAs, lncRNAs, etc.) across the conditions (e.g. tumor and non-tumor tissue samples) is a common research interest in gene-expression studies. In this gene selection, researchers are often interested in detecting a small set of genes for diagnostic purpose in medicine that involves identification of the minimal subset of genes that achieves maximal predictive performance. biomarker discovery and classification problem.
VoomDDA is a decision support tool developed for RNA-Sequencing datasets to assist researchers in their decisions for diagnostic biomarker discovery and classification problem. VoomDDA consists both sparse and non-sparse statistical learning classifiers adapted with voom method. Voom is a recent method that estimates the mean and variance relationship of the log-counts of RNA-Seq data (log counts per million, log-cpm) at observational level. It also provides precision weights for each observation that can be incorporated with the log-cpm values for further analysis. Algorithms in our tool incorporates the log-cpm values and the corresponding precision weights into biomarker discovery and classification problem. For this purpose, these algorithms use weighted statistics in estimating the discriminating functions of the used statistical learning algorithms.
VoomNSC is a sparse classifier that is developed to bring together two powerful methods for RNA-Seq classification:
VoomNSC both provides fast, accurate and sparser classification results for RNA-Seq data. More details can be found in the research paper. This tool also includes RNA-Seq extensions of diagonal linear and diagonal quadratic discriminant classifiers: (i) voomDLDA and (ii) voomDQDA.
[1] Zararsiz, G., Goksuluk, G., Korkmaz, S., et al. (2015). VoomDDA: Discovery of Diagnostic Biomarkers and Classification of RNA-Seq Data.
[2] Law, C.W., Chen, Y., Shi, W., et al. (2014). voom: Precision weights unlock linear model analysis tools for RNA-Seq read counts. Genome Biology; 15:R29.
[3] Tibshirani, R., Hastie, T., Narasimhan, B., et al. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS; 99(10): 6567-72.
[4] Dudoit, S., Fridlyand, J. and Speed, T.P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association; 97(457): 77-87.
Getting model results... this may take a while....
Creating heatmap... this may take a while....
Creating network plot... this may take a while....
Getting ontology results... this may take a while....
1.Uploading the data
Two example datasets are available in voomDDA web application. Cervical cancer is a miRNA, lung cancer is a gene expression dataset. For GO analysis, users should select the necessary option (miRNA or gene) to obtain the related analysis results.
VoomDDA application requires three inputs from the user. Train and test sets should be text files (.txt) that contain the raw mapped read counts in a matrix form, where rows correspond to genomic features (for simplicity of language, let’s say genes) and the columns correspond to observations (or samples). This type of count data can be obtained from feature counting softwares such as HTSeq [1] or featureCounts [2]. Note that this type of count data should contain the raw number of mapped reads, should not be normalized or contain RPKM values. Class labels should also be in a text file (.txt) and should contain each sample condition. Note that each row should contain only one label of a sample. Example datasets for Witten et al. cervical dataset are given as below:
If the purpose is the prediction of the class labels of new test observations, users should upload all three necessary files. However, test set is not required, when the purpose is just the identification of the diagnostic biomarkers.
After uploading the data, make sure that the data is displayed in the screen.
2. Pre-processing the Data
2.1. Filtering
VoomDDA classifiers (VoomNSC, VoomDLDA and VoomDQDA) introduced in this application have the same assumptions with voom+limma pipeline [3], that is to filter out the rows with zero or very very low counts. In RNA-Seq data, we often meet with count data that contains rows with single unique values (mostly zero). This type of data may lead to unreliable estimation of the mean and variance relationship of the data and unstable model fitting for the introduced classifiers. Three possible filtering criteria are available: (i) DESeq2 outlier and independent filtering, (ii) near-zero variance filtering, (iii) variance filtering.
DESeq2 package [4] contains a filtering criteria based on outlier detection and independent filtering. Outliers are detected based on the Cook’s distance and independent filtering is applied based on the gene-wise mean normalized counts. More details can be obtained in the vignette of DESeq2 package [5].
Near-zero variance filtering is described in caret package of R [6]. This package applies filtering based on two criteria: (i) the frequency of the most frequent value to the most frequent second value is higher than 19 (95/5), (ii) the number of unique values divided by the sample size is less than 10%.
Variance filtering is another option to filter out the non-informative genes. This option may also be selected to decrease the computational cost of the model building process for very large datasets. After selecting this option, users can enter the number of genes desired to be included to the classification models.
After selecting one or multiple filtering criteria, filtering statistics are demonstrated in the screen.
2.2. Normalization
Library sizes for each observation are dependent on the experimental design and may lead to the existence of technical biases. These biases can have significant effect on the classification results and should be corrected before starting to classification model building. In our experiments, we found that normalization has a significant effect on the classification results for datasets that have very large library size differences across samples. Two normalization approaches are available in the application: (i) DESeq median ratio [7], (ii) trimmed mean of M values (TMM) [8]. More details about this approaches can be found in referenced papers.
3. Model Building for Classification
After data processing, users can build classification models with three introduced algorithms: (i) voomNSC, (ii) voomDLDA, (iii) voomDQDA. VoomNSC is a sparse classifier that brings together two powerful methods, voom method [3] and nearest shrunken centroids algorithm [9], for the classification of RNA-Seq data. VoomDLDA and voomDQDA are non-sparse classifiers which are the extensions of diagonal discriminant classifiers [10]. Details of these classifiers are given in the referenced paper [11].
After selecting any of the three classifiers, a summary of the fitting process is displayed in the screen. A confusion matrix and several statistical diagnostic measures are given to examine how successful the classifier fit to the given data. Furthermore, a heatmap plot is constructed to display the expression levels of genes and the gene-wise and sample-wise relationships. Heatmap is displayed for the entire unfiltered genes for non-sparse classifiers, while displayed for the selected gene subset for sparse voomNSC classifier.
4. Identification of Diagnostic Biomarkers
If VoomNSC is the selected classifier, the subset of genes, that are most relevant with the class condition, are identified and the gene names are displayed in the screen. Several plots are also given. First plot demonstrates the selection of the threshold parameter. The parameter which fits the most accurate and sparsest model is identified as optimal. Second plot displays the distribution of selected genes in each class. Third plot displays the shrunken differences of the selected genes. Final plot is the heatmap plot discussed in the previous section.
5. Prediction
Based on the selected classifier, predictions appear on the screen for each test observation. Note that the test observations should be processed as same as the training observations. Same experimental and computational procedures should be applied before obtaining the raw count data. Data should be in the same format as the training data to obtain the predictions. It should contain the raw mapped read counts, and the gene names should match with the training data.
VoomDDA application filters and normalizes the test data based on the information obtained from the training data. Thus, the estimated parameters from the training data are used for the test data. This guarantees that both sets are on the same scale and homoscedastic each other.
6. Downstream Analysis
After detecting diagnostic biomarkers via voomNSC algorithm, it may be useful to visualize the results to see the interactions or go further analysis, such as GO analysis. For this purpose, several downstream analysis tools are also available in this web application. These tools include heatmaps, network analysis and gene ontology analysis. Detailed information about gene ontology analysis can be found in topGO BIOCONDUCTOR package.
References
[1] Anders, S., Pyl, P.T., and Huber, W. (2015) HTSeq - a Python framework to work with high-throughput sequencing data. Bioinformatics; 31(2):166-9.
[2] Liao, Y., Smyth, G.K., and Shi, W. (2013). featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics. doi: 10.1093/bioinformatics/btt656.
[3] Law, C.W., Chen, Y., Shi, W. and Smyth, G.K. (2014). voom: Precision weights unlock linear model analysis tools for RNA-Seq read counts. Genome Biology; 15:R29.
[4] Love, M.I., Huber, W. and Anders, S. (2015). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology; 15(550). doi:10.1186/s13059-014-0550-8 .
[5] Love, M.I., Huber, W. and Anders, S. (2015). Differential analysis of count data – the DESeq2 package. http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf (19.06.2015).
[6] Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software; 28(5).
[7] Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology; 11(R106): doi:10.1186/gb-2010-11-10-r106 .
[8] Robinson, M.D., and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology; 11(R25).
[9] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS; 99(10): 6567–72.
[10] Dudoit, S., Fridlyand, J. and Speed, T.P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association; 97(457): 77-87.
[11] Zararsiz, G., Goksuluk, D, Korkmaz, S., et al. (2015). VoomDDA: Discovery of Diagnostic Biomarkers and Classification of RNA-Seq Data.
Hacettepe University Faculty of Medicine Department of Biostatistics
gokmen.zararsiz@hacettepe.edu.tr
Hacettepe University Faculty of Medicine Department of Biostatistics
dincer.goksuluk@hacettepe.edu.tr
EMBL Heidelberg
Hacettepe University Faculty of Medicine Department of Biostatistics
selcuk.korkmaz@hacettepe.edu.tr
Istanbul University Faculty of Science Department of Biology
Cankiri Karatekin University Faculty of Science Department of Biology
Hacettepe University Faculty of Medicine Department of Biostatistics
Erciyes University Faculty of Medicine Department of Biostatistics
(2) Lung cancer data added as an example dataset
(2) Bug fixes and improvements
(1) Bug fixes and improvements
(1) Gene ontology results added
(2) Gene ontology plot added
(3) Bug fixes and improvements
(1) Heatmap added
(2) Network plot added
(3) Bug fixes and improvements
(1) Upgraded to shiny version 0.14
(1) VoomDDA web application has been released.
MLSeq: Machine learning interface for RNA-Seq data
MLViS: machine learning-based virtual screening tool
easyROC: a web-tool for ROC curve analysis
MVN: a web-tool for assessing multivariate normality
DDNAA: Decision support system for differential diagnosis of nontraumatic acute abdomen
If you use this application, please cite it as below:
Zararsiz, G., Goksuluk, D., Klaus, B., Korkmaz, S., Eldem, V., Karabulut, E., & Ozturk, A. (2017). voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data. PeerJ, 5, e3890.