voomDDA: Discovery of Diagnostic Biomarkers and Classification of RNA-Seq Data (ver. 1.5)

1. Pre-processing

a) Filtering

b) Normalization

2. Model building


Heatmap options


Download GO Plot

Input data

Load example data:
Upload train set:
Upload class labels:
Upload test set:

You can upload your data as separated by comma, tab, semicolon or space.

Note: Samples are in the column and genes are in the rows.


Identifying the relevant genes (or other genomic features such as transcripts, miRNAs, lncRNAs, etc.) across the conditions (e.g. tumor and non-tumor tissue samples) is a common research interest in gene-expression studies. In this gene selection, researchers are often interested in detecting a small set of genes for diagnostic purpose in medicine that involves identification of the minimal subset of genes that achieves maximal predictive performance. biomarker discovery and classification problem.

VoomDDA is a decision support tool developed for RNA-Sequencing datasets to assist researchers in their decisions for diagnostic biomarker discovery and classification problem. VoomDDA consists both sparse and non-sparse statistical learning classifiers adapted with voom method. Voom is a recent method that estimates the mean and variance relationship of the log-counts of RNA-Seq data (log counts per million, log-cpm) at observational level. It also provides precision weights for each observation that can be incorporated with the log-cpm values for further analysis. Algorithms in our tool incorporates the log-cpm values and the corresponding precision weights into biomarker discovery and classification problem. For this purpose, these algorithms use weighted statistics in estimating the discriminating functions of the used statistical learning algorithms.

VoomNSC is a sparse classifier that is developed to bring together two powerful methods for RNA-Seq classification:

1. to extend voom method for RNA-Seq classification studies,
2. to make nearest shrunken centroids (NSC) algorithm available for RNA-Seq technology.

VoomNSC both provides fast, accurate and sparser classification results for RNA-Seq data. More details can be found in the research paper. This tool also includes RNA-Seq extensions of diagonal linear and diagonal quadratic discriminant classifiers: (i) voomDLDA and (ii) voomDQDA.

References

[1] Zararsiz, G., Goksuluk, G., Korkmaz, S., et al. (2015). VoomDDA: Discovery of Diagnostic Biomarkers and Classification of RNA-Seq Data.

[2] Law, C.W., Chen, Y., Shi, W., et al. (2014). voom: Precision weights unlock linear model analysis tools for RNA-Seq read counts. Genome Biology; 15:R29.

[3] Tibshirani, R., Hastie, T., Narasimhan, B., et al. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS; 99(10): 6567-72.

[4] Dudoit, S., Fridlyand, J. and Speed, T.P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association; 97(457): 77-87.


            

Getting model results... this may take a while....



Creating heatmap... this may take a while....

Creating network plot... this may take a while....

Gene ontology results

Getting ontology results... this may take a while....

Tutorial

1.Uploading the data

Two example datasets are available in voomDDA web application. Cervical cancer is a miRNA, lung cancer is a gene expression dataset. For GO analysis, users should select the necessary option (miRNA or gene) to obtain the related analysis results.

VoomDDA application requires three inputs from the user. Train and test sets should be text files (.txt) that contain the raw mapped read counts in a matrix form, where rows correspond to genomic features (for simplicity of language, let’s say genes) and the columns correspond to observations (or samples). This type of count data can be obtained from feature counting softwares such as HTSeq [1] or featureCounts [2]. Note that this type of count data should contain the raw number of mapped reads, should not be normalized or contain RPKM values. Class labels should also be in a text file (.txt) and should contain each sample condition. Note that each row should contain only one label of a sample. Example datasets for Witten et al. cervical dataset are given as below:

Training set of cervical data

Test set of cervical data

Class labels of cervical data

If the purpose is the prediction of the class labels of new test observations, users should upload all three necessary files. However, test set is not required, when the purpose is just the identification of the diagnostic biomarkers.

After uploading the data, make sure that the data is displayed in the screen.


2. Pre-processing the Data

2.1. Filtering

VoomDDA classifiers (VoomNSC, VoomDLDA and VoomDQDA) introduced in this application have the same assumptions with voom+limma pipeline [3], that is to filter out the rows with zero or very very low counts. In RNA-Seq data, we often meet with count data that contains rows with single unique values (mostly zero). This type of data may lead to unreliable estimation of the mean and variance relationship of the data and unstable model fitting for the introduced classifiers. Three possible filtering criteria are available: (i) DESeq2 outlier and independent filtering, (ii) near-zero variance filtering, (iii) variance filtering.

DESeq2 package [4] contains a filtering criteria based on outlier detection and independent filtering. Outliers are detected based on the Cook’s distance and independent filtering is applied based on the gene-wise mean normalized counts. More details can be obtained in the vignette of DESeq2 package [5].

Near-zero variance filtering is described in caret package of R [6]. This package applies filtering based on two criteria: (i) the frequency of the most frequent value to the most frequent second value is higher than 19 (95/5), (ii) the number of unique values divided by the sample size is less than 10%.

Variance filtering is another option to filter out the non-informative genes. This option may also be selected to decrease the computational cost of the model building process for very large datasets. After selecting this option, users can enter the number of genes desired to be included to the classification models.

After selecting one or multiple filtering criteria, filtering statistics are demonstrated in the screen.

2.2. Normalization

Library sizes for each observation are dependent on the experimental design and may lead to the existence of technical biases. These biases can have significant effect on the classification results and should be corrected before starting to classification model building. In our experiments, we found that normalization has a significant effect on the classification results for datasets that have very large library size differences across samples. Two normalization approaches are available in the application: (i) DESeq median ratio [7], (ii) trimmed mean of M values (TMM) [8]. More details about this approaches can be found in referenced papers.


3. Model Building for Classification

After data processing, users can build classification models with three introduced algorithms: (i) voomNSC, (ii) voomDLDA, (iii) voomDQDA. VoomNSC is a sparse classifier that brings together two powerful methods, voom method [3] and nearest shrunken centroids algorithm [9], for the classification of RNA-Seq data. VoomDLDA and voomDQDA are non-sparse classifiers which are the extensions of diagonal discriminant classifiers [10]. Details of these classifiers are given in the referenced paper [11].

After selecting any of the three classifiers, a summary of the fitting process is displayed in the screen. A confusion matrix and several statistical diagnostic measures are given to examine how successful the classifier fit to the given data. Furthermore, a heatmap plot is constructed to display the expression levels of genes and the gene-wise and sample-wise relationships. Heatmap is displayed for the entire unfiltered genes for non-sparse classifiers, while displayed for the selected gene subset for sparse voomNSC classifier.


4. Identification of Diagnostic Biomarkers

If VoomNSC is the selected classifier, the subset of genes, that are most relevant with the class condition, are identified and the gene names are displayed in the screen. Several plots are also given. First plot demonstrates the selection of the threshold parameter. The parameter which fits the most accurate and sparsest model is identified as optimal. Second plot displays the distribution of selected genes in each class. Third plot displays the shrunken differences of the selected genes. Final plot is the heatmap plot discussed in the previous section.


5. Prediction

Based on the selected classifier, predictions appear on the screen for each test observation. Note that the test observations should be processed as same as the training observations. Same experimental and computational procedures should be applied before obtaining the raw count data. Data should be in the same format as the training data to obtain the predictions. It should contain the raw mapped read counts, and the gene names should match with the training data.

VoomDDA application filters and normalizes the test data based on the information obtained from the training data. Thus, the estimated parameters from the training data are used for the test data. This guarantees that both sets are on the same scale and homoscedastic each other.


6. Downstream Analysis

After detecting diagnostic biomarkers via voomNSC algorithm, it may be useful to visualize the results to see the interactions or go further analysis, such as GO analysis. For this purpose, several downstream analysis tools are also available in this web application. These tools include heatmaps, network analysis and gene ontology analysis. Detailed information about gene ontology analysis can be found in topGO BIOCONDUCTOR package.


References

[1] Anders, S., Pyl, P.T., and Huber, W. (2015) HTSeq - a Python framework to work with high-throughput sequencing data. Bioinformatics; 31(2):166-9.

[2] Liao, Y., Smyth, G.K., and Shi, W. (2013). featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics. doi: 10.1093/bioinformatics/btt656.

[3] Law, C.W., Chen, Y., Shi, W. and Smyth, G.K. (2014). voom: Precision weights unlock linear model analysis tools for RNA-Seq read counts. Genome Biology; 15:R29.

[4] Love, M.I., Huber, W. and Anders, S. (2015). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology; 15(550). doi:10.1186/s13059-014-0550-8 .

[5] Love, M.I., Huber, W. and Anders, S. (2015). Differential analysis of count data – the DESeq2 package. http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf (19.06.2015).

[6] Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software; 28(5).

[7] Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology; 11(R106): doi:10.1186/gb-2010-11-10-r106 .

[8] Robinson, M.D., and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology; 11(R25).

[9] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS; 99(10): 6567–72.

[10] Dudoit, S., Fridlyand, J. and Speed, T.P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association; 97(457): 77-87.

[11] Zararsiz, G., Goksuluk, D, Korkmaz, S., et al. (2015). VoomDDA: Discovery of Diagnostic Biomarkers and Classification of RNA-Seq Data.

Authors

Gokmen Zararsiz, PhD

Hacettepe University Faculty of Medicine Department of Biostatistics

gokmen.zararsiz@hacettepe.edu.tr


Dincer Goksuluk, PhD

Hacettepe University Faculty of Medicine Department of Biostatistics

dincer.goksuluk@hacettepe.edu.tr


Bernd Klaus, PhD

EMBL Heidelberg

bernd.klaus@embl.de


Selcuk Korkmaz, PhD

Hacettepe University Faculty of Medicine Department of Biostatistics

selcuk.korkmaz@hacettepe.edu.tr


Vahap Eldem, PhD

Istanbul University Faculty of Science Department of Biology

vahap.eldem@istanbul.edu.tr


Turgay Unver, PhD

Cankiri Karatekin University Faculty of Science Department of Biology

turgayunver@gmail.com


Erdem Karabulut, PhD

Hacettepe University Faculty of Medicine Department of Biostatistics

ekarabul@hacettepe.edu.tr


Ahmet Ozturk, PhD

Erciyes University Faculty of Medicine Department of Biostatistics

ahmets67@hotmail.com


News

Version 1.5 (November 25, 2016)

(2) Lung cancer data added as an example dataset

(2) Bug fixes and improvements

Version 1.4 (November 14, 2016)

(1) Bug fixes and improvements

Version 1.3 (November 5, 2016)

(1) Gene ontology results added

(2) Gene ontology plot added

(3) Bug fixes and improvements

Version 1.2 (August 20, 2016)

(1) Heatmap added

(2) Network plot added

(3) Bug fixes and improvements

Version 1.1 (July 18, 2016)

(1) Upgraded to shiny version 0.14

Version 1.0 (June 18, 2015)

(1) VoomDDA web application has been released.


Other Tools

MLSeq: Machine learning interface for RNA-Seq data

MLViS: machine learning-based virtual screening tool

easyROC: a web-tool for ROC curve analysis

MVN: a web-tool for assessing multivariate normality

DDNAA: Decision support system for differential diagnosis of nontraumatic acute abdomen


Please feel free to send us bugs and feature requests.

If you use this application, please cite it as below:

Zararsiz, G., Goksuluk, D., Klaus, B., Korkmaz, S., Eldem, V., Karabulut, E., & Ozturk, A. (2017). voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data. PeerJ, 5, e3890.

Clik here for published paper.