webTclass: A system for biomarker discovery with stability analysis
Biomarker discovery

Data analysis

webTclass is a system for biomarker discovery using the scheme developed in Tclass classification system (bioinformatics, 2002, 18:325-6), which was originally developed for sample class prediction using gene expression profile in MatLab language. In Tclass system, we integrated feature selection method of stepwise optimization and Naive bayes dicriminant method. The major character of the Tclass system is the stability analysis, through which classification power of different feature sets can be distinguished clearly even through those different feature sets have the same leave-one-out classification accuracy. Tclass system has been applied to the following analysis.
  • An Approach to Studying Lung Cancer-related Proteins in Human Blood (Mol Cell Proteomics, 2005, 4:1480-6).
  • How many genes are needed for early detection of breast cancer, based on gene expression patterns in ... (Breast Cancer Res., 2005, 7:E5).
  • Construction of mathematical model for high-level expression of foreign genes in pPIC9 vector... ( Biochem Biophys Res Commun, 2007, 354:498-504).
  • Construction of two mathematical models for prediction of bacterial sRNA targets (Biochem Biophys Res Commun, 2008, 372:346-50).
  • sTarPicker: a method for efficient prediction of bacterial sRNA targets based on a two-step model for hybridization (PLoS One, 2011, 6:e22705).
  • Identification of Novel Autoantibodies for Detection of Malignant Mesothelioma (PLoS One, 2013, 8:e72458).
  • The potential biomarker panels for identification of Major Depressive Disorder (MDD) patients with and without ... (PLoS One, 2014,9:e97479).
  • To provide support for the related scientists, we developed the webserver webTclass in R language. Additionally, we also provide an option to apply GEO database directly using geo datasets such as GDS6063.

    Stability analysis

    Input a matrix

    The first step to develop a prediction model is to input a matrix with sample type information into the webTclass system. There are three ways to input a matrix (Figure01).

    Figure01: Three options for input a matrix (1, 2, and 3).

    1. The first way is to input a matrix in csv format (Figure02).

    Figure02: Data format of the csv file: Rows represent features, columns represent samples, the first row for the name of samples, the second row for sample type information, and the first column for the feature names.

    In the csv format file above (Figure02), the second row named samtype should be provided, which will be applied to determine the number of samples in each class automatically. After upload the file, click the button 'Get matrix', the matrix information will be read into the system, and related classification information and matrix will be displayed automatically (Figure03).

    Figure03: Information for sample groups and feature filtering.

    In view that some missing values may be included in the matrix, we provide an option for users to discard those samples with the ratio of the missing values more than the given value. The default value is 20% (0.2). If the ratio of the missing values in a feature across all samples less than the given value, we will fill the missing values with zeros. This is a necessary step so that the matrix can be applied to develop prediction models.

    The second piece of filtering is to select features using t-test (two classes) or AOV (more than two classes) given p values. The default p value is 0.1. Here we assume that as a potential biomarker, its values in different groups of samples should have significant difference. If the user set up a smaller p value, the number of the selected features will become smaller. These remaining features will be used to develop models

    Furthermore, the option standardization is also provided. When select this option, the sample vector (each column of the input matrix) will be standardied so that the mean and standard deviation will become zero and one, respectively.

    Finally, click the button 'Filter', the program will do calculation and display the updated matrix, which will be applied to detect optimal feature sets and develop models. This matrix can also be downloaded.

    2. The second way is to input a GEO dataset name such as GDS6063.

    At present, we only accept datasets from GEO database. The format is GDS followed a series of numbers. The hint for this option is GDS6063. If the users want to test on the GEO dataset, the first way should be left blank. Then click the button 'Get matrix', the related information will be displayed. Here we want to point out that sample group information is extracted automatically. Users can change the group informtaion using button 'Reset group information' in Figure03.

    3. The third way is to click the button demoMatrix directly.

    When the users click on this button, the classification informtion and matrix will be displayed automatically as those in first way.

    Select the optimal feature sets and develop models

    To build a model, some parameters should be set up (Figure04)

    Figure04: Parameter options for developing models.

    1. Input maximum number of featurs: the maximum number of features in a set for developing models.

    2. Input maximum number of feature set: the maximum number of the feature sets for each number of features from 1 to the maximum number of features.

    3. Discriminant methods: four methods were provided, which include Naive Bayes method, linear discriminant algorithm, KNN algorithm and Mahalanobis distance discriminant.

    4. Object function: to search the optimal feature sets, the object function will be applied during developing model. For example, when Leave-one-out CV (cross- validation) was selected as the object function, different feature sets will be evaluated using this object function, and those feature sets with top values from the object function will be further used for stability analysis. Here we provided four options, namely 100% training, Leave-one-out CV, Leave-one-out mcc and Leave-one-out f1.

    5. Stability analysis: to remove the effect of sample partitions on the model performance, for each selected feature set, we partitioned the whole dataset into the training dataset and test dataset with the given ratio for a certain number of times. The default ratio and the number of simulations are 0.75 and 100, respectively. Furthermore, the stability index was defined as the average accuracies from all test datasets from simulations. Finally, the sets with the hihgest stability index will be applied to develop prediction model.

    6. Partition ratio: the ratio to devide the whole data into training dataset and test dataset.

    7. Simulations: the number of simulations in stability analysis, which are usually between 100 and 1000.

    After completing the setup of the related parameters, the user can click the button "construct prediction models" to construct the model. At the same time, a link to the calculation results will be displayed, from which users can download the prediction model for further use ("Ensembl classifiers" or "Predict new samples"). For the demo matrix with default filtering, the time for developing the model is about 2 mins. After finishing calculation, the model information such as the candidate feature sets and their stability index and backward analysis will be provided.

    We also provide a download button to view the calculation results. When user click the download button, the candidate feature sets andtheir stablity index and backward analysis will be displayed, if the task was completed. At the same time, users can use load() function in the R language to view the detailed model information.

    Display or develop new emsemble classifiers

    When a user uploads a model developed in previous step ("Construct models"), the model information developed using the feature set with the highest stability will be displayed (Figure05). At the same time, all other feature sets evaluated in previous step will also be displayed, from which the user can develop new models using alternative feature sets.

    Figure05: display or develop new models.

    When the user select the new feature set, and click the button "Construct ensemble classifier", a new model will be built and the prediction on the existing data will be made. The prediction results, probability and accuracy will be displayed at the bottom of the page. Additionally, the model can also be downloaded

    Make prediction on new samples

    Users can predict new samples using the model developed in previous steps ("Construct models" or "Ensembl classifiers"). The prediction interface is displayed in Figure 06.

    Figure06: predict new sampples.

    To predict new samples, users firstly upload the model developed in previous steps ("Construct models" or "Ensembl classifiers"). Then, the user should input new sample informaion. The new sample data should keep the same structure as those in "input a matrix" excluding the row "samtype", from which we will extract related features through their names. Therefore, it is very important to keep the feature names consistence between the training dataset and new samples. Otherwise, the prediction results will be wrong. The prediction results will be shown in a table at the bottom of the page.

    Prof. Shengqi Wang


    Beijing Institute of Microbiology and Epidemiology

    Prof. Wuju Li

    liwj@bmi.ac.cn or wujuchina@126.com

    Beijing Institute of Microbiology and Epidemiology