mass spectrometry for proteomics

Quick-Start

This quick-start tutorial will guide you through all the steps needed to (i) load a preprocessed dataset (i.e. peak lists), (ii) perform a quality control, (iii) match all the spectra in the dataset in order to make it comparable and (iv) carry out different analyses, such as clustering, Principal Component Analysis or biomarker discovery.

Contents

  1. Load peak lists
  2. Perform quality control
  3. Match peaks
  4. Analyze data.
    1. Biomarker discovery
    2. Principal Component Analysis
    3. Clustering
    4. Classification

1. Load peak lists

Download the Cancer dataset and decompress it on your computer. This dataset contains spectra from twelve samples belonging to three categories (labels): Lymphoma, Myeloma and Healthy, so that they are separared in three directories.
Since they are labeled peak lists, press the button Load Peak List Button on the toolbar. A dialog box will appear:

Load Peak List Dialog

Select the LABELED Experiment Type. Then, another dialog box will appear:

Load Peak List Dialog II

In the Data Directories section you have to add the directories corresponding to each labeling agrupation. Click the OK button in order to load the data.

After the data is loaded, you can see it in the Mass-Up data section:

Mass-Up data

2. Perform quality control

The quality control (QC) step allows you to ensure that the spectrum in your dataset are right. Due to this reason, it is usually performed before carrying out the analyses.

The QC has two levels: (i) replicates, a low level QC analysis focused on the replicates of each sample, and (ii) samples, a high level QC analysis with additional information from the intra-sample m/z matching process. At the replicates level, the user can check basic information about each individual spectrum (i.e. peak count, m/z range, intensity ranges, etc.) and compare all spectra in the dataset. At the samples level, the user can check the performance of the intra-sample peak matching process, by comparing the percentages of presence (POP) counts (globally and by conditions) and the POPs of each sample

To apply QC to the loaded peak lists, press the button Quality Control Button on the toolbar. A dialog box will appear:

Quality Control Dialog

You have to provide the following information:

  • Labeled/Unlabeled peak lists: in the Labeled tab you can add the loaded Labeled Peak Lists by clicking on the button Add Elements Button.
  • Intra-sample peak matching: the intra-sample peak matching parameters. For this quickstart, you can use the Forward algorithm with the default parameters.

Click the OK button in order to perform the quality control. After this operation, you can see it in the Mass-Up data section and the Quality Control View is automatically opened:

QC View

For further help on the Quality Control operation, please refer to the help.

3. Match peaks

The Match Peaks operation allows you to make comparable a set of spectra and suitable for further analysis. Analyses operations take as input Matched Peak Lists.

For further help on how the peak matching process works, please refer to the help.

To apply the Peak Matching to the loaded peak lists, press the button Peak Matching Button on the toolbar. A dialog box will appear:

Match Peaks Dialog

You have to provide the following information:

  • Labeled/Unlabeled peak lists: in the Labeled tab you can add the loaded Labeled Peak Lists by clicking on the button Add Elements Button.
  • Intra-sample peak matching: the intra-sample peak matching parameters. For this quickstart, you can use the Forward algorithm with the default parameters.
    • Generate consensus spectrum: choose whether you want to generate a consensus spectrum. For this quickstart, you can use set it to true and use the default POP parameter.
  • Inter-sample peak matching: the inter-sample peak matching parameters. For this quickstart, you can use the Forward algorithm with the default parameters.

Click the OK button in order to perform the peak matching. After this operation, you can see it Matched Peak Lists in the Mass-Up data section. Note that since you have matched several Labeled Peak Lists the new Matched Peak Lists appear under a Labeled Mached Peak List Set.

Matched Peak Lists

This Labeled Matched Peak List Set called Labeled Matched Peak List Set 1 will be used as input for the analysis operations.

4. Analyze data

4.1 Biomarker discovery

When identifying new biomarkers, it is necessary to distinguish between two types of data sets that can be analyzed: (i) those cases where there are a known and well defined number of conditions (e.g. healthy vs. diseased, differents stages of a disease, etc.), and (ii) those cases where there are no conditions or where they are not clearly defined. In accordance with this differentiation, Mass-Up provides two types of biomarker discovery analysis: (i) the inter-label analysis, for the former type of data, and (ii) the intra-label analysis, for the latter type of data.

Since we are working with the first type of dataset we will apply the inter-label analysis, which allows the user to identify those peaks that can be potential biomarkers to differentiate the conditions by performing the appropriate statistic tests.

To apply the Inter-Label Biomarker Discovery analysis to the matched peak lists, press the button Biomarker Discovery Button on the toolbar. A dialog box will appear:

Biomarker Discovery Dialog

In this dialog, you have to select the Labeled Matched Peak List Set and the individual Matched Peak Lists that you want to use. Select the Labeled Matched Peak List Set 1 and check all the Mached Peak Lists as the image shows.

Click the OK button in order to perform the analysis. After this operation, you can see it in the Mass-Up data section. By clicking on it, the Biomarker Discovery View is opened.

BD View

The first three columns contain the m/z value, the p-value, and the q-value respectively; while the other columns show in which samples the m/z values are present. By default, peaks are sorted by p-value. You can sort it by q-value by clicking on the column header. As can be seen, the peaks with a q-value < 0.05 are clear candidates to be biomarkers as they differentiate certain conditions from others.

4.2 Principal Component Analysis

Principal Component Analysis (PCA) allows the user to visually identify if there is a separation between the condicions (labels) present in the dataset.

To apply the PCA analysis to the matched peak lists, press the button PCA Button on the toolbar. A dialog box will appear:

PCA Dialog Type

Select the LABELED Experiment Type. Then, another dialog box will appear:

PCA Dialog

You have to provide the following information:

  • Data: the Labeled Matched Peak List Set that you want to use.
  • Max. components: the maximum number of principal components to retain.
  • Variance covered: the amount of variance to account for when retaining principal components.
  • Normalize: whether input data will be normalized.
  • Discretize: whether input data will be discretized into 0 and 1 vectors before applying PCA.

For this quickstart, use the default settings and choose the previous Labeled Matched Peak List Set.

Click the OK button in order to perform the analysis. After this operation, you can see it in the Mass-Up data section. By clicking on it, the PCA View is opened.

BD View

As you can see, the spectra of Cancer dataset can be grouped by their corresponding conditions using PCA.

4.3 Clustering

The clustering analysis allows finding groups of similar spectra among all the samples being studied. In the case of labeled data, it allows to check if the different conditions present in the input data are separable by means of the m/z values of each sample.

To apply the Clustering analysis to the matched peak lists, press the button Clustering Button on the toolbar. A dialog box will appear:

Clustering Dialog

You have to provide the following information:

  • Minimum Variance: peaks with a variance lower or equals to this value are removed.
  • Peak List: if provided, only these peaks will be analyzed.
  • Peak Mass Tolerance Type: type of peak mass tolerace used: absolute, relative or ppm.
  • Peak Mass Tolerance: acceptable difference between two measurements of the same mass.
  • DistanceFunctionType Reference: which value use when comparing two clusters.
  • DistanceFunctionType Function: function used to measure the distance between to clusters.
  • Conversion Values: presence, percentage of presence or intensity.
  • Deep Clustering: check if you want to perform a spectrum-based clustering instead of sample-based.
  • Directory: directory in which the cluster files will be stored.

For this quickstart select the previous Labeled Matched Peak List Set, use the default settings and choose a directory to store the clustering results.

Click the OK button in order to perform the analysis. After this operation, you can see it in the Mass-Up data section. By clicking on it, the Clustering View is opened.

Clustering View

As you can see, the spectra of Cancer dataset can be grouped by their corresponding conditions using Clustering.

4.4 Classification

Through the "Classification Analysis" operation, the user can evaluate which is the classifier that performs best for the data under analysis. This operation provides an interface adapted from the Weka software that allows the user to select and to configure a classifier, and to evaluate its performance by means of a cross-validation scheme. The output log of the evaluation process summarizes the performance of the classifier using different statistical measurements, such as accuracy, kappa, precision, recall, etc.

To create a Classification analysis using the matched peak lists, press the button Classification Button on the toolbar. A dialog box will appear:

Classification Dialog

Give a name to your experiment, choose the Labeled Matched Peak List Set 1 and click the OK button in order to perform the analysis. After this operation, you can see it in the Mass-Up data section. By clicking on it, the Classification View is opened.

Classification View

In the Classification view you have to:

  1. Choose a classifier: for this quickstart, you can choose the IBk classifier (lazy > IBk)
  2. Choose a validation scheme: for this quick start, you can use the default validation scheme, that is, a 10 fold-cross validation.
  3. Click the start button to run the Classification Analysis.

When it finishes, you can check the output log of the evaluation process. In this case you can see in the "Confusion Matrix" that al the samples has been correctly classified, obtaining a 100% of accuracy and a kappa statistic of 1.