Tutorial content

  1. Overview of the interface and basic operations
    1. Load a corpus
    2. Result viewer
  2. Classification example using TCBR-HMM
    1. Create a TCBR-HMM classifier
    2. Train and Learn with the classifier
    3. Test and results review

1   Overview of the interface and basic operations

As it can be seen in the figure, the BioClass interface is formed by:

  • Menu Bar, which provides access to the operations implemented in the BioClass tool.
  • Clipboard panel, which contains the elements generated through operations, like the data that the algorithms work with.
  • Log panel that allows the user to have knowledge of what happens in the tool through messages concerning operations that are being executed.
  • Viewer panel, where each classifier and data object can be explored and modified.

BioClass operations are grouped by categories: Corpus, classification and filtering. The operations available on the Corpus menu allows the user to load datasets to be processed by the application. Through Classification, models of reasoning can be created, trained or tested in different ways. Finally, the Filtering menu contains those algorithms that transform the datasets based on their dimensions.

Two basic operations are described in the next sections: How to load a corpus and how to view the results of a classification process.

1.1   Load Corpus

 

The corpus represents a set documents to be processed by the application. The supported formats for these matrices of documents are CSV and Arff formats, the latter being the native format of the Weka API. In the download section, multiple examples of text corpus are included.

This operation is carried out by the Load corpus from file menu option, located in the Corpus category:

As can be seen, the load window has the following setting options:

  • Matrix name: Name by which the corpus will be displayed in the Clipboard and operations.
  • Sparse matrix file: Path of the dataset to be processed.
  • PMID (Document Identifier Attribute): In some cases, It is necessary to identify each of the documents through a key attribute in the corpus. This field contains the name of the identifier.
  • Has document identifier: If it is selected, the load process takes into account the content of the PMID parameter and uses it as a key.

Once the "OK" button is pressed, the corpus is loaded from the selected route and then stored in the Clipboard, where it can be used later in a classification processes.

1.2   Result Viewer

 

After a classifying process of a test corpus with a trained model, a matrix of results is produced where each document is assigned to a category. To verify that the classification process has been successful, BioClass offers a result viewer in which the actual real categories are compared with the predicted ones. To access this view, simply select an element of type "result" on the Clipboard panel.

2   Classification example using TCBR-HMM

 

2.1   Create a TCBR-HMM classifier

 

As the workflow indicates, the first step after the document corpus are loaded is the creation of the reasoning models. In this case, the selected classifier is the TCBR-HMM classifier proposed by our research group.

The operation of creating a TCBR-HMM classifier lies within the section Classification/CreateModel:

Once selected, an interface for the creation of the classifier is shown:

The parameterization options for the TCBR-HMM are:

  • Classifier name: Name by which the classifier will be displayed in the Clipboard and operations.
  • Number of states: Number of states for the TCBR-HMM classifier. It must be an integer value greater than 0. The states represent ranking positions for words. The greater the number of states, the greater the adjustment to the training corpus will be. Having a high number of states also reduces the generalization capacity of the model.
  • Generalization factor: This factor controls the importance of word rankings, against a more general approach in which only the occurence of the words is taken into account. A higher value of the factor f implies that the rankings and the order of appearance of words have more importance, reducing the generalization capacity of the classifier and increasing the model fitting to the training corpus.

After creating the TCBR-HMM classifier, it will be stored in the list of classification models of the Clipboard.

2.1   Train and Learn with the classifier

 

Once the T-HMM classifier is created and a training corpus is loaded, the classifier can be trained to subsequently be able to classify new instances / documents based on the information extracted from the training process.

To train the classifier, the operation can be started with the "Train Model" option in the Classification menu, where the classifier and the train corpus must be specified. Another way to access is to right click on the model in the Clipboard:

Once the model is trained, it can be updated with new documents. At this stage, the "learn" process with TCBR-HMM aims to further improve the classification process with the addition of new instances from a new corpus. To do this, the the "Learn / UpdateModel" operation must be executed:

The parameters to complete the learning process are:

  • Sparse matrix: Corpus with which the model will learn and improve its future classification. The corpus must be compatible with the training corpus used to train the model (A corpus with the same type of documents). Moreover, as a supervised learning, the corpus must specify the actual class of the contained documents in order to learn from them.
  • Trained model: Trained Model/classifier selected for learning.
  • Learning weight (L): Learning factor that determines the weight that will have the new instances compared to the previously used in the training process. The higher the learning factor, the higher the adjustment will be to this new documentst.
  • Learn from errors: Boolean indicating whether the algorithm should take into account only the documents that cause a misclassification. Initially, the learning process classifies the new corpus and determines which documents are classified correctly.

2.1   Test and results review

 

Once the classifier model is trained and possibly updated with some other corpus, it can be used to classify the documents contained in a new corpus where their class is unknown. This process can be initated with the "Test model" option in the Classification menu, where the classifier and the test corpus must be specified.

The testing process adds an object of type "model tested" in the object tree. If it is selected, the results of the classification are displayed in the viewer panel. The precision values ​​achieved are shown on the bottom part. One of the most important is the "F-measure", which is the most representative. The higher this value is in both categories (Class) of documents, the more effective the classifier is.