BioClass Workflow

BioClass follows a few independent steps that comprise the workflow in a supervised classification process:

  • Load Corpus: BioClass provides a corpus management module. This part allows the user to load their set of preprocessed documents and create different matrix instances to store the data (e.g. Train and Test sets for a supervised learning).
  • Filter: Filtering algorithms like PCA or InfoGain can be applied to the corpus in order to further reduce their dimensionality or transform their represen- tation. This optional step can reduce the execution time and increase the performance of certain classifiers.
  • Create Classifiers: The classification process starts by creating an initial document classifier. BioClass provides the user with various classification algorithms like k-NN, Naive Bayes, SVM and the text-oriented classifiers T-HMM and TCBR-HMM proposed by the authors.
  • Train & Test: The application of the classifiers is carried out in the Train and Test steps. The user can train a defined classifier with a document train set in order to classify new documents contained in a different test set.
  • Learn: This step takes into account models like TCBR-HMM that can learn from new sets of document after they are trained. This learning step takes as input a new dataset in the same format used by the Train or Test sets. The resultant readjusted classifier is saved and can be employed in a new learn cycle or be used to classify a different Test set.
  • Store Results: The results of the classification process are stored and can be shown along with evaluation measures and plots. This helps the user to generate some feedback based on the classification errors in order to compare different clas- sifiers or readjust their parameters.