Download and Run

  1. Download the BioClass last version: Download Now.
  2. Extract the content of the .zip file into a folder.
  3. In order to start the BioClass tool:
    • Run the "run.sh" file in the "BioClass" folder for Linux users
    • Run the "run.bat" file in the "BioClass" folder for Windows users
    • Requisite: You will need Java (6 or later)

Sample data

The sample corpus for download is taken from the OSHUMED collection. This is a subset of the MEDLINE database, which is a bibliographic database of important medical literature maintained by the National Library of Medicine. It contains 348,566 references consisting of fields such as titles, abstracts, and MeSH descriptors from 279 medical journals published between 1987 and 1991. The complete set of medical abstracts for the year 1991 is taken as the initial sample corpus.

Each document of the initial corpus has one or more associated disease categories. In order to adapt them to the BioClass scheme which consists of distinguishing relevant documents from non-relevant ones, we select one of these categories as relevant and consider the others as non-relevant. For this example, the Cardio (C14) category is selected as the relevant category, since it is the most frequent category of the OHSUMED corpus.

In addition, in order to reduce the input feature size to train the classifier, the standard text pre-processing techniques are used. A predefined list of stopwords (common English words) is removed from the text and a stemmer based on the Lovins stemmer is applied. Then, words occurring in less than 10 documents of the entire training corpus are also removed.

Finally, in order to evaluate the reasoning models, the corpus is randomly divided into three different subsets:

  • Train Corpus (50% of the original corpus), to train the classifiers.
  • Evaluation Corpus (10% of the original corpus), as the corpus to learn with in the case of classifiers like TCBR-HMM.
  • Test Corpus (40% of the original corpus), to test the classifiers.
Download Sample C14 OHSUMED Corpus: Train | Evaluation | Test.

User manual

Currently, the complete version of the user manual for BioClass is in Spanish. It can be downloaded from following link: