Manual

Working with BDBM

Viewing and searching files

All text (such as FASTA) files can be viewed by double clicking on top of the selected file. The Line wrap checkbox allows the wrapping of the lines, an option that is useful when dealing with files with long text lines. The letter type and size can be changed in the Change Font pop-up menu. It is also possible to search for regular expressions. To find the next and previous hit use the Next and Previous buttons, respectively. The Clear button clears the search. Since genome and transcriptome files are usually very long only a fraction of the document is shown in the BDBM window. Different portions of the file can be viewed using the Set Position button.

Deleting and exporting files

FASTA, BLAST-formatted databases and output files can be deleted by double clicking on top of the item with the right mouse bottom. FASTA, BLAST-formatted databases and output files can be exported by double clicking on top of the item with the right mouse bottom. FASTA and output files (but not BLAST-formatted databases) can be renamed when exported.

The File menu

The File menu gives the user the possibility to configure BDBM (the Configuration option) and exit the program.

The Configuration option

The Configuration option under File can be used to specify the location of the BLAST, EMBOSS, COMPART, SPLIGN, PROCOMPART, PROSPLIGN and BEDTOOLS binaries, as well as the location of the Repository.

Important note

Please, note that this option is not available when BDBM is installed using the Docker-based installers. When using this option, external tools are installed and BDBM is configured during the automatized installation process, so that you don't have to worry about this.

The Operations menu

The Operations menu gives the user the possibility to perform eight different operations (Import Fasta, Make BLAST Database, BLAST DB alias, Retrieve Search Entry, Get ORF (EMBOSS), Splign-Compart (NCBI), ProSplign-Compart (NCBI), Reformat Fasta, Merge Fastas, and Refine annotation) on FASTA-formatted files and BLAST databases.

Import FASTA

Files can be imported using the Import Fasta option under Operations. Depending on whether the FASTA files are nucleotide or protein sequences, this operation will copy the selected files to either the /fasta/nucleotides/ or /fasta/proteins folder that is located in the specified repository folder. Moving the fasta files directly into these subfolders will have the same effect as using the Import Fasta option. When using the Import Fasta option, the user must indicate whether the file contains nucleotide or protein sequences and the location of the file. Remember that by double clicking on FASTA or output text files (but not BLAST-formatted files; see below) the user can look and search for regular expressions. Therefore, if unsure about the nature of a given FASTA file it is advisable to look at the files using this option.

Make Blast database

In order to be able to perform BLAST searches, FASTA files must be formatted. This can be achieved by using either the Make Blast database option under Operations or by double clicking the right mouse button on top of the selected FASTA file. The user must specify the database type (DB Type), the FASTA file to be used (Input fasta) and the name of the database to be created (Output name).

BLAST DB Alias

Multiple Blast-formatted databases can be treated as one when using the BLAST DB Alias option under Operations or by double clicking the right mouse button on top of a selected database. The user is requested to select the BLAST-formatted databases to be included using the bottom add. Then the user selects the name of the Alias (Output name).

Retrieve Search Entry

The Retrieve Search Entry option allows the user to select a given entry from a given database. It should be noted that the full name of the entry must be given, unless a gi (general identifier) number is specified. Therefore, the use of short sensible names is advised for the sequences contained in the FASTA files that are used to create BLAST-formatted databases. If the user does not remember the full name but still keeps the FASTA file used to create the database, the simplest way to remember it is by double clicking on top of the appropriate FASTA file and do a search using the Regular Expression search option (see below).

Get ORF (EMBOSS)

Many transcriptomes are not annotated, preventing the easy retrieval of coding sequences for phylogenetic and evolutionary comparative studies. The Get ORF (EMBOSS) option under Operations retrieves all ORFs that start with a Methionine and that are longer and shorter than the lengths specified by the user. This option can also be used with non-annotated genomes, when the genes to be annotated are known (or assumed) to be intronless. The resulting file will be saved in the /fasta/nucleotides folder that is located in the specified repository folder, and thus can be used in further operations such as Blast analyses. When the Remove new lines checkbox is selected the sequences in the resulting file will not be fragmented. The Get ORF (EMBOSS) option can also be selected by double clicking the right mouse button on top of a FASTA file.

Splign-Compart (NCBI)

When genomes are non-annotated and genes have (or are suspected to have) introns, the Splign-Compart (NCBI) option under Operations can be used to annotate exons or genes, as long as a CDS reference sequence is available from a closely related species. How closely related the species must be depends on how fast the gene(s) in question evolve. For instance, a few highly conserved Drosophila virilis genes can be annotated this way using as reference Drosophila melanogaster CDSs (the common ancestor of the two species lived more than 40 million years ago). When using the Splign-Compart (NCBI) option the user must specify the FASTA file containing the genome to be annotated (Genome Fasta), as well as the file containing the reference CDSs (CDS Fasta) and the name of the output file (Output name). The resulting file will be saved in the /fasta/nucleotides folder that is located in the specified repository folder, and thus can be used in further operations such as Blast analyses. If the Concatenate Exons option is used then adjacent exons will be concatenated. Therefore, if an annotation is obtained for every exon of a given gene, the resulting sequence will be the complete CDS. The resulting CDS is based on the nucleotide homology to a given sequence, and thus may produce CDS sequences with lengths that are not multiple of three, if for instance, sequencing errors causing frameshifts are present in the genome to be annotated. Nevertheless, the existence of intron splicing signals at the exons 5’ and 3’ ends is taken into account. When using this option, it is advisable to use short FASTA headers, avoid the use of special characters, as well as new lines within sequences, otherwise the pipeline will likely crash. Remember that the reformatting of any FASTA file can be easily performed using the BDBM Reformat Fasta option under the Operations tab (see below). Since this is a homology based approach, it is possible to keep the stop codon in the reference sequence. When this is done, the stop codon will be likely included in the resulting annotation as well, increasing the chances of having a complete annotation, even when the previous codon differs in the reference and target sequences. The inclusion of a stop codon in the CDS annotation is also a guarantee that the CDS is indeed complete.

ProSplign-Compart (NCBI)

An alternative to Splign-Compart (NCBI) is ProSplign-Compart (NCBI). When using this option, protein reference sequences rather than CDSs (nucleotide) reference sequences are used. Since protein sequences change at a slower pace than nucleotide sequences, in principle the reference and target sequences can be more distantly related than when using the Splign-Compart (NCBI) option, but it is difficult to quantify how distantly related they can be. Moreover, Splign-Compart (NCBI) runs considerably faster than ProSplign-Compart (NCBI).

When using the ProSplign-Compart (NCBI) option the user must specify the FASTA file containing the genome to be annotated (Genome Fasta), as well as the file containing the reference protein sequences (Query protein FASTA), and the name of the output file (Output name). The main results file is saved in the /fasta/nucleotides folder that is located in the specified repository folder. The first number on the header is an index that is followed by the name of the protein sequence used to obtain the annotation. The remaining information gives the possibility to link this file to two other files that will be saved in the /Export Files/nucleotides folder that is located in the specified repository folder (see below), and information on the name of the sequence that was annotated (see text after Header:). In the /Export Files/nucleotides folder, the file with the txt extension shows the output of the tblastx search used for the subsequent annotation, while the file with the fasta extension gives the genome region where the gene has been annotated, including the name of the target nucleotide sequence (see text after Header:). The correspondence between the two files is made by looking at the first four numbers in the file with the Fasta extension that must match the first, second, fourth and fifth number, respectively, in the file with the txt extension. It should be noted that a single reference sequence can give rise to more than one annotation if ProSplign cannot completely confidently align the reference and target sequences (those positions that are confidently aligned are labelled with an asterisk in the file with the txt extension).

The resulting CDS annotation is based on the homology to a given protein reference sequence, and thus may produce sequence annotations with lengths that are not multiple of three, if for instance, sequencing errors causing frameshifts are present in the genome to be annotated. Nevertheless, the existence of intron splicing signals at the exons 5’ and 3’ ends is taken into account. There will be no stop codon in the CDS annotation since the reference sequence is a protein.

Reformat FASTA

Very often FASTA files include fragmented sequences that are problematic for some (such as the Splign-Compart option, for instance) of the implemented operations but not for others. The Reformat Fasta option under Operations will remove those line breaks from FASTA files when the Sequence fragment length parameter is smaller than one. Changing the value of the Sequence fragment length parameter to a positive value will produce a FASTA file with sequences that are fragmented at the specified value, an option that may be useful when preparing FASTA files to be deposited elsewhere. The SMART option automatically extracts the gi (general identifier) codes. The GENERIC option extracts the information present in the fields that are delimited by the | symbol. The first field is number 0. The PREFIX option allows the incorporation of a prefix into all sequence names with the possibility of erasing or keeping the headers. When using the Reformat Fasta option the user must indicate whether the file contains nucleotide or protein sequences and the location of the file. The Reformat Fasta option can also be used by double clicking the right mouse button on top of a selected FASTA file.

Merge Fasta

The sequences contained in different FASTA files can be merged even if they are not in the same order, as long as the sequence names are the same. This is a useful option when preparing files for phylogenetic analyses, for instance. Files are selected using the ctrl button.

Refine annotation

Depending on the number and location of the differences found between the reference CDS and the target sequences, the Splign-Compart (NCBI) and the ProSplign-Compart (NCBI) options do not always provide a complete CDS annotation. Nevertheless, if all exon-intron splice junctions are covered in the partial CDS annotation it may be possible to obtain a complete CDS annotation by combining the results produced by the Splign-Compart (NCBI) and the ProSplign-Compart (NCBI) options and information on putative open reading frames generated by the getorf application. When using the Refine Annotation option the user must provide the name of the FASTA file generated by the Splign-Compart (NCBI) or the ProSplign-Compart (NCBI) options, as well as a FASTA file with approximate genome region where the gene is located (for instance the file with the Fasta extension that is saved by the ProSplign-Compart (NCBI) option in the /Export Files/nucleotides folder, located in the specified repository folder. Moreover, it must specify the size of the region used to determine if there is an overlap, as well as the minimum and maximum size of the open reading frames to be reported by the getorf application.

The Refine Annotation option will automatically perform the following steps:

Get all open reading frames (between STOP codons) for the provided genome region (only the plus strand is considered thus it is important to give the genome sequence in the proper orientation) and sort them by sequence size (from the longest to the shortest).
For each sequence in the partial CDS annotation file: extract the last n positions of the sequence and try to find a match in the open reading frames obtained in (1); only the first match is considered; if a match is found add the piece of sequence found after the hit to the sequence from where the motif used for the search originates from.
For all sequences obtained in (2): compare each possible pair of sequences in order to extract the first n positions of the first sequence and try to find a match in the second sequence; if a match is found erase the sequence region from the hit to the end of the target sequence and concatenate the two sequences.
Repeat step (3) until there is no possibility of merging two sequences.
Get all open reading frames (between a START and a STOP codon) for the provided genome region and sort them by size (from the longest to the shortest).
For each sequence obtained in (4): extract the first n positions of the sequence and try to find a match in the open reading frames obtained in (5); only the first match is considered; if a hit is found, erase the sequence of the open reading frame showing the hit from the place where the motif is found until the end and prefix the first sequence being processed with the sequence just obtained.
For all sequences obtained in (6): compare each possible pair of sequences in order to extract the first n positions of the first sequence and try to find a match in the second sequence; if a match is found erase the sequence region from the hit to the end of the target sequence and concatenate the two sequences.
Repeat step (7) until there is no possibility of merging two sequences.

The BLAST menu

The BLAST menu gives the user the possibility to perform local blast searches.

BLASTN, BLASTP, TBLASTN and TBLASTX (with or without external query)

These options allow the user to perform BLASTN (search a nucleotide database using a nucleotide query), BLASTP (search protein database using a protein query), TBLASTN (search translated nucleotide database using a protein query) and TBLASTX (search translated nucleotide database using a translated nucleotide query), using a sequence that was retrieved using the Retrieve Search Entry option or using any other FASTA-formatted sequence (when using the options with external query).

BLAST DataBase Manager

Manual

Table of contents

Working with BDBM

Viewing and searching files

Deleting and exporting files

The File menu

The File menu

The Configuration option

Important note

The Operations menu

The Operations menu

Import FASTA

Make Blast database

BLAST DB Alias

Retrieve Search Entry

Get ORF (EMBOSS)

Splign-Compart (NCBI)

ProSplign-Compart (NCBI)

Reformat FASTA

Merge Fasta

Refine annotation

The BLAST menu

The BLAST menu

BLASTN, BLASTP, TBLASTN and TBLASTX (with or without external query)