Introduction

SEDA (SEquence DAtaset builder) is an open-source, multiplatform application for processing FASTA files containing DNA and protein sequences.

As the following image shows, SEDA uses the Input-Process-Output (IPO) model to process sequence files in FASTA format (https://en.wikipedia.org/wiki/FASTA_format). This means that every operation in SEDA takes as input one or more FASTA files and produces one or more FASTA files.

_images/19.png

According to the FASTA format, each file may contain one or more sequences. Each sequence is composed by a header line which begins with ‘>’ and one or more lines containing the nucleotide or amino acid sequences represented using single-letter codes. The header of a sequence typically should give a name (unique identifier) for the sequence, and may also contain additional information (called description). The description is separated by a blank space from the sequence name/identifier. The following example shows a sequence in FASTA format.

>SEQUENCE_NAME_IDENTIFIER Description
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTG

To facilitate the usage of the application, SEDA processing operations are grouped in six main groups:

  • Alignment-related: including functions to align sequences using Clustal Omega, concatenate sequences, and create consensus sequences.
  • BLAST: including an operation for performing batch BLAST analyses and a two-way ortholog identification method.
  • Filtering: including different operations to filter sequences (e.g. those that meet some criteria, remove isoforms or duplicated sequences, among others).
  • Gene Annotation: including gene annotation pipelines based on Augustus as implemented in SAPP, getorf from EMBOSS, Splign/Compart (https://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi), and ProSplign/ProCompart (https://www.ncbi.nlm.nih.gov/sutils/static/prosplign/prosplign.html).
  • Reformatting: providing operations to change the format of the FASTA files such as a powerful operation for changing sequence headers, among others.
  • General: containing operations whose functionality is not related to the other groups (e.g. splitting files and extracting random sequences, DNA to Protein translation, or FASTA files comparison, among others).