Skip to content
Snippets Groups Projects
Commit ee5f74ad authored by tcarver's avatar tcarver
Browse files

sequence and annotation file formats

parent 239663e5
No related branches found
No related tags found
No related merge requests found
<SECT1 ID="FILETYPES">
<TITLE>Sequence and Annotation File Formats</TITLE>
<PARA>
&prog; reads in the common sequence and annotation file formats. As larger data sets become more common it is now
possible to index some of these formats (FASTA and GFF3) to speed up and improve the performance of &prog;.
&prog; can read the following sequence and annotation file formats. As discussed in the next
section these can be read individually or as a combination of different annotation files
being read in and overlaid on the same sequence.
</PARA>
<ITEMIZEDLIST>
<LISTITEM>
<PARA>
<ULINK URL="http://www.ebi.ac.uk/embl/Documentation/">EMBL</ULINK> format.
</PARA>
</LISTITEM>
<LISTITEM>
<PARA>
<ULINK URL="http://www.ncbi.nlm.nih.gov/genbank/">GenBank</ULINK> format.
</PARA>
</LISTITEM>
<LISTITEM>
<PARA>
<ULINK URL="http://www.sequenceontology.org/gff3.shtml">GFF3</ULINK> format. The
file can contain both the sequence and annotation or the
GFF can just be the feature annotation and be read in as a seperate entry and overlaid
onto another entry containing the sequence.
</PARA>
</LISTITEM>
<LISTITEM>
<PARA>
FASTA nucleotide sequence files can be one of the following:
</PARA>
<ITEMIZEDLIST>
<LISTITEM>
<PARA>
Single FASTA sequence.
</PARA>
</LISTITEM>
<LISTITEM>
<PARA>
Multiple FASTA sequence. The sequences are concatenated together when opened in &prog;.
</PARA>
</LISTITEM>
<LISTITEM>
<PARA>
Indexed FASTA files can be read in. The files are indexed
using <ULINK URL="http://samtools.sourceforge.net/">SAMtools</ULINK>:
</PARA>
<SYNOPSIS>
samtools faidx ref.fasta
</SYNOPSIS>
<PARA>
A drop down menu of the sequences in the Entry toolbar (see <XREF
LINKEND="MAINWINDOW-BREAKDOWN">) at the top can then used to select the sequence to view.
</PARA>
</LISTITEM>
</ITEMIZEDLIST>
</LISTITEM>
<LISTITEM>
<PARA>
Indexed GFF3 format. This can be read in and overlaid onto an indexed FASTA file. The indexed GFF3 file contains
the feature annotations. To index the GFF first sort and bgzip the file and then use tabix with
"-p gff" option (see the <ULINK URL="http://samtools.sourceforge.net/tabix.shtml">tabix manual</ULINK>):
</PARA>
<SYNOPSIS>
(grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz;
tabix -p gff sorted.gff.gz
</SYNOPSIS>
<PARA>
A drop down menu of the sequences in the Entry toolbar (see <XREF
LINKEND="MAINWINDOW-BREAKDOWN">) at the top can then used to select the sequence to view.
Using indexed FASTA and indexed GFF files improves the memory management
and enables large genomes to be viewed. Note that as it is indexed the sequence and annotation are
read-only and cannot be edited. When there are many contigs to select from it can be easier
to display the one of interest by typing the name into the drop down list.
</PARA>
</LISTITEM>
<LISTITEM>
<PARA>
The output of <ULINK
URL="http://sonnhammer.sbc.su.se/download/software/MSPcrunch+Blixem/"><COMMAND>MSPcrunch</COMMAND></ULINK>.
<COMMAND>MSPcrunch</COMMAND> must be run with the <COMMAND>-x</COMMAND> or
<COMMAND>-d</COMMAND> flags.
</PARA>
</LISTITEM>
<LISTITEM>
<PARA>
The output of <ULINK URL="http://www.ncbi.nlm.nih.gov/blast/">blastall version
2.2.2</ULINK> or better. <COMMAND>blastall</COMMAND> must be run with the
<COMMAND>-m 8</COMMAND> flag which generates one line of information per HSP.
Note that currently &prog; displays each Blast HSP as a separate feature
rather than displaying each BLAST hit as a feature.
</PARA>
</LISTITEM>
</ITEMIZEDLIST>
</SECT1>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment