sequence and annotation file formats

ee5f74ad · tcarver · 239663e5 · ee5f74ad
Commit ee5f74ad authored 12 years ago by tcarver
--- a/docs/filetypes.sgml
+++ b/docs/filetypes.sgml
+  <SECT1 ID="FILETYPES">
+    <TITLE>Sequence and Annotation File Formats</TITLE>
+    <PARA>
+&prog; reads in the common sequence and annotation file formats. As larger data sets become more common it is now
+possible to index some of these formats (FASTA and GFF3) to speed up and improve the performance of &prog;.
+&prog; can read the following sequence and annotation file formats. As discussed in the next
+section these can be read individually or as a combination of different annotation files
+being read in and overlaid on the same sequence.
+    </PARA>
+    <ITEMIZEDLIST>
+      <LISTITEM>
+        <PARA>
+          <ULINK URL="http://www.ebi.ac.uk/embl/Documentation/">EMBL</ULINK> format.
+        </PARA>
+      </LISTITEM>
+      <LISTITEM>
+        <PARA>
+          <ULINK URL="http://www.ncbi.nlm.nih.gov/genbank/">GenBank</ULINK> format.
+        </PARA>
+      </LISTITEM>
+      <LISTITEM>
+        <PARA>
+          <ULINK URL="http://www.sequenceontology.org/gff3.shtml">GFF3</ULINK> format. The
+          file can contain both the sequence and annotation or the
+          GFF can just be the feature annotation and be read in as a seperate entry and overlaid
+          onto another entry containing the sequence.
+        </PARA>
+      </LISTITEM>
+
+      <LISTITEM>
+        <PARA>
+          FASTA nucleotide sequence files can be one of the following:
+        </PARA>
+        <ITEMIZEDLIST>
+          <LISTITEM>
+            <PARA>
+               Single FASTA sequence.
+            </PARA>
+          </LISTITEM>
+          <LISTITEM>
+            <PARA>
+               Multiple FASTA sequence. The sequences are concatenated together when opened in &prog;.
+            </PARA>
+          </LISTITEM>
+          <LISTITEM>
+            <PARA>
+          Indexed FASTA files can be read in. The files are indexed
+using <ULINK URL="http://samtools.sourceforge.net/">SAMtools</ULINK>:
+        </PARA>
+       <SYNOPSIS>
+samtools faidx ref.fasta
+        </SYNOPSIS>
+        <PARA>
+A drop down menu of the sequences in the Entry toolbar (see <XREF
+LINKEND="MAINWINDOW-BREAKDOWN">) at the top can then used to select the sequence to view.
+        </PARA>
+          </LISTITEM>
+        </ITEMIZEDLIST>
+      </LISTITEM>
+
+      <LISTITEM>
+        <PARA>
+Indexed GFF3 format. This can be read in and overlaid onto an indexed FASTA file. The indexed GFF3 file contains
+the feature annotations. To index the GFF first sort and bgzip the file and then use tabix with 
+"-p gff" option (see the <ULINK URL="http://samtools.sourceforge.net/tabix.shtml">tabix manual</ULINK>):
+        </PARA>
+       <SYNOPSIS>
+(grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz;
+tabix -p gff sorted.gff.gz
+        </SYNOPSIS>
+        <PARA>
+A drop down menu of the sequences in the Entry toolbar (see <XREF
+LINKEND="MAINWINDOW-BREAKDOWN">) at the top can then used to select the sequence to view.
+Using indexed FASTA and indexed GFF files improves the memory management
+and enables large genomes to be viewed. Note that as it is indexed the sequence and annotation are
+read-only and cannot be edited. When there are many contigs to select from it can be easier
+to display the one of interest by typing the name into the drop down list.
+        </PARA>
+      </LISTITEM>
+
+      <LISTITEM>
+        <PARA>
+The output of <ULINK
+URL="http://sonnhammer.sbc.su.se/download/software/MSPcrunch+Blixem/"><COMMAND>MSPcrunch</COMMAND></ULINK>.
+<COMMAND>MSPcrunch</COMMAND> must be run with the <COMMAND>-x</COMMAND> or
+<COMMAND>-d</COMMAND> flags.
+        </PARA>
+      </LISTITEM>
+      <LISTITEM>
+        <PARA>
+The output of <ULINK URL="http://www.ncbi.nlm.nih.gov/blast/">blastall version
+2.2.2</ULINK> or better.  <COMMAND>blastall</COMMAND> must be run with the
+<COMMAND>-m 8</COMMAND> flag which generates one line of information per HSP.
+Note that currently &prog; displays each Blast HSP as a separate feature
+rather than displaying each BLAST hit as a feature.
+        </PARA>
+      </LISTITEM>
+    </ITEMIZEDLIST>
+
+  </SECT1>