Assembly Graph File Formats

Assembly graphs describe the structure of DNA assemblies using nodes (segments or contigs) and edges (connections or overlaps). Several file formats are used to represent these graphs, each with its own syntax and conventions.

This page summarises the four most common formats supported by agtools.

GFA

Extension: .gfa

Versions supported: GFA1 (commonly used in assembly)

GFA (Graphical Fragment Assembly) is a standard format for describing genome assemblies as graphs, where:

  • Headers (start with the tag H) denote metadata.
  • Segments (start with the tag S) are nodes (contigs or unitigs).
  • Links (start with the tag L) define edges between segments.
  • Paths (start with the tag P) optionally define traversal orders.

agtools also supports the tags J, C, and W. Please refer to the GFA specification for further details.

Example

S   unitig1 ATGCGTACGGGGTAAGTGAGCCTG
S   unitig2 CGGTACCTTAAAGTCTGG
L   unitig1 +   unitig2 -   10M
P   contig1 unitig1+,unitig2-   10M

References: GFA Format

FASTG

Extension: .fastg

FASTG represents possible assembly paths using a FASTA-like structure that encodes graph topology.

Each node (NODE_X) in the graph (contig or unitig) contains:

  • A path label (e.g. :neighbor1,neighbor2...) describing outgoing edges.
  • A sequence entry representing the nucleotide sequence.

This is followed by a reverse-complement entry (NODE_X'). See the example below.

Example

>NODE_1:NODE_2';
ATGCGTACGTTAG
>NODE_1';
CTAACGTACGCAT
>NODE_2:NODE_1',NODE_3';
CGGTAACCTGACC
>NODE_2';
GGTCAGGTTACCG
>NODE_3:NODE_2';
TTGACCGGATCGA
>NODE_3';
TCGATCCGGTCAA

References: FASTG Format

Note: The FASTG format described in the official FASTG specification differs substantially from the FASTG files produced by MEGAHIT and early versions of SPAdes (prior to version 3.10.0).

ASQG

Extension: .asqg

The ASQG format represents assembly graphs using overlapping contigs.

  • Header records (start with the tag HT) denote metadata.
  • Vertex records (start with the tag VT) denote the sequences.
  • Edge description records (start with the tag ED) denote pairs of overlapping sequences.

Example

HT  VN:i:1  ER:f:0  OL:i:45 IN:Z:reads.fa   CN:i:1  TE:i:0
VT  read1   GATCGATCTAGCTAGCTAGCTAGCTAGTTAGATGCATGCATGCTAGCTGG
VT  read2   CGATCTAGCTAGCTAGCTAGCTAGTTAGATGCATGCATGCTAGCTGGATA
VT  read3   ATCTAGCTAGCTAGCTAGCTAGTTAGATGCATGCATGCTAGCTGGATATT
ED  read2 read1 0 46 50 3 49 50 0 0
ED  read3 read2 0 47 50 2 49 50 0 0

References: ASQG Format

DOT

Extension: .dot (GraphViz) or .gv (ABySS)

DOT files describe general graphs and can be used to render visual diagrams of assembly graphs.

Example

graph {
  0 [
    id=0
    name=1
  ];
  1 [
    id=1
    name=2
  ];
  2 [
    id=2
    name=3
  ];
  3 [
    id=3
    name=4
  ];
  4 [
    id=4
    name=5
  ];

  1 -- 0;
  2 -- 1;
  3 -- 1;
  4 -- 2;
  4 -- 3;
}

To view it using GraphViz:

dot -Tpng graph.dot -o graph.png

References: GraphViz DOT Format

The same example graph can be represented in ABySS DOT format, where:

  • l denotes sequence length
  • d denotes overlap length

Example

digraph g {
"1+" [l=8]
"1-" [l=8]
"2+" [l=8]
"2-" [l=8]
"3+" [l=8]
"3-" [l=8]
"4+" [l=8]
"4-" [l=8]
"5+" [l=8]
"5-" [l=8]
"1+" -> "2+" [d=-7]
"2-" -> "1-" [d=-7]
"2+" -> "3+" [d=-7]
"3-" -> "2-" [d=-7]
"2+" -> "4+" [d=-7]
"4-" -> "2-" [d=-7]
"3+" -> "5+" [d=-7]
"5-" -> "3-" [d=-7]
"4+" -> "5+" [d=-7]
"5-" -> "4-" [d=-7]
}

References: ABySS DOT Format


Comparison of Assembly Graph File Formats

Feature GFA FASTG ASQG DOT (GraphViz / ABySS)
Primary use General-purpose assembly graph representation Compact representation of assembly paths Read overlap graphs Visualisation and graph layout
Typical extension .gfa .fastg .asqg .dot, .gv
Graph type Directed graph Directed graph Directed graph Directed or undirected
Nodes represent Segments / unitigs Contigs or unitigs Reads or contigs Generic graph nodes
Edges represent Overlaps / links Traversable paths Sequence overlaps Graph connections
Stores sequences ✅ Yes ✅ Yes ✅ Yes ❌ No
Stores overlaps ✅ Yes ✅ Implicitly ✅ Yes Optional
Supports paths ✅ Yes (P records) ✅ Yes (via node labels) ❌ No ❌ No
Orientation-aware ✅ Yes ✅ Yes ✅ Yes Optional
Standardised ✅ Yes (GFA1) ⚠️ Semi-standard ❌ Tool-specific ✅ GraphViz
Used by assemblers SPAdes, Flye, hifiasm MEGAHIT SGA ABySS
Supported by agtools ✅ Full support ✅ Supported ✅ Supported ✅ Supported
Best for General graph analysis Path-centric graphs Overlap-based assemblies Visualisation
Visualisation support Via Bandage Via Bandage Limited Native (GraphViz)

Summary

Format When to Use
GFA Best all-purpose format for assembly graphs; recommended default
FASTG Useful when working with assembler-generated paths
ASQG Suitable for overlap-based assemblies
DOT Best for visualisation and debugging