Cleaning an assembly graph file

agtools can clean a GFA file using a FASTA file of the segments. It will remove any segments that are not found in the FASTA file, along with other elements that contain those segments. It will also add segment sequences to the GFA file if they are missing, ensuring consistency with the GFA specification. You can use the clean subcommand provided through the command-line interface. Please refer to the CLI reference for further details on the clean subcommand.

Here is an example GFA file. The lines starting with the S tag are missing the segment sequences.

# GFA file: 10 segments (20 bp) and 1 short segment (10 bp: seqX)
H   VN:Z:1.0
# Segments
S   seq1        DP:f:1.0
S   seq2        DP:f:1.0
S   seq3        DP:f:1.0
S   seq4        DP:f:1.0
S   seq5        DP:f:1.0
S   seq6        DP:f:1.0
S   seq7        DP:f:1.0
S   seq8        DP:f:1.0
S   seq9        DP:f:1.0
S   seq10       DP:f:1.0
S   seqX        DP:f:1.0
L   seqX    +   seq2    -   5M
L   seq4    +   seq5    +   10M
L   seq6    -   seq7    +   7M
L   seq9    +   seq10   -   5M
J   seq5    +   seqX    -   *
J   seq3    -   seq8    +   *
J   seq7    +   seq2    +   *
C   seq1    +   seqX    -   10M 5   0
C   seq2    +   seqX    +   10M 2   0
P   seqpath1    seq1+,seqX+,seq3-   20M,10M,20M
P   seqpath2    seq4+,seq5+,seq6+   20M,20M,20M
W   seqread1    0   *   <seq6<seqX>seq9
W   seqread2    0   *   <seq1>seq2<seq3

We have the following segments file in FASTA format. It is missing seqX.

>seq1
ATGCGTATGCGTATGCGTAA
>seq2
CGTACTGACTGACTGACTGA
>seq3
TGCATGCATGCATGCATGCA
>seq4
ACGTACGTACGTACGTACGT
>seq5
GTACGTACGTACGTACGTAC
>seq6
CGTACGTACGTACGTACGTA
>seq7
TACGTACGTACGTACGTACG
>seq8
ATCGATCGATCGATCGATCG
>seq9
GCGTGCGTGCGTGCGTGCGT
>seq10
TTGCTTGCTTGCTTGCTTGC

You can run the following command to clean the GFA file.

agtools clean -g test_graph.gfa -f test_fasta.fasta -o results/cleaned_graph.gfa

The cleaned graph looks as shown below. Note that the segment sequences are added back and seqX is removed.

# GFA file: 10 segments (20 bp) and 1 short segment (10 bp: seqX)
H   VN:Z:1.0
# Segments
S   seq1    ATGCGTATGCGTATGCGTAA    DP:f:1.0
S   seq2    CGTACTGACTGACTGACTGA    DP:f:1.0
S   seq3    TGCATGCATGCATGCATGCA    DP:f:1.0
S   seq4    ACGTACGTACGTACGTACGT    DP:f:1.0
S   seq5    GTACGTACGTACGTACGTAC    DP:f:1.0
S   seq6    CGTACGTACGTACGTACGTA    DP:f:1.0
S   seq7    TACGTACGTACGTACGTACG    DP:f:1.0
S   seq8    ATCGATCGATCGATCGATCG    DP:f:1.0
S   seq9    GCGTGCGTGCGTGCGTGCGT    DP:f:1.0
S   seq10   TTGCTTGCTTGCTTGCTTGC    DP:f:1.0
L   seq4    +   seq5    +   10M
L   seq6    -   seq7    +   7M
L   seq9    +   seq10   -   5M
J   seq3    -   seq8    +   *
J   seq7    +   seq2    +   *
P   seqpath1    seq1+,seqX+,seq3-   20M,10M,20M
P   seqpath2    seq4+,seq5+,seq6+   20M,20M,20M
W   seqread2    0   *   <seq1>seq2<seq3