agtools API Reference

UnitigGraph

Represents a unitig-level assembly graph parsed from a GFA file.

Attributes:
  • graph (Graph) –

    The undirected graph representing the unitig-level assembly graph.

  • vcount (int) –

    The number of vertices (segments) in the graph.

  • lcount (int) –

    The number of links (lines starting with tag "L") in the GFA file.

  • ecount (int) –

    The number of edges in the graph after simplification.

  • pcount (int) –

    The number of paths (lines starting with tag "P") in the GFA file.

  • file_path (str) –

    Path to the GFA file.

  • segment_names (list) –

    List of segment names.

  • segment_name_to_id (dict) –

    Mapping from segment name to internal ID. This is used to map segment names to their vertex IDs in the graph.

  • segment_lengths (dict) –

    Mapping from segment name to length of sequence.

  • segment_offsets (dict) –

    Mapping from segment name to byte offset of the segment line in the GFA file.

  • oriented_links (dict) –

    Mapping from [from segment id][to segment id] -> list of (from orientation, to orientation).

  • link_overlap (dict) –

    Mapping from oriented segment pair (from segment id, from orientation, to segment id, to orientation) -> overlap length.

  • path_index (dict) –

    Mapping from path name to byte offset of the path line in the GFA file.

  • self_loops (list) –

    List of segment IDs that form self-loops.

Methods:

Name Description
from_gfa

Parse a GFA file into a UnitigGraph object.

get_segment_sequence

Retrieve a DNA sequence for a segment.

get_neighbors

Get neighboring segments of a given segment.

get_adjacency_matrix

Return the adjacency matrix as a matrix or a pandas DataFrame.

is_connected

Check if there is a path between two segments in the graph.

get_connected_components

Get connected components of the graph.

calculate_average_node_degree

Calculate the average node degree of the graph.

calculate_total_length

Calculate the total length of all segments in the graph.

calculate_average_segment_length

Calculate the average segment length.

calculate_n50_l50

Calculate N50 and L50 for the segments in the graph.

get_gc_content

Calculate the GC content of segment sequences.

Examples:

>>> from agtools.core.unitig_graph import UnitigGraph
>>> ug = UnitigGraph.from_gfa("assembly.gfa")
>>> ug.vcount
42
>>> ug.ecount
80
References

GFA: Graphical Fragment Assembly (GFA) Format Specification https://github.com/GFA-spec/GFA-spec

calculate_average_node_degree()

Calculate the average node degree of the graph.

Returns:
  • int

    Average node degree of the graph.

Raises:
  • ValueError

    If the graph does not have any segments.

Examples:

>>> ug.calculate_average_node_degree()
2.576374745417515

calculate_average_segment_length()

Calculate the average segment length.

Returns:
  • int

    Average segment length.

Raises:
  • ValueError

    If the graph does not have any segments.

Examples:

>>> ug.calculate_average_segment_length()
8490.319755600814

calculate_n50_l50()

Calculate N50 and L50 for the segment in the graph.

Returns:
  • tuple of (int, int)

    A tuple containing: - N50 : int The length N such that 50% of the total length is contained in segments of length >= N. - L50 : int The minimum number of segments whose summed length >= 50% of the total length.

Examples:

>>> ug.calculate_n50_l50()
(15000, 12)

calculate_total_length()

Calculate the total length of all segments in the graph.

Returns:
  • int

    Total length of all segments.

Examples:

>>> ug.calculate_total_length()
350000

from_gfa(file_path) classmethod

Parse a GFA file into a UnitigGraph object.

Parameters:
  • file_path (str) –

    Path to the GFA file.

Returns:

Examples:

>>> ug = UnitigGraph.from_gfa("assembly.gfa")
>>> ug.vcount
42
>>> ug.ecount
80

get_adjacency_matrix(type='matrix')

Return the adjacency matrix as igraph or pandas DataFrame.

Parameters:
  • type (str, default: 'matrix' ) –

    The return type. Options are: - "matrix": Return the adjacency matrix object from self.graph.get_adjacency(). - "pandas": Return a Pandas DataFrame with unitig names as row and column labels.

Returns:
  • adjacency( object or DataFrame ) –
    • If type="matrix", returns the adjacency matrix object.
    • If type="pandas", returns a DataFrame where both rows and columns are indexed by unitig names.
Raises:
  • ValueError

    If type is not "matrix" or "pandas".

Examples:

>>> matrix = ug.get_adjacency_matrix()
>>> isinstance(matrix, list)
True
>>> df = ug.get_adjacency_matrix(type="pandas")
>>> df.head()
            unitig_1  unitig_2  unitig_3
unitig_1          0         1         0
unitig_2          1         0         1
unitig_3          0         1         0

get_connected_components()

Get connected components of the graph.

Returns:
  • list

    A list of the connected components with internal segment IDs

Examples:

>>> components = ug.get_connected_components()
>>> len(components)
3
>>> [len(c) for c in components]
[10, 8, 5]
>>> components[0]
[0, 1, 2, 3, ...]

get_gc_content()

Calculate the GC content of segment sequences.

Returns:
  • float

    GC content as a percentage of total base pairs.

Raises:
  • ValueError

    If total length of the segments is zero.

Examples:

>>> round(ug.get_gc_content(), 2)
0.42

get_neighbors(seg_name)

Get neighboring segments of a given segment.

Parameters:
  • seg_name (str) –

    The segment name.

Returns:
  • list of str

    List of neighboring segment names.

Examples:

>>> ug.get_neighbors("unitig_1")
['unitig_2', 'unitig_3']

get_path(path_name)

Retrieve the segment string and overlaps string of a path.

This method retrieves the segment string and overlaps string of a path from the original GFA file using byte offsets.

Parameters:
  • path_name (str) –

    The path identifier whose segment sequence should be retrieved.

Returns:
  • segments( str ) –

    The segment string for the path.

  • overlaps( str ) –

    The overlaps string for the path.

Raises:
  • KeyError

    If the path does not exist in the graph.

Examples:

>>> ug.get_path("path_1")
('unitig_1+,unitig_2+,unitig_3+', '*')

get_segment_sequence(seg_name)

Retrieve a DNA sequence for a segment.

This method retrieves the sequence of a segment from the original GFA file using byte offsets, without loading all sequences into memory at once.

Parameters:
  • seg_name (str) –

    The segment name whose DNA sequence should be retrieved.

Returns:
  • Seq

    The DNA sequence corresponding to the given segment.

Raises:
  • KeyError

    If the segment name does not exist in the graph.

  • ValueError

    If the retrieved sequence length does not match the expected length recorded during graph construction.

Examples:

>>> ug.get_segment_sequence("unitig_1")[:10]
Seq('ATGCGTACGG')

is_connected(from_seg, to_seg)

Check if there is a path between two segments in the graph.

This method determines whether a path exists between the segment specified by from_seg and the segment specified by to_seg using the underlying graph's shortest path search.

Parameters:
  • from_seg (str) –

    Name of the starting segment.

  • to_seg (str) –

    Name of the target segment.

Returns:
  • bool

    True if there is a path connecting from_seg to to_seg, False otherwise.

Examples:

>>> ug.is_connected("unitig_1", "unitig_2")
True

ContigGraph

Represents a contig-level assembly graph derived from a GFA file.

Attributes:
  • graph (Graph) –

    The undirected graph representing the contig-level assembly graph.

  • vcount (int) –

    The number of vertices (contigs) in the graph.

  • lcount (int) –

    The number of links (lines starting with tag "L") in the GFA file.

  • ecount (int) –

    The number of edges in the graph after simplification

  • file_path (str) –

    Path to the GFA file.

  • contig_names (list) –

    List of contig names.

  • contig_name_to_id (dict) –

    Mapping from contig name to internal ID. This is used to map contig names to their vertex IDs in the graph.

  • contig_parser (FastaParser) –

    FastaParser object containing the file pointers to contig sequences.

  • contig_descriptions ((dict[str, str], optional)) –

    Dictionary mapping contig names to additional descriptions in FASTA file.

  • graph_to_contig_map ((bidict[int, str], optional)) –

    Bi-directional dictionary mapping from contig identifiers in the GFA file to FASTA file.

  • self_loops ((list[str], optional)) –

    List of contig names that form self-loops in the graph.

Methods:

Name Description
get_contig_sequence

Retrieve a DNA sequence for a contig.

get_neighbors

Get neighboring contigs of a given contig.

get_adjacency_matrix

Return the adjacency matrix as igraph or pandas DataFrame.

is_connected

Check if there is a path between two contigs in the graph.

get_connected_components

Get connected components of the graph.

calculate_average_node_degree

Calculate the average node degree of the graph.

calculate_total_length

Calculate the total length of all contigs in the graph.

calculate_average_contig_length

Calculate the average contig length.

calculate_n50_l50

Calculate N50 and L50 for the contigs in the graph.

get_gc_content

Calculate the GC content of contig sequences.

calculate_average_contig_length()

Calculate the average contig length.

Returns:
  • int

    Average contig length.

Raises:
  • ValueError

    If the graph does not have any contig.

Examples:

>>> cg.calculate_average_contig_length()
40000

calculate_average_node_degree()

Calculate the average node degree of the graph.

Returns:
  • int

    Average node degree of the graph.

Raises:
  • ValueError

    If the graph does not have any contigs.

Examples:

>>> cg.calculate_average_node_degree()
1

calculate_n50_l50()

Calculate N50 and L50 for the contigs in the graph.

Returns:
  • tuple of (int, int)

    A tuple containing: - N50 : int The length N such that 50% of the total length is contained in contigs of length >= N. - L50 : int The minimum number of contigs whose summed length >= 50% of the total length.

Examples:

>>> cg.calculate_n50_l50()
(15000, 12)

calculate_total_length()

Calculate the total length of all contigs in the graph.

Returns:
  • int

    Total length of all contigs.

Examples:

>>> cg.calculate_total_length()
120000

get_adjacency_matrix(type='matrix')

Return the adjacency matrix as igraph or pandas DataFrame.

Parameters:
  • type (str, default: 'matrix' ) –

    The return type. Options are: - "matrix": Return the adjacency matrix object from self.graph.get_adjacency(). - "pandas": Return a Pandas DataFrame with contig names as row and column labels.

Returns:
  • adjacency( object or DataFrame ) –
    • If type="matrix", returns the adjacency matrix object.
    • If type="pandas", returns a DataFrame where both rows and columns are indexed by contig names.
Raises:
  • ValueError

    If type is not "matrix" or "pandas".

Examples:

>>> matrix = cg.get_adjacency_matrix()
>>> isinstance(matrix, list)
True
>>> df = cg.get_adjacency_matrix(type="pandas")
>>> df.head()
            contig_1  contig_2  contig_3
contig_1          0         1         0
contig_2          1         0         1
contig_3          0         1         0

get_connected_components()

Get connected components of the graph.

Returns:
  • list

    A list of the connected components with internal contig IDs.

Examples:

>>> components = cg.get_connected_components()
>>> len(components)
3
>>> [len(c) for c in components]
[10, 8, 5]
>>> components[0]
[0, 1, 2, 3, ...]

get_contig_sequence(contig_name)

Retrieve a DNA sequence for a contig.

This method retrieves the sequence of a contig from the contigs file using byte offsets, without loading all sequences into memory at once.

Parameters:
  • contig_name (str) –

    The contig identifier whose DNA sequence should be retrieved.

Returns:
  • Seq

    The DNA sequence corresponding to the given contig.

Examples:

>>> cg.get_contig_sequence("contig_1")
Seq('TTGATGCGACGTACGG')

get_gc_content()

Calculate the GC content of contig sequences.

Returns:
  • float

    GC content as a percentage of total base pairs.

Raises:
  • ValueError

    If total length of the contigs is zero.

Examples:

>>> cg.get_gc_content()
0.42

get_neighbors(contig_name)

Get neighboring contigs of a given contig.

Parameters:
  • contig_name (str) –

    The contig name.

Returns:
  • list of str

    List of neighboring contig names.

Examples:

>>> cg.get_neighbors("contig_1")
['contig_2', 'contig_3']

is_connected(from_contig, to_contig)

Check if there is a path between two contigs in the graph.

This method determines whether a path exists between the contig specified by from_contig and the contig specified by to_contig using the underlying graph's shortest path search.

Parameters:
  • from_contig (str) –

    Name of the starting contig.

  • to_contig (str) –

    Name of the target contig.

Returns:
  • bool

    True if there is a path connecting from_contig to to_contig, False otherwise.

Raises:
  • KeyError

    If the contig names do not exist in the assembly.

Examples:

>>> cg.is_connected("contig_1", "contig_2")
True

FastaParser

A minimal, lightweight FASTA parser with on-demand sequence retrieval.

This parser builds an index mapping of sequence names to byte offsets in the file, allowing sequences to be fetched lazily without loading the entire FASTA file into memory. Works with both plain-text FASTA and gzip-compressed FASTA (.gz).

Attributes:
  • file_path (str) –

    Path to the FASTA file (plain or gzipped).

  • assembler (str) –

    Assembler used to get the GFA file.

  • mapping (dict) –

    Name mapping of contigs in graph and FASTA file (MEGAHIT).

  • index (dict) –

    Mapping of sequence name to file offset for the header line.

  • gzipped (bool) –

    True if the file is gzip-compressed.

Methods:

Name Description
get_sequence

Retrieve a DNA sequence by sequence name.

get_index

Retrieve the file pointer of the DNA sequence by sequence name.

Examples:

>>> from agtools.core.fasta_parser import FastaParser
>>> parser = FastaParser("contigs.fasta")

get_index(seq_name)

Retrieve the file pointer of the DNA sequence by sequence name.

Parameters:
  • seq_name (str) –

    The sequence name to fetch (matching the FASTA header without '>').

Returns:
  • int

    The DNA sequence as a string, or None if the name is not found.

Raises:
  • KeyError

    If the sequence is not found in the index

Examples:

>>> parser.get_index("contig_1")
8487228

get_sequence(seq_name)

Retrieve a DNA sequence by sequence name.

Parameters:
  • seq_name (str) –

    The sequence name to fetch (matching the FASTA header without '>').

Returns:
  • Seq

    The DNA sequence corresponding to the given sequence name.

Raises:
  • RuntimeWarning

    If the sequence is not found in the contigs FASTA file

Examples:

>>> seq = parser.get_sequence("contig_1")
>>> len(seq)
1500
>>> seq[:10]
Seq('TGGCTCTTCA')