Guides

Filtering SNP alignments with SKA

SKA is a tool for comparing small and highly similar genomes using split k-mers. This guide explains how to create a SNP alignment from a skf file using the different options implemented in the command ska align (here using ska v0.3.5).

SKA logo

Recommended command line

For those in a hurry, the recommended command line for filtering for precise SNP calling is:

ska align --no-gap-only-sites --filter-ambig-as-missing --filter no-ambig-or-const

Breakdown using an example

In this example, let’s consider a skf file generated using k=51 from Illumina sequencing reads of 45 Mycobacterium tuberculosis isolates collected by the UKHSA, and belonging to the same transmission cluster (i.e. less than 12 SNPs between isolates).

Guides

Building trees with SKA

SKA is a tool for comparing small and highly similar genomes using split k-mers. This guide will explain how to use SKA to build a phylogenetic tree for different Escherichia coli lineages in a few minutes. Although SKA is tailored more towards analysing variation within a lineage, tree-building ends up working fine for the whole species but requires more memory.

SKA logo

Why SKA is good for building phylogenetic trees

The basic approach to building a tree with SKA is to generate a SNP alignment using split k-mers and then feed that to a tree building algorithm of choice. Since SKA does not require a specifying a reference sequence to determine the SNPs, SKA gets around potential biases introduced by reference choice and is thus particularly well suited to analysing microbial genomes derived from outbreaks or pathogen surveillance.

October 18, 2022

A beginner's guide to fitting PopPUNK models

PopPUNK now has a lot of different models available, which can make it hard to know where to start, or to tell if you’ve done the right thing when fitting one to your data.

Some questions I’ll address:

tl;dr

Don’t fit a model if you don’t have to.
Try and use a refine/boundary model.
Make sure your component near the origin is sensible (not too big or too small), don’t just rely on the network score.
Check your clusters on a tree, in microreact or by somehow visualising them.

Do you need to fit a new model?

If there is an existing one you can download from databases then you can usually forgo this step entirely. As well as avoiding extra work and validation, you’ll also get consistent cluster names with other studies, which is a big advantage.