Filtering SNP alignments with SKA
SKA is a tool for comparing small and highly similar genomes using split k-mers. This guide explains how to create a SNP alignment from a skf file using the different options implemented in the command ska align (here using ska v0.3.5).

Recommended command line
For those in a hurry, the recommended command line for filtering for precise SNP calling is:
ska align --no-gap-only-sites --filter-ambig-as-missing --filter no-ambig-or-const
Breakdown using an example
In this example, let’s consider a skf file generated using k=51 from Illumina sequencing reads of 45 Mycobacterium tuberculosis isolates collected by the UKHSA, and belonging to the same transmission cluster (i.e. less than 12 SNPs between isolates).
Building trees with SKA
SKA is a tool for comparing small and highly similar genomes using split k-mers. This guide will explain how to use SKA to build a phylogenetic tree for different Escherichia coli lineages in a few minutes. Although SKA is tailored more towards analysing variation within a lineage, tree-building ends up working fine for the whole species but requires more memory.

Why SKA is good for building phylogenetic trees
The basic approach to building a tree with SKA is to generate a SNP alignment using split k-mers and then feed that to a tree building algorithm of choice. Since SKA does not require a specifying a reference sequence to determine the SNPs, SKA gets around potential biases introduced by reference choice and is thus particularly well suited to analysing microbial genomes derived from outbreaks or pathogen surveillance.
A beginner's guide to fitting PopPUNK models
PopPUNK now has a lot of different models available, which can make it hard to know where to start, or to tell if you’ve done the right thing when fitting one to your data.
Some questions I’ll address:
tl;dr
- Don’t fit a model if you don’t have to.
- Try and use a refine/boundary model.
- Make sure your component near the origin is sensible (not too big or too small), don’t just rely on the network score.
- Check your clusters on a tree, in microreact or by somehow visualising them.
Do you need to fit a new model?
If there is an existing one you can download from databases then you can usually forgo this step entirely. As well as avoiding extra work and validation, you’ll also get consistent cluster names with other studies, which is a big advantage.