July 22, 2025

Hunting collisions with ntHash

This blog post summarises some checks and comprobations done while developing Sparrowhawk after we discovered an unsettling surprise. The first two sections explain quickly what a genomic assembler is and the issue we found during the development (bad performance when assembling at large k values), and why it prompted us to study hash collisions as a potential cause for this. If you are here only for the hashes and the collisions, jump directly to the third section (ntHash and collisions).

August 1, 2024

Internship in the PIM group

Introduction

The intersection of biology and computer science has given rise to the rapidly evolving field of bioinformatics. For aspiring scientists in this field, the opportunity to work alongside leading experts at a research institute is a unique experience. Each year, the French Embassy Internship Program makes this dream a reality for a small group of French students, opening the doors to EMBL-EBI.

April 25, 2024

ggCaller installation

Prerequisites

Before you begin, make sure your system meets the following requirements:

Operating System: Linux (recommended) or macOS.
Compiler: GCC (GNU Compiler Collection) version 4.8 or higher.
CMake: Version 3.1.0 or later.
Git: Version 2.0 or later.
Python: Version 3.6 or later.
Pip: Python package installer.
Basic familiarity with the command line interface.

Please download the example fasta files from this link. These fasta files will be the input of ggCaller.

December 12, 2023

Multiple horses for multiple courses

This post is about a talk I gave in February 2020 at RSLondonSouthEast, a local conference for research software engineers. First an overview of the talk, then an update after following my own advice for the past few years.

Multiple horses for multiple courses

Choosing a language

When deciding which programming language to use for a project, a useful principle is ‘horses for courses’ i.e. pick the one that is best suited for the task. How might you decide which one this is?

October 30, 2023

Graph-based gene prediction with ggCaller

In this blog, I give a brief overview of bacterial pangenome analysis, and what problems our tool, ggCaller, solves.

The bacterial pangenome – quantifying within-species diversity

A genome is a set of biological instructions, known as ‘genes’, which describe how to make and maintain a living organism. The biological similarities and differences we observe between organisms, either between those that are closely or distantly related, are in no small part due to the genes they possess. By identifying the genes within an organism’s genome, we can make predictions about how it behaves, such as how it interacts with other organisms, and in which environments it can survive.

July 18, 2023

Review: Wellcome Ideathon 2023

We attended the Wellcome Data Science Ideathon as semi-finalists in July 2023, which saw the Wellcome Trust invite around 100 researchers across 25 teams to compete to answer some of the biggest public health challenges we face today.

The format

The Ideathon was similar to a Hackathon; groups were tasked with answering specific questions in one of three themes - Infectious Disease, Climate & Health, and Mental Health. The tasks were principally focused on providing low-code solutions employing machine-learning techniques to analyse diverse qualitative and quantitative data.

February 27, 2023

EMBL-UNESCO Residency in Infection Biology Research

This application is now closed

EMBL and UNESCO have recently announced a visitor programme as part of its infection biology scheme, in which we are participating (along with many other EMBL groups, see the link at the end for a full list).

Our group is participating in the scheme, and we’d be love to hear from interested candidates who would like to work with us. We have particular expertise in analysing genomic data from bacterial pathogens, developing and using new bioinformatic tools to do so, as well as developing mathematical models for pathogen transmission and evolution. We are a friendly and collaborative group who like to learn and work together, rather than alone or in silos.

December 13, 2022

Mutation Spectra in Streptococcus pneumoniae

An Introduction to Mutation Spectra

It is intutitive that an organism’s ecological, phenotypic, or epidemiological context exposes it to distinct mutagens, and might thus produce specific signatures and patterns of mutation – that organism’s mutational spectrum.

This idea is well-established in oncology. Cancer epidemiolgy studies have shown that a handful of genes, most prominently the human p53 gene, show patterns of mutation specific to the corresponding cancer types. Further, certain mutagens are associated with certain mutation types or patterns. For example, in Pfeifer & Besaratinia’s 2009 review¹ on the subject, they summarised findings from various studies on mouse models containing human p53 genes (Hupki mouse models).

Blogs

Peer Review of the pre-print 'Endonuclease fingerprint indicates a synthetic origin of SARS-CoV-2'

This is a peer review of the pre-print “Endonuclease fingerprint indicates a synthetic origin of SARS-CoV2”, it is highly recommended that you go and read the pre-print in order to understand this review.

Introduction

The broad thread of the argument in the pre-print is that a synthetically engineered COVID-19 virus would be created using a process where ‘restriction’ enzymes cut the vaccine genome into roughly equal fragments so that they can be cloned in a bacterial system before being reassembled. Restriction enzymes cut the genome at very select sites where the DNA matches certain short sequences. These target sequences, called restriction sites, may appear in the genome by chance as natural nucleotide mutations occur. However, for a given genome the restriction sites will not usually be in the equally spaced locations needed for cutting the genome into equal chunks using restriction enzymes. To get around this, scientists create mutations in the viral genome to create new, more equally spaced restriction sites and remove old restriction sites which are too close together. Observing equally spaced restriction sites might be a marker of a synthetically engineered virus, since a wild-type viral genome has a small probability of having restriction sites so equally spaced throughout the genome.

October 7, 2022

Visualising microbial population structure with mandrake

Paper: https://doi.org/10.1098/rstb.2021.0237

(Joint work with Gerry Tonkin-Hill)

Dimensional reduction and embeddings

Dimension reduction methods are a popular way to understand large amounts of genetic data: PCA, t-SNE and UMAP have all been used to analyse and visualise large numbers of samples in two-dimensions (with the latter being particularly popular with single cell techniques).