Hunting collisions with ntHash
This blog post summarises some checks and comprobations done while developing Sparrowhawk after we discovered an unsettling surprise. The first two sections explain quickly what a genomic assembler is and the issue we found during the development (bad performance when assembling at large k values), and why it prompted us to study hash collisions as a potential cause for this. If you are here only for the hashes and the collisions, jump directly to the third section (ntHash and collisions).
Internship in the PIM group
Introduction
The intersection of biology and computer science has given rise to the rapidly evolving field of bioinformatics. For aspiring scientists in this field, the opportunity to work alongside leading experts at a research institute is a unique experience. Each year, the French Embassy Internship Program makes this dream a reality for a small group of French students, opening the doors to EMBL-EBI.
ggCaller installation
Prerequisites
Before you begin, make sure your system meets the following requirements:
- Operating System: Linux (recommended) or macOS.
- Compiler: GCC (GNU Compiler Collection) version 4.8 or higher.
- CMake: Version 3.1.0 or later.
- Git: Version 2.0 or later.
- Python: Version 3.6 or later.
- Pip: Python package installer.
- Basic familiarity with the command line interface.
Please download the example fasta files from this link. These fasta files will be the input of ggCaller.
Multiple horses for multiple courses
This post is about a talk I gave in February 2020 at RSLondonSouthEast, a local conference for research software engineers. First an overview of the talk, then an update after following my own advice for the past few years.
Multiple horses for multiple courses
Choosing a language
When deciding which programming language to use for a project, a useful principle is ‘horses for courses’ i.e. pick the one that is best suited for the task. How might you decide which one this is?
Graph-based gene prediction with ggCaller
In this blog, I give a brief overview of bacterial pangenome analysis, and what problems our tool, ggCaller, solves.
The bacterial pangenome – quantifying within-species diversity
A genome is a set of biological instructions, known as ‘genes’, which describe how to make and maintain a living organism. The biological similarities and differences we observe between organisms, either between those that are closely or distantly related, are in no small part due to the genes they possess. By identifying the genes within an organism’s genome, we can make predictions about how it behaves, such as how it interacts with other organisms, and in which environments it can survive.
Review: Wellcome Ideathon 2023
We attended the Wellcome Data Science Ideathon as semi-finalists in July 2023, which saw the Wellcome Trust invite around 100 researchers across 25 teams to compete to answer some of the biggest public health challenges we face today.
The format
The Ideathon was similar to a Hackathon; groups were tasked with answering specific questions in one of three themes - Infectious Disease, Climate & Health, and Mental Health. The tasks were principally focused on providing low-code solutions employing machine-learning techniques to analyse diverse qualitative and quantitative data.
EMBL-UNESCO Residency in Infection Biology Research
This application is now closed
EMBL and UNESCO have recently announced a visitor programme as part of its infection biology scheme, in which we are participating (along with many other EMBL groups, see the link at the end for a full list).
Our group is participating in the scheme, and we’d be love to hear from interested candidates who would like to work with us. We have particular expertise in analysing genomic data from bacterial pathogens, developing and using new bioinformatic tools to do so, as well as developing mathematical models for pathogen transmission and evolution. We are a friendly and collaborative group who like to learn and work together, rather than alone or in silos.
Mutation Spectra in Streptococcus pneumoniae
An Introduction to Mutation Spectra
It is intutitive that an organism’s ecological, phenotypic, or epidemiological context exposes it to distinct mutagens, and might thus produce specific signatures and patterns of mutation – that organism’s mutational spectrum.
This idea is well-established in oncology. Cancer epidemiolgy studies have shown that a handful of genes, most prominently the human p53 gene, show patterns of mutation specific to the corresponding cancer types. Further, certain mutagens are associated with certain mutation types or patterns. For example, in Pfeifer & Besaratinia’s 2009 review1 on the subject, they summarised findings from various studies on mouse models containing human p53 genes (Hupki mouse models).
Peer Review of the pre-print 'Endonuclease fingerprint indicates a synthetic origin of SARS-CoV-2'
This is a peer review of the pre-print “Endonuclease fingerprint indicates a synthetic origin of SARS-CoV2”, it is highly recommended that you go and read the pre-print in order to understand this review.
Introduction
The broad thread of the argument in the pre-print is that a synthetically engineered COVID-19 virus would be created using a process where ‘restriction’ enzymes cut the vaccine genome into roughly equal fragments so that they can be cloned in a bacterial system before being reassembled. Restriction enzymes cut the genome at very select sites where the DNA matches certain short sequences. These target sequences, called restriction sites, may appear in the genome by chance as natural nucleotide mutations occur. However, for a given genome the restriction sites will not usually be in the equally spaced locations needed for cutting the genome into equal chunks using restriction enzymes. To get around this, scientists create mutations in the viral genome to create new, more equally spaced restriction sites and remove old restriction sites which are too close together. Observing equally spaced restriction sites might be a marker of a synthetically engineered virus, since a wild-type viral genome has a small probability of having restriction sites so equally spaced throughout the genome.
Visualising microbial population structure with mandrake
Paper: https://doi.org/10.1098/rstb.2021.0237
(Joint work with Gerry Tonkin-Hill)
Dimensional reduction and embeddings
Dimension reduction methods are a popular way to understand large amounts of genetic data: PCA, t-SNE and UMAP have all been used to analyse and visualise large numbers of samples in two-dimensions (with the latter being particularly popular with single cell techniques).