Escherichia coli diversity and evolution: perspectives from the study of 80,000 genomes

Lucile Vigue

13 September 2023

Thesis defence

Pratical info

14h - 23h
Conference room Rosalind Franklin
research professional
Reduced mobility access

Under the supervision of Olivier Tenaillon in IAME (faculty of medicine of Bichat hospital) and Ivan Matic's team Robustesse et évolvabilité de la vie

Abstract :
 
A commensal bacteria in the gut of humans and many vertebrates, Escherichia coli is also a deadly pathogen responsible for 950,000 deaths per year worldwide. As a generalist organism capable of adapting to different ecological niches, it is a species of choice for studying evolution on different time scales. Its status as a model organism in biology and its importance for human health have led to the sequencing of many strains worldwide. The aim of this thesis is to analyse the diversity present in 81,440 of these genomes and to understand how this diversity can inform us about the evolutionary processes at work in this species.
 
The 81,440 genomes collected cover the natural diversity of Escherichia coli. Strains isolated in humans and more specifically in a clinical context are largely represented. In particular, 11,000 of these genomes are Shigella, obligate pathogenic strains of primates that have adopted an intracellular lifestyle. To study these 81,440 genomes, I extracted the coding sequences and organised them in a database. A comparison of the core genomes of these strains allowed me to classify them into 240 clusters from which I was able to infer a global phylogeny of the species corrected for recombination.

In order to further analyse the mutational patterns, I used Direct-Coupling Analysis (DCA). This statistical physics approach allows to predict the effect of a mutation occurring in a gene and inducing an amino-acid change in the corresponding protein. By modelling the interactions between pairs of amino acids within the protein, DCA allows the genetic context in which the mutation occurs to be taken into account.
 
By applying DCA to thousands of E. coli core genes, I have shown that it can predict not only the native amino acids of this species but also the polymorphisms observed in it. DCA also predicts the probability of observing a mutation at a certain frequency. In doing so, it reveals differences in the efficiency of natural selection between different subpopulations of E. coli. In particular, natural selection appears to be much less effective in Shigella strains, consistent with the reduced effective size of this population.

Genetic context was found to be key to the quality of the predictions made by DCA. This context is built up over long time scales by the addition of many weak interactions between amino acids. These do not affect all residues of a protein in the same way. DCA can predict the variability of these residues. In particular, between 30% and 50% of the sites in a protein are highly constrained by the genetic background of E. coli. A mutation at one of these sites will generally be deleterious if it occurs alone. These sites do not therefore tolerate polymorphisms. However, they can co-evolve over long time scales so that the amino acids observed there vary widely between species.

If individual residues of a protein can evolve at different rates, so can proteins. I have developed a selection test, based on the DCA, which allows genes to be compared with each other. In the short term, the essential genes are those under the strongest purifying selection pressure, while the level of expression determines the long-term rate of evolution. This test also detects inactivations of transcriptional factors, inactivations that appear to be selected in the short term but counter-selected in the longer term.

The present work demonstrates the interest of coupling the study of large genome databases with modelling approaches to understand the evolution of a species on different time scales