Predicting local genome similarity with DNABERT
In recent years, the number of sequenced genomes has grown rapidly. Many species are no longer represented by a single reference genome, but by numerous genomes of individuals. To make optimal use of the genomic diversity found in large collections of genomes, there is a transition ongoing from a reference-centric approach to one based on pangenomes: computational representations of multiple genomes that facilitate fast analyses. Computational pangenomics is currently a hot topic and a challenging field of research [1]. We have developed a pangenome solution called PanTools, that compresses multiple annotated sequences into a single graph data structure, constructed, stored, and annotated in a Neo4j graph database [2]. The representation stores genome sequences (as so-called compressed, colored De Bruijn Graphs) and links genomes by detecting which genes are similar, indicating they may have the same function. However, no such similarity measures are yet available for other regions in the genome other than from methods based on costly full whole-genome alignments.
An exciting trend in the analysis of sequence data has been the development of Transformer-based models that are pre-trained on large data volumes, which can subsequently be fine tuned to perform a range of specific tasks with relatively minor effort. A well-known example is BERT, which has had a major impact in natural language tasks. Inspired by BERT, similar approaches have been developed on genome sequence data, e.g. DNABERT, HyenaDNA, GROVER etc. – for an overview of such genomic language models (gLMs), see [3]. gLMs show good performance in various specific prediction tasks related to genome annotation (for example, predicting where proteins can bind the DNA or predicting gene expression), after minor fine-tuning for those tasks.
In this project you will explore how we can exploit gLMs to predict local genome similarity in plants. In earlier work we already learned that genome alignments (true/false) can be predicted with reasonable accuracy using a Siamese neural network based on HyenaDNA. However, several questions remain. Can the network architecture, the prediction task and particularly the training dataset be further optimized? How sensitive is the approach to the evolutionary distance between genomes, and in what type of genomic regions does the model perform particularly well or poorly? Can we add higher level information on the contents of the genome (genes, functions, pathways) to arrive at a measure that combines both sequence and functional similarity?
References
[1] Computational Pan-Genomics Consortium. 2018. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics 19(1):118-135.
[2] Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S. 2016. PanTools: representation, storage and exploration of pan-genomic data. Bioinformatics 32(17):i487-i494.
[3] Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic language models: opportunities and challenges. 2024. arXiv 2407.11435.
Contact: Twan van Laarhoven (RU); Dick de Ridder (RU, WUR).