print · login   

Fold2vec: protein structure embedding for deep learning 

The structure of proteins to a large extent determines their function and activity in the cell. Thus far, computational biological research on proteins has focused more on their sequences than their structure, due to a lack of experimental structures and the challenges of working with 3D data. Recently however, the AlphaFold2 deep learning protein structure predictor was introduced and led to a dramatic increase in the number and quality of available protein. This calls for novel methods to work with protein structures, particularly based on deep learning. A major challenge which could benefit from using structure information is protein function prediction, which until now has mainly depended on sequence data.

An important aspect of the success of deep learning is the ability to work with pre-trained embeddings to tackle new problems, avoiding duplication of effort and wasteful computation. Such approaches have already been applied to protein sequences [1], allowing to capitalize on the large amounts of unlabelled data available, and have recently also been applied to protein structure [2-4]. We also explored this option in a previous thesis project, combining graph neural networks with invariant local structure features we developed before [5].

In this project, we will explore the extension of a protein structure embedding method (either the one developed earlier in house, or one from literature) in two directions: (1) increasing resolution, to allow modelling of side chains next to only the protein backbone that is currently usually modelled; and (2) taking the per-residue uncertainties generated in protein structure predictions into account in structure-based deep learning. To make use of as much data as possible, an important requirement will be to first set up a (limited) self-supervised learning environment. Overall, the project aims to generate insights into the opportunities and limitations of using protein structure data in computational biology.

References

[1] Wang Z, Combs SA, Brand R, Romero Calvo M, Xu P, Price G, Golovach N, Salawu EO, Wise CJ, Ponnapalli SP, Clark PM. 2022. LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction. Sci. Reports 12:6832, doi: 10.1038/s41598-022-10775-y

[2] Villegas-Morcillo A, Makrodimitris S, van Ham RCHJ, Gomez AM, Sanchez V, Reinders MJT. 2021. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37(2):162-170, doi: 10.1093/bioinformatics/btaa701.

[3] Blaabjerg LM, Jonsson N, Boomsma W, Stein A, Lindorff-Larsen K. 2023. A joint embedding of protein sequence and structure enables robust variant effect predictions. bioRxiv 2023.12.14.571755; doi: 10.1101/2023.12.14.571755.

[4] Ibtehaz N, Kihara D. 2023. Application of sequence embedding in protein sequence-based predictions. In: Machine Learning in Bioinformatics of Protein Sequences: Algorithms, Databases and Resources for Modern Protein Bioinformatics, Kurgan L. (ed)., pp. 31-55, doi: 10.1142/12899.

[5] Durairaj J, Akdel M, de Ridder D, van Dijk ADJ. 2020. Geometricus represents protein structures as shape-mers derived from moment invariants. Bioinformatics 36(Suppl_2):i718-i725. doi: 10.1093/bioinformatics/btaa839.

Contact: Twan van Laarhoven (RU); Dick de Ridder (RU, WUR).