EMBL-EBI at the Wellcome Genome Campus, Hinxton, Cambridge. (Credit: EMBL-EBI)

Intricate Structural Information revealed by Deep Learning

Mar 23 2022

“These models exceed my expectations. They’re not just copying the data already in Pfam, they’re able to learn from the data and find new information that is yet to be discovered. What this gives us is the ability to expand the Pfam collection and potentially that of other resources using these same deep learning methods.” Alex Bateman

The EMBL-European Bioinformatics Institute has been able to expand its open access protein family database (Pfam), with the help of deep learning models. Pfam provides insights for biologists on protein characteristics including vital protein annotations, structures and multiple sequence alignments and is widely used to classify protein sequences into phylogenies and identify domains that provide insights into protein activity.

The increase in knowledge content(1) was achieved through the use of deep learning methods developed by Google Research that were trained to use data from the Pfam research base to annotate previously undescribed protein domains, shedding light on potential protein function.

“Initially I was rather sceptical about using deep learning to reproduce the protein families within Pfam. Then I started collaborating more closely with Lucy Colwell and her team at Google Research and my scepticism quickly changed to excitement for the potential of these methods to improve our ability to classify sequences into domains and families,” said Alex Bateman, Senior Team Leader of Protein Sequence Resources at EMBL-EBI.

“These models exceed my expectations. They’re not just copying the data already in Pfam, they’re able to learn from the data and find new information that is yet to be discovered. What this gives us is the ability to expand the Pfam collection and potentially that of other resources using these same deep learning methods.”

Exceeding previous expansion efforts

The project resulted in the expansion of the Pfam, database by almost 10%, exceeding previous expansion efforts made over the last decade. The deep learning methods were also able to predict the function for 360 human proteins that had no previous annotation data available in Pfam.

Using additional protein family predictions generated from the Google Research team’s neural networks created a supplement to Pfam called Pfam-N, (network) which added a further 6.8 million protein sequences to the Pfam database.

“We’re also now building on these established deep learning methods to expand the information in the database even further,” said Bateman. “We’re changing the way the existing deep learning model works so that we can call multiple protein domains at once. This new update to the database should be ready very soon.”

“My personal view is that there’s still a lot of scope to improve the deep learning models we’re currently using,” Bateman added. “We’re in the early days of this and I’m very hopeful for what it will mean for the future classification of protein families. This may even be something that will get solved in the next five years.”

This work is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

More information online