History of demystifying proteins with Artificial Intelligence and Machine Learning approaches 

An account of the role of Artificial Intelligence (AI) and Machine Learning (ML) in solving the biggest problems in biology. We will look at the evolution of computational approaches in protein folding and designing novel proteins

Listen to this article

Audio copyright: immunitoAI

Charles Darwin with the finches that he studied for his theory of evolution and Alan Turing with Turing Machine (Photos: Contunico ZDF Studios, thecrazyprogrammer.com)

For centuries, scientists have formulated universal laws in the form of mathematical equations, as a set of inputs producing an output. Biology, in its early days, did not adhere to this approach, until the revolutionary works of Darwin and Mendel in the 1800s. Their work demonstrated that biological systems shared common traits and patterns, expressed as “models”. In the early 20th century, mathematical and statistical methods were used to integrate Mendelian genetics with Darwinian evolution. Prof. Mukund Thattai, computational biologist at the National Centre for Biological Sciences, says, “There is a long tradition of using mathematical models in biology, for example in codifying the rules of evolutionary biology.” For instance, there are models that predict how a new mutation will affect a population. “The exciting thing is that now we can do experiments that track what these mutations do, and it fits with these theories”, he adds. 

The models of evolution and heredity set the stage for thinking about biological processes in a systematic manner. While the concept of discrete units of inheritance had been developed, the actual nature of these units remained a mystery. The tools available at the time were inadequate to study fundamental biological components, such as proteins and DNA, but the nature and role of these molecules was actively under investigation. This created a pressing need for innovative approaches. As the mid-20th century unfolded, the motivation to bring together computation and biology intensified with our understanding of biological molecules. 

The Code of Life

In 1950, Alan Turing, a young polymath, proposed the concept of machine intelligence, exploring the potential of computers thinking like humans, laying the foundation for artificial intelligence (AI). The 1950s was an exciting era.The role of DNA as a genetic information encoding molecule had been firmly established. The analogy with a code was contributed by Gamow, a physicist who inspired the biologists to develop a framework describing the flow of information. The word ‘code’ highlighted that the sequence of the DNA carries instructions for the synthesis of proteins, much like a code carries information in a language. The actual deciphering of the genetic code came in 1961, revealing the rules that dictate the translation of genetic information into proteins. 

By that time, the first protein structure was deciphered and the first protein sequence, that of insulin, was published. As the size and variety of data grew, there was a need for repositories organising information, giving birth to "The Atlas of Protein Sequence and Structure" by Margaret Dayhoff and team in 1962. Around the same time, the notion of protein folding “problem” emerged. 

The Protein Folding Problem 

The protein folding problem is determining the three-dimensional atomic structure of a protein molecule from its primary structure (Box 1). This is challenging because even relatively small proteins can have an astronomical number of possible ways to fold. Experimental methods for determining protein structures can be time-consuming and expensive. A part of the puzzle was solved in 1968, with the observation that even though proteins have vast conformational spaces, they quickly converge to native states. 

Box 1: Protein folding

Box 1

Protein energy funnel guides the protein through many different sequences of traps toward the low-energy folded (native) structure. Each protein normally folds up into a single stable conformation.

Image byThomas Splettstoesser 


New methods to study the sequence and structure of proteins were blossoming, some of which started to use computers for analysis. The initial ideas and efforts towards computer-aided drug design were also developed during the 60s. Through the 1970s and the 1980s, parallel advances in computer science and molecular biology made momentous contributions to the field. Discovery of molecular forces that govern protein folding along with common structural motifs in proteins built the foundation for the field of structure prediction. 

Simultaneously, the theory of protein folding was evolving. It was proposed (and later verified) that the final folded structure by a sequence is generally the one in which the free energy is minimised (Box 1). Proteins are a dynamic network of atoms and interact with other molecules and the environment around them. The earliest computer simulation of protein folding was witnessed in 1975 using Molecular Dynamics (MD). Sophisticated structure determination methods made it possible to obtain atomic resolution, and researchers began to implement bioinformatics algorithms to model and visualise the 3D structure. 

1982 marked the starting point of the pharmaceutical computation industry with the program Dock, which predicted binding of small molecules. In 1988, a neural network was used for secondary structure prediction from protein sequence. Several scripting languages emerged in the mid-80s that still remain popular today. In 1987, the first alignment program was introduced which allowed multiple protein sequences to be aligned together to study relationships between proteins from different organisms, known as homology.

Genomics Online

1990s ushered in the big data revolution with genomics, structural bioinformatics, and the advent of the internet. With the Human Genome Project starting, there was a wave of algorithmic activity for solving biomolecular problems. The internet led to the creation of many bioinformatics resources accessible throughout the world. In tandem, progress in the field of biophysics and development of force fields enhanced the accuracy of prediction. 

Modeller was released in 1993 as the first automatic tool which could compute three-dimensional structures of a protein sequence based on related proteins, followed by other homology-based programs. Ab-initio methods involve predicting the three-dimensional structure of a protein from its amino acid sequence. In 2000, Rosetta, a software for structure prediction and protein design was released by David Baker’s group. 

The early success of these tools led to a rapid rise in biologists adopting computation into their research. “The genomics revolution combined with efficient sequence-based search really accelerated what biologists saw as the benefit of computation, because you didn't have to be a modeller or a mathematician. So it became the first pass, where if you pull out a protein of interest, you search the databases. This goes hand in hand with the availability of genomes ”, says Prof. Thattai.

AI-ML: Generating Novel Proteins 

Concerted efforts were being made in the 90s and 2000s towards de novo protein design. But why do we want to design novel proteins? The existing proteins solve the problems that were relevant during evolution. However, today's challenges, for instance, new diseases, are very different and dynamic. It is likely that new proteins would evolve to solve these problems, but it will take millions of years and the urgency of contemporary issues demands faster solutions. The process of protein design uses the rules that cause natural proteins to fold and function to develop novel proteins. Box 2 covers the methods of novel protein synthesis in detail.

Box 2: De Novo Protein Design

The early designs from the 1990s were simple structures, with short chains of amino acids. Most of these designs were generated through minimal, rational, or very early computational design. The complexity of design started to enhance in the 2000s. 2015 onwards, the lengths and complexities of de novo proteins increasingly mirror those of natural proteins. 

From A Brief History of De Novo Protein Design: Minimal, Rational, and Computational by Derek N. Woolfson

In the 2010s, AI-ML gained prominence by use of neural networks in image recognition and natural language processing. As protein sequences contain all the necessary information to reach the folded structure, the ideas that have proved useful to associate labels to images can also help to associate a folded structure to a protein sequence. One way to think about protein sequences and structures involves treating them as 'text' and applying language modelling algorithms that follow biological 'grammar' and 'syntax' rules. By considering proteins in this linguistic context, the algorithm not only learns relationships between amino acids but also acquires knowledge about the biological world. This enables the neural networks to generate cohesive and meaningful representations of protein sequences, akin to constructing a fluent sentence or document.

The Critical Assessment of Techniques for Protein Structure Prediction (CASP) is a biennial competition that began in 1994 to solve structures of proteins. Protein structure prediction greatly benefitted from the influx of ideas from ML. AlphaFold team from DeepMind won the 2018 competition and made headlines by predicting protein structures with exceptional accuracy. In the same year, ProteinGAN was introduced as a generative model that could create artificial protein sequences. This marked one of the early attempts of using generative AI in protein design. AlphaFold 2, introduced in 2020, further improved upon its predecessor and surpassed the performance of many existing methods. ProteinBERT, inspired by BERT (Bidirectional Encoder Representations from Transformers), a popular model in natural language processing, was introduced in 2022. ProteinBERT captures contextual information and relationships within protein sequences, crucial for structure and function prediction. These neural architecture belong to the same category of deep learning networks used to produce AI-generated artworks in programs like DALL-E and text in programs like ChatGPT. 

Read more about the evolution of AI and ML here.

Related blogs