Harnessing the power of artificial intelligence (AI) and the world's fastest supercomputers, a research team led by the U.S. Department of Energy's (DOE) Argonne National Laboratory has developed an innovative computing framework to speed up the design of new proteins.

One of the key innovations of the team's MProt-DPO framework is its ability to integrate different types of data streams, or "multimodal data." It combines traditional protein sequence data with experimental results, molecular simulations and even text-based narratives that provide detailed insights into each protein's properties. This approach can potentially accelerate protein discovery for a wide range of applications.

"Say you want to build a new vaccine or design an enzyme that can break down plastics for recycling in an environmentally friendly way," said Arvind Ramanathan, Argonne computational biologist. "Our AI framework can help researchers zero in on promising proteins from countless possibilities, including candidates that may not exist.”

Navigating the protein world

Mapping a protein's amino acid sequence to its structure and function is a long-standing research challenge. Each unique arrangement of amino acids—the building blocks of proteins—can yield different properties and behaviors. The sheer volume of potential variations makes testing them all through experiments alone impractical.

To put this in perspective, modifying just three amino acids in a sequence of 20 creates 8,000 possible combinations. But most proteins are far more complex, with some research targets containing hundreds to thousands of amino acids.

Large language models (LLMs), which form the basis of chatbots like ChatGPT, are AI models trained on large amounts of data to detect patterns and generate new information. LLMs help researchers sift through massive datasets in science, providing insights and predictions for complex problems like protein design.

Role of AI

Building and training the framework's LLMs required using powerful supercomputers, including the Aurora exascale system at the Argonne Leadership Computing Facility (ALCF). The ALCF is a DOE Office of Science user facility.

"The language models we trained are on the order of a few billion parameters," said Venkat Vishwanath, AI and machine learning team lead at the ALCF. "Supercomputers are crucial not only for training and fine-tuning the models, but also for running the end-to-end workflow. This includes performing large-scale simulations to verify the stability and catalytic activity of the generated protein sequences.”

In addition to Aurora, the team deployed their framework on other top systems: Frontier at DOE's Oak Ridge National Laboratory, Alps at the Swiss National Supercomputing Centre, Leonardo at CINECA center in Italy and the PDX machine at NVIDIA. They achieved over one exaflop of sustained performance (mixed precision) on each machine, with a peak performance of 5.57 exaflops on Aurora. The Argonne system recently earned the top spot in AI performance, achieving 10.6 exaflops on the HPL-MxP benchmark.

Surpassing an exaflop, which equals a quintillion calculations per second, highlights the immense computational power required for this effort.

Learning from outcome

The DPO in MProt-DPO stands for Direct Preference Optimization. The DPO algorithm helps AI models improve by learning from preferred or unpreferred outcomes. By adapting DPO for protein design, the Argonne team enabled their framework to learn from experimental feedback and simulations as they happen.

While generative AI techniques like LLMs have been developed for biological systems, existing tools have been limited by their inability to incorporate multimodal data. MProt-DPO, however, includes experimental data and text-based narratives that give context to each protein's behavior. This approach builds on earlier work by Ramanathan and colleagues, who created a text-guided protein design framework.

Ramanathan noted that using experimental data also helps improve the trustworthiness of their AI models. The team tested MProt-DPO on two tasks to demonstrate its ability to handle complex protein design challenges. First, they focused on the yeast protein HIS7, using experimental data to improve the performance of various mutations. For the second task, they worked on malate dehydrogenase, an enzyme that plays a key role in how cells produce energy. Using simulation data, they optimized the design of the enzyme to improve its catalytic efficiency.

The creation of MProt-DPO is also helping to advance Argonne's broader AI for science and autonomous discovery initiatives. The tool's use of multimodal data is central to the ongoing efforts to develop AuroraGPT, a foundation model designed to aid in autonomous scientific exploration across disciplines.

Sources of Article

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE