Accelerated Molecular Modeling with NVIDIA cuEquivariance and NVIDIA NIM microservices

The emergence of models like AlphaFold2 has skyrocketed the demand for faster inference and training of molecular AI models. The need for speed comes with unique computational challenges, including algorithmic complexity, memory efficiency, and strict accuracy requirements. To address this, NVIDIA collaborated with partners to provide accelerated solutions like faster equivariant operations and faster MSA generation.

Today, we released new kernels in cuEquivariance and an NVIDIA NIM microservice to speed up the training and inference of molecular AI models, such as Boltz-2, an open-source foundation model developed by MIT and Recursion. These accelerations enable the development of more sophisticated molecular AI systems and faster insights into molecular structures at scale.

NVIDIA cuEquivariance expands to accelerate next-gen protein structure models

NVIDIA cuEquivariance is a CUDA-X library designed to accelerate the demanding computations of geometry-aware neural networks like MACE, Allegro, NequIP, and DiffDock. It provides highly optimized CUDA kernels and comprehensive APIs that significantly speed up core equivariance operations such as those involving Segmented Tensor Products.

Starting with cuEquivariance v0.5, the library now includes accelerated Triangle Attention and Triangle Multiplication kernels, pivotal and unique to Nobel-winning protein structure prediction models like AlphaFold2. With the addition of accelerated triangle operations, the impact of cuEquivariance expands to applications like protein folding, RNA/DNA binding, blind docking, protein complex prediction, and affinity scoring.

Understanding a protein’s 3D structure is crucial because it reveals how the protein works. However, the true complexity of cells, and life itself, stems from ?dynamic interactions within biological complexes. These complexes aren’t just made of proteins—they’re intricate assemblies of proteins, nucleic acids (like DNA and RNA), lipids, carbohydrates, and various small molecules, all working together.

Predicting the structure and dynamic behavior of these individual molecules and complexes represents the next frontier in molecular AI. This next scientific breakthrough can shed light on cellular pathways, identifying disease mechanisms, and designing drugs that can precisely target specific molecular interactions.

Proteins, RNA, and DNA are all long molecules built from repeating units: nucleotides for DNA and RNA, and amino acids for proteins. When these sequences are produced within cells, their specific arrangement of building blocks causes them to fold into intricate three-dimensional structures. These 3D shapes are critically important, as they dictate the molecule’s function and its interactions with other cellular components.

Two fundamental, yet computationally intensive operations in state-of-the-art geometry-aware neural networks like Alphafold3, Proteina, Chai-1, Neo-1, and Boltz-2 are triangle multiplication and triangle attention. These often rank highly among the most time-consuming components in such models.

Pairwise attention mechanisms, made popular through the introduction of Transformers, work by calculating the correlation of a token (a building block in Transformer vocabulary) with every other token, allowing the model to understand the relevance of a word in the context of all other words in a sentence, for example.

Since molecular AI models are tasked with predicting a 3D structure based on a 2D representation, pairwise relationships don’t provide all the contextual information. This is where “Triangular Relationships” can act as a powerful proxy for capturing spatial relationships. For example, if a building block i is close to k, and k is close to j, i and j are likely spatially related, even without a strong direct pairwise signal.

For a molecule with N building blocks, these operations naively exhibit O(N³) complexity. This computational intensity poses a significant challenge for large molecules and complex multi-molecular assemblies, leading to significant computational costs and hard limitations on how much an AI model can scale.

cuEquivariance accelerates Triangle Operations

Here, we discuss the performance of the cuEquivariance forward Triangle Attention module as compared to a vanilla PyTorch implementation. This measures just the module run-time instead of the full end-to-end speed-up of inference or training. We’ll discuss end-to-end performance benchmarks later in the blog post.

With cuEquivariance Triangle Attention kernels BF16 precision, for inference, one can achieve up to 5x kernel-level speedups and a reduced memory footprint from O(N³) to O(N²) over the PyTorch implementation. We also compared cuEquivariance Triangle Attention Forward Kernel with Trifast and observed anywhere from 1.5x to 2x kernel-level speedup with BF16 precision.

Similarly, cuEquivariance Triangle Multiplication kernels with BF16 precision offer up to 5x module-level speedups without any accuracy regressions.

“These kernels are long-awaited and will become an integral part of the Boltz family of models, helping address the bottlenecks in speed and memory consumption,” said Gabriele Corso, researcher at MIT.

A Box plot compares module-level wall time across precisions FP32 (blue), TF32 (orange) and BF16 (green) for PyTorch, Trifast and cuEquivariance implementations of Triangle Attention. — *Figure 2. A Box plot comparing Wall-times of PyTorch, Trifast and cuEquivariance Triangle Attention at kernel-level, shown across a range of precisions.*

A Box plot compares module-level wall time across precisions FP32 (blue), TF32 (orange) and BF16 (green) for PyTorchand cuEquivariance implementations of Triangle Multiplication. — *Figure 3. A Box plot comparing Wall-times of PyTorch and cuEquivariance Triangle Multiplication at module-level, shown across a range of precisions.*

On Boltz-1x, a next-generation version of Boltz-1, we compared various precisions (TF32, FP32, and BF16) of PyTorch, Trifast, and cuEquivariance-based implementations and compared their end-to-end inference run time. These runs use the default test dataset published by the Boltz-1x authors. Keeping the precisions the same, from PyTorch BF16 to cuEquivariance BF16, we see up to 1.75x performance boost. If one were to go from PyTorch FP32 to cuEquivariance BF16, there’s up to a 2.5x performance boost on Boltz-1x.

We see up to 1.35x speedups for end-to-end training from PyTorch FP32 to cuEquivariance BF16 configurations with Boltz-1x. The end-to-end speedups could vary depending on the architecture of the model.

“This extension to cuEquivariance is extremely valuable, we have already seen a more than 2-fold training and 3-fold inference speedup, dramatically cutting down model iteration cycles and unlocking inference on an order of magnitude larger molecules,” said Luca Naef, CTO at VantAI.

The accelerations provided by cuEquivariance have been well received by several of our partners, including MIT, VantAI, Molecular Glue Labs (MGL), Dyno, Peptone, Genesis, and Xaira, who were able to test early versions and provide feedback. We are excited for the rest of the community to benefit from these accelerations and provide critical feedback that helps us improve our work, thereby pushing the envelope of scientific innovations.

Enterprise-grade cofolding with Boltz-2 NIM for digital biology

Building on the success of models like Boltz-1, the next-generation Boltz-2 model, developed by the Boltz team at MIT in collaboration with Recursion, represents a significant step forward. Boltz-2 is engineered to be a larger and more capable model, with inference-time optimizations merged from Boltz-1x, and incorporating unique, state-of-the-art affinity prediction capabilities. To deliver the most accessible version of this cutting-edge model, NVIDIA is packaging Boltz-2 as an NVIDIA NIM.

NIM are easy-to-use, pre-built containers that provide optimized, production-ready inference for state-of-the-art AI models. The Boltz-2 NIM will offer researchers and developers streamlined access to its powerful capabilities, enabling real-time predictions and efficient test-time scaling for demanding drug discovery workflows. This approach democratizes access to leading-edge molecular AI, allowing a broader range of users to leverage Boltz-2’s predictive power.

Accelerated computing for the next frontier of molecular AI

The enhanced computational efficiency driven by cuEquivariance is paramount. For training, these accelerated kernels enable researchers to build larger foundation models that can further leverage pre-training scaling laws where increased computational throughput often correlates with improved model performance. Moreover, the resulting efficiencies in compute time and cost free up resources for more model development cycles, further pushing the boundaries of next-generation capabilities. At test-time, the accelerations facilitate more extensive in silico experiments, enabling virtual screening campaigns to scale to hundreds of thousands or even millions of inferences.

“NVIDIA’s cuEquivariance library delivers significant accelerations that are fundamental to structurally-aware biomolecular models like Boltz-2,” said Ben Mabey, chief technology officer at Recursion. “By tackling key compute bottlenecks, this will enable faster R&D cycles in the pharmaceutical industry’s deployment of these powerful models for drug discovery.”

Complementing these library-level accelerations, NVIDIA is also enhancing accessibility to state-of-the-art models through products like NVIDIA NIM microservices. For example, packaging advanced models such as Boltz-2 as a NIM provides researchers and developers with a streamlined, production-ready solution for deploying these powerful capabilities, efficiently scaling demanding drug discovery workflows.

Working with the broader scientific community, NVIDIA develops and refines foundational software like the cuEquivariance library, and delivers optimized model access through NIM. These offerings, powered by NVIDIA’s compute platform, equip the scientific community to push the boundaries of research, accelerating the journey from computational insight to real-world impact in drug development and biology more broadly.

Try cuEquivariance today

These accelerations are available today with a PyTorch API front-end under an Apache 2.0 license. Click here to learn more about the accelerations, the precisions supported, and examples.