NVIDIA cuQuantum is an SDK of optimized libraries and tools that accelerate quantum computing emulations at both the circuit and device level by orders of magnitude. With NVIDIA Tensor Core GPUs, developers can speed up quantum computer simulations based on quantum dynamics, state vectors, and tensor network methods by orders of magnitude. In many cases, this provides researchers with simulations at scales and speeds that are otherwise impossible.

What’s new in cuQuantum 25.06?

25.06 updates all cuQuantum libraries: cuDensityMat, cuStateVec, and cuTensorNet. New features include gradients for quantum dynamics workflows, further optimizations for NVIDIA Grace Blackwell, NVIDIA GB200 NVL72, and NVIDIA GB300 NVL72 systems, and primitives for density matrix renormalization group (DMRG) tensor network algorithms. For more information, see the cuQuantum 25.06 release notes.

Unlocking AI for quantum processor design workflows

cuDensityMat provides new APIs that facilitate the calculation of gradients of ?quantum state evolution. Developers of quantum Hamiltonian dynamics frameworks and solvers can use these new APIs to efficiently backpropagate quantum dynamics simulations with respect to optimizable Hamiltonian parameters, opening an efficient route to rational Quantum Processor Unit (QPU) design. This is critically important as it enables QPU builders to train large AI models in calibration, control, gate, and qubit design, reducing the timeline to useful quantum processors.

We show 16.86x speedups for back-propagation and 26.15x speedup for the forward pass of the gradients of a fluxonium qubit system on the same single B200 GPU comparing cuQuantum and another JAX-based quantum framework. — *Figure 1. Speedups on NVIDIA B200 for both feed-forward and back-propagation for a common fluxonium qubit system consisting of a qubit and resonator*

All simulations for Figure 1 were run on one NVIDIA DGX B200 GPU. Note that the observed speed-ups originate from automated exploitation of the Hamiltonian structure and reliance on efficient backend CUDA libraries.

Researchers designing fluxonium qubit-based QPUs need to compute gradients of some target cost function calculated from a simulation of a fluxonium qubit system to optimize their QPU layout and/or drive pulses. We first considered a simplified model, a qubit with 32 levels and a resonator with 255 levels, each with local dissipators, and a drive on the resonator. We calculated the gradient of the overlap of the output quantum state obtained as a result of the operator action on an input quantum state against some fictitious target. This model represents the main building block of real fluxonium qubit quantum dynamics optimization scenarios.

Figure 1 shows the observed speed-ups for the feed-forward operator action and its back-propagation through the new cuDensityMat API executed on the NVIDIA B200 GPU. The observed 16-26x speed-ups over a GEMM-based JAX implementation reference, also executed on the same GPU, are very encouraging for researchers deploying AI models for qubit design and optimization workloads relying on auto-differentiation.

NVIDIA Blackwell kernel optimizations

cuStateVec introduces further custom GPU kernels to optimize more operations on the latest NVIDIA GPU architecture, ensuring about 2-3x performance improvements over NVIDIA Hopper systems.

This chart shows speedups of B200 over H100 for the same software and algorithm, Quantum Phase Estimation. For double precision, with a 32 qubit-sized problem, we get a 2.14x speedup, and for single precision with a 33 qubit-sized problem, we get a 2.99x speedup over the same problems on last generation’s NVIDIA H100 GPU.? — *Figure 2. Speedup of end-to-end simulation time of quantum phase estimation (QPE) on a single GPU of an NVIDIA DGX H100 compared to an NVIDIA DGX B200*

With these improvements, researchers get the best performance out of advanced NVIDIA hardware, and even more performance for operations that include batching, expectation value calculations, and collapse operators. These continued updates enable quantum computing developers to use the cutting edge of AI supercomputing hardware.

Accelerating and scaling quantum emulations with DMRG primitives

With cuTensorNet’s latest release, we ship our first Matrix Product State, Density Matrix Renormalization Group (MPS-DMRG) primitives, enabling developers and researchers the ability to solve DMRG in the context of quantum computing simulations. By offering primitives for iteratively optimizing the fidelity of an MPS approximation to a quantum circuit, cuTensorNet makes it easy for quantum computer researchers to employ GPU acceleration for DMRG. These same primitives can also be used to perform quantum-dynamical simulations through the MPS time-dependent variational principle (MPS-TDVP) algorithm.

This primitive is a stepping stone to many new features cuQuantum plans to support in future releases. This includes faster and larger-scale MPS quantum circuit simulations and approximate quantum dynamical simulations for larger-scale QPU design. Quantum algorithms developers will have access to large-scale simulations for designing algorithms in current and near-term devices. QPU builders will be able to model longer-range interactions and larger Hilbert spaces without using less accurate trajectory methods. Both of which reduce the timeline to useful quantum computing.

Getting started with cuQuantum

Download cuQuantum through pip install cuquantum-cu12 to start experimenting with these functionalities, or integrate them into your framework, simulator, or solvers. For other ways to get started, check out the documentation page.?

Please reach out with questions, requests, or issues on GitHub.
Learn more about NVIDIA quantum computing.