Accelerating Embedding Lookups with cuEmbed

NVIDIA recently released cuEmbed, a high-performance, header-only CUDA library that accelerates embedding lookups on NVIDIA GPUs. If you’re building recommendation systems, embedding operations are likely consuming significant computational resources.

Embedding lookups present a unique optimization challenge. They’re memory-intensive operations with irregular access patterns. cuEmbed is designed specifically to address these challenges, achieving throughputs that are more than double the peak HBM memory bandwidth for power-law distributed input indices.

In this post, I explain what embedding lookups are, why they’re critical for recommenders, and how the cuEmbed optimization techniques deliver exceptional performance. I also provide practical guidance for integrating cuEmbed into your projects, whether you’re using C++ directly or working with PyTorch.

Recognizing that embedding use cases vary widely across applications, NVIDA made cuEmbed completely open source. This enables you to customize and extend core performant kernels.

What are embedding lookups?

Certain inputs are natural to process with neural networks, such as vectors of floating-point numbers or pixel values, which can be passed directly to convolutional or fully connected layers.

However, most data is non-numerical, meaning it is represented as one option in a range of categories. Examples of this may be the product being considered for recommendation (for example, a particular movie) and perhaps the genre or year of creation.

Embeddings are a way of translating non-numerical features into vectors of floating-point numbers to use for predictions. An embedding table is a set of all learned vectors corresponding to a non-numerical feature. The core operation being optimizing in cuEmbed is the embedding lookup operation (Figure 1). This operation has the following characteristics:

Takes one or more lookup indices that represent the category or categories being referenced as input.
Retrieves the corresponding rows from an embedding table.
(Optional) Combines these vectors, for example, through sum, mean, or concatenation.
Produces a dense output vector for downstream neural network processing.

In PyTorch, these operations are accessible using nn.Embedding for single index lookups, and nn.EmbeddingBag for lookups with multiple indices.

The scale of embedding operations in production systems is substantial. A typical embedding operation may involve looking up from O(1) to O(100) indices, so processing a full batch can involve loading tens of thousands or even millions of embedding rows. For more information, see Recommendation Data From Ele.me and User Behaviour Dataset from Alibaba.

A diagram shows an embedding table with a height of “Num Categories ex: 1M” and a width of “Embedding Width ex: 256”. A vector representing indices is shown pointing to four specific rows of the embedding table. Those four rows are gathered into a third matrix, shown with a second set of arrows. This matrix then feeds into a reduction operator that may be either “Sum”, “Mean” or “Concatenation” to generate a single output vector. — *Figure 1. Embedding lookup forward pass for one batch sample*

Learning the embedding vectors requires a backward pass, in which gradients arriving from the downstream neural network are propagated back to the embedding table. This is computationally like the forward pass. In the forward pass, you gather and optionally accumulate all the rows referenced by the indices in a single batch sample. However, in the backward pass, you gather and accumulate gradients destined for a single embedding row, from all the batch samples which included that row’s index in the forward lookup.

Optimizing recommendation embeddings and cuEmbed

Embedding lookup operations are memory-intensive and relatively light on floating-point math. They are a great fit for GPUs, which have terabytes per second (TB/s) of memory bandwidth available to HBM.

It is well-known that to achieve high memory bandwidth on GPUs, accesses should be coalesced across threads. However, Figure 2 shows that even relatively narrow accesses such as embedding width 32 in float format (128 bytes per row) uniformly distributed across the embedding table can achieve near peak HBM throughput. If accesses are aligned and coalesced within a warp, different warps can access disjoint memory locations and still achieve high memory performance, as is the case in the embedding lookup.

A line graph shows the title “Uniform Index Distribution”. The y axis is the rate of processing (TB/s) with a range from 0 to 4. The x axis is the batch size, which ranges from 512 to 131072 in increasing powers of 2. A line labeled “HBM Peak BW” is flat at roughly 3.3 TB/s. Three lines represent embedding performance for widths 32, 64, and 128. At the lowest batch size, the performance ranges from 0 to just over 1 TB/s, and at the highest batch sizes the performance of all three plateaus to roughly 3 TB/s. — *Figure 2. cuEmbed performance for forward-pass embedding lookup on H100 SXM*

Figure 2 covers 10M categories; a uniform index distribution, fixed-hotness format where hotness=64, float32 embed type, sum combine mode, and 1K iterations.

GPUs further shine on embedding lookups due to the bandwidth amplification effect of caches. The indices provided to embedding lookups tend to follow a power-law distribution, meaning that some items, or rows, are more popular than others. The popular rows are retained in L1 and L2 caches on the GPU. This can alleviate downstream pressure in the memory system and allow faster processing to take place on-chip.

The strategy used by cuEmbed to increase performance of embedding lookups is to maximize the number of loads-in-flight presented to the memory system. This is motivated by Little’s law, which relates the number of items being processed in parallel and the latency of the accesses to the overall bandwidth achieved.

We provision fewer threads, with more registers per thread, to increase the total amount of resources devoted to global load (LDG) instructions. We also used loop unrolling to increase memory-level parallelism by loading multiple embedding rows simultaneously from a single thread. We increased the granularity of loads to 128 bits-per-thread (LDG.128) by using vector types (for example, float4) for loads and stores, and pre-load indices into shared memory to reduce the latency of index accesses.

Figure 3 shows the rate at which we process embedding rows in TB/s can exceed 8 TB/s, more than 2x higher than the peak HBM bandwidth rate.

A line graph shows the title “Power-Law Index Distribution.” The y axis is the rate of processing (TB/s) with a range from 0 to 9. The x axis is the batch size, which ranges from 512 to 131072 in increasing powers of 2. A line labeled “HBM Peak BW” is flat at roughly 3.3 TB/s. Three lines represent embedding performance for widths 32, 64, and 128. At the lowest batch sizes, the performance ranges from 0 to just over 1 TB/s, and at the highest batch sizes the performance of all three grows to over 7.5 TB/s, with the width 32 line reaching the highest point at over 8 TB/s. — *Figure 3. cuEmbed performance for forward-pass embedding lookup on H100 SXM*

Figure 3 covers 10M categories, Power-Law (PSX) index distribution: α=1.05, fixed-hotness format where hotness=64; float32 embed type, sum combine mode, and 1K iterations.

Altogether, cuEmbed running on the H100 GPU is capable of processing embedding rows at TB/s rates on a wide range of configurations and at reasonable batch sizes.

How to use cuEmbed

cuEmbed is a C++ header -only library, so you don’t have to build it separately or install it. Instead, add cuEmbed as a submodule to your project and include the relevant .cuh files to access the API.

For example, to access the embedding lookup functions, include the following:

cuembed/include/embedding_lookup.cuh

The backward pass requires additional transformations to the lookup indices, such as transposition:

cuembed/include/index_transformations.cuh

You can also add cuEmbed directly to your project through CMake Package Manager (CPM). Add the following code to your CMakeLists.txt file, replacing my_library with the name of your target:

CPMAddPackage(
  NAME cuembed
  GIT_REPOSITORY https://github.com/NVIDIA/cuembed.git
  GIT_TAG main
  OPTIONS
    "BUILD_TESTS OFF"
    "BUILD_BENCHMARKS OFF"
)
target_link_libraries(my_library cuembed::hdrs)

The /examples/pytorch directory contains an example integration of cuEmbed into PyTorch. The cuEmbed forward, backward, and various helper functions are exposed to PyTorch as C++ extensions which are further wrapped in a custom PyTorch op. Currently, only a subset of torch.nn.EmbeddingBag functionality is currently supported in cuEmbed and this example.

For more information, including detailed documentation and examples, see the /NVIDIA/cuembed GitHub repo. cuEmbed is completely open source. The kernels are designed using C++ templates to be easy to read, so you can extend and enhance these kernels to suit your needs while retaining the optimizations required for high-performance.

Pinterest’s success story: Faster recommender training

We shared cuEmbed with engineers at Pinterest to test its performance on real-world recommender workloads. The response we got was encouraging.

Chen Yang, on Pinterest ML Foundations Team said, “Embedding lookups for sparse?or categorical features are often a bottleneck in our GPU-based recommender models. After integrating cuEmbed with minimal code changes, we achieved 15-30% GPU-roofline training throughput improvements across our key production ranking and recommendation ML workloads.”

Summary

cuEmbed provides a high-performance solution for embedding lookups on NVIDIA GPUs, achieving throughput rates that exceed raw HBM bandwidth through optimized memory access patterns and effective cache utilization. By open-sourcing this library, we aim to enable the community to customize and extend these optimizations for diverse embedding use cases.

Whether you’re building recommendation systems, graph neural networks, or working with language models, cuEmbed can significantly accelerate your embedding operations with minimal code changes.

Next steps

cuEmbed provides a high-performance solution for embedding lookups on NVIDIA GPUs, achieving throughput rates that exceed raw HBM bandwidth through optimized memory access patterns and effective cache utilization. By open-sourcing this library, NVIDIA aims to enable the community to customize and extend these optimizations for diverse embedding use cases.

Whether you’re building recommendation systems, graph neural networks, or working with language models, cuEmbed can significantly accelerate your embedding operations with minimal code changes.

Download cuEmbed from the /NVIDIA/cuembed GitHub repo.
- For Python integration patterns, explore the /examples directory.
- For advanced usage scenarios, see the cuEmbed documentation.