Amr Elmeleegy – NVIDIA Technical Blog

Amr Elmeleegy – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-05-21T19:28:15Z http://www.open-lab.net/blog/feed/ Amr Elmeleegy <![CDATA[NVIDIA Dynamo Accelerates llm-d Community Initiatives for Advancing Large-Scale Distributed Inference]]> http://www.open-lab.net/blog/?p=100638 2025-05-21T17:55:09Z 2025-05-21T17:41:06Z

The introduction of the llm-d community at Red Hat Summit 2025 marks a significant step forward in accelerating generative AI inference innovation for the open...]]>

The introduction of the llm-d community at Red Hat Summit 2025 marks a significant step forward in accelerating generative AI inference innovation for the open source ecosystem. Built on top of vLLM and Inference Gateway, llm-d extends the capabilities of vLLM with Kubernetes-native architecture for large-scale inference deployments. This post explains key NVIDIA Dynamo components that…

]]> Amr Elmeleegy <![CDATA[NVIDIA Dynamo Adds GPU Autoscaling, Kubernetes Automation, and Networking Optimizations]]> http://www.open-lab.net/blog/?p=100047 2025-05-21T19:28:15Z 2025-05-20T18:30:02Z

At NVIDIA GTC 2025, we announced NVIDIA Dynamo, a high-throughput, low-latency open-source inference serving framework for deploying generative AI and reasoning...]]>

At NVIDIA GTC 2025, we announced NVIDIA Dynamo, a high-throughput, low-latency open-source inference serving framework for deploying generative AI and reasoning models in large-scale distributed environments. The latest v0.2 release of Dynamo includes: In this post, we’ll walk through these features and how they can help you get more out of your GPU investments.

]]> Amr Elmeleegy <![CDATA[Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models]]> http://www.open-lab.net/blog/?p=95274 2025-04-23T00:15:55Z 2025-03-18T17:50:00Z

NVIDIA announced the release of NVIDIA Dynamo today at GTC 2025. NVIDIA Dynamo is a high-throughput, low-latency open-source inference serving framework for...]]>

NVIDIA announced the release of NVIDIA Dynamo today at GTC 2025. NVIDIA Dynamo is a high-throughput, low-latency open-source inference serving framework for deploying generative AI and reasoning models in large-scale distributed environments. The framework boosts the number of requests served by up to 30x, when running the open-source DeepSeek-R1 models on NVIDIA Blackwell.

]]> 2 Amr Elmeleegy <![CDATA[Spotlight: Perplexity AI Serves 400 Million Search Queries a Month Using NVIDIA Inference Stack]]> http://www.open-lab.net/blog/?p=93396 2025-03-18T18:26:38Z 2024-12-05T17:58:43Z

The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with...]]>

As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo. The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with provisioning the necessary hardware and software to meet that demand while simultaneously balancing cost efficiency with optimal user experience. This challenge was faced by the…

]]> Amr Elmeleegy <![CDATA[NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200]]> http://www.open-lab.net/blog/?p=92591 2024-12-12T19:47:20Z 2024-11-22T00:53:18Z

Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series...]]>

Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series of models introduced in July 2023 had a context length of 4K tokens, and the Llama 3.1 models, introduced only a year later, dramatically expanded that to 128K tokens. While long context lengths allow models to perform cognitive tasks…

]]> 1 Amr Elmeleegy <![CDATA[Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill]]> http://www.open-lab.net/blog/?p=92052 2024-11-15T17:59:38Z 2024-11-15T17:59:35Z

In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment...]]>

In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment experience for developers. This builds on our previous post discussing how advanced KV cache optimization features in TensorRT-LLM improve performance up to 5x in use cases that require system prefills. When a user submits a request to…

]]> Amr Elmeleegy <![CDATA[5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse]]> http://www.open-lab.net/blog/?p=91625 2025-05-01T18:34:40Z 2024-11-08T23:55:43Z

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up...]]>

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up to 14x on x86-based NVIDIA H100 Tensor Core GPUs and 28x on the NVIDIA GH200 Superchip. In this post, we shed light on KV cache reuse techniques and best practices that can drive even further TTFT speedups. LLM models are rapidly…

]]> Amr Elmeleegy <![CDATA[3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot]]> http://www.open-lab.net/blog/?p=91412 2025-05-01T18:34:34Z 2024-11-01T22:00:36Z

Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands �C and where input...]]>

Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands – and where input sequence lengths differ with each request – poses unique challenges. To achieve low latency inference in these environments, multi-GPU setups are a must – irrespective of the GPU generation or its memory capacity. To enhance inference performance in…

]]> 1 Amr Elmeleegy <![CDATA[NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models]]> http://www.open-lab.net/blog/?p=90897 2024-11-06T02:24:56Z 2024-10-28T15:00:00Z

Deploying large language models (LLMs) in production environments often requires making hard trade-offs between enhancing user interactivity and increasing...]]>

Deploying large language models (LLMs) in production environments often requires making hard trade-offs between enhancing user interactivity and increasing system throughput. While enhancing user interactivity requires minimizing time to first token (TTFT), increasing throughput requires increasing tokens per second. Improving one aspect often results in the decline of the other…

]]> 1 Amr Elmeleegy <![CDATA[NVIDIA Contributes NVIDIA GB200 NVL72 Designs to Open Compute Project]]> http://www.open-lab.net/blog/?p=90182 2024-11-26T20:56:44Z 2024-10-15T16:30:00Z

During the 2024 OCP Global Summit, NVIDIA announced that it has contributed the NVIDIA GB200 NVL72 rack and compute and switch tray liquid cooled designs to the...]]>

During the 2024 OCP Global Summit, NVIDIA announced that it has contributed the NVIDIA GB200 NVL72 rack and compute and switch tray liquid cooled designs to the Open Compute Project (OCP). This post provides details about this contribution and explains how it increases the utility of current design standards to meet the high compute density demands of modern data centers.

]]> 2 Amr Elmeleegy <![CDATA[NVIDIA GH200 Grace Hopper Superchip Delivers Outstanding Performance in MLPerf Inference v4.1]]> http://www.open-lab.net/blog/?p=89401 2024-11-06T02:27:00Z 2024-09-24T16:36:57Z

In the latest round of MLPerf Inference �C a suite of standardized, peer-reviewed inference benchmarks �C the NVIDIA platform delivered outstanding...]]>

In the latest round of MLPerf Inference – a suite of standardized, peer-reviewed inference benchmarks – the NVIDIA platform delivered outstanding performance across the board. Among the many submissions made using the NVIDIA platform were results using the NVIDIA GH200 Grace Hopper Superchip. GH200 tightly couples an NVIDIA Grace CPU with an NVIDIA Hopper GPU using NVIDIA NVLink-C2C…

]]> Amr Elmeleegy <![CDATA[NVIDIA Triton Inference Server Achieves Outstanding Performance in MLPerf Inference 4.1 Benchmarks]]> http://www.open-lab.net/blog/?p=87970 2024-09-05T18:37:49Z 2024-08-28T16:00:00Z

Six years ago, we embarked on a journey to develop an AI inference serving solution specifically designed for high-throughput and time-sensitive production use...]]>

Six years ago, we embarked on a journey to develop an AI inference serving solution specifically designed for high-throughput and time-sensitive production use cases from the ground up. At that time, ML developers were deploying bespoke, framework-specific AI solutions, which were driving up their operational costs and not meeting their latency and throughput service level agreements.

]]> Amr Elmeleegy <![CDATA[NVIDIA GH200 Superchip Delivers Breakthrough Energy Efficiency and Node Consolidation for Apache Spark]]> http://www.open-lab.net/blog/?p=87567 2024-08-22T18:24:50Z 2024-08-20T20:00:00Z

With the rapid growth of generative AI, CIOs and IT leaders are looking for ways to reclaim data center resources to accommodate new AI use cases that promise...]]>

With the rapid growth of generative AI, CIOs and IT leaders are looking for ways to reclaim data center resources to accommodate new AI use cases that promise greater return on investment without impacting current operations. This is leading IT decision makers to reassess past infrastructure decisions and explore strategies to consolidate traditional workloads into fewer…

]]> Amr Elmeleegy <![CDATA[Revolutionizing Data Center Efficiency with the NVIDIA Grace Family]]> http://www.open-lab.net/blog/?p=86550 2024-10-09T20:01:54Z 2024-08-02T15:00:00Z

The exponential growth in data processing demand is projected to reach 175 zettabytes by 2025. This contrasts sharply with the slowing pace of CPU performance...]]>

The exponential growth in data processing demand is projected to reach 175 zettabytes by 2025. This contrasts sharply with the slowing pace of CPU performance improvements. For more than a decade, semiconductor advancements have not kept up with the pace predicted by Moore’s Law, leading to a pressing need for more efficient computing solutions. NVIDIA GPUs have emerged as the most efficient…

]]> Amr Elmeleegy <![CDATA[Demystifying AI Inference Deployments for Trillion Parameter Large Language Models]]> http://www.open-lab.net/blog/?p=83013 2025-03-18T18:27:34Z 2024-06-12T16:00:00Z

AI is transforming every industry, addressing grand human scientific challenges such as precision drug discovery and the development of autonomous vehicles, as...]]>

As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo. AI is transforming every industry, addressing grand human scientific challenges such as precision drug discovery and the development of autonomous vehicles, as well as solving commercial problems such as automating the creation of e-commerce product descriptions and extracting insights from legal contracts. Today…

]]> 2 Amr Elmeleegy <![CDATA[Enhancing the Apparel Shopping Experience with AI, Emoji-Aware OCR, and Snapchat��s Screenshop]]> http://www.open-lab.net/blog/?p=82250 2025-03-18T18:30:13Z 2024-05-17T17:33:20Z

Ever spotted someone in a photo wearing a cool shirt or some unique apparel and wondered where they got it? How much did it cost? Maybe you've even thought...]]>

As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo. Ever spotted someone in a photo wearing a cool shirt or some unique apparel and wondered where they got it? How much did it cost? Maybe you’ve even thought about buying one for yourself. This challenge inspired Snap’s ML engineering team to introduce Screenshop, a service within Snapchat’s app that uses AI to locate…

]]> Amr Elmeleegy <![CDATA[Generate Stunning Images with Stable Diffusion XL on the NVIDIA AI Inference Platform]]> http://www.open-lab.net/blog/?p=78388 2025-03-18T18:31:44Z 2024-03-07T19:05:46Z

Diffusion models are transforming creative workflows across industries. These models generate stunning images based on simple text or image inputs by...]]>

As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo. Diffusion models are transforming creative workflows across industries. These models generate stunning images based on simple text or image inputs by iteratively shaping random noise into AI-generated art through denoising diffusion techniques. This can be applied to many enterprise use cases such as creating personalized…

]]> 1 ��˳��97caoporen��