Low-Latency Inference – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-05-16T23:50:38Z http://www.open-lab.net/blog/feed/ Berkin Kartal <![CDATA[AI Fabric Resiliency and Why Network Convergence Matters]]> http://www.open-lab.net/blog/?p=98574 2025-05-15T19:07:25Z 2025-05-14T16:20:00Z High-performance computing and deep learning workloads are extremely sensitive to latency. Packet loss forces retransmission or stalls in the communication...]]> High-performance computing and deep learning workloads are extremely sensitive to latency. Packet loss forces retransmission or stalls in the communication...Typical data center interconnection schema for Clos fabric.

High-performance computing and deep learning workloads are extremely sensitive to latency. Packet loss forces retransmission or stalls in the communication pipeline, which directly increases latency and disrupts the synchronization between GPUs. This can degrade the performance of collective operations such as all-reduce or broadcast, where every GPU��s participation is required before progressing.

Source

]]>
0
Amr Elmeleegy <![CDATA[Spotlight: Perplexity AI Serves 400 Million Search Queries a Month Using NVIDIA Inference Stack]]> http://www.open-lab.net/blog/?p=93396 2025-03-18T18:26:38Z 2024-12-05T17:58:43Z The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with...]]> The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with...

As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo. The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with provisioning the necessary hardware and software to meet that demand while simultaneously balancing cost efficiency with optimal user experience. This challenge was faced by the��

Source

]]>
0
Nick Comly <![CDATA[Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance]]> http://www.open-lab.net/blog/?p=88938 2024-11-29T21:06:06Z 2024-09-26T21:44:00Z Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding...]]> Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding...

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).

Source

]]>
0
Ashraf Eassa <![CDATA[Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch]]> http://www.open-lab.net/blog/?p=88127 2024-11-29T21:06:37Z 2024-09-05T18:30:00Z As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that...]]> As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that...Image of an HGX H200

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that real-time generative AI applications demand. Performance depends both on the ability for the combined GPUs to process requests as ��one mighty GPU�� with ultra-fast GPU-to-GPU communication and advanced software able to take full��

Source

]]>
0
���˳���97caoporen����