This is the second post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM. When building LLM-based applications, it is critical to understand the performance characteristics of these models on a given hardware. This serves multiple purposes: As a client-side LLM-focused benchmarking tool…
]]>This is the first post in the large language model latency-throughput benchmarking series, which aims to instruct developers on common metrics used for LLM benchmarking, fundamental concepts, and how to benchmark your LLM applications. The past few years have witnessed the rise in popularity of generative AI and large language models (LLMs), as part of a broad AI revolution.
]]>NVIDIA announced the release of NVIDIA Dynamo today at GTC 2025. NVIDIA Dynamo is a high-throughput, low-latency open-source inference serving framework for deploying generative AI and reasoning models in large-scale distributed environments. The framework boosts the number of requests served by up to 30x, when running the open-source DeepSeek-R1 models on NVIDIA Blackwell.
]]>As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo. Diffusion models are transforming creative workflows across industries. These models generate stunning images based on simple text or image inputs by iteratively shaping random noise into AI-generated art through denoising diffusion techniques. This can be applied to many enterprise use cases such as creating personalized…
]]>