Sharan Chetlur – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-01-11T17:32:51Z http://www.open-lab.net/blog/feed/ Sharan Chetlur <![CDATA[TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x]]> http://www.open-lab.net/blog/?p=92847 2025-01-11T17:32:51Z 2024-12-02T23:09:43Z NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that...]]>

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs. By adding support for speculative decoding on single GPU and single-node multi-GPU, the library further expands its supported…

Source

]]>
3
Sharan Chetlur <![CDATA[Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill]]> http://www.open-lab.net/blog/?p=92052 2024-11-15T17:59:38Z 2024-11-15T17:59:35Z In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment...]]>

In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment experience for developers. This builds on our previous post discussing how advanced KV cache optimization features in TensorRT-LLM improve performance up to 5x in use cases that require system prefills. When a user submits a request to…

Source

]]>
���˳���97caoporen����