Haohang Huang – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-11T01:44:00Z http://www.open-lab.net/blog/feed/ Haohang Huang <![CDATA[NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference]]> http://www.open-lab.net/blog/?p=92963 2025-03-11T01:44:00Z 2024-12-18T17:31:01Z Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM)...]]>

Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM) inference now available with NVIDIA TensorRT-LLM. ReDrafter helps developers significantly boost LLM workload performance on NVIDIA GPUs. NVIDIA TensorRT-LLM is a library for optimizing LLM inference. It provides an easy-to-use Python API to…

Source

]]>
Haohang Huang <![CDATA[NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching]]> http://www.open-lab.net/blog/?p=93516 2024-12-12T19:35:15Z 2024-12-11T22:10:51Z NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes...]]>

NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes inference for diverse model architectures, including the following: The addition of encoder-decoder model support further expands TensorRT-LLM capabilities, providing highly optimized inference for an even broader range of…

Source

]]>
���˳���97caoporen����