Laikh Tewari – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-07-10T18:29:51Z http://www.open-lab.net/blog/feed/ Laikh Tewari <![CDATA[Think Smart and Ask an Encyclopedia-Sized Question: Multi-Million Token Real-Time Inference for 32X More Users]]> http://www.open-lab.net/blog/?p=102927 2025-07-10T18:29:51Z 2025-07-08T01:00:00Z Modern AI applications increasingly rely on models that combine huge parameter counts with multi-million-token context windows. Whether it is AI agents...]]>

Modern AI applications increasingly rely on models that combine huge parameter counts with multi-million-token context windows. Whether it is AI agents following months of conversation, legal assistants reasoning through gigabytes of case law as big as an entire encyclopedia set, or coding copilots navigating sprawling repositories, preserving long-range context is essential for relevance and…

Source

]]>
Laikh Tewari <![CDATA[Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM]]> http://www.open-lab.net/blog/?p=95040 2025-04-23T15:02:57Z 2025-01-16T22:57:30Z Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the...]]>

Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the previous tokens are used as historical context in LLM serving for generation of the next set of tokens. Caching these key and value elements from previous tokens avoids expensive recomputation and effectively leads to higher throughput. However…

Source

]]>
Laikh Tewari <![CDATA[TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x]]> http://www.open-lab.net/blog/?p=92847 2025-01-11T17:32:51Z 2024-12-02T23:09:43Z NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that...]]>

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs. By adding support for speculative decoding on single GPU and single-node multi-GPU, the library further expands its supported…

Source

]]>
3
���˳���97caoporen����