Jie-Fang Zhang – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2024-12-12T19:47:20Z http://www.open-lab.net/blog/feed/ Jie-Fang Zhang <![CDATA[NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200]]> http://www.open-lab.net/blog/?p=92591 2024-12-12T19:47:20Z 2024-11-22T00:53:18Z Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series...]]>

Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series of models introduced in July 2023 had a context length of 4K tokens, and the Llama 3.1 models, introduced only a year later, dramatically expanded that to 128K tokens. While long context lengths allow models to perform cognitive tasks…

Source

]]>
1
Jie-Fang Zhang <![CDATA[Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch]]> http://www.open-lab.net/blog/?p=90040 2024-11-22T23:12:12Z 2024-10-09T15:00:00Z The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of...]]>

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. For example, a chatbot supports a small number of users at very low latencies for good interactivity. Meanwhile, synthetic data generation requires high throughput to process many items…

Source

]]>
1
���˳���97caoporen����