This is the first part of a two-part series discussing the NVIDIA Triton Inference Server’s FasterTransformer (FT) library, one of the fastest libraries for distributed inference of transformers of any size (up to trillions of parameters). It provides an overview of FasterTransformer, including the benefits of using the library. Join the NVIDIA Triton and NVIDIA TensorRT community to stay…
]]>This is the second part of a two-part series about NVIDIA tools that allow you to run large transformer models for accelerated inference. For an introduction to the FasterTransformer library (Part 1), see Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server. Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates…
]]>