State-of-the-art image diffusion models take tens of seconds to process a single image. This makes video diffusion even more challenging, requiring significant computational resources and high costs. By leveraging the latest FP8 quantization features on NVIDIA Hopper GPUs with NVIDIA TensorRT, it’s possible to significantly reduce inference costs and serve more users with fewer GPUs.
]]>NVIDIA TensorRT, an established inference library for data centers, has rapidly emerged as a desirable inference backend for NVIDIA GeForce RTX and NVIDIA RTX GPUs. Now, deploying TensorRT into apps has gotten even easier with prebuilt TensorRT engines. The newly released TensorRT 10.0 with weight-stripped engines offers a unique solution for minimizing the engine shipment size by reducing…
]]>Video quality metrics are used to evaluate the fidelity of video content. They provide a consistent quantitative measurement to assess the performance of the encoder. VMAF combines human vision modeling with machine learning techniques that are continuously evolving, enabling it to adapt to new content. VMAF excels in aligning with human visual perception by combining detailed analysis…
]]>This post is the fifth in a series about optimizing end-to-end AI. NVIDIA TensorRT is a solution for speed-of-light inference deployment on NVIDIA hardware. Provided with an AI model architecture, TensorRT can be used pre-deployment to run an excessive search for the most efficient execution strategy. TensorRT optimizations include reordering operations in a graph…
]]>This post is the fourth in a series about optimizing end-to-end AI. As explained in the previous post in the End-to-End AI for NVIDIA-Based PCs series, there are multiple execution providers (EPs) in ONNX Runtime that enable the use of hardware-specific features or optimizations for a given deployment scenario. This post covers the CUDA EP and TensorRT EP using the highly optimized NVIDIA…
]]>