Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and��
]]>Microsoft Bing Visual Search enables people around the world to find content using photographs as queries. The heart of this capability is Microsoft��s TuringMM visual embedding model that maps images and text into a shared high-dimensional space. Operating on billions of images across the web, performance is critical. This post details efforts to optimize the TuringMM pipeline using NVIDIA��
]]>NVIDIA TensorRT, an established inference library for data centers, has rapidly emerged as a desirable inference backend for NVIDIA GeForce RTX and NVIDIA RTX GPUs. Now, deploying TensorRT into apps has gotten even easier with prebuilt TensorRT engines. The newly released TensorRT 10.0 with weight-stripped engines offers a unique solution for minimizing the engine shipment size by reducing��
]]>In this post, we delve deeper into the inference optimization process to improve the performance and efficiency of our machine learning models during the inference stage. We discuss the techniques employed, such as inference computation graph simplification, quantization, and lowering precision. We also showcase the benchmarking results of our scene text detection and recognition models��
]]>To make scene text detection and recognition work on irregular text or for specific use cases, you must have full control of your model so that you can do incremental learning or fine-tuning as per your use cases and datasets. Keep in mind that this pipeline is the main building block of scene understanding, AI-based inspection, and document processing platforms. It should be accurate and have low��
]]>Identification and recognition of text from natural scenes and images become important for use cases like video caption text recognition, detecting signboards from vehicle-mounted cameras, information retrieval, scene understanding, vehicle number plate recognition, and recognizing text on products. Most of these use cases require near real-time performance. The common technique for text��
]]>This post is part of a series about optimizing end-to-end AI. The performance of AI models is heavily influenced by the precision of the computational resources being used. Lower precision can lead to faster processing speeds and reduced memory usage, while higher precision can contribute to more accurate results. Finding the right balance between precision and performance is crucial for��
]]>This post is part of a series about optimizing end-to-end AI. While NVIDIA hardware can process the individual operations that constitute a neural network incredibly fast, it is important to ensure that you are using the tools correctly. Using the respective tools such as ONNX Runtime or TensorRT out of the box with ONNX usually gives you good performance, but why settle for good performance��
]]>This post is the fifth in a series about optimizing end-to-end AI. NVIDIA TensorRT is a solution for speed-of-light inference deployment on NVIDIA hardware. Provided with an AI model architecture, TensorRT can be used pre-deployment to run an excessive search for the most efficient execution strategy. TensorRT optimizations include reordering operations in a graph��
]]>Learn how AI is boosting creative applications for creators during NVIDIA GTC 2023, March 20-23.
]]>This post is the fourth in a series about optimizing end-to-end AI. As explained in the previous post in the End-to-End AI for NVIDIA-Based PCs series, there are multiple execution providers (EPs) in ONNX Runtime that enable the use of hardware-specific features or optimizations for a given deployment scenario. This post covers the CUDA EP and TensorRT EP using the highly optimized NVIDIA��
]]>This post is the third in a series about optimizing end-to-end AI. When your model has been converted to the ONNX format, there are several ways to deploy it, each with advantages and drawbacks. One method is to use ONNX Runtime. ONNX Runtime serves as the backend, reading a model from an intermediate representation (ONNX), handling the inference session, and scheduling execution on an��
]]>This post is the second in a series about optimizing end-to-end AI. In this post, I discuss how to use ONNX to transition your AI models from research to production while avoiding common mistakes. Considering that PyTorch has become the most popular machine learning framework, all my examples use it but I also supply references to TensorFlow tutorials. ONNX (Open Neural Network��
]]>This post is the first in a series about optimizing end-to-end AI. The great thing about the GPU is that it offers tremendous parallelism; it allows you to perform many tasks at the same time. At its most granular level, this comes down to the fact that there are thousands of tiny processing cores that run the same instruction at the same time. But that is not where such parallelism stops.
]]>Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates, bug fixes, content, best practices, and more. Every AI application needs a strong inference engine. Whether you��re deploying an image recognition service, intelligent virtual assistant, or a fraud detection application, a reliable inference server delivers fast, accurate��
]]>TensorRT is an SDK for high performance, deep learning inference. It includes a deep learning inference optimizer and a runtime that delivers low latency and high throughput for deep learning applications. TensorRT uses the ONNX format as an intermediate representation for converting models from major frameworks such as TensorFlow and PyTorch. In this post, you learn how to convert PyTorch��
]]>Microsoft and NVIDIA have collaborated to build, validate and publish the ONNX Runtime Python package and Docker container for the NVIDIA Jetson platform, now available on the Jetson Zoo. Today��s release of ONNX Runtime for Jetson extends the performance and portability benefits of ONNX Runtime to Jetson edge AI systems, allowing models from many different frameworks��
]]>As more and more deep learning models are being deployed into production environments, there is a growing need for a separation between the work on the model itself, and the work of integrating it into a production pipeline. Windows ML caters to this demand by addressing efficient deployment of pretrained deep learning models into Windows applications. Developing and training the model itself��
]]>Every year, clever researchers introduce ever more complex and interesting deep learning models to the world. There is of course a big difference between a model that works as a nice demo in isolation and a model that performs a function within a production pipeline. This is particularly pertinent to creative apps where generative models must run with low latency to generate or enhance image��
]]>NVIDIA has released TensorRT 4 at CVPR 2018. This new version of TensorRT, NVIDIA��s powerful inference optimizer and runtime engine provides: Additional features include the ability to execute custom neural network layers using FP16 precision and support for the Xavier SoC through NVIDIA DRIVE AI platforms. TensorRT 4 speeds up deep learning inference applications such as neural machine��
]]>