VLMs – NVIDIA Technical Blog

VLMs – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-07-11T15:00:00Z http://www.open-lab.net/blog/feed/ Amanda Saunders <![CDATA[New NVIDIA Llama Nemotron Nano Vision Language Model Tops OCR Benchmark for Accuracy]]> http://www.open-lab.net/blog/?p=100840 2025-06-12T18:50:47Z 2025-06-03T21:36:50Z

Documents such as PDFs, graphs, charts, and dashboards are rich sources of data that, when extracted and organized, provide informative decision-making...]]>

Documents such as PDFs, graphs, charts, and dashboards are rich sources of data that, when extracted and organized, provide informative decision-making... An illustration for NVIDIA Llama Nemotron Nano VL.

An illustration for NVIDIA Llama Nemotron Nano VL.

Documents such as PDFs, graphs, charts, and dashboards are rich sources of data that, when extracted and organized, provide informative decision-making insights. From automating financial statement processing to improving business intelligence workflows, intelligent document processing is becoming a core component of AI solutions in enterprises. Organizations can accelerate the AI��

]]> 0 Adam Ryason <![CDATA[Advance Video Analytics AI Agents Using the NVIDIA AI Blueprint for Video Search and Summarization]]> http://www.open-lab.net/blog/?p=98690 2025-06-12T23:35:07Z 2025-05-19T06:00:00Z

Vision language models (VLMs) have transformed video analytics by enabling broader perception and richer contextual understanding compared to traditional...]]>

Vision language models (VLMs) have transformed video analytics by enabling broader perception and richer contextual understanding compared to traditional...

metropolis-vss-blueprint-gif

Vision language models (VLMs) have transformed video analytics by enabling broader perception and richer contextual understanding compared to traditional computer vision (CV) models. However, challenges like limited context length and lack of audio transcription still exist, restricting how much video a VLM can process at a time. To overcome this, the NVIDIA AI Blueprint for video search and��

]]> 0 Joseph Lucas <![CDATA[Structuring Applications to Secure the KV Cache]]> http://www.open-lab.net/blog/?p=99425 2025-05-15T19:08:32Z 2025-04-29T22:43:01Z

When interacting with transformer-based models like large language models (LLMs) and vision-language models (VLMs), the structure of the input shapes the...]]>

When interacting with transformer-based models like large language models (LLMs) and vision-language models (VLMs), the structure of the input shapes the...

Structuring Applications to Secure the KV Cache blog

When interacting with transformer-based models like large language models (LLMs) and vision-language models (VLMs), the structure of the input shapes the model��s output. But prompts are often more than a simple user query. In practice, they optimize the response by dynamically assembling data from various sources such as system instructions, context data, and user input.

]]> 0 Davide Paglieri <![CDATA[Benchmarking Agentic LLM and VLM Reasoning for Gaming with NVIDIA NIM]]> http://www.open-lab.net/blog/?p=99202 2025-05-15T19:08:40Z 2025-04-24T17:00:00Z

This is the first post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM.?...]]>

This is the first post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM.?...

nvidia-nim-microservices

This is the first post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM. Researchers from the University College London (UCL) Deciding, Acting, and Reasoning with Knowledge (DARK) Lab leverage NVIDIA NIM microservices in their new game-based benchmark suite, Benchmarking Agentic LLM and VLM Reasoning On Games��

]]> 0 Michael Zephyr <![CDATA[MONAI Integrates Advanced Agentic Architectures to Establish Multimodal Medical AI Ecosystem]]> http://www.open-lab.net/blog/?p=97638 2025-04-23T00:26:59Z 2025-03-19T16:00:00Z

The growing volume and complexity of medical data��and the pressing need for early disease diagnosis and improved healthcare efficiency��are driving...]]>

The growing volume and complexity of medical data��and the pressing need for early disease diagnosis and improved healthcare efficiency��are driving...

multimodal-medical-ai-ecosystem

The growing volume and complexity of medical data��and the pressing need for early disease diagnosis and improved healthcare efficiency��are driving unprecedented advancements in medical AI. Among the most transformative innovations in this field are multimodal AI models that simultaneously process text, images, and video. These models offer a more comprehensive understanding of patient data than��

]]> 0 Chen Fu <![CDATA[Streamline LLM Deployment for Autonomous Vehicle Applications with NVIDIA DriveOS LLM SDK]]> http://www.open-lab.net/blog/?p=96776 2025-03-07T20:13:46Z 2025-03-10T19:30:00Z

Large language models (LLMs) have shown remarkable generalization capabilities in natural language processing (NLP). They are used in a wide range of...]]>

Large language models (LLMs) have shown remarkable generalization capabilities in natural language processing (NLP). They are used in a wide range of...

intersection-with-cars

Large language models (LLMs) have shown remarkable generalization capabilities in natural language processing (NLP). They are used in a wide range of applications, including translation, digital assistants, recommendation systems, context analysis, code generation, cybersecurity, and more. In automotive applications, there is growing demand for LLM-based solutions for both autonomous driving and��

]]> 2 Francesco Ciannella <![CDATA[Building a Simple VLM-Based Multimodal Information Retrieval System with NVIDIA NIM]]> http://www.open-lab.net/blog/?p=96151 2025-03-06T19:26:45Z 2025-02-26T17:00:00Z

In today��s data-driven world, the ability to retrieve accurate information from even modest amounts of data is vital for developers seeking streamlined,...]]>

In today��s data-driven world, the ability to retrieve accurate information from even modest amounts of data is vital for developers seeking streamlined,... Three icons leading to a computer monitor.

Three icons leading to a computer monitor.

In today��s data-driven world, the ability to retrieve accurate information from even modest amounts of data is vital for developers seeking streamlined, effective solutions for quick deployments, prototyping, or experimentation. One of the key challenges in information retrieval is managing the diverse modalities in unstructured datasets, including text, PDFs, images, tables, audio, video��

]]> 1 Shubham Agrawal <![CDATA[Vision Language Model Prompt Engineering Guide for Image and Video Understanding]]> http://www.open-lab.net/blog/?p=96229 2025-04-23T02:38:32Z 2025-02-26T16:25:34Z

Vision language models (VLMs) are evolving at a breakneck speed. In 2020, the first VLMs revolutionized the generative AI landscape by bringing visual...]]>

Vision language models (VLMs) are evolving at a breakneck speed. In 2020, the first VLMs revolutionized the generative AI landscape by bringing visual... A GIF of a warehouse with people walking around.

A GIF of a warehouse with people walking around.

Vision language models (VLMs) are evolving at a breakneck speed. In 2020, the first VLMs revolutionized the generative AI landscape by bringing visual understanding to large language models (LLMs) through the use of a vision encoder. These initial VLMs were limited in their abilities, only able to understand text and single image inputs. Fast-forward a few years and VLMs are now capable of��

]]> 0 Joanne Chang <![CDATA[Upcoming Webinar: Unlocking Video Analytics With AI Agents]]> http://www.open-lab.net/blog/?p=96135 2025-02-20T15:52:55Z 2025-02-13T22:05:57Z

Master prompt engineering, fine-tuning, and customization to build video analytics AI agents.]]>

Master prompt engineering, fine-tuning, and customization to build video analytics AI agents.

Webinar : Vision for All - Unlocking Video Analytics With AI Agents

Master prompt engineering, fine-tuning, and customization to build video analytics AI agents.

]]> 0 Shashank Maheshwari <![CDATA[NVIDIA JetPack 6.2 Brings Super Mode to NVIDIA Jetson Orin Nano and Jetson Orin NX Modules]]> http://www.open-lab.net/blog/?p=95089 2025-04-23T15:03:02Z 2025-01-16T22:10:29Z

The introduction of the NVIDIA Jetson Orin Nano Super Developer Kit sparked a new age of generative AI for small edge devices. The new Super Mode delivered an...]]>

The introduction of the NVIDIA Jetson Orin Nano Super Developer Kit sparked a new age of generative AI for small edge devices. The new Super Mode delivered an... Stylized image of JetPack connected to a monitor.

Stylized image of JetPack connected to a monitor.

The introduction of the NVIDIA Jetson Orin Nano Super Developer Kit sparked a new age of generative AI for small edge devices. The new Super Mode delivered an unprecedented generative AI performance boost of up to 1.7x on the developer kit, making it the most affordable generative AI supercomputer. JetPack 6.2 is now available to support Super Mode for Jetson Orin Nano and Jetson Orin NX��

]]> 1 Samuel Ochoa <![CDATA[Build a Video Search and Summarization Agent with NVIDIA AI Blueprint]]> http://www.open-lab.net/blog/?p=86011 2025-02-13T20:44:57Z 2025-01-07T04:20:00Z

This post was originally published July 29, 2024 but has been extensively revised with NVIDIA AI Blueprint information. Traditional video analytics applications...]]>

This post was originally published July 29, 2024 but has been extensively revised with NVIDIA AI Blueprint information. Traditional video analytics applications... Decorative image of icons and a molecular structure in green.

Decorative image of icons and a molecular structure in green.

This post was originally published July 29, 2024 but has been extensively revised with NVIDIA AI Blueprint information. Traditional video analytics applications and their development workflow are typically built on fixed-function, limited models that are designed to detect and identify only a select set of predefined objects. With generative AI, NVIDIA NIM microservices��

]]> 2 Joanne Chang <![CDATA[Just Released: NVIDIA VILA VLM]]> http://www.open-lab.net/blog/?p=93512 2024-12-12T19:35:17Z 2024-12-09T17:09:10Z

Now available in preview, NVIDIA VILA is an advanced multimodal VLM that provides visual understanding of multi-images and video.]]>

Now available in preview, NVIDIA VILA is an advanced multimodal VLM that provides visual understanding of multi-images and video.

metropolis-and-iva-ngc-visual-blog-1920x1080

Now available in preview, NVIDIA VILA is an advanced multimodal VLM that provides visual understanding of multi-images and video.

]]> 0 Shubham Agrawal <![CDATA[Build an Agentic Video Workflow with Video Search and Summarization]]> http://www.open-lab.net/blog/?p=92834 2025-01-07T05:45:50Z 2024-12-03T18:30:00Z

Building a question-answering chatbot with large language models (LLMs) is now a common workflow for text-based interactions. What about creating an AI system...]]>

Building a question-answering chatbot with large language models (LLMs) is now a common workflow for text-based interactions. What about creating an AI system... An avatar sitting at a computer, which is linked to multiple action icons through the NVIDIA NIM icon.

Building a question-answering chatbot with large language models (LLMs) is now a common workflow for text-based interactions. What about creating an AI system that can answer questions about video and image content? This presents a far more complex task. Traditional video analytics tools struggle due to their limited functionality and a narrow focus on predefined objects.

]]> 0 Samuel Ochoa <![CDATA[Build Multimodal Visual AI Agents Powered by NVIDIA NIM]]> http://www.open-lab.net/blog/?p=90989 2024-11-14T19:40:37Z 2024-10-31T20:20:01Z

The exponential growth of visual data��ranging from images to PDFs to streaming videos��has made manual review and analysis virtually impossible....]]>

The exponential growth of visual data��ranging from images to PDFs to streaming videos��has made manual review and analysis virtually impossible.... Decorative image.

Decorative image.

The exponential growth of visual data��ranging from images to PDFs to streaming videos��has made manual review and analysis virtually impossible. Organizations are struggling to transform this data into actionable insights at scale, leading to missed opportunities and increased risks. To solve this challenge, vision-language models (VLMs) are emerging as powerful tools��

]]> 0 Anjali Shah <![CDATA[Deploying Accelerated Llama 3.2 from the Edge to the Cloud]]> http://www.open-lab.net/blog/?p=89436 2024-11-07T05:08:12Z 2024-09-25T18:39:49Z

Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an...]]>

Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an...

llama-3.2-graphic.

Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an updated Llama Guard model with support for vision. When paired with the NVIDIA accelerated computing platform, Llama 3.2 offers developers, researchers, and enterprises valuable new capabilities and optimizations to realize their��

]]> 0 Abrar Anwar <![CDATA[Using Generative AI to Enable Robots to Reason and Act with ReMEmbR]]> http://www.open-lab.net/blog/?p=88932 2024-11-07T05:08:39Z 2024-09-23T20:01:55Z

Vision-language models (VLMs) combine the powerful language understanding of foundational LLMs with the vision capabilities of vision transformers (ViTs) by...]]>

Vision-language models (VLMs) combine the powerful language understanding of foundational LLMs with the vision capabilities of vision transformers (ViTs) by... Photo of robot moving down a path.

Photo of robot moving down a path.

Vision-language models (VLMs) combine the powerful language understanding of foundational LLMs with the vision capabilities of vision transformers (ViTs) by projecting text and images into the same embedding space. They can take unstructured multimodal data, reason over it, and return the output in a structured format. Building on a broad base of pretraining, they can be easily adapted for��

]]> 0 Samuel Ochoa <![CDATA[Develop Generative AI-Powered Visual AI Agents for the Edge]]> http://www.open-lab.net/blog/?p=85444 2024-11-07T05:08:55Z 2024-07-17T15:00:00Z

An exciting breakthrough in AI technology��Vision Language Models (VLMs)��offers a more dynamic and flexible method for video analysis. VLMs enable users to...]]>

An exciting breakthrough in AI technology��Vision Language Models (VLMs)��offers a more dynamic and flexible method for video analysis. VLMs enable users to... An illustration representing an AI model.

An illustration representing an AI model.

An exciting breakthrough in AI technology��Vision Language Models (VLMs)��offers a more dynamic and flexible method for video analysis. VLMs enable users to interact with image and video input using natural language, making the technology more accessible and adaptable. These models can run on the NVIDIA Jetson Orin edge AI platform or discrete GPUs through NIMs. This blog post explores how to build��

]]> 1 Min-Hung Chen https://minhungchen.netlify.app/ <![CDATA[Introducing DoRA, a High-Performing Alternative to LoRA for Fine-Tuning]]> http://www.open-lab.net/blog/?p=84454 2024-11-07T05:09:12Z 2024-06-28T15:00:00Z

Full fine-tuning (FT) is commonly employed to tailor general pretrained models for specific downstream tasks. To reduce the training cost, parameter-efficient...]]>

Full fine-tuning (FT) is commonly employed to tailor general pretrained models for specific downstream tasks. To reduce the training cost, parameter-efficient...

abstract-graphic

Full fine-tuning (FT) is commonly employed to tailor general pretrained models for specific downstream tasks. To reduce the training cost, parameter-efficient fine-tuning (PEFT) methods have been introduced to fine-tune pretrained models with a minimal number of parameters. Among these, Low-Rank Adaptation (LoRA) and its variants have gained considerable popularity because they avoid additional��

]]> 0 Chintan Shah <![CDATA[Power Cloud-Native Microservices at the Edge with NVIDIA JetPack 6.0, Now GA]]> http://www.open-lab.net/blog/?p=83182 2024-11-07T05:09:27Z 2024-06-04T20:24:32Z

NVIDIA JetPack SDK powers NVIDIA Jetson modules, offering a comprehensive solution for building end-to-end accelerated AI applications. JetPack 6 expands the...]]>

NVIDIA JetPack SDK powers NVIDIA Jetson modules, offering a comprehensive solution for building end-to-end accelerated AI applications. JetPack 6 expands the...

nvidia-jetson-orin-ai-stack

NVIDIA JetPack SDK powers NVIDIA Jetson modules, offering a comprehensive solution for building end-to-end accelerated AI applications. JetPack 6 expands the Jetson platform��s flexibility and scalability with microservices and a host of new features. It��s the most downloaded version of JetPack in 2024. With the JetPack 6.0 production release now generally available��

]]> 0 Yao (Jason) Lu <![CDATA[Visual Language Intelligence and Edge AI 2.0 with NVIDIA Cosmos Nemotron]]> http://www.open-lab.net/blog/?p=81534 2025-01-09T03:29:25Z 2024-05-03T15:00:00Z

Note: As of January 6, 2025, VILA is now part of the Cosmos Nemotron VLM family. NVIDIA is proud to announce the release of NVIDIA Cosmos Nemotron, a family of...]]>

Note: As of January 6, 2025, VILA is now part of the Cosmos Nemotron VLM family. NVIDIA is proud to announce the release of NVIDIA Cosmos Nemotron, a family of... Decorative image of VILA and Jetson Orin workflow.

Decorative image of VILA and Jetson Orin workflow.

Note: As of January 6, 2025, VILA is now part of the Cosmos Nemotron VLM family. NVIDIA is proud to announce the release of NVIDIA Cosmos Nemotron, a family of state-of-the-art vision language models (VLMs) designed to query and summarize images and videos from physical or virtual environments. Cosmos Nemotron builds upon NVIDIA��s groundbreaking visual understanding research including VILA��

]]> 1 Yao (Jason) Lu <![CDATA[Visual Language Models on NVIDIA Hardware with VILA]]> http://www.open-lab.net/blog/?p=81571 2025-01-07T04:01:29Z 2024-05-03T15:00:00Z

Note: As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models. Visual language models have evolved significantly recently....]]>

Note: As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models. Visual language models have evolved significantly recently.... Decorative image.

Decorative image.

Note: As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models. Visual language models have evolved significantly recently. However, the existing technology typically only supports one single image. They cannot reason among multiple images, support in context learning or understand videos. Also, they don��t optimize for inference speed. We developed VILA��

]]> 1 Chitoku Yato <![CDATA[Bringing Generative AI to Life with NVIDIA Jetson]]> http://www.open-lab.net/blog/?p=71836 2024-11-07T05:10:23Z 2023-10-19T22:17:37Z

Recently, NVIDIA unveiled Jetson Generative AI Lab, which empowers developers to explore the limitless possibilities of generative AI in a real-world setting...]]>

Recently, NVIDIA unveiled Jetson Generative AI Lab, which empowers developers to explore the limitless possibilities of generative AI in a real-world setting...

Bringing Generative AI to Life with NVIDIA Jetson

Recently, NVIDIA unveiled Jetson Generative AI Lab, which empowers developers to explore the limitless possibilities of generative AI in a real-world setting with NVIDIA Jetson edge devices. Unlike other embedded platforms, Jetson is capable of running large language models (LLMs), vision transformers, and stable diffusion locally. That includes the largest Llama-2-70B model on Jetson AGX Orin at��

]]> 0 Jason Black <![CDATA[Jumpstart your Generative AI Journey with NVIDIA Jetson Orin?]]> http://www.open-lab.net/blog/?p=66149 2024-11-07T05:10:42Z 2023-06-05T15:29:55Z

The NVIDIA Jetson Orin Nano and Jetson AGX Orin Developer Kits are now available at a discount for qualified students, educators, and researchers. Since its...]]>

The NVIDIA Jetson Orin Nano and Jetson AGX Orin Developer Kits are now available at a discount for qualified students, educators, and researchers. Since its...

Jumpstart your Generative AI Journey with Jetson Orin

The NVIDIA Jetson Orin Nano and Jetson AGX Orin Developer Kits are now available at a discount for qualified students, educators, and researchers. Since its initial release almost 10 years ago, the NVIDIA Jetson platform has set the global standard for embedded computing and edge AI. These high-performance, low-power modules and developer kits for deep learning and computer vision give developers��

]]> 6 ��˳��97caoporen��