使用一個 GPU 運行多模態提取以實現更高效的 AI 工作流

隨著企業生成和使用越來越多的多樣化數據，從 PDF 和演示文稿等多模態文檔中提取見解已成為一項重大挑戰。傳統的純文本提取和基本的檢索增強生成 (RAG) 流程存在缺陷，無法充分發揮這些復雜文檔的價值。結果如何？見解缺失、工作流程效率低下以及運營成本上升。

在本博文中，我們將通過使用 NVIDIA NeMo Retriever 微服務的代碼示例和一個 GPU，深入探討構建有效的多模態提取工作流的關鍵組件。

NVIDIA NeMo Retriever 提取工作流

NeMo Retriever Retriever 是多模態文檔處理工作流的示例架構。它使用微服務從數百萬文檔的不同文件類型中高效提取信息。它與 NeMo Retriever 嵌入和重排序模型搭配使用，形成了完整的可擴展 RAG 解決方案，如適用于 RAG 的 NVIDIA AI Blueprint 所示 (如圖 1 所示) 。

A diagram showing the multimodal extraction pipeline and retrieval pipeline of NVIDIA NeMo Retriever. — *圖 1。提取流程圖，作為構建企業 RAG NVIDIA AI Blueprint 的一部分*

在下一節中，我們將介紹一個簡單的用例，其中展示了分步 NeMo Retriever 提取工作流以及其他 NeMo Retriever 組件，以使用提取的數據。

借助多模態文檔理解完成業務任務

為此，我們將考慮一個組織共享文件夾，其中包含有關 NVIDIA Blackwell GPU 的產品資源。這些文件包括不同類型（例如 PDF、PPTX 和 JPEG）的文本、圖像、圖表和表格。

在本示例中，客戶支持工程師要求比較 NVIDIA Blackwell 的性能，以便為合作伙伴提供支持。

此類請求可以通過任何應用程序發送，例如聊天用戶界面或自動內容生成器。在這里，我們將使用pipeline的 Python 客戶端的直接提示來演示它。

第 1 步：使用一個 GPU 加速管道

在本地或云計算機上使用 Docker 部署藍圖。請參閱 NVIDIA 文檔中的 NeMo Retriever 提取快速入門指南中的部署指南。

在本示例中，整個工作流部署在 AWS g6e.xlarge 計算機 (L40S GPU，48 GB) 上。驗證所需配置文件的所有已部署服務是否正常運行。

An image of deployed services of NeMo Retriever Extraction, which includes all pipeline models, serving, and observability tools. — *圖 2。NVIDIA NeMo Retriever 提取部署服務*

工作流服務包括視覺元素識別和 OCR (光學字符識別) 模型、嵌入模型、Milvus DB 以及可觀察性工具 (Prometheus 和 Grafana、Attu、Zipkin 等) 。

注意：出于原型設計目的，可以在藍圖頁面 Build an Enterprise RAG pipeline 中訪問pipeline源代碼。

第 2 步：提交存儲中文件的 ingestion 作業

所有服務啟動后，我們可以通過 Python 客戶端或 CLI (Command Line Interface) 提交提取作業。

我們將展示 Python 客戶端的用法。

在本示例中，我們傳遞集合中文件的路徑，并定義要包含在作業中的任務 (即extract、split和embed) 。我們將extract任務設置為包含所有modalities類型，并將split任務設置為將文本分塊為1,024個token。

from nv_ingest_client.client import Ingestor
 
demo_files = "demo_files/*"
 
ingestor = (
    Ingestor(message_client_hostname="localhost")
    .files(demo_files)
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_images=True,
        text_depth="page",
    )
    .dedup()
    .split(
        tokenizer="meta-llama/Llama-3.2-1B",
        chunk_size=1024,
    )
    .embed()
    .vdb_upload()
)
 
result = ingestor.ingest()

第 3 步：分析 Job 結果

完成提取作業后，我們可以分析結果結構 (Figure 3)：

import pandas as pd
 
df = pd.DataFrame([])
for doc in result:
    for obj in doc:
        df = pd.concat([df,pd.json_normalize(obj)])
 
display(df)

A screenshot showing extraction job results structure. — *圖 3。Extraction job 結果*

作業提取為我們的每個文檔生成了來自不同模式的多個對象。其中包括文本、圖像和結構化對象，指的是charts和tables。

例如，我們可以看到提取的文本對象 Text Object：

# print a random text object that was extracted. 
 
print(df[df['document_type']=='text'].sample(1)['metadata.content'][0])
 
# Output:
# NVIDIA GB200 NVL72 | Datasheet | 1
# NVIDIA GB200 NVL72
# Powering the new era of computing.
# Unlocking Real-Time Trillion-Parameter Models
# NVIDIA GB200 NVL72 connects 36 Grace CPUs and 72 Blackwell GPUs in an NVIDIA?
# NVLink?-connected, liquid-cooled, rack-scale design. Acting as a single, massive GPU, it
# delivers 30X faster real-time trillion-parameter large language model (LLM) inference.
# The GB200 Grace Blackwell Superchip is a key component of the NVIDIA GB200
# NVL72, connecting two high-performance NVIDIA Blackwell GPUs and an NVIDIA...

還可以將文本分割成較小的塊。我們可以在 Ingestor 的拆分配置中控制分塊策略。

以下是提取的隨機表例：

# detected chart
from base64 import b64decode
from IPython import display
 
rand_extracted_object = df[df['document_type']=='structured'].sample(1)
 
display.Image(b64decode(rand_extracted_object['metadata.content'][0]))

An image of a table object that was extracted by the ingestion job, aligned and bounded correctly. — 圖 4。一個由 ingestion job 提取的表格

除了視覺對象提取之外，還保存了其中的文本內容：

# table textual content
rand_extracted_object['metadata.table_metadata.table_content'][0]
 
# Output:
# " | Product Specifications' |\n| The NVIDIA GB200 Grace Blackwell Superchip comes in two configurations: GB200 NVL72 and GB200 NVL2 |\n| Feature | GB200 NVL72 | GB200NVL2 | GB200 Grace Blackwell | Superchip |\n| Configuration | 36 Grace CPUs, | 2 Grace CPUs, | 1 Grace CPU, |\n| 72 Blackwell GPUs, | 2 Blackwell GPUs | 2 Blackwell GPUs ....

這些對象被作業自動分塊和嵌入。我們可以通過 Milvus 客戶端或 Attu ( Milvus 的 Web 用戶界面) 服務 (在第 1 步中與服務包的其余部分一起部署) ，追蹤流程自動創建的 Milvus 集合中的向量嵌入。

A screenshot of an Attu dashboard showing the created Milvus collection. — 圖 5。Milvus 集合由 NVIDIA NeMo Retriever 提取管道自動創建

第 4 步：檢索

我們將演示如何基于 NeMo Retriever 提取工作流、提取的數據和 NeMo Retriever 嵌入構建檢索組件。首先，為嵌入和生成器 LLM 微服務定義 NVIDIA 客戶端。

from openai import OpenAI
 
nvidia_client = OpenAI(
  api_key="...",
  base_url="https://integrate.api.nvidia.com/v1"
)

嵌入用戶查詢 (使用提取中使用的相同嵌入模型)

user_query = "I am a customer support engineering asking for my client - What is the main difference between the two configurations of grace blackwell?"
 
# embed user query
response = nvidia_client.embeddings.create(
    input=user_query,
    model="nvidia/nv-embedqa-e5-v5",
    encoding_format="float",
    extra_body={"input_type": "query", "truncate": "NONE"}
)
user_query_vector=response.data[0].embedding

使用 NeMo Retriever 提取 Python 客戶端檢索器獲取與用戶查詢最相似的結果：

from nv_ingest_client.util.milvus import nvingest_retrieval
 
query_results = nvingest_retrieval(
    [user_query],
    "nv_ingest_collection",
    hybrid=False,
    embedding_endpoint="http://localhost:8012/v1",
    model_name="nvidia/llama-3.2-nv-embedqa-1b-v2",
    top_k=1,
    gpu_search=True,
)
 
top_result = query_results[0][0]['entity']['text']

為生成器 LLM 創建相關提示并獲取響應：

prompt = """Based on the following context answer the user query:
 
context:
{}
 
user query:
{}
 
""".format(top_result, user_query)
 
completion = nvidia_client.chat.completions.create(
  model="meta/llama-3.2-3b-instruct",
  messages=[{"role":"user","content":prompt}],
  temperature=0.2,
  top_p=0.7,
  max_tokens=200,
  stream=True
)
 
for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")
 
 
# Output:
# Based on the provided context, the main difference between the two configurations of the NVIDIA GB200 Grace Blackwell Superchip is the number of Grace CPUs and Blackwell GPUs.
#
# The two configurations are:
# 1. GB200 NVL72: 36 Grace CPUs, 72 Blackwell GPUs
# 2. GB200 NVL2: 2 Grace CPUs, 2 Blackwell GPUs
#
# This difference affects the overall performance and capabilities of the system, with the NVL72 configuration having more processing power and memory bandwidth due to the higher number of Grace CPUs and Blackwell GPUs.

我們檢索了高度相關的數據塊，無需對原始文件進行任何直接搜索和審查。

這個簡單的用例展示了如何通過快速部署設置來自動理解多模態企業源文件的上下文。

總結

NeMo Retriever 提取工作流通過自動處理不同的文件類型 (例如 PDF、演示文稿和電子表格) 來解決多模態文檔處理的挑戰。從文本、圖像、表格和圖表中提取有意義的內容，將以前孤立的信息更改為可訪問的結構化數據。這使組織能夠從現有知識庫中獲得更深入的見解。

此解決方案背后的架構整合了 Object Detection、Chart Parsing 和 Vector Embeddings 等高級組件，可實現高效的上下文感知檢索。通過保留跨模態的關系并通過 Semantic Search 將其表面化，該工作流提供了一種全面的文檔理解方法。使用 NeMo Retriever 實施這種端到端流程標志著企業知識管理的重大進步，它將未被充分利用的靜態文檔轉化為高價值資產，從而推動生成式 AI 應用和更智能的決策制定。

通過不斷提取和使用新數據，NeMo Retriever 還可以幫助組織創建數據飛輪，通過提高數據質量，生成更好的 AI 模型，進而生成更有價值的數據。

使用適用于 RAG 的 NVIDIA AI 藍圖開始使用 NeMo Retriever 提取工作流，或在 build.nvidia.com 上試用單個 NeMo Retriever 微服務，以進行提取、嵌入和重新排序。

使用一個 GPU 運行多模態提取以實現更高效的 AI 工作流

NVIDIA NeMo Retriever 提取工作流

借助多模態文檔理解完成業務任務

第 1 步：使用一個 GPU 加速管道

第 2 步：提交存儲中文件的 ingestion 作業

第 3 步：分析 Job 結果

第 4 步：檢索

總結

相關資源

標簽

關于作者

使用一個 GPU 運行多模態提取以實現更高效的 AI 工作流

NVIDIA NeMo Retriever 提取工作流

借助多模態文檔理解完成業務任務

第 1 步：使用一個 GPU 加速管道

第 2 步：提交存儲中文件的 ingestion 作業

第 3 步：分析 Job 結果

第 4 步：檢索

總結

相關資源

標簽

關于作者

相關文章

NVIDIA NeMo 檢索器將準確的多模態 PDF 數據提取速度提高 15 倍

借助 NVIDIA NIM 智能體藍圖構建企業級多模態文檔檢索工作流

相關文章

CUTLASS：基于張量和空間微核處理多維數據的原理抽象

借助 AWS 上的 NVIDIA Run:ai 加速 AI 模型編排

使用 SynthDa 提升合成數據增強和人體動作識別

借助 GliaCloud 和 NVIDIA Omniverse 庫加速視頻制作和定制

從 TB 級到一站式解決方案：AI 驅動的氣候模型走向主流