Accelerated AI Inference with NVIDIA NIM on Azure AI Foundry

The integration of NVIDIA NIM microservices into Azure AI Foundry marks a major leap forward in enterprise AI development. By combining NIM microservices with Azure’s scalable, secure infrastructure, organizations can now deploy powerful, ready-to-use AI models more efficiently than ever before.

NIM microservices are containerized for GPU-accelerated inferencing for pretrained and customized AI models. Integrating leading inference technology from NVIDIA and the broader community, NIM microservices deliver optimized response latency and throughput for the latest AI models on NVIDIA accelerated infrastructure.

Developers can access AI models through APIs that adhere to industry standards for each domain, simplifying the development of AI applications. NIM supports AI use cases across multiple domains and a range of AI models, including community models, NVIDIA AI Foundation models, and custom AI models provided by NVIDIA partners. This includes models for speech, images, video, 3D, drug discovery, medical imaging, and more.

Azure AI Foundry

Azure AI Foundry is a trusted, integrated platform for developers and IT administrators to design, customize, and manage AI applications and agents. It offers a rich set of AI capabilities and tools through a simple portal, unified SDK, and APIs. The platform facilitates secure data integration, model customization, and enterprise-grade governance to accelerate the path to production.

NVIDIA NIM on Azure AI Foundry

NIM microservices are natively supported on Azure AI Foundry, enabling developers to quickly create a streamlined path for deployment. The microservices are running on Azure’s managed compute, removing the complexity of setting up and maintaining GPU infrastructure while ensuring high availability and scalability, even for highly demanding workloads. This enables teams to move quickly from model selection to production use.

NVIDIA NIM is part of NVIDIA AI Enterprise, a suite of easy-to-use microservices engineered for secure, reliable, and high-performance AI inferencing. Each NIM deployment includes NVIDIA AI Enterprise support, offering consistent performance and reliable updates for enterprise-grade use.

NIM deployed on Azure AI Foundry can be accessed through standard APIs, making it easy to integrate into Azure services like AI Agent Service and agentic AI frameworks like Semantic Kernel. Leveraging robust technologies from NVIDIA and the community including NVIDIA Dynamo, NVIDIA TensorRT, NVIDIA TensorRT-LLM, vLLM, PyTorch, and more, NIM microservices are built to scale seamlessly on managed Azure compute.

NIM microservices provide:

Zero-configuration deployment: Get up and running quickly with out-of-the-box optimization.?
Seamless Azure integration: Work easily with Azure AI Agent Service and Semantic Kernel.?
Enterprise-grade reliability: Benefit from NVIDIA AI Enterprise support for continuous performance and security.?
Scalable inference: Tap into Azure’s NVIDIA-accelerated infrastructure for demanding workloads.?

You can deploy these services with just a few clicks. Select a model such as Llama 3.3 70B NIM from the model catalog in Azure AI Foundry and integrate it directly into your AI workflows. Start building generative AI applications that work flawlessly within the Azure ecosystem.

Deploy a NIM on Azure AI Foundry

This section walks you through how to deploy a NIM on Azure AI Foundry.

Video 1. Learn how to deploy models with NVIDIA NIM on Azure AI Foundry

Azure AI Foundry Portal

Begin by accessing the Azure AI Foundry portal and then following the steps below.

1. Navigate to ai.azure.com and ensure that you have a Hub and Project available.

2. Select Model Catalog from the left sidebar menu (Figure 1).

3. In the Collections filter, select NVIDIA to see all the NIM microservices that are available on Azure AI Foundry.

Screenshot of the Azure AI Foundry Model Catalog. — *Figure 1. Deploy NIM microservices directly from the Azure AI Foundry portal*

4. Select the NIM you want to use. This example uses the Llama 3.1 8B Instruct NIM microservice.

5. Click Deploy.

6. Choose the deployment name and virtual machine (VM) type that you would like to use for your deployment (Figure 2). VM SKUs that are supported for the selected NIM and also specified within the model card will be preselected. Note that this step requires having sufficient quota available in your Azure subscription for the selected VM type. If needed, follow the instructions to request a service quota increase.

Screenshot of the Azure AI Foundry Model Catalog with dialog box showing NIM deployment name and virtual machine menu selection. — *Figure 2. Choose the deployment name and VM type for your selected NIM*

7. Optional: Customize the deployment configuration, such as the instance count. As part of this step, it’s also possible to leverage an existing endpoint for your deployment.

8. Click Next, and then review the Pricing breakdown and terms of use for the NIM deployment (Figure 3). The pricing breakdown consists of the Azure Compute charges plus a flat fee per GPU for the NVIDIA AI Enterprise license that is required to use the NIM software.

Deploy NVIDIA NIM microservice through Azure Marketplace pricing breakdown. — *Figure 3. Pricing breakdown for NVIDIA NIM deployment through Azure Marketplace*

9. After acknowledging the Terms of Use, click Deploy to launch the deployment process.

Python SDK deployment

It’s also possible to programmatically deploy NIM on Azure AI Foundry using the Azure Machine Learning Python SDK. To do so, following the steps presented in this section.

Install the Azure ML SDK:

pip install azure-ai-ml
pip install azure-identity

Set up the credentials:

from azure.ai.ml import MLClient
from azure.identity import InteractiveBrowserCredential
workspace_ml_client = MLClient(
    credential=InteractiveBrowserCredential(),
    subscription_id="azure-subscription-id",
    resource_group_name="resource-group-name",
    workspace_name="azure-ai-foundry-project-name",
)

Create the managed online endpoint that will host the NIM deployment:

import time, sys
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    ProbeSettings,
)
 
# Make the endpoint name unique
timestamp = int(time.time())
online_endpoint_name = "endpoint" + str(timestamp)
 
# Create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    auth_mode="key",
)
workspace_ml_client.online_endpoints.begin_create_or_update(endpoint).wait()

Finally, use the model ID to deploy the model:

model_name = "azureml://registries/azureml-nvidia/models/Llama-3.1-8B-Instruct-NIM-microservice/versions/2"
 
demo_deployment = ManagedOnlineDeployment(
    name="nim-deployment",
    endpoint_name=online_endpoint_name,
    model=model_name,
    instance_type="Standard_NC24ads_A100_v4",
    instance_count=1,
    liveness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        timeout=2,
        period=10,
        initial_delay=1000,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=10,
        success_threshold=1,
        timeout=10,
        period=10,
        initial_delay=1000,
    ),
)
workspace_ml_client.online_deployments.begin_create_or_update(demo_deployment).wait()
endpoint.traffic = {"nim-deployment": 100}
workspace_ml_client.online_endpoints.begin_create_or_update(endpoint).result()

Integrating NIM into AI solutions using OpenAI or Azure AI Foundry SDK

NVIDIA NIM on Foundry exposes an OpenAI-compatible API. To learn more about the payload supported, see the NVIDIA NIM for LLMs documentation. Another option is to use the Azure AI Foundry SDK.

OpenAI SDK with NIM on Azure AI Foundry

Use openai SDK to consume Llama 3.1 8B NIM deployments in Azure AI Foundry and Azure ML. The NVIDIA Meta Llama 3 family of models in Azure AI and Azure ML offers API compatibility with the OpenAI Chat Completion API. It enables customers and users to transition seamlessly from OpenAI models to Meta Llama LLM NIM microservices.

The API can be directly used with OpenAI’s client libraries or third-party tools, like LangChain or LlamaIndex.

The following example shows how to use openai with a Meta Llama 3 chat model deployed in Azure AI and Azure ML. For more details, see Azure/azureml-examples on GitHub.

pip install openai

Note that you will need an endpoint URL and an authentication key associated with that endpoint. These can be acquired from the previous steps. To work with openai, configure the client as follows:

base_url: Use the endpoint URL from your deployment. Include /v1 as part of the URL.
api_key: Use your API key.

client = OpenAI(
    base_url="https://<endpoint>.<region>.inference.ml.azure.com/v1", api_key="<key>"
)

Use the client to create chat completions requests:

response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Who is the most renowned French painter? Provide a short answer.",
        }
    ],
    model="meta/llama-3.1-8b-instruct",
)

The generated text can be accessed as follows:

print(response.choices[0].message.content)

Azure AI Inference SDK with NIM on Azure AI Foundry

This example shows how to consume Llama 3.1 8B NIM deployments in Azure AI Foundry and Azure AML using an AI Inference SDK with the Llama 3 NIM. For more details, see Azure/azureml-examples on GitHub.

FIrst, install the package azure-ai-inference using your package manager, like pip:

pip install azure-ai-inference

Use ChatCompletionsClient from ai.inference package to call the Llama LLM NIM:

import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
from azure.ai.inference.models import SystemMessage, UserMessage
 
endpoint = "https://<endpoint>.<region>.inference.ml.azure.com/v1"
key = os.getenv("AZURE_AI_CHAT_KEY", "keyhere")
 
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(key),
)
 
response = client.complete(
    messages=[
        SystemMessage("You are a helpful assistant."),
        UserMessage("Can you write me a song?"),
    ],
)
 
print(response.choices[0].message.content)

Get started with NVIDIA NIM on Azure AI Foundry

The integration of NVIDIA NIM microservices into Azure AI Foundry brings together the best of both worlds—the cutting-edge NVIDIA AI inferencing platform that includes hardware and software, and Azure’s enterprise-grade cloud infrastructure. This powerful combination enables developers and organizations to rapidly deploy, scale, and operationalize AI models with minimal configuration and maximum performance. With just a few clicks or lines of code, you can unlock high-performance inferencing for some of today’s most advanced models.

To get started, check out these resources: