Fine-Tuning LLMOps for Rapid Model Evaluation and Ongoing Optimization

Large language models (LLMs) have created unprecedented opportunities across various industries. However, moving LLMs from research and development into reliable, scalable, and maintainable production systems presents unique operational challenges.

LLMOps, or large language model operations, are designed to address these challenges. Building upon the principles of traditional machine learning operations (MLOps), it provides a framework for managing the entire LLM lifecycle, from data preparation and model fine-tuning to deployment, monitoring, and continuous improvement. Operationalizing LLMs introduces several significant challenges throughout the pipeline and deployment phases, including:

Fine-tuning pipeline management: Orchestrating the fine-tuning process involves managing large models, tracking numerous experiments with varying hyperparameters, ensuring reproducibility of results, and efficiently using distributed computing resources.
Evaluation at scale: A critical challenge in scaling evaluations is moving beyond simple metrics to assess the subjective quality and safety of LLM outputs. This problem becomes acute in multi-agent systems, where isolating and evaluating the performance of each agent-node individually becomes difficult and creates a significant barrier to effective, large-scale analysis.
Model versioning and lineage: Tracking the lineage of finetuned models, including the base model version, finetuning data, hyperparameters, and evaluation results, is critical for reproducibility, debugging, and regulatory compliance. Managing and storing numerous large model artifacts efficiently is also difficult.
Inference serving complexity: Deploying and serving LLMs for real-time inference with low latency and high throughput requires specialized serving frameworks, efficient model loading (especially for large models or techniques like LoRA), and dynamic scaling based on demand. Optimizing inference performance across different hardware configurations is a continuous challenge.

Amdocs, a company specializing in telecommunications solutions, is addressing these very challenges to overcome the complexities of operationalizing their custom LLMs and accelerate their AI initiatives. Amdocs has built a robust LLMOps pipeline based on NVIDIA AI Blueprint for building data flywheels, which uses NVIDIA NeMo microservices for streamlined fine-tuning, evaluation, guardrailing, and serving them as NVIDIA NIM for efficient, scalable deployment.

Their adoption was specifically driven by a cloud-native, GitOps approach for automated and declarative management. In this blog post, we’ll delve into the architecture they’re using, showcase how this stack addresses key LLMOps challenges, and highlight the results.

Liad Levi-Raz, data scientist at Amdocs, put it, “Leveraging the NVIDIA NeMo microservices and NVIDIA NIM stack orchestrated by GitOps has fundamentally transformed our ability to iterate on and deploy LLMs. We integrated it into a CI/CD automation system, which enables rapid and efficient evaluation of new LLMs, ensuring they are suitable for our use cases. As a data scientist, I can solely focus on LLM fine-tuning and not worry about infrastructure details.”

NVIDIA Nemo microservices

NVIDIA NeMo microservices are designed to facilitate a continuous improvement cycle for LLMs, often visualized as an “Enterprise AI Flywheel” This flywheel concept emphasizes the iterative nature of LLMOps, where insights gained from deployed models and new data feed back into the development process, leading to continuously improving LLM performance and capabilities, and model optimizations.

This flywheel concept is built as part of the NVIDIA AI Blueprint for building data flywheels, a reference architecture built on the NVIDIA NeMo Microservices. The following diagram illustrates this concept and the role of key NeMo microservices within it.

Flow diagram showing an enterprise AI flywheel, also known as LLMOps, with NVIDIA NeMo Curator, Customizer, Evaluator, Guardrails, NIM microservices, and enterprise data. — *Figure 1. Enterprise AI Flywheel (LLMOps) using NVIDIA Nemo microservices and NIM*

This iterative process begins with enterprise data curated and processed by NeMo Curator for data processing. The prepared data is customized by the NVIDIA NeMo Customizer and evaluated by the NVIDIA NeMo Evaluator. NeMo Guardrails provides a layer of safety and alignment for the underlying model use case.

Finally, the model can be deployed as an NVIDIA NIM for advanced inference. NVIDIA NIM is designed to accelerate and simplify the deployment of generative AI models. It provides a standardized way to package and deploy these models as containerized microservices and optimize their performance for inference.

Case study: Amdocs amAIz GenAI platform

To tackle LLMOps challenges, especially validating new LLMs quickly and reliably, Amdocs adopted a robust GitOps-based LLMOps strategy. This approach, using NVIDIA AI Blueprint for building data flywheel, is integrated directly into their existing amAIz platform’s CI/CD pipeline. This enables powerful evaluation and crucial regression testing for any newly released LLMs considered for use by the amAIz Suite. By providing the LLM’s configuration, the process—including GPU-dependent component deployment on a Kubernetes cluster running on NVIDIA DGX Cloud—is automatically triggered. In the following section, we demonstrate the GitOps based LLMOps pipeline.

GitOps-based LLMOps methodology

This section details and demonstrates the built GitOps-based LLMOps pipeline. Figure 2 shows the high-level workflow diagram of the built LLMOps pipeline. In the upper section of the diagram, we show the components deployed in the Amdocs environment, while in the lower half are the components deployed in the NVIDIA DGX Cloud.

A workflow diagram showing data scientists and DevOps engineers engaging with Amdocs’ LLMOps environment to build and deploy agentic AI applications, built on NVIDIA DGX Cloud. — *Figure 2. A high-level workflow of the LLMOps pipeline in the Amdocs environment*

Components within the Amdocs environment include:

Git repository: A version control system that serves as the single source of truth. It stores all configuration files and workflow definitions, where data scientists commit changes and ArgoCD pulls updates from.
Management Kubernetes cluster: The core orchestration environment within Amdocs, hosting ArgoCD and Argo Workflow. This Kubernetes cluster is created through the Azure Kubernetes Service (AKS) and has only CPU-based compute nodes.
ArgoCD: A continuous delivery tool for Kubernetes. It continuously monitors the Git Repository for changes, pulling them and synchronizing the Kubernetes Cluster to match the desired state. It also triggers the deployment of microservices and other components needing GPUs to the NVIDIA DGX Cloud.
Argo workflow: The workflow execution engine running within the Kubernetes cluster, responsible for creating and executing LLM workflows.
LLM workflow components: Predefined, reusable building blocks for LLM workflows. These are referenced by LLM workflows templates. These components are based on Argo Workflows’ ClusterWorkflowTemplate resource. A ClusterWorkflowTemplate is a cluster-scoped definition of a reusable workflow or a template for a workflow step that can be referenced across different namespaces. This allows for standardized and shareable automation logic. For more details, refer to the Argo Cluster Workflow Templates documentation. The following shows an example of such a component, defining a NeMo customization step:

apiVersion: argoproj.io/v1alpha1
kind: ClusterWorkflowTemplate
metadata:
  name: nemo-customization-template
spec:
  templates:
    - name: nemo-customization
      inputs:
        parameters:
          - name: nemo_customizer_endpoint
          - name: dataset_name
          - name: dataset_namespace
          - name: config
          - name: output_model
          - name: name
          - name: training_type
          - name: finetuning_type
          - name: batch_size
          - name: epochs
          - name: learning_rate
          - name: adapter_dim
          - name: adapter_dropout
      container:
        image: ubuntu:24.10
        command: [/bin/bash, -c]
        args:
          - |
            apt-get update && apt-get install git curl jq -y && \
            curl -k -X POST "{{ inputs.parameters.nemo_customizer_endpoint }}/v1/customization/jobs" \
              -H 'accept: application/json' \
              -H 'Content-Type: application/json' \
              -d '{
                  "name": "{{inputs.parameters.name}}",
                  "output_model": "{{inputs.parameters.output_model}}",
                  "config": "{{inputs.parameters.config}}",
                  "dataset": {
                      "name": "{{inputs.parameters.dataset_name}}",
                      "namespace": "{{inputs.parameters.dataset_namespace}}"
                  },
                  "hyperparameters": {
                      "training_type": "{{inputs.parameters.training_type}}",
                      "finetuning_type": "{{inputs.parameters.finetuning_type}}",
                      "batch_size": {{inputs.parameters.batch_size}},
                      "epochs": {{inputs.parameters.epochs}},
                      "learning_rate": {{inputs.parameters.learning_rate}},
                      "lora": {
                          "adapter_dim": {{inputs.parameters.adapter_dim}},
                          "adapter_dropout": {{inputs.parameters.adapter_dropout}}
                      }
                  }
              }' \
            | jq -r '.id' > /tmp/customization_id.txt && \
            cat /tmp/customization_id.txt
      outputs:
        parameters:
        - name: customization_ids
          valueFrom:
            path: /tmp/customization_id.txt

This ClusterWorkflowTemplate named nemo-customization-template defines a reusable step for triggering a model customization (fine-tuning) job through the NeMo Customizer microservice API. It takes various parameters as inputs, such as the endpoint of the NeMo Customizer, dataset ID, parent model ID, and hyperparameters for training. The curl command within the container makes an API call to the NeMo Customizer, passing these parameters to begin the customization process. The output captures the customization ID for subsequent steps in the workflow.

LLM workflow templates: These are reusable blueprints for complete LLM tasks, based on ArgoWorkflow Template resources. They define the structure and sequence of operations for an LLM pipeline. These templates refer to and combine various LLM workflow components (like the nemo-customization-template) to form a complete workflow. Sequences of operations can be linear (a simple sequence of steps) or a directed acyclic graph (DAG) for parallel execution and complex dependencies between components. Figure 3 shows a high-level LLM workflow for fine-tuning. Each box in the diagram represents a component performing a specific task within the overall pipeline.

A detailed LLM Workflow diagram from data preprocessing through to inference. — *Figure 3. An example of a high-level LLM Workflow diagra*m

The deployment of NeMo microservices, NVIDIA NIM, and execution of fine-tuning and evaluation jobs occur automatically on the NVIDIA DGX Cloud, as depicted in Figure 2. This is facilitated by the integration of Amdocs’ ArgoCD instance with a dedicated Kubernetes cluster running within DGX Cloud. All incoming requests directed to this cluster first pass through a gateway, which then intelligently routes them to either the Kubernetes ingress controller or directly to the Kube-API server, ensuring efficient and secure access to the deployed components and triggered jobs.

LLMOps pipeline in action

The LLMOps pipeline starts with dataset preparation. Telco-specific data, including customer bill data, is uploaded. It is then automatically formatted and split into training and testing sets. The data undergoes transformations, anonymization, and tokenization, and may be expanded synthetically using the NVIDIA NeMo framework capabilities.

For this use case, Amdocs needed an annotated dataset. Amdocs created a compact tuning dataset, featuring dozens of examples, with the expected input and outputs. Table 1 highlights a sample dataset.

	Various bill_headers (inputs)	User question	JSON Output
1	[{‘id’: ‘amaiz_id_1300_10_27_24’, ‘bill_date’: ‘2024-10-19’, ‘billing_period’: {‘start_datetime’: ‘2023-10-27, ‘end_datetime’: ‘2023-11-26’}, …}]	I noticed that my bill has increased recently. Can you explain why?	{‘bill_found’: ‘true’, ‘bill_id’: ‘amaiz_id_1300_11_27_24’, ‘bill_date’: ‘2024-11-19T17:33:00.000000’}
2	[{‘id’: ‘amaiz_id_9241_10_24_24’, ‘bill_date’: ‘2024-10-19’, ‘billing_period’: {‘start_datetime’: ‘2023-10-24, ‘end_datetime’: ‘2023-11-23’}, ‘due_amount’: …}]	Hi why is my bill more than last month?? my bill increase from $180.98 in November to $208.35 in December	{‘bill_found’: ‘false’}

The dataset is uploaded to the NVIDIA NeMo Data Store. Next, the pipeline deploys the new foundational LLM as a NIM (such as LLaMA 3.1 8B – instruct).

Afterward, the model using parameter-efficient fine-tuning (PEFT), like LoRA, is run on a subset of the prepared data stored in Nemo Data Store using Nemo Customizer. The pipeline automatically discovers the best hyperparameters for finetuning and uploads the resulting model to the Nemo Data Store through Nemo Customizer.

Following this, a multi-stage evaluation process is triggered, using NeMo Evaluator.

The pipeline performs evaluation benchmarks by running regression tests on the original base model and the fine-tuned model using various standardized benchmarks like GSM8K, SQuAD, GLUE, and SuperGLUE. This step is a critical regression test to ensure the new model’s general capabilities haven’t been negatively impacted.

Next, a specific business evaluation is conducted. This involves comparing the predictions of the base and fine-tuned models and analyzing performance using relevant metrics with a more customized benchmark: LLM-as-a-judge, using a custom dataset and custom metrics. metrics. Domain experts conduct a final human evaluation for the top-performing models identified to ensure alignment with business requirements. The selected fine-tuned LoRA model adapter is then deployed using NVIDIA NIM for LLMs.

In this pipeline, Git serves as the single source of truth, where data scientists commit LLM workflows as code and DevOps teams manage infrastructure configurations. The entire process is orchestrated by Argo: ArgoCD continuously monitors the repository to deploy and synchronize microservices on Kubernetes, while Argo Workflows executes demanding tasks like model fine-tuning and evaluation on the NVIDIA DGX Cloud using NeMo microservices. This also enables development agility, allowing scientists to experiment interactively in Jupyter notebooks that interface directly with NeMo microservices APIs. For complete oversight, MLflow is seamlessly integrated to automatically capture all experiment metrics for analysis.

This integrated GitOps approach provides a powerful and automated orchestration layer for the LLMOps pipeline and is reproducible because all configurations and workflow definitions are versioned in Git. Users can also quickly find performance information and the suitability of new models without manual intervention, speeding up their AI adoption cycle.

Results

Evaluation results show the benefits of the fine-tuning process across multiple dimensions. Regression tests using standard benchmarks like TriviaQA confirm that the fine-tuned model retains core capabilities without degradation, achieving a score of 0.6, which matches the base model.

Fine-tuning delivered a performance boost on the specific task, with an accuracy of 0.83 for the LoRA-fine-tuned version, even with only 50 training examples, as shown in Figure 4. This outperforms the base Llama3.1-8b-instruct model score of 0.74.

A bar chart comparing the accuracy of a Llama3.1-8b-instruct base model compared to a fine-tuned LoRA-Llama3.1-8b-instruct model, showing normalized improvements from 0.74 to 0.83. — *Figure 4. A performance comparison shows that the Llama3.1-8b-instruct base model achieved* ***0.74*** *accuracy, while the LoRA-Llama3.1-8b-instruct model reached* ***0.83*** *accuracy*

This improvement extends beyond quantitative metrics. Qualitative analysis shows the fine-tuned model learned to produce correctly formatted and accurate results, identifying when a bill wasn’t found, whereas the base model failed in these specific instances, as shown in Figure 5.

An image showing an example bill. — *Figure 5. An example where the expected result was {“bill_found”:”false”}, but the base model incorrectly predicted*

To achieve deeper, domain-specific insights, a key component of our evaluation is a custom LLM-as-a-judge. This approach uses a dedicated judge LLM to compare responses against human-curated references, employing custom datasets and metrics aligned with application KPIs to assess correctness, relevance, and fluency.

The flexibility and robustness of this method enable tailored evaluation across multiple dimensions. This advanced technique, combined with automated benchmarks and similarity metrics, provides a comprehensive performance view and establishes a data flywheel, enabling an iterative cycle that continuously enhances the quality of the fine-tuned model through consistent and repetitive evaluation. Smaller models can then power production-grade workflows, optimizing for both performance and cost.

Conclusion

Operationalizing LLMs presents significant challenges, particularly concerning pipeline complexity, deployment at scale, and ensuring continuous performance and safety. As demonstrated by the architecture using NVIDIA AI Blueprint for building data flywheels, which includes NVIDIA NeMo microservices for continuous fine-tuning and evaluation, NVIDIA NIM for efficient inference, and GitOps orchestration through ArgoCD and Argo Workflow, building a robust LLMOps pipeline is achievable. This stack facilitates automated workflows, enables a continuous improvement cycle akin to the data flywheel, and directly addresses many of the core complexities of managing LLMs in production. The case study highlights the practical benefits of integrating such a pipeline into existing CI/CD processes for rapid fine-tuning.

Learn more about the collaboration with Amdocs and NVIDIA.

Get started with the NVIDIA AI Blueprint for data flywheels and explore AI-powered operations for telecom.