Nicole Luo – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-04-01T19:02:02Z http://www.open-lab.net/blog/feed/ Nicole Luo <![CDATA[Mastering LLM Techniques: Text Data Processing]]> http://www.open-lab.net/blog/?p=91738 2025-04-01T19:02:02Z 2024-11-13T18:05:06Z Training and customizing LLMs for high accuracy is fraught with challenges, primarily due to their dependency on high-quality data. Poor data quality and...]]>

Training and customizing LLMs for high accuracy is fraught with challenges, primarily due to their dependency on high-quality data. Poor data quality and inadequate volume can significantly reduce model accuracy, making dataset preparation a critical task for AI developers. Datasets frequently contain duplicate documents, personally identifiable information (PII), and formatting issues.

Source

]]>
Nicole Luo <![CDATA[Curating Non-English Datasets for LLM Training with NVIDIA NeMo Curator]]> http://www.open-lab.net/blog/?p=84628 2024-10-18T20:14:04Z 2024-07-10T16:00:00Z Data curation plays a crucial role in the development of effective and fair large language models (LLMs). High-quality, diverse training data directly...]]>

Data curation plays a crucial role in the development of effective and fair large language models (LLMs). High-quality, diverse training data directly impacts LLM performance, addressing issues like bias, inconsistencies, and redundancy. By curating high-quality datasets, we can ensure that LLMs are accurate, reliable, and generalizable. When training a localized multilingual LLM…

Source

]]>
Nicole Luo <![CDATA[Deploy Multilingual LLMs with NVIDIA NIM]]> http://www.open-lab.net/blog/?p=84933 2024-07-25T18:19:11Z 2024-07-08T18:49:33Z Multilingual large language models (LLMs) are increasingly important for enterprises operating in today's globalized business landscape. As businesses expand...]]>

Multilingual large language models (LLMs) are increasingly important for enterprises operating in today’s globalized business landscape. As businesses expand their reach across borders and cultures, the ability to communicate effectively in multiple languages is crucial for success. By supporting and investing in multilingual LLMs, enterprises can break down language barriers, foster inclusivity…

Source

]]>
3
Nicole Luo <![CDATA[Training Localized Multilingual LLMs with NVIDIA NeMo, Part 2]]> http://www.open-lab.net/blog/?p=82295 2025-02-17T05:27:39Z 2024-05-17T17:29:49Z In Part 1, we discussed how to train a monolingual tokenizer and merge it with a pretrained LLM��s tokenizer to form a multilingual tokenizer. In this post, we...]]>

In Part 1, we discussed how to train a monolingual tokenizer and merge it with a pretrained LLM’s tokenizer to form a multilingual tokenizer. In this post, we show you how to integrate the customized tokenizer into the pretrained LLM as well as how to start a continual pretraining task in NVIDIA NeMo. Please import the following libraries before starting: After…

Source

]]>
1
Nicole Luo <![CDATA[Training Localized Multilingual LLMs with NVIDIA NeMo, Part 1]]> http://www.open-lab.net/blog/?p=82294 2024-10-18T20:22:45Z 2024-05-17T17:29:13Z In today's globalized world, the ability of AI systems to understand and communicate in diverse languages is increasingly crucial. Large language models (LLMs)...]]>

In today’s globalized world, the ability of AI systems to understand and communicate in diverse languages is increasingly crucial. Large language models (LLMs) have revolutionized the field of natural language processing, enabling AI to generate human-like text, answer questions, and perform various language tasks. However, most mainstream LLMs are trained on data corpora that primarily consist of…

Source

]]>
3
���˳���97caoporen����