The Automated Audio Captioning task centers around generating natural language descriptions from audio inputs. Given the distinct modalities between the input (audio) and the output (text), AAC systems typically rely on an audio encoder to extract relevant information from the sound, represented as feature vectors, which a decoder then uses to generate text descriptions.
]]>Building an effective automatic speech recognition (ASR) model for underrepresented languages presents unique challenges due to limited data resources. In this post, I discuss the best practices for preparing the dataset, configuring the model, and training it effectively. I also discuss the evaluation metrics and the encountered challenges. By following these practices��
]]>NVIDIA NeMo is an end-to-end platform for the development of multimodal generative AI models at scale anywhere��on any cloud and on-premises. The NeMo team just released?Canary, a multilingual model that transcribes speech in English, Spanish, German, and French with punctuation and capitalization. Canary also provides bi-directional translation, between English and the three other supported��
]]>NVIDIA NeMo, an end-to-end platform for developing multimodal generative AI models at scale anywhere��on any cloud and on-premises��recently released Parakeet-TDT. This new addition to the?NeMo ASR Parakeet model family boasts better accuracy and 64% greater speed over the previously best model, Parakeet-RNNT-1.1B. This post explains Parakeet-TDT and how to use it to generate highly accurate��
]]>NVIDIA NeMo, an end-to-end platform for the development of multimodal generative AI models at scale anywhere��on any cloud and on-premises��released the Parakeet family of automatic speech recognition (ASR) models. These state-of-the-art ASR models, developed in collaboration with Suno.ai, transcribe spoken English with exceptional accuracy. This post details Parakeet ASR models that are��
]]>Speech and translation AI models developed at NVIDIA are pushing the boundaries of performance and innovation. The NVIDIA Parakeet automatic speech recognition (ASR) family of models and the NVIDIA Canary multilingual, multitask ASR and translation model currently top the Hugging Face Open ASR Leaderboard. In addition, a multilingual P-Flow-based text-to-speech (TTS) model won the LIMMITS ��24��
]]>Breaking barriers in speech recognition, NVIDIA NeMo proudly presents pretrained models tailored for Dutch and Persian��languages often overlooked in the AI landscape. These models leverage the recently introduced FastConformer architecture and were trained simultaneously with CTC and transducer objectives to maximize each model��s accuracy. Automatic speech recognition (ASR) is a��
]]>The integration of speech and translation AI into our daily lives is rapidly reshaping our interactions, from virtual assistants to call centers and augmented reality experiences. Speech AI Day provided valuable insights into the latest advancements in speech AI, showcasing how this technology addresses real-world challenges. In this first of three Speech AI Day sessions��
]]>Learn how to build and deploy production-quality conversational AI apps with real-time transcription and NLP.
]]>Every year, as part of their coursework, students from the University of Warsaw, Poland get to work under the supervision of engineers from the NVIDIA Warsaw office on challenging problems in deep learning and accelerated computing. We present the work of three M.Sc. students��Alicja Ziarko, Pawe? Pawlik, and Micha? Siennicki��who managed to significantly reduce the latency in TorToiSe��
]]>Large language models (LLMs) are becoming an integral tool for businesses to improve their operations, customer interactions, and decision-making processes. However, off-the-shelf LLMs often fall short in meeting the specific needs of enterprises due to industry-specific terminology, domain expertise, or unique requirements. This is where custom LLMs come into play.
]]>Large language models (LLMs), such as GPT, have emerged as revolutionary tools in natural language processing (NLP) due to their ability to understand and generate human-like text. These models are trained on vast amounts of diverse data, enabling them to learn patterns, language structures, and contextual relationships. They serve as foundational models that can be customized to a wide range of��
]]>One of the main challenges for businesses leveraging AI in their workflows is managing the infrastructure needed to support large-scale training and deployment of machine learning (ML) models. The NVIDIA FLARE platform provides a solution: a powerful, scalable infrastructure for federated learning that makes it easier to manage complex AI workflows across enterprises. NVIDIA FLARE 2.3.0��
]]>Large language models (LLMs) have generated excitement worldwide due to their ability to understand and process human language at a scale that is unprecedented. It has transformed the way that we interact with technology. Having been trained on a vast corpus of text, LLMs can manipulate and generate text for a wide variety of applications without much instruction or training. However��
]]>Voice-enabled technology is becoming ubiquitous. But many are being left behind by an anglocentric and demographically biased algorithmic world. Mozilla Common Voice (MCV) and NVIDIA are collaborating to change that by partnering on a public crowdsourced multilingual speech corpus��now the largest of its kind in the world��and open-source pretrained models. It is now easier than ever before to��
]]>The telecommunication industry has seen a proliferation of AI-powered technologies in recent years, with speech recognition and translation leading the charge. Multi-lingual AI virtual assistants, digital humans, chatbots, agent assists, and audio transcription are technologies that are revolutionizing the telco industry. Businesses are implementing AI in call centers to address incoming requests��
]]>Develop safe and trustworthy LLM conversational applications with NVIDIA NeMo Guardrails, an open-source toolkit that enables programmable guardrails for defining desired user interactions within an application.
]]>On May 23 at 9 am CEST learn to build and deploy production-quality conversational AI applications with real-time transcription and natural language processing capabilities.
]]>ChatGPT has made quite an impression. Users are excited to use the AI chatbot to ask questions, write poems, imbue a persona for interaction, act as a personal assistant, and more. Large language models (LLMs) power ChatGPT, and these models are the topic of this post. Before considering LLMs more carefully, we would first like to establish what a language model does. A language model gives��
]]>Large language models (LLMs) are incredibly powerful and capable of answering complex questions, performing feats of creative writing, developing, debugging source code, and so much more. You can build incredibly sophisticated LLM applications by connecting them to external tools, for example reading data from a real-time source, or enabling an LLM to decide what action to take given a user��s��
]]>The Dataiku platform for everyday AI simplifies deep learning. Use cases are far-reaching, from image classification to object detection and natural language processing (NLP). Dataiku helps you with labeling, model training, explainability, model deployment, and centralized management of code and code environments. This post dives into high-level Dataiku and NVIDIA integrations for image��
]]>Project Mellon is a lightweight Python package capable of harnessing the heavyweight power of speech AI (NVIDIA Riva) and large language models (LLMs) (NVIDIA NeMo service) to simplify user interactions in immersive environments. NVIDIA announced at NVIDIA GTC 2023 that developers can start testing Project Mellon to explore creating hands-free extended reality (XR) experiences controlled by��
]]>Over 55% of the global population uses social media, easily sharing online content with just one click. While connecting with others and consuming entertaining content, you can also spot harmful narratives posing real-life threats. That��s why VP of Engineering at Pendulum, Ammar Haris, wants his company��s AI to help clients to gain deeper insight into the harmful content being generated��
]]>Multilingual automatic speech recognition (ASR) models have gained significant interest because of their ability to transcribe speech in more than one language. This is fueled by the growing multilingual communities as well as by the need to reduce complexity. You only need one model to handle multiple languages. This post explains how to use pretrained multilingual NeMo ASR models from the��
]]>Real-time natural language understanding will transform how we interact with intelligent machines and applications.
]]>Join experts from Google, Meta, NVIDIA, and more at the first annual NVIDIA Speech AI Summit. Register now!
]]>Learn how to build, train, customize, and deploy a GPU-accelerated automatic speech recognition service with NVIDIA Riva in this self-paced course.
]]>Speech recognition technology is growing in popularity for voice assistants and robotics, for solving real-world problems through assisted healthcare or education, and more. This is helping democratize access to speech AI worldwide. As labeled datasets for unique, emerging languages become more widely available, developers can build AI applications readily, accurately, and affordably to enhance��
]]>When examining an intricate speech AI robotic system, it��s easy for developers to feel intimidated by its complexity. Arthur C. Clarke claimed, ��Any sufficiently advanced technology is indistinguishable from magic.�� From accepting natural-language commands to safely interacting in real-time with its environment and the humans around it, today��s speech AI robotics systems can perform tasks to��
]]>Text normalization (TN) converts text from written form into its verbalized form, and it is an essential preprocessing step before text-to-speech (TTS). TN ensures that TTS can handle all input texts without skipping unknown symbols. For example, ��$123�� is converted to ��one hundred and twenty-three dollars.�� Inverse text normalization (ITN) is a part of the automatic speech recognition (ASR)��
]]>Speaker diarization is the process of segmenting audio recordings by speaker labels and aims to answer the question ��Who spoke when?��. It makes a clear distinction when it is compared with speech recognition. Before you perform speaker diarization, you know ��what is spoken�� but you don��t know ��who spoke it��. Therefore, speaker diarization is an essential feature for a speech recognition��
]]>Loss functions for training automatic speech recognition (ASR) models are not set in stone. The older rules of loss functions are not necessarily optimal. Consider connectionist temporal classification (CTC) and see how changing some of its rules enables you to reduce GPU memory, which is required for training and inference of CTC-based models and more. For more information about the��
]]>Learn about the latest tools, trends, and technologies for building and deploying conversational AI.
]]>Artificial intelligence (AI) has transformed synthesized speech from monotone robocalls and decades-old GPS navigation systems to the polished tone of virtual assistants in smartphones and smart speakers. It has never been so easy for organizations to use customized state-of-the-art speech AI technology for their specific industries and domains. Speech AI is being used to power virtual��
]]>Video conferencing, audio and video streaming, and telecommunications recently exploded due to pandemic-related closures and work-from-home policies. Businesses, educational institutions, and public-sector agencies are experiencing a skyrocketing demand for virtual collaboration and content creation applications. The crucial part of online communication is the video stream, whether it��s a simple��
]]>With audio and video streaming, conferencing, and telecommunication on the rise, it has become essential for developers to build applications with outstanding audio quality and enable end users to communicate and collaborate effectively. Various background noises can disrupt communication, ranging from traffic and construction to dogs barking and babies crying. Moreover, a user could talk in a��
]]>Deep learning is proving to be a powerful tool when it comes to high-quality synthetic speech development and customization. A Toronto-based startup, and NVIDIA Inception member, Resemble AI is upping the stakes with a new generative voice tool able to create high-quality synthetic AI Voices. The technology can generate cross-lingual and naturally speaking voices in over 50 of the most��
]]>You can save time and produce a more accurate result when processing audio data with automated speech recognition (ASR) models from NVIDIA NeMo and Label Studio. NVIDIA NeMo provides reusable neural modules that make it easy to create new neural network architectures, including prebuilt modules and ready-to-use models for ASR. With the power of NVIDIA NeMo, you can get audio transcriptions��
]]>Deepgram, an NVIDIA Inception startup developing automatic speech recognition (ASR) deep learning models, recently published a new demo that highlights the speed and scalability of its platform on NVIDIA GPUs. ��We��ve reinvented Automatic Speech Recognition (ASR) with a complete, deep learning model that allows companies to get faster, more accurate transcription, resulting in more reliable��
]]>Facebook AI this week announced they are open sourcing a deep learning model called M2M-100 that can translate any language pair, among 100 languages, without relying on English data. For example, when translating from Chinese to French, previous models would train on Chinese to English to French. M2M-100 directly trains on Chinese to French to better preserve meaning. ��Deploying M2M-100 will��
]]>COVID-19 is fundamentally changing the doctor-patient dynamic worldwide. Telemedicine is now becoming an essential technology that healthcare providers can offer patients as an adjunct or alternative for in-person visits that is both effective and convenient. We��ve all been there, at one time or another in the last six months: speaking with a nurse or family doctor using video calling on our��
]]>A team of Emory University students won Amazon��s 2020 Alexa Socialbot Grand Challenge, a worldwide competition to create that most engaging AI chatbot. The team earned $500,000 for their chatbot named Emora. The researchers developed Emora as a social companion that can provide comfort and warmth to people interacting with Alexa-enabled devices. Emora can chat about movies, sports��
]]>To help accelerate natural language processing in biomedicine, Microsoft Research developed a BERT-based AI model that outperforms previous biomedicine natural languag processing (NLP) methods. The work promises to help researchers rapidly advance research in this field. The model, built on top of Google��s BERT, can classify documents, extract medical information��
]]>To help localize subtitles from English to other languages, such as Russian, Spanish, or Portuguese, Netflix developed a proof-of-concept AI model that can automatically simplify and translate subtitles to multiple languages. The work is presented in a paper, Simplify-then-Translate: Automatic Preprocessing for Black-Box Machine Translation, published this month on the preprint platform��
]]>At GTC 2020, NVIDIA announced and shipped a range of new AI SDKs, enabling developers to support the new Ampere architecture. For the first time, developers have the tools to build end-to-end deep learning-based pipelines for conversational AI and recommendation systems. Today, NVIDIA announced Riva, a fully accelerated application framework building multimodal conversational AI services.
]]>Many of today��s speech synthesis models lack emotion and human-like expression. To help tackle this problem, a team of researchers from the NVIDIA Applied Deep Learning Research group developed a state-of-the-art model that generates more realistic expressions and provides better user control than previously published models. Named ��Flowtron��, the model debuted publicly for the first time as��
]]>This week, OpenAI released Jukebox, a neural network that generates music with rudimentary singing, in a variety of genres and artist styles. ��Provided with genre, artist, and lyrics as input, Jukebox outputs a new music sample produced from scratch,�� the company stated in their post, Jukebox. Generating CD-quality music is a challenging problem to solve, as a typical song has over 10��
]]>To improve how natural language processing (NLP) systems such as Alexa handle complex requests, Amazon researchers, in collaboration with the University of Massachusetts Amherst, developed a deep learning-based, sequence-to-sequence model that can better handle simple and complex queries. ��Virtual assistants such as Amazon Alexa, Apple Siri, and Google Assistant often rely on a semantic��
]]>The article below is a guest post by Nuance, a company focused on conversational AI. In this post, Nuance engineers describe their use of NVIDIA��s automatic mixed precision to speed up their AI models in the healthcare industry. By Wenxuan Teng, Ralf Leibold, and Gagandeep Singh Nuance��s ambient clinical intelligence (ACI) technology is an example of how it is accelerating development of��
]]>Today, NVIDIA released TensorRT 6 which includes new capabilities that dramatically accelerate conversational AI applications, speech recognition, 3D image segmentation for medical applications, as well as image-based applications in industrial automation. TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for AI��
]]>To help people with speech impairments better interact with every-day smart devices, Google researchers have developed a deep learning-based automatic speech recognition (ASR) system that aims to improve communications for people with amyotrophic lateral sclerosis (ALS), a disease that can affect a person��s speech. The research, part of Project Euphoria, is an ASR platform that performs speech��
]]>NVIDIA announced breakthroughs today in language understanding that give developers the opportunity to more naturally develop conversational AI applications using BERT and real-time inference tools, such as TensorRT to dramatically speed up their AI speech applications. In today��s announcement, researchers and developers from NVIDIA set records in both training and inference of BERT��
]]>AI-enabled services such as speech recognition and natural language processing are increasing in demand. To help developers manage growing datasets, latency requirements, customer requirements, and more complex neural networks, we are highlighting a few AI speech applications that rely on NVIDIA��s inference platform to solve common AI speech challenges. From Amazon��s Alexa Research group��
]]>Current audio speech recognition models normally do not perform well in noisy environments. To help solve the problem, researchers from Samsung and Imperial College in London developed a deep learning solution that uses computer vision for visual speech recognition. The model is capable of lipreading, as well as synthesizing audio it sees from the video. Lipreading is primarily used by��
]]>To potentially improve natural language queries, including the retrieval of images from speech, Researchers from IBM and the University of Virginia developed a deep learning model that can generate objects and their attributes from natural language descriptions. Unlike other recent methods, this approach does not use GANs. ��We show that under minor modifications, the proposed framework can��
]]>Developers from Amazon��s Alexa Research group have just published a developer blog and published a paper describing how they are using adversarial training to recognize and improve emotion detection. ��A person��s tone of voice can tell you a lot about how they��re feeling. Not surprisingly, emotion recognition is an increasingly popular conversational-AI research topic,�� said Viktor Rozgic��
]]>Almost 1,000,000 books are published every year in the United States, however, only around 40,000 of them are converted into audiobooks, primarily due to costs and production time. To help with the process, DeepZen, a London-based company, and a member of the Inception program, NVIDIA��s start-up incubator, developed a deep learning-based system that can generate complete audio recordings of��
]]>To enhance the capability of text-to-speech and automatic speech recognition algorithms, Microsoft researchers developed a deep learning model that uses unsupervised learning, an approach not commonly used in this field, to improve the accuracy of the two speech tasks. By using the Transformer model, which is based on a sequence-to-sequence architecture, the team achieved a 99.84%
]]>Microsoft AI Research just announced a new breakthrough in the field of conversational AI that achieves new records in seven of nine natural language processing tasks from the General Language Understanding Evaluation (GLUE) benchmark. Microsoft��s natural language processing algorithm called Multi-Task DNN, first released in January and updated this month, incorporates Google��s BERT NLP model��
]]>To help people who suffer from hearing loss, Researchers from Columbia University just developed a deep learning-based system that can help amplify specific speakers in a group, a breakthrough that could lead to better hearing aids. ��The brain area that processes sound is extraordinarily sensitive and powerful; it can amplify one voice over others, seemingly effortlessly��
]]>Every week we highlight NVIDIA��s Top 5 AI stories of the week. In this week��s edition we cover a new deep learning-based algorithm from OpenAI that can automatically generate new music. Plus, an automatic speech recognition model that could improve Alexa��s algorithm by 15%. Watch below: Planning a workout that is specific to a user��s needs can be challenging.
]]>Trying to generate music like Mozart, Beethoven, or perhaps Lady Gaga? AI research organization OpenAI just released a demo of a new deep learning algorithm that can automatically generate original music using many different instruments and styles. ��We��ve created Musenet, a deep neural network that can generate 4-minute musical compositions with 10 different instruments, and can combine styles��
]]>Researchers from Johns Hopkins University and Amazon published a new paper describing how they trained a deep learning system that can help Alexa ignore speech not intended for her, improving the speech recognition model by 15%. ��Voice-controlled house-hold devices, like Amazon Echo or Google Home, face the problem of performing speech recognition of device directed speech in the presence of��
]]>Gridspace, a southern California-based company, recently presented an end-to-end deep learning based-solution that can allow businesses to automate the call center process by using NVIDIA GPUs on the cloud. In an example shown at GTC Silicon Valley, the company presented a video of an AI generated voice interacting with a customer as if it were a real person. The tool has the potential to allow��
]]>When you ask your phone a question, you don��t just want the right answer. You want the right answer, right now. The answer to this seemingly simple question requires an AI-powered service that involves multiple neural networks that have to perform a variety of predictions and get an answer back to you in under one second so it feels instantaneous. These include: All of these AI��
]]>From an AI algorithm that can predict earthquakes to a system that can decode rodent chatter �C here are the top 5 AI stories of the week. Most people can��t detect an earthquake until the ground under their feet is already shaking or sliding, leaving little time to prepare or take shelter. Scientists are trying to short circuit that surprise using the critical time window during the��
]]>According to the U.N., up to 100 elephants are slaughtered every day in Africa by poachers taking part in the illegal ivory trade. This amounts to around 35,000 elephants killed each year due to poaching.To help fight the problem, Conservation Metrics, a Santa Cruz, California-based startup, is using deep learning to help detect the sounds of elephants, as well as gunfire, and get a more detailed��
]]>To help up-and-coming musicians create the best beats for their song, developers from a Japanese-based AI startup developed a deep learning system called Neural Beatboxer that can convert everyday sounds into hours of automatically compiled rhythms. Users can visit their website, feed it some sounds, and the neural network automatically produces a custom drum kit that can go on for hours.
]]>Researchers from Facebook developed a deep learning system that can replicate the music it hears and play it back as if it were Mozart, Beethoven, or Bach. This is the first time researchers have produced high fidelity musical translation between instruments, styles, and genres. ��Humans have always created music and replicated it �C whether it is by singing, whistling, clapping, or��
]]>Voicea, a San Francisco Bay Area startup, recently announced $20 million funding for their GPU-based deep learning system that can now fully transcribe meetings and put together highlights. The system was designed to help teams better collaborate in an enterprise environment. Eva, the start-ups AI assistant, joins meetings and conference calls through a combination of machine learning��
]]>Researchers at Microsoft announced they reached a 5.1% error rate which is a new milestone in reaching human parity for recognizing words in a conversation as well as professional human transcribers. They improved the accuracy of their system from last year on the Switchboard conversational speech recognition task. The benchmarking task is a corpus of recorded telephone conversations that the��
]]>Take part in the world��s top GPU developer event May 8 -11, 2017 in Silicon Valley where artificial intelligence, virtual reality and autonomous vehicles will take center stage. GTC 2017 provides developers and thought leaders with the opportunity to share their work with thousands of the world��s brightest minds. The 2016 event had more than 5,500 attendees, and 600+ sessions on GPU��
]]>Alexey Kamenev, Software Engineer at Microsoft Research talks about their open-source Computational Network Toolkit (CNTK) for deep learning, which describes neural networks as a series of computational steps via a directed graph. Kamenev also shares a bit about how they��re using GPUs, the CUDA Toolkit and GPU-accelerated libraries for the variety of Microsoft products that benefit from deep��
]]>Yann LeCun, Director of Facebook AI Research, invited NVIDIA CEO Jen-Hsun Huang to speak at ��The Future of AI�� symposium at NYU, where industry leaders discussed the state of AI and its continued advancement. Jen-Hsun published a blog on his talk that coverstopics such as how deep learning is a new software model that needs a new computing model; why AI researchers have adopted GPU-accelerated��
]]>Researchers from Karlsruhe Institute of Tech, MIT and University of Toronto published MovieQA, a dataset that contains 7702 reasoning questions and answers from 294 movies. Their innovative dataset and accuracy metrics provide a well-defined challenge for question/answer machine learning algorithms. The questions range from simpler ��Who�� did ��What�� to ��Whom�� that can be solved by computer vision��
]]>NVIDIA announced that Facebook will accelerate its next-generation computing system with the NVIDIA Tesla Accelerated Computing Platform which will enable them to drive a broad range of machine learning applications. Facebook is the first company to train deep neural networks on the new Tesla M40 GPUs �C introduced last month �C this will play a large role in their new open source ��Big Sur����
]]>Wired discusses Google��s announcement that it is open sourcing its TensorFlow machine learning system �C noting the system uses GPUs to both train and run artificial intelligence services at the company. Inside Google, when tackling tasks like image recognition and speech recognition and language translation, TensorFlow depends on machines equipped with GPUs that were originally designed to render��
]]>Instagram could offer a novel way of monitoring the drinking habits of teenagers. Using photos and text from Instagram, a team of researchers from the University of Rochester has shown that this data can not only expose patterns of underage drinking more cheaply and faster than conventional surveys, but also find new patterns, such as what alcohol brands or types are favored by different��
]]>In a recent interview with TIME, NVIDIA��s senior director of automotive Danny Shapiro shares how the company��s innovations in gaming graphics are well-suited to the needs of autonomous vehicles. Driverless cars, which take passengers from A to B with minimal human input, are already hitting American roads. A variety of automakers and technology firms are experimenting with driverless technology��
]]>The HPC and GPU Supercomputing Group of Silicon Valley will be hosting two researchers from Baidu on Tuesday, October 6, 2015 from 6:30PM to 9:30PM at the NVIDIA Headquarters in Santa Clara, Ca. We are very excited to have Awni Hannun and Erich Elsen from Baidu Research join us. Awni, former Stanford researcher, has extensive experience in solving hard speech recognition tasks using deep��
]]>Two winners in the Visionary category are harnessing the computing power of NVIDIA GPUs to drive their artificial intelligence applications. MIT Technology Review recently revealed its annual ��35 Innovators Under 35,�� which lists young technologists usingtoday��s emerging technologies to transform tomorrow��s world. Ilya Sutskever, 29, is a key member of the Google Brain research team��
]]>Chinese search giant Baidu recently presented a new GPU-based Deep Speech deep learning system that has 94% accuracy when handling voice queries in Mandarin. Originally unveiled in December 2014, the speech recognition system was only able to recognize the English language. Baidu senior research engineer Awni Hannun was interviewed by Medium to share why Mandarin is such a tough��
]]>