• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Purnendu Mukherjee, NVIDIA; Thor Johnsen, NVIDIA
    gtc-dc 2019
    We’ll explain how the training of large scale language models such as BERT, GPT-2, and XLNet requires massively parallel computation to achieve convergence within a reasonable amount of time. GPU-enabled multi-node training is necessary for these computation demands. We’ll present the tools we used to scale out the pre-training of these language models without losing accuracy. We used distributed training tools, improved optimization for large-batch training, automatic mixed precision, cluster managers, and GPU profiling tools like NSight. We’ll also discuss common bottlenecks and how to avoid them.