After clicking “Watch Now” you will be prompted to login or join.
Microsoft Open Sources Breakthrough Optimizations for Large-Scale BERT Models
Emma Ning, Microsoft | Nathan Yan, Microsoft
GTC 2020
We'll talk about BERT (Bidirectional Encoder Representations from Transformers), which is a popular deep-learning model used for natural language processing. Inferencing BERT at high scale is traditionally extremely costly, though, and may not even be possible with strict latency constraints. We'll share examples of how we've scaled BERT with NVIDIA GPUs, such as Bing improving BERT inference for its real-time service needs, serving more than 1 million BERT inferences per second within Bing's latency limits. We will discuss open source, enhanced versions of these optimizations that are now available in ONNX Runtime and have been extended to work on both GPU and CPU. With ONNX Runtime, AI developers can now easily produce large transformer models with high performance across both CPU and GPU hardware, using the same technology Microsoft uses to serve their customers.
人人超碰97caoporen国产