• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Note: This video may require joining the NVIDIA Developer Program or login

    GTC Silicon Valley-2019 ID:S9422:An Automatic Batching API for High Performance RNN Inference

    Murat Guney(NVIDIA)
    We will describe a new API that more effectively utilizes the GPU hardware for multiple single inference instances of the same RNN model. Many NLP applications have real-time run time requirements for multiple independent inference instances. Our proposed API accepts independent inference requests from an application and seamlessly combines them to a large batch execution. Time steps from independent inference tasks are combined together so that we achieve high performance while staying within the latency budgets of an application for a time step. We also discuss functionality that allows the user to wait on completion of a certain time step, a task that's possible because our implementation is mainly composed of non-blocking function calls. Finally, we'll present performance data from the Turing architecture for an example RNN model with LSTM cells and projections.

    View the slides (pdf)