Cris Cecka – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-07-16T00:51:12Z http://www.open-lab.net/blog/feed/ Cris Cecka <![CDATA[CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design]]> http://www.open-lab.net/blog/?p=103394 2025-07-16T00:51:12Z 2025-07-16T15:45:00Z GEMM optimization on GPUs is a modular problem. Performant implementations need to specify hyperparameters such as tile shapes, math and copy instructions, and...]]>

Source

]]>
Cris Cecka <![CDATA[CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels]]> http://www.open-lab.net/blog/?p=103359 2025-07-16T00:04:51Z 2025-07-16T15:30:00Z In the era of generative AI, utilizing GPUs to their maximum potential is essential to training better models and serving users at scale. Often, these models...]]>

In the era of generative AI, utilizing GPUs to their maximum potential is essential to training better models and serving users at scale. Often, these models have layers that cannot be expressed as off-the-shelf library operations due to subtle modifications, and DL compilers typically forgo the last few percentage points of optimizations to make their deployment feasible.

Source

]]>
Cris Cecka <![CDATA[Pro Tip: cuBLAS Strided Batched Matrix Multiply]]> http://www.open-lab.net/blog/parallelforall/?p=7561 2022-08-21T23:38:07Z 2017-02-28T03:39:17Z There��s a new computational workhorse in town. For decades, general matrix-matrix multiply��known as GEMM in Basic Linear Algebra Subroutines (BLAS)...]]>

There’s a new computational workhorse in town. For decades, general matrix-matrix multiply—known as GEMM in Basic Linear Algebra Subroutines (BLAS) libraries—has been a standard benchmark for computational performance. GEMM is possibly the most optimized and widely used routine in scientific computing. Expert implementations are available for every architecture and quickly achieve the peak…

Source

]]>
11
���˳���97caoporen����