Berkin Kartal – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-05-29T19:05:04Z http://www.open-lab.net/blog/feed/ Berkin Kartal <![CDATA[AI Fabric Resiliency and Why Network Convergence Matters]]> http://www.open-lab.net/blog/?p=98574 2025-05-29T19:05:04Z 2025-05-14T16:20:00Z High-performance computing and deep learning workloads are extremely sensitive to latency. Packet loss forces retransmission or stalls in the communication...]]>

High-performance computing and deep learning workloads are extremely sensitive to latency. Packet loss forces retransmission or stalls in the communication pipeline, which directly increases latency and disrupts the synchronization between GPUs. This can degrade the performance of collective operations such as all-reduce or broadcast, where every GPU’s participation is required before progressing.

Source

]]>
Berkin Kartal <![CDATA[Comparing Solutions for Boosting Data Center Redundancy]]> http://www.open-lab.net/blog/?p=70873 2023-10-19T19:05:58Z 2023-09-29T19:46:58Z In today��s data center, there are many ways to achieve system redundancy from a server connected to a fabric. Customers usually seek redundancy to increase...]]>

In today’s data center, there are many ways to achieve system redundancy from a server connected to a fabric. Customers usually seek redundancy to increase service availability (such as achieving end-to-end AI workloads) and find system efficiency using different multihoming techniques. In this post, we discuss the pros and cons of the well-known proprietary multi-chassis link aggregation…

Source

]]>
0
���˳���97caoporen����