NVLink and the New Hardware Game
We have spent the previous chapters slowly climbing the stack from copper wire to user-facing application. Let us close the technical part of this volume with a glimpse of where the frontier is, and why it matters that you understand everything beneath it.
Why is NVIDIA the most valuable company in the world?
At the time of writing, NVIDIA's market capitalization has surpassed essentially every other public company in the world. It has done this on the back of selling GPUs to companies that are training and serving large AI models.
A junior reading the news will say: "GPUs are good at AI, NVIDIA makes GPUs, NVIDIA wins." That is true and shallow. There is a deeper truth, and it is the same truth this entire book has been pointing at.
The bottleneck for large-scale AI is not "do GPUs exist." Several companies make GPUs. AMD makes them. Intel makes them. Google makes TPUs. Many startups make accelerators.
The bottleneck is how fast GPUs can talk to each other.
A single GPU cannot hold a state-of-the-art model. The model is too large. So you build a cluster — 8, 16, 64, or thousands of GPUs — and you split the model across them. Now every step of training or inference requires the GPUs to share intermediate results with each other. Hundreds of gigabytes per second of data flying between chips.
If you connect those GPUs over a normal PCI bus or a normal network, the inter-GPU bandwidth becomes the limit. The GPUs sit idle waiting for data. Your billion-dollar cluster runs at a fraction of its theoretical capacity.
NVIDIA's edge — the one nobody else has fully cracked yet — is a technology called NVLink, an interconnect designed specifically to let multiple GPUs share memory at speeds that approach the GPUs' internal memory bandwidth. With NVLink and its successors (NVSwitch, NVLink Switch System), you can wire dozens of GPUs together in a single rack and have them behave, from the model's point of view, almost like one big unified GPU.
This is, fundamentally, a hardware victory, and a hardware-aware software stack victory. It is what the whole industry has been chasing.
Why is this in a book about computer fundamentals?
Because it is the same lesson, one more time, at the latest frontier:
- The hub became a switch by getting aware.
- The copper cable was beaten by fiber on distance.
- The IPv4 shortage was solved by NAT through clever address translation.
- The slow file-by-file network became fast by minimizing syscalls and copying.
- The single GPU was beaten by tightly interconnected clusters through direct memory sharing.
Every leap forward in this story has the same shape. Find the real bottleneck. Build something that addresses it directly. Let the rest of the world catch up.
If you understand only the application layer, you will spend your career making apps slightly faster by switching frameworks. If you understand the full stack, you will spend your career identifying the bottlenecks that move industries.
That is the difference this book is trying to make.
Push On It
- Read NVIDIA's own technical documentation for NVLink and NVSwitch. Identify the bandwidth numbers. Compare to PCIe Gen 4 and Gen 5. The gap is the moat.
- Read about RDMA (Remote Direct Memory Access) and InfiniBand. These are older technologies that pioneered the idea of letting one machine read another's memory without going through the CPU. How is NVLink an evolution of these ideas?
- Look at one of the open-source distributed training frameworks (PyTorch DDP, FSDP, DeepSpeed, Megatron). Find the part of the code that handles cross-GPU communication. Notice how much engineering goes into doing it well.