Virtual Machines and Containers

We have spent three chapters on the operating system that runs on real hardware. This chapter is about the operating systems that run on operating systems, and about a clever trick that runs many workloads on the same operating system while making them believe otherwise.

The history of virtualisation is one of the most consequential threads in modern infrastructure. It made cloud computing economically possible. It made serverless functions feasible. It made Docker a verb. Understanding it will change how you think about every deployment you ever make.

Emulator, simulator — a quick distinction.

You will hear these words used loosely. Worth separating cleanly.

A simulator imitates the behaviour of a system at some level of abstraction. A flight simulator gives a pilot the experience of flying without anyone actually leaving the ground. A traffic simulator predicts congestion patterns by approximating how cars and drivers behave. A simulator does not usually run the same binary code as the real system — it models the system's behaviour in a separate program.

An emulator runs the actual code of one machine on a different machine. A SNES emulator on your laptop runs the binary instructions of a Super Nintendo cartridge by translating each SNES instruction into one or more instructions your laptop's CPU understands. The original binary thinks it is running on a SNES; in fact every instruction is being interpreted.

In computing infrastructure, "emulator" usually means "I am running x86 code on an ARM processor by translating each instruction." That is what Rosetta 2 does on Apple Silicon Macs when you launch an Intel application. It is what qemu-system-x86_64 does on an ARM host.

Virtual machines are a related but stricter idea, and they need their own paragraph.

What a virtual machine actually is.

A virtual machine is a software environment that imitates a complete computer — CPU, RAM, disks, network interfaces, BIOS — so that an unmodified operating system can boot inside it as if on real hardware.

A program called the hypervisor sits between the host operating system (or the bare hardware) and the guest virtual machine. The guest believes it is the only thing running on a real computer. The hypervisor presents it with a virtual CPU, virtual RAM, virtual devices, and intercepts the privileged operations that would, on real hardware, talk directly to the chips.

There are two flavours. Type-1 hypervisors (VMware ESXi, Xen, Hyper-V, KVM in some configurations) run directly on the hardware. Type-2 hypervisors (VirtualBox, VMware Workstation, Parallels) run as an application inside a host operating system. The line between the two has blurred over time; the distinction matters less than the technique.

The crucial trick that makes modern virtualisation fast is hardware virtualisation extensions.

VT-x, AMD-V, and the silicon that made it cheap.

In the early 2000s, virtualisation was slow because everything privileged had to be trapped, decoded, and emulated in software. Every system call inside the guest cost real CPU work in the hypervisor.

Intel and AMD then added instruction set extensions — Intel calls theirs VT-x, AMD calls theirs AMD-V — that let the CPU itself distinguish between "host mode" and "guest mode" execution. A guest OS can now run its own instructions directly on the CPU at native speed. The CPU silently switches modes for the few operations that need to trap into the hypervisor. The hypervisor handles those traps, then resumes the guest.

This made virtualisation almost free for most workloads. It is the reason a virtual machine on modern hardware runs at 95%+ of native speed. Without these extensions, every VM would feel like running through tar.

You can check whether your CPU supports them: on Linux, look for vmx (Intel) or svm (AMD) flags in /proc/cpuinfo. On most modern desktops and servers, they are present. On many laptops, they are present but disabled in BIOS — toggle them on before you try to run a VM.

KVM, QEMU, and how they relate.

These two words come up together so often it is worth disambiguating them.

KVM (Kernel-based Virtual Machine) is a Linux kernel module that turns the Linux kernel itself into a hypervisor. It uses VT-x or AMD-V to run guest code natively. KVM by itself only handles CPU and memory virtualisation; it does not draw a screen, emulate a network card, or pretend to be a BIOS.

QEMU (Quick Emulator) is a userspace machine emulator. It can run on its own, without KVM, by emulating every CPU instruction in software — slow but flexible (this is how QEMU runs ARM code on x86 hosts and vice versa). When QEMU is combined with KVM, QEMU provides the virtual hardware (devices, BIOS, peripherals) and KVM provides the fast CPU and memory virtualisation. This is the most common combo on Linux: QEMU + KVM.

When you read "I ran a VM with KVM," what almost always actually happened is QEMU using KVM as its acceleration engine.

Firecracker — the VM stripped to the bone.

For most of virtualisation's history, the question was "how do we run a desktop OS inside another desktop OS?" The answer involved full hardware emulation: BIOS, PCI buses, USB controllers, graphics adapters, audio. A heavyweight machine, booting in tens of seconds.

Cloud computing changed the question. AWS Lambda needed to spin up an isolated environment per request, run a small function in it, and tear it down. A 30-second VM boot was a non-starter. Booting fast became more important than supporting every legacy peripheral.

Amazon's answer was Firecracker, an open-source virtual machine monitor stripped to almost nothing. No BIOS. No PCI bus. No graphical devices. The minimum device model required to boot a Linux kernel and run one workload. Firecracker boots a VM in ~125 milliseconds and uses ~5 MB of memory overhead.

This is the technology behind AWS Lambda and AWS Fargate. Every time you invoke a Lambda function, somewhere a fresh micro-VM is starting up around your code, running it, and being thrown away. The economics that make per-request isolation viable only exist because the VM around each request boots in a fraction of a second.

Firecracker matters not just because it powers a major cloud product, but because it demonstrates the lesson we keep running into: find the real bottleneck, address it directly, ignore the legacy you do not need.

Containers — the trick where the kernel becomes the boundary.

A container is not a virtual machine.

A virtual machine virtualises hardware. The guest runs its own kernel. The hypervisor lies to it about the CPU and RAM.

A container shares the host kernel. There is only one kernel on the machine. The container is just a set of processes running on the host kernel, but the kernel has been instructed to lie to them about what they can see.

The lies are implemented through two Linux kernel features.

Namespaces let the kernel give a group of processes their own view of certain resources. There is a PID namespace (each container sees its own process tree starting at PID 1), a network namespace (each container has its own interfaces and routing tables), a mount namespace (each container has its own filesystem view), a UTS namespace (hostname), a user namespace (UID mappings), and a few others. Inside a container, ps aux shows only the container's processes. ip addr shows only the container's network. The host can see everything; the container can see only what its namespaces allow.

Cgroups (which we met in the last chapter) let the kernel apply resource limits to the same group of processes. CPU shares, memory cap, I/O bandwidth, network rate. The container's processes cannot collectively use more than the cgroup permits.

Combine namespaces (for visibility) with cgroups (for resource limits) and you have something that behaves like an isolated machine — its own processes, its own network, its own filesystem, its own resource budget — while in fact being a set of ordinary processes on the host kernel.

This is why containers start in tens of milliseconds (no kernel boot, no hardware emulation), use almost no extra memory (no separate kernel, no separate userspace ground floor), and feel like lightweight VMs. They are not VMs. They are a clever set of lies told by a shared kernel.

The trade-offs, honestly.

Containers are smaller, faster, and cheaper to run than VMs. They also share the host kernel. If a vulnerability in the kernel lets a process escape its namespace, every container on the host is exposed. VMs, with their own kernel, provide a stronger isolation boundary — to escape a VM, you have to find a bug in the hypervisor itself, which is a much smaller and more hardened attack surface.

This is why high-isolation workloads — multi-tenant cloud, untrusted customer code, security-sensitive functions — often run inside VMs (or micro-VMs like Firecracker), even though containers would be lighter. The extra security is worth the extra cost.

This is also why some technologies blur the line. Kata Containers wraps each container in its own micro-VM, giving you container ergonomics with VM-level isolation. AWS Lambda runs your function in Firecracker. The frontier is "containers when you control the workload, micro-VMs when you don't."

Docker, Podman, containerd, Rancher Desktop — the runtime zoo.

The container runtime ecosystem is more complicated than it needs to be, so a quick map.

Docker is the company and the original product. The Docker daemon manages images, runs containers, and exposes a CLI. For years it was the only mainstream way to use containers.

containerd is the lower-level runtime that Docker now uses under the hood. It can be used directly without the Docker daemon (Kubernetes does this).

runc is the lowest-level component that actually creates the container (sets up namespaces, cgroups, drops capabilities, starts the process). containerd and Docker both use runc.

Podman is a Docker-compatible CLI that does not need a daemon. Useful for some security and rootless-container scenarios.

Docker Desktop is the GUI app for Mac and Windows. It now requires a license for commercial use in larger companies. This is why Rancher Desktop exists — an open-source alternative that bundles containerd or dockerd, Kubernetes, and the necessary VM layer for Mac/Windows hosts (Linux containers need a Linux kernel; on Mac/Windows, the desktop apps run a small VM in the background).

If you are on a Mac and your company will not pay for Docker Desktop, install Rancher Desktop. The container images you build and the CLI commands you type are unchanged.

The mental model to take with you.

When you deploy software, you have three rough choices.

You can deploy on the bare metal — install your application directly on a Linux server. Maximum performance, minimum isolation. One bad neighbour can break the whole machine.

You can deploy in a virtual machine — full kernel-level isolation, full hardware abstraction, slower to start, heavier in resource use. The default in classic infrastructure.

You can deploy in a container — shared kernel, lightweight isolation via namespaces and cgroups, milliseconds to start, dense packing on the host. The default in modern infrastructure.

You can also combine them: containers running inside VMs (Kubernetes on EC2), micro-VMs running container-shaped workloads (Firecracker for Lambda), nested VMs inside VMs (developer laptops running cloud-like local clusters).

The senior engineer reads each system in front of them and asks: which boundary is doing the actual isolation here, and which boundary is just there because it was easy to add? When something breaks, the layer that broke is the one that was load-bearing. Knowing that map — bare metal, kernel, namespaces, hypervisor, micro-VM, full VM — is what separates "I deployed it" from "I understand how it runs."

That is the workout we have been building toward.

Push On It

Install QEMU on your machine. Download any small Linux kernel image and a root filesystem (Alpine is convenient). Boot the VM with qemu-system-x86_64 -kernel ... -initrd ... -append "console=ttyS0" -nographic. Watch it boot from nothing. Time it.
Get Firecracker working (it is harder on a Mac, easier on Linux). Boot the same kernel inside Firecracker. Compare boot times against QEMU. The gap is the cost of the things Firecracker chose not to do.
Install Rancher Desktop (or Podman, or anything with a working docker CLI). Run an Alpine container interactively (docker run -it alpine sh). Inside the container, run ps aux, ip addr, and mount. Compare them to the same commands on your host. The container's lies are visible from inside.
Read the Linux kernel docs on namespaces (man 7 namespaces). Create a new network namespace manually with ip netns add test. Verify that processes inside it see a different network. You have just built half a container with the kernel's own tools.

Virtual Machines and Containers

What you will learn

Virtual Machines and Containers

Push On It

Boot a Kernel Yourself

Flashcards (6)