Processes, Scheduling, and Signals

We met processes briefly in the foundation part — a program is a recipe, a process is the cooking, isolation matters. This chapter goes a layer deeper. We are going to look at what the operating system is actually doing to the processes it runs, and what every senior engineer should know about reading the symptoms.

The states a process can be in.

At any given moment, a process is in one of a small set of states. The kernel maintains this state in the process table and uses it to make decisions.

Running. The process is currently executing on a CPU core, or it is in the runqueue waiting for one. On most schedulers, "running" includes both "actively on-CPU right now" and "ready to run, just waiting for a slot."

Sleeping. The process is blocked waiting for something — a disk read to complete, a network packet to arrive, a lock to be released, a timer to expire. The kernel will not schedule it until whatever it is waiting for becomes available. Most processes on a real machine, most of the time, are sleeping.

Stopped. The process has been paused, usually by a signal (SIGSTOP, or you hit Ctrl-Z in a shell). It is not running, not sleeping; it is frozen mid-execution and will only resume when it receives SIGCONT.

Zombie. This is the weird one, and it is worth lingering on.

A zombie is a process that has finished executing but whose entry in the process table has not been cleaned up. It happens when the parent process has not yet collected the child's exit status. The zombie owns no memory, no file descriptors, no CPU time. It is just a tiny record — exit code, PID, a few flags — sitting in the table waiting for its parent to acknowledge it.

You can see them with ps aux | grep Z (look for Z in the STAT column). On a healthy system, you will see zero or close to zero. On a system where a parent process is buggy and not reaping its children, you can see them accumulate. Eventually, you hit the system's PID limit and new processes refuse to start. The fix is to fix the parent — make it call wait() — or to kill the parent and let init adopt the zombies (which it will then reap correctly).

I will leave the creation of a zombie as homework. The recipe is in the milestone at the end of this chapter.

Scheduling — who gets the CPU next.

A computer with many runnable processes and a small number of CPU cores has to decide, many times per second, which process gets which core.

The scheduler does this. On modern Linux it is the Completely Fair Scheduler (CFS), which (roughly) tracks how much CPU time each process has used recently and tries to give the least-served runnable process the next slot. The scheduler is preemptive: it will interrupt a running process at the end of its time slice and switch to another, even if the first one was happy to keep running.

Every process has a nice value — a number from −20 (highest priority) to 19 (lowest priority). The default is 0. A nice value of −5 will get more CPU time than a default process; a nice value of 10 will get less. You set it with the nice and renice commands. The system also has real-time scheduling classes (SCHED_FIFO, SCHED_RR) for processes that absolutely must run when ready — audio engines, robotics controllers — but these are rare in normal application code.

What happens when the scheduler gets it wrong? You see it. The cursor stutters. Audio glitches. A long task takes longer than it should because the kernel kept switching it out. A senior engineer reads these symptoms and asks: is the workload genuinely CPU-bound, or is the scheduler making bad choices? Most of the time the workload is the problem. Sometimes the scheduler is. The way to find out is to measure, not to guess.

Signals — the kernel's way of telling a process to react.

Signals are the inter-process notification system in Linux. A signal is a small integer with a name and a default behaviour. You send one with kill -SIGNAL pid, or programmatically with kill(pid, signal).

The famous ones:

SIGTERM (15) — a polite "please shut down." The process can catch this, clean up, and exit. This is what kill <pid> sends by default and what systemd sends to stop a service.
SIGKILL (9) — non-negotiable termination. The kernel kills the process immediately. The process cannot intercept SIGKILL. This is what you reach for when SIGTERM did not work; this is also what the kernel sends when the OOM-killer fires.
SIGINT (2) — sent when you hit Ctrl-C in a terminal. Most processes interpret this as "stop what you are doing." Applications can catch it.
SIGSTOP (19) and SIGCONT (18) — pause and resume. SIGSTOP is uncatchable, like SIGKILL.
SIGSEGV (11) — segmentation fault. The kernel sends this when a process tries to access memory it does not own. The default behaviour is to dump core and die.
SIGBUS, SIGFPE, SIGILL — other kernel-originated faults (misaligned memory access, floating-point error, illegal instruction).
SIGHUP (1) — historically, "the terminal hung up." Many daemons now interpret SIGHUP as "reload your config files." A senior engineer knows that kill -HUP nginx-pid is the polite way to make nginx re-read /etc/nginx/nginx.conf without restarting.

Applications install signal handlers to react. A well-behaved server traps SIGTERM, finishes serving its in-flight requests, closes its sockets, flushes its logs, and exits cleanly. A poorly-behaved one ignores SIGTERM and forces you to send SIGKILL, losing the in-flight work.

The OOM-killer — when memory runs out.

Linux is, by default, an overcommitting operating system. It will let processes allocate more virtual memory than the machine has physical RAM, on the bet that not every allocation will be touched. Most of the time, this bet pays off — programs allocate optimistically and use only a fraction.

Sometimes the bet loses. The machine runs out of physical RAM, swap is full, and the kernel cannot satisfy the next allocation. At that point, the OOM-killer wakes up. It scans the running processes, scores them on a combination of memory usage and "badness" (importance, age, etc.), and SIGKILLs the unluckiest one.

You will see this in dmesg as a line like Out of memory: Killed process 1234 (some-app). The kernel was being a bodyguard. If it had not killed something, the whole system would have hung.

A senior engineer who is paged at 3 a.m. about a service that "just disappeared" knows to check dmesg for the OOM-killer's footprint before doing anything else.

Hard and soft limits — the kernel's negotiable rules.

Every process runs under a set of limits — maximum number of open files, maximum process count, maximum CPU time, maximum locked memory, maximum stack size. You see them with ulimit -a.

The distinction between soft and hard limits matters. The soft limit is what the kernel currently enforces on this process. The hard limit is the ceiling, set by the administrator. A non-root process can raise its own soft limit, but only up to the hard limit. Only root can raise the hard limit.

Most "weird errors" in production turn out to be hidden limits. An nginx hitting "too many open files" on a busy day. A database hitting per-process memory locks. A worker that cannot spawn more threads because the per-user process limit is exhausted. The fix is in /etc/security/limits.conf or in the process's systemd unit (via LimitNOFILE=).

When you raise a limit, raise it deliberately and document why. Limits exist to protect the system from one process eating everything. A higher limit is a willingness to lose more if that process misbehaves.

Kernel errors and what they actually mean.

You will sometimes see error messages that look like they come from your application but are actually from the kernel.

A segmentation fault is the kernel telling your process "you tried to access memory you do not own; I am terminating you." It is not a bug in the OS. It is the OS catching a bug in your code (or in a library you depend on).

A stack overflow is the kernel telling your process "your stack grew beyond the limit; I will not let it grow further." Usually caused by unbounded recursion or a very large stack-allocated variable.

A broken pipe is the kernel telling your process "you tried to write to a socket or pipe whose other end was closed." It comes as SIGPIPE; ignored by most server frameworks, it terminates naive programs.

A bus error is rarer but worth knowing: usually an alignment problem, or memory-mapped file truncated underneath you.

These errors are coming from the layer below your application. The senior move is to recognise that, look at the surrounding kernel state (dmesg, system logs, resource limits), and reason about which subsystem ran out of patience.

Cgroups — bounding a process's resource use.

The newer, structured way to apply limits to processes is control groups (cgroups). A cgroup is a collection of processes that share a set of resource constraints — CPU shares, memory limit, I/O bandwidth, network rate, PIDs limit.

Cgroups are how Docker, Podman, Kubernetes, and systemd's resource-management features all work under the hood. When you set --memory=512m on a Docker container, Docker creates a cgroup with that memory limit and puts the container's processes into it. The kernel enforces the limit; if the container exceeds it, the OOM-killer fires inside the cgroup — only the container's processes are eligible victims, not the rest of the system.

You can interact with cgroups directly. On a modern systemd system: systemctl set-property nginx.service MemoryMax=2G CPUQuota=50%. Or write into /sys/fs/cgroup/... directly if you want to live dangerously.

This is the foundation we will build on in the next chapter, when we look at containers and what makes them different from virtual machines.

A small parting observation.

There is a recurring shape to all of this. The OS gives every process the illusion of owning the machine — its own memory space, its own CPU time, its own file descriptors, its own filesystem view. Behind the curtain, the OS is multiplexing one set of physical resources across many simultaneous illusions, arbitrating disputes, killing the unruly, and writing logs about what it did.

When you write code, you write it as if you owned the machine. You almost always do not. The senior engineer is the one who knows where the illusion ends, and who knows what to look at when it breaks.

Push On It

Write a tiny program that forks, has the child exit immediately, and has the parent sleep without calling wait(). Run it. In another terminal, find the zombie with ps aux | grep Z. Now kill the parent and observe the zombie disappear. Explain to yourself why.
Open top or htop on a machine doing real work. Sort by CPU. Pick a process. Send it SIGSTOP (kill -STOP <pid>), then SIGCONT (kill -CONT <pid>). Watch its state in top change. Notice that other processes get more CPU during the stop. The scheduler is rebalancing.
Pick any service running under systemd. Find its cgroup (systemctl status <name> shows the path). Look at /sys/fs/cgroup/<path>/memory.current and cpu.stat. You are reading live resource usage of a constrained process.
Look up the OOM-killer's scoring algorithm. Then deliberately trigger it in a VM (write a small program that allocates memory in a tight loop until killed). Verify your prediction about which process the kernel would choose.

Processes, Scheduling, and Signals

What you will learn

Processes, Scheduling, and Signals

Push On It

Make a Zombie

Flashcards (6)