Vol. II · № 4
16… Essay · Annotated
Back to the index

How Memory Actually Works

A from-the-ground-up tour of the memory every running program uses, the regions inside it, the allocators that hand it out, and the tools that report on it. Language-agnostic, no leaks yet.

The first time I cared about memory, I was in college doing competitive programming.

Every problem came with two budgets, a time limit and a memory limit. The judge would happily reject a correct answer that took 1.1 seconds when the limit was 1.0, or used 65 megabytes when the cap was 64. Half the fun of competitive programming, and most of the pain, lived in that gap. You’d write a solution that worked on the samples, submit it, and watch the verdict come back: TLE. MLE. Wrong test case 47, runtime error. Suddenly an algorithmic problem became a memory problem too.

That’s where I first read about malloc, about the heap, about how an int[] of a million elements is four megabytes and where those four megabytes actually live. Almost every memory bug I’ve debugged since has rhymed with that early lesson: the memory you allocate has to live somewhere, somebody has to give it back, and somebody (your runtime, the operating system, your future self) has to know what’s still in use.

Kent Beck has a line for this.

Make it work. Make it right. Make it fast. — Kent Beck

Most of the engineering time I see these days goes into the first step. Some into the second. The third, making it fast, making it small, making it actually fit, is rare. Abundant hardware and forgiving runtimes hide a lot of sins. But when “make it fast” does come up, it almost always means understanding the rooms underneath the runtime. The first two steps are about logic. This one is about machinery.

Most of what I knew about memory in college was bookish. The rest I picked up later, slowly, from working on real systems and watching tools disagree with each other about what was happening. This post is the part both versions agree on: what an operating system gives a program, what an allocator does, what a garbage collector does, and how the layers under your runtime fit together.

What does the OS give a program?

When you run ./my-program, the operating system creates a process and hands it something called an address space.

An address space is just a long list of byte slots, numbered from 0 up to a very large number. On a 64-bit system that number is 2^48 — 256 terabytes of addressable bytes. Your program does not have 256 terabytes of memory. It has the illusion of 256 terabytes of memory. The rest of this post is mostly about the gap between that illusion and reality.

The illusion is called virtual memory. Each process gets its own private address space. When the program reads or writes to address 0x7fff_a823_4000, the CPU translates that address — through hardware called the MMU — into a real physical address in your RAM. The translation table is called a page table, and the unit of translation is a page, usually 4 kilobytes.

Process Avirtual addresses Process Bvirtual addresses Kernel + MMUpage tables physical RAM
Each process sees its own private address space. The kernel and CPU translate those virtual addresses into physical RAM, page by page, on demand.

A few things follow from this design that matter for everything below.

First, a virtual address is not real memory until it’s used. The kernel can hand you a billion bytes of address space and not back a single one of them with physical RAM. Pages get backed on demand, the first time you read or write to them. This is why a program can ask for a huge mmap and the OS shrugs and says yes.

Second, two processes with the same virtual address point to different physical bytes. The translation is per-process. This is why processes can’t trample each other’s memory.

Third, the kernel can move pages around. It can swap them to disk if memory is tight, share them between processes if they’re identical (think glibc, loaded once, mapped into every process), or unmap them entirely. From the program’s point of view, the address stays the same.

That’s the foundation. Now let’s look at what’s actually inside the address space.

The five regions of a process

Every Unix-like process is laid out roughly the same way. The address space is divided into a handful of regions, each with its own purpose, growth direction, and rules.

arguments and environment stack — function callsgrows downward the gap — unmapped mmap region — shared libs,large allocations, mapped files heap — dynamic allocationsgrows upward bss + data — globals, statics text — your code, read-only
The layout of a typical Linux process. High addresses at the top, low at the bottom. The stack grows downward into the gap; the heap grows upward into it.

text

The text region holds your program’s compiled instructions. When you run a binary, the kernel maps the code section of the file into this region and marks it read-only and executable. It’s fixed size, doesn’t grow, and you generally don’t think about it. If you ever see Segmentation fault from writing through a stale function pointer, you tried to write here.

data and bss

The data segment holds globals and statics that have an initial value:

cglobals.c — ends up in data and bss
int counter = 42;     // initialized — goes in .data
int buffer[1024];     // uninitialized — goes in .bss
static char *name;    // uninitialized — also .bss

.bss (Block Started by Symbol — the name is historical and pointless) holds uninitialized globals. The kernel zero-fills them at startup so you never observe garbage in them. Both regions are fixed size.

heap

The heap is where your program asks for memory at runtime. It’s the interesting region — most of what we’ll discuss in this post happens here. We’ll come back to it in the next two sections.

mmap region

The mmap region is where the kernel maps in things that are too big or too special to live on the heap. Three main occupants:

  • Shared libraries. Every .so your program links against (libc.so, libssl.so, the JVM, etc.) is mmaped into this region. They live here because they can be shared across processes — the kernel maps the same physical pages into many address spaces.
  • Large heap allocations. When you ask for, say, a 10 MB buffer, allocators usually skip the heap and ask the kernel for a fresh mmap directly. We’ll see why in a moment.
  • Memory-mapped files. When you mmap a file, the kernel makes the file’s bytes appear as part of your address space.

stack

The stack holds function call state. Every time a function is called, a new stack frame is pushed: arguments, local variables, the return address, saved registers. When the function returns, the frame is popped. This is automatic, fast, and completely safe — you never free a stack variable.

The stack has a fixed size limit (on Linux, 8 MB by default). If you blow past it, you get a stack overflow and the process dies. You blow past it by recursing too deeply or allocating something huge on the stack:

coverflow.c — guaranteed crash
void boom() {
    char buffer[16 * 1024 * 1024];  // 16 MB on the stack
    // segfault before this line ever executes
}

The stack and the heap grow toward each other. The stack starts at high addresses and grows down. The heap starts low and grows up. The unmapped gap between them shrinks as your program does more work.

The stack in detail

Function calls don’t happen by magic. When you call a function, the caller and the callee follow a contract called the calling convention — a set of rules about which registers hold what, where return values go, and what gets pushed onto the stack.

A typical stack frame looks like this:

return address arguments to this function local variables saved registers space for the next frame
A stack frame, top of stack at the top. Each function call pushes a new frame; each return pops one. Frames live and die in strict last-in-first-out order.

Two properties make the stack fast and limited.

It’s fast because allocation is one instruction: subtract from the stack pointer. Deallocation is one instruction: add to it. There’s no list of free chunks to search, no metadata to update.

It’s limited because the lifetime of a stack variable is the lifetime of the function call. The moment a function returns, every byte in its frame is reclaimed. If you need memory that outlives the function — to return to the caller, to store in a long-lived data structure, to pass to another thread — the stack is the wrong place. You need the heap.

clifetime.c — why the heap exists
int *bad() {
    int x = 42;
    return &x;       // returning a pointer to a stack variable
}                    // x is destroyed here. The pointer is dangling.

int *good() {
    int *x = malloc(sizeof(int));
    *x = 42;
    return x;        // heap-allocated, survives the return
}

That contrast is the whole reason the heap exists.

The heap in detail

The heap is region of address space that your allocator manages. The allocator is a library — usually the C standard library’s malloc/free, or one of its drop-in replacements like jemalloc, tcmalloc, or mimalloc. It is not the kernel. It’s a userspace component that sits between your program and the kernel.

When your program calls malloc(size), here’s what happens.

malloc(64) ptr malloc(1 MB) mmap(2 MB) fresh pages ptr your code allocator kernel
A small malloc is served from the allocator's internal cache. A large one bypasses the cache and asks the kernel for fresh pages directly via mmap.

The allocator keeps a cache of memory it has already gotten from the kernel. When you ask for 64 bytes, it looks in its cache, finds a free slot, hands you a pointer. No system call needed.

When the cache runs dry, the allocator asks the kernel for more memory. There are two ways to ask:

  • brk / sbrk — moves the top of the heap up. This is the classic Unix way: the heap has a single contiguous range, and sbrk extends it. Cheap, but the heap is one piece.
  • mmap — gets a fresh, separate region of pages somewhere in the address space. More flexible, but each region is its own thing.

Modern allocators use both. Small allocations come from a heap extended by sbrk. Large ones (typically over 128 KB on glibc) skip the heap and live in their own mmap regions, so they can be returned to the kernel independently when freed.

Inside the heap, the allocator keeps free lists — linked lists of chunks that are not currently in use, organized by size. When you malloc(64), the allocator picks a free chunk that fits, splits it if it’s bigger, returns the rest to a free list. When you free(ptr), the chunk goes back on a free list, possibly merged with neighbors.

used 32 B free 16 B used 128 B free 64 B used 16 B free 256 B
The heap, mid-program. Some chunks are in use (allocated to your program), some are free (returned to the allocator and waiting to be reused).

This is also where fragmentation happens. Your program allocates and frees many things over time. Eventually the free space is broken into small islands, none of them big enough for the next big allocation, even though the total free memory is plenty. Allocators spend a lot of effort fighting this — bin sizes, slab caches, coalescing — but they can never eliminate it completely.

If you ever wonder why a long-running program keeps getting bigger even though “the heap should be the same size by now,” fragmentation is one of the answers.

Managed runtimes layer on top

Now the second floor of the building.

Languages with garbage collection — Java, Python, Go, JavaScript, C# — don’t expose malloc to you. You write new Foo() or [1, 2, 3] or {}, and an object appears. There’s no free. So where does the memory come from?

It comes from mmap. The runtime asks the kernel for a giant region of memory — usually tens or hundreds of megabytes at a time — and then sub-allocates inside it using its own bookkeeping. Each runtime has its own scheme.

  • The JVM has generational heaps — Eden, Survivor, Old, Metaspace — each its own arena, each with its own collector strategy.
  • V8 and JSC (the engines behind Node.js, Chrome, Bun) split the heap into a young generation (small, fast, frequent collection) and an old generation (larger, slower).
  • CPython has a small-object allocator on top of malloc, with arenas and pools.
  • The Go runtime has an mheap that’s structured into spans, each carved into objects of a fixed size class.

The point isn’t the specifics. It’s the shape: when your runtime says “the heap is 200 MB,” it means its arenas hold 200 MB of live objects. The kernel might see your process holding 800 MB of mmap’d pages — the rest is arena overhead, free space inside the arenas, fragmentation, and other runtime structures. The OS heap and the runtime heap are not the same thing, and they almost never report the same number.

resident memory(what the OS reports) mmap'd arenas(what the runtime asked for) live objects(what the runtime calls heapUsed)
What the runtime calls 'the heap' is a logical view inside the much larger memory the OS actually mapped to your process. Live objects are a subset of arena space, which is a subset of resident memory.

This is the first place where memory tooling starts to lie to you. process.memoryUsage().heapUsed reports the third box. ps -o rss reports the first. They are different numbers and disagreeing is normal.

What “leak” actually means

A memory leak is, loosely, memory your program no longer needs but has not given back. The reason that loose definition is unsatisfying is that the underlying mechanism differs by language. There are three flavors.

Flavor 1: forgotten free (C, C++)

This is the classic. You allocated, you didn’t free, you lost the pointer.

cleak.c — the forgotten free
char *make_greeting() {
    char *msg = malloc(64);
    sprintf(msg, "hello");
    return msg;        // caller is supposed to call free()
}

int main() {
    for (int i = 0; i < 1000000; i++) {
        make_greeting();   // pointer thrown away
    }                       // 64 MB lost; nothing can reach it
}

After the loop, 64 MB of heap chunks exist, every one of them allocated, none of them reachable from anything in your program. The allocator doesn’t know they’re unreachable — it just knows they’re not free. The OS doesn’t know either. They sit there, billed to your process, until the process exits.

This flavor is impossible in a managed language because there’s no malloc/free for you to forget. Which brings us to:

Flavor 2: reachable orphan (Java, Go, Python, JavaScript)

In a managed language, the garbage collector traces from a set of roots (the call stacks of all threads, global variables, registers) and finds every object that’s still reachable. Anything not reachable is recycled. So flavor 1 cannot happen.

What can happen instead is the opposite: memory that is reachable, by accident, that you no longer need.

jsleak.js — the modern leak. 'cache' grows forever
const cache = new Map();

function handle(req) {
    cache.set(req.id, req.payload);
    return cache.get(req.id);
}

Every entry in cache is reachable from a module-level global. As far as the garbage collector can tell, every entry is in use. It can’t help you. The leak is not “I forgot to free”; it’s “I remembered too well.” A cache that doesn’t evict, an event listener that’s never unregistered, a closure that pins a large object alive — these are the modern leaks.

Flavor 3: cycle (Python, Swift, anything refcounted)

Some languages don’t trace from roots. They keep a reference count on every object. When the count hits zero, the object is freed immediately. Python, Swift, Objective-C, COM, all use refcounting (Python with a tracing collector layered on top).

Refcounting is fast and predictable. It’s also blind to cycles.

pycycle.py — two objects keeping each other alive
class Node:
    def __init__(self): self.peer = None

a = Node()
b = Node()
a.peer = b
b.peer = a
del a, b
# refcounts: a is held by b.peer (1), b is held by a.peer (1)
# nothing reaches 0; both objects stay alive
peer peer Node arefcount = 1 Node brefcount = 1
A reference cycle. Each object's refcount is held up by the other. Neither hits zero, neither is freed — even though nothing else in the program can reach them.

Python ships a separate cycle collector that periodically walks the object graph to find and break these cycles. Swift doesn’t, which is why Swift programmers learn the words weak and unowned.

So: same word “leak,” three different mechanisms, three different fixes. The mechanism you have determines the tool you reach for.

Garbage collection in detail

Since two of the three leak flavors live inside garbage-collected runtimes, it’s worth a little time on how they actually work.

A tracing collector does what we just described: starts at a set of roots (stack frames, globals, registers), walks every reference it can follow, marks each object it visits, and at the end of the walk reclaims everything that wasn’t marked. This is the mark-and-sweep algorithm, and almost every modern GC is a variant of it.

GC root: stack GC root: globals obj A — live obj B — live obj C — live obj D — dead obj E — dead
A tracing GC walks from roots, marking everything it can reach. Whatever isn't marked at the end is unreachable and gets recycled.

The basic algorithm has problems in practice. It walks every reachable object, which can be millions on a real heap. While it walks, the program has to pause — a “stop-the-world” GC. For a long time, GC pauses were the main reason JVM apps got a bad reputation in latency-sensitive shops.

1.

That reputation is mostly historical. ZGC, on heaps in the tens of gigabytes, hits sub-millisecond max pauses. The “stop the world” name has lasted longer than the pauses themselves.

Modern collectors fix this with two big ideas.

The first is the generational hypothesis: most objects die young. A typical web request creates a flurry of short-lived objects (request bodies, JSON parses, intermediate strings) that die before the next request arrives. A small number of objects (caches, connections, the HTTP server itself) live forever. So why scan everything every time? Modern collectors split the heap into a small young generation and a larger old generation. They scan the young generation often and quickly. They scan the old one rarely.

The second is concurrency. A concurrent collector does most of its work alongside the program, with only short pauses for the parts that absolutely need the program to be still. ZGC and Shenandoah on the JVM, Go’s collector, V8’s Orinoco — all concurrent.

A refcounting collector — Python, Swift — is a totally different shape. Every object has an integer counter. Every assignment that creates a reference bumps the counter up; every reassignment or scope exit bumps it down. When it hits zero, the object is freed immediately. No tracing, no pauses, very predictable. The cost is that it can’t reclaim cycles, and the per-assignment overhead is real.

Most modern runtimes are hybrids. Python is refcounting plus a cycle collector. Modern JVMs are tracing plus generations plus concurrency. The shape of your runtime’s GC determines the shape of the leaks you’ll have.

The hidden room: native memory

Here’s the part most engineers don’t internalize until they’ve been bitten.

Every managed runtime has a C or C++ underbelly. The runtime itself is written in C++ (V8, JSC, the JVM, CPython, Go’s runtime). Many libraries you use are also written in C and called via the runtime’s foreign function interface. All of these can — and do — allocate memory using malloc or mmap directly, without going through the runtime’s heap.

That memory exists. The OS sees it (it’s part of your RSS). But the runtime’s heap counter doesn’t know about it, because the runtime didn’t allocate it.

Examples of native allocation, by ecosystem:

  • Java: DirectByteBuffer allocates off-heap memory for I/O. JNI calls allocate whatever the native code asks for. Netty uses pooled direct buffers heavily.
  • Node.js: Buffer.allocUnsafe(N) allocates N bytes directly via mmap, outside V8’s heap. Every TCP socket, every file read, every stream chunk passes through these. Internal V8 queues for WritableStream, TransformStream, and friends are also C++ side.
  • Python: Every C extension (numpy, lxml, cryptography) allocates with malloc, completely invisible to Python’s allocator counters.
  • Go: cgo calls into C. Anything allocated by the C side is outside Go’s heap.
jsnative.js — heap stays flat, RSS climbs without bound
const buffers = [];
setInterval(() => {
    buffers.push(Buffer.allocUnsafe(10 * 1024 * 1024));
}, 100);
// process.memoryUsage().heapUsed:    barely moves
// process.memoryUsage().rss:         climbs forever
resident memory (RSS)what the OS sees for your process runtime heapmanaged objects, GC tracks these native heaplibc, FFI, runtime internals,off-heap buffers
The total memory your process holds is the sum of two things: what the runtime allocates for managed objects, and what native code allocates directly. Most tools only show you one of them.

This split is why memory debugging in managed languages can be frustrating. Your runtime gives you a heap snapshot — it’s a beautiful tool, you can see every object, every reference, every retainer chain. But if the leak is on the native side, the heap snapshot will tell you nothing. The whole leak is invisible to it.

What different layers report

There are three layers that can answer the question “how much memory is my process using.” They almost never agree, and the disagreement is the signal you actually want.

layerwhat it seeshow you ask
OSresident memory, virtual sizeps, /proc/<pid>/smaps_rollup
runtimeits own managed heapprocess.memoryUsage, MemoryMXBean
allocatornative pages, fragmentation, arenasjemalloc stats, mallinfo
bashpeeking at the OS-level truth (Linux)
ps -o pid,rss,vsz -p $$
cat /proc/$$/status | grep -E '^Vm'
cat /proc/$$/smaps_rollup
jswhat V8 self-reports
console.log(process.memoryUsage());
// {
//   rss:           18923520,    // OS-level — total resident
//   heapTotal:      6328320,    // V8's heap reservation
//   heapUsed:       4587392,    // V8's live objects
//   external:        781072,    // C++ side allocations V8 tracks
//   arrayBuffers:     17890     // ArrayBuffer payloads
// }

A few useful relationships fall out of these layers.

heapUsed is always less than heapTotal (live objects fit inside the runtime’s reservation). heapTotal is part of rss. rss is heapTotal plus external plus arrayBuffers plus all the native allocations the runtime doesn’t track plus the code, the stack, the loaded .so files, and so on.

When rss and heapUsed grow together, your memory pressure is on the runtime side. When rss grows and heapUsed stays flat, the pressure is somewhere the runtime cannot see — the room your tool does not look in.

This is the part that surprises most people, and it’s where the gnarliest memory bugs live: a runtime that swears everything is fine, sitting inside an operating system that is in the process of killing the container.

Going further

Three pieces I’d reach for if any of this caught your interest:

  • Ulrich Drepper, What Every Programmer Should Know About Memory — the canonical deep dive on caches, virtual memory, and what the CPU actually does. Long, but ages remarkably well. lwn.net/Articles/250967
  • Julia Evans, Bite Size Linux and Memory Allocation zines — the most readable, most genuinely fun primers I know on this material. wizardzines.com
  • Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective — the textbook version. The chapters on virtual memory and dynamic allocation are worth the whole book.

Every program runs on the same machinery. The languages on top hide different parts of it. Knowing the parts is the difference between being a guest in your runtime and being at home in it.

Previous move
Next move