Mechanical Sympathy • The Applied Go Weekly Newsletter 2025-11-30

Your weekly source of Go news, tips, and projects

Mechanical Sympathy
Hi ,
There are two things in Go that probably don't grab a newcomer's immediate attention: The mechanics of heap allocation and data races. The first is the garbage collector's domain until performance tuning becomes important. The second may go unnoticed until the first few attempts at writing concurrent code start producing strange results.
In either case, it's good to have an idea about the things going on behind the scenes. There is a term for this: mechanical sympathy. And that's what the two featured articles help to acquire.
–Christoph
Featured articles
Why Your Go Code Is Slower Than It Should Be: A Deep Dive Into Heap Allocations
Go's garbage collecting mechanism is a huge win for development speed while still producing fairly fast apps. (I hear you shouting, Rustaceans, "but not blazingly fast!". Yes, that domain is left to languages with manual (and, in Rust's case, smartly compiler-aided) memory management.) But sometimes, performance can be improved with small changes. Cristian Curteanu collected some typical patterns.
A million ways to die from a data race in Go
Well, not quite a million, but five are surely enough reasons for honing your data race avoidance skills.
Spotlight: Show some sympathy (for your hardware)
(Yes, I finally found some time to write one!)
Many programming languages try to be as "high-level" as possible by shielding the developer from the details of the hardware and the inner workings of compiler and runtime. What a shame! Users of these languages get some ease in coding for the price of needlessly slow apps.
What if a language laid only a thin abstraction over the hardware it is supposed to run on? And what if developers knew enough about the hardware they write code for?
Why you should have (or develop) mechanical sympathy
In an ideal world, algorithms would always execute without any friction—no CPU or RAM limits, no I/O delay. Reality, however, confronts the developer with hard and soft limits as well as more or less subtle hardware peculiarities. Mechanical sympathies means to know about these factors and use them to your code's advantage.
The term "mechanical sympathy" originates from Jackie Stewart, a racing driver, who said:
You don’t have to be an engineer to be a racing driver, but you do have to have mechanical sympathy.
– Jackie Stewart
What he meant is that if you know your car, you become a better driver. A better driver can drive his car faster, which, after all, is the point of car races, right? The same applies to software development: Know your hardware architecture—the CPU, RAM, I/O, and so forth—and you can become a better developer. A better developer writes software that's more efficient, in terms of both speed and memory consumption. Did you notice that hardware became tremendously faster over the last decades while software feels sluggish as ever? Performance optimization has been neglected for too long.
A common meme in our industry is “avoid pre-optimization.” I used to preach this myself, and now I’m very sorry thatI did.
I want to extend the notion of mechanical sympathy from hardware to software mechanics. Compilers and runtimes can contain complex algorithms that are worth developing some mechanical sympathy for. Think of:
- memory allocation and garbage collection
- concurrency as a "mechanical" system
to name only two. My point is: Certain kinds of software abstraction layers can expose hardware-like behavior on their own.
Mechanical sympathy for hardware
Now let's look at some features of modern CPUs that can affect your code in unforeseen ways but that you also can leverage to make your code performing better.
Local caches
From a software point of view, RAM is linear. Available memory is a long row of memory cells numbered from 1 to 2^x (where x depends on the width of the address bus in bits).
In reality, RAM is hierarchical: A series of caches sit between RAM and a CPU core, each one smaller but faster than the previous one. (As these caches are local to the CPU core, they are typically called "L1", "L2", and "L3") When the core fetches data from RAM, the data remains cached as long as available cache space permits, and future requests to the same data become much faster.
Algorithms can be designed to take advantage of the caches by operating on amounts of data that fit into a local CPU cache—ideally into the L1 cache that sits closest to the CPU and is the fastest but also the smallest one in the cache hierarchy.
On the flipside, algorithms that are unaware of the cache architecture can become slower than necessary, even to an extent that cancels out performance achievements from concurrent work.
How so? Imagine a concurrent app with goroutines running on multiple cores at once. Goroutine a on core A reads a variable and updates it. Now the variable's current value sits in core A's L1 cache. Then, goroutine b on core B wants to read and update the same variable. To make this possible, the processor must first invalidate the value in A's L1 cache, move the new value back to main RAM, then let B fetch and process the value and store it in B's L1 cache. Now imagine if goroutine a wants to update the value again; the slow cache invalidation process repeats.
Structure alignment
Terms like "32-bit processor" or "64-bit architecture" refer to the size of one unit of information that the processor can read from RAM. A 64-bit processor reads data in chunks of 64 bits or eight bytes at once.
But what about smaller data, such as an int16 or a boolean? The CPU still must fetch a full 64-bit chunk of data to read a 16-bit value from RAM.
For this reason, data must be aligned at 64-bit borders.
Consider this struct:
var s struct {
a int8
b int64
c bool
}
If it was stored in memory as a continuous data block, field b would spread across two 64-bit memory cells:
0 8 64
+-+-------------+
|a|b |
+-+-------------+
|b|c|###########|
+-+-------------+
(The hash signs ### denote unused space within a memory cell.)
To fetch field b, the CPU would have to read both memory cells, extrac the two fragments of b and combine them into one.
As this would be highly inefficient, the compiler maps the struct to memory so that no value spans two memory cells:
0 8 64
+-+-------------+
|a|#############|
+-+-------------+
|b |
+-+-------------+
|c|#############|
+-+-------------+
Apparently, our partiular struct wastes quite some space by storing two 8-bit values in two 64-bit memory cells.
Luckily, as Gophers, we are empowered to change this. The compiler maps struct fields in the sequence given in the source code, so by rearranging the struct a bit, we can have the compiler map the two short fields, a and c, to the same memory cell:
var s struct {
b int64
a int8
c bool
}
The compiler happily arranges a single cell for a and c, thus reducing the required memory by one third!
0 8 64
+-+-------------+
|b |
+-+-------------+
|a|c|###########|
+-+-------------+
This might not sound much, but for large slices of structs, it can make a difference.
These were just two examples of using knowledge about hardware behavior to improve efficiency. Next, I'll transpose this principle to software mechanics.
Mechanical sympathy for Go
Programming languages, as simple as some of them may seem, are in fact quite complex systems, comprised of a compiler and runtime libraries that directly influence how code performs. Languages often build their data types around specific paradigms.
Two examples of this in Go are slices and goroutine scheduling.
Slices
Every Go newcomer quickly learns that slices are quite special: While other languages treat dynamic arrays as an opaque concept that "just works" (somehow), Go makes the internal structure explicit: Slices live on underlying static arrays, and once a Go developer learns how the mechanics of slicing, copying, and appending work, they can design algorithms that perform better than those that have to rely on an impenetrable abstraction of dynamic arrays.
Slice header
+-----------------+
| len | cap | ptr |
+--------------|--+
+------------+
| Array
+-V---------------------------------+
| | | | | | | | | |
+-----------------------------------+
Algorithms that operate on a known (maximum) amount of data can pre-allocate underlying arrays, by either explicitly allocating an array and slicing it or calling make with a capacity parameter: s := make([]int, 100, 10000). Without the need to resize the underlying array, allocations are reduced to a minimum. Memory-intense operations become faster and the garbage collector has much less to do.
Another aspect of slices worth mentioning: As continous blocks of data, they are much more cache-friendly than pointer-based structures like linked lists, whose elements are necessarily allocated on the heap and therefore can be anywhere in RAM. So if you're about to implement an algorithm that would seem to require connecting individual data elements through pointers, see if you can find a slice-based implementation instead that can take advantage of cache locality.
The Go scheduler
You probably have come across demo apps that spawn bazillions of goroutines to demonstrate the amazingly low overhead per goroutine. Yet, knowing and adapting to the way goroutines are scheduled onto system threads can pay off.
System threads are mapped 1:1 to CPU cores; that is, only one thread can run on a core at any given time. Swapping out a thread to run another one is rather expensive, so the Go inventors designed goroutines so that scheduling them becomes dirt cheap.
The scheduler isn't exactly a simple algorithm, so I attempted to make the logic more accessible by comparing the scheduler to a restaurant.
The most important part of designing concurrent work is to know whether the goroutines you plan to spawn are I/O-bound or CPU-bound.
- I/O-bound goroutines are frequently blocked by slow data exchange with storage or network connections. You can spawn a lot more of them than then number of cores in your target system, as they mostly will be in a blocked state, giving room for OS activity or other goroutines to run.
- CPU-bound goroutines max out the CPU by doing calculations on data that's readily available in RAM. Here, you'd want to schedule no more goroutines to run simultaneously than there are CPU cores available. Any goroutine above that number will only generate additional scheduling overhead without any further speed-up.
As concurrency is inherently difficult (despite Go making it more accessible through concurrency primitives that are easy to reason about (goroutines and channels, that is), always keep an eye on how concurrent flows unfold in your app by monitoring CPU and scheduling profiles to see contention, memory use, and goroutines leaks and adjust concurrent flows accordingly.
Conclusion: To write good software, learn hardware
As a universal rule, if you want to get better in a particular domain of life, lean out of the window, peek over the fence, and explore the domain adjacent to yours. It is always interesting and sometimes revealing when you take on a new perspective.
Exploring modern hardware is fascinating, and if modern CPU architecture feels overwhelming, start with programming a microcontroller.
Whatever you decide to explore outside the domain you want to get better, it'll pay off in many ways.
Further reading
Within the boundaries of a Spotlight, I can barely scratch the surface of this topic. No worries! There's a lot of great stuff about mechanical sympathy just one click away:
About cache misses and concurrent access to data
Slow down your code with goroutines
The Go scheduler explained with chefs, pans, and stoves
About mechanical sympathy
Mechanical Sympathy: Understanding the Hardware Makes You a Better Developer
Hardware-Aware Coding: CPU Architecture Concepts Every Developer Should Know
Mechanical sympathy and Go
Writing Go Code with Mechanical Sympathy: Optimizing for Hardware Efficiency
Quote of the Week: On implementing ideas
I'd rather build and fail than spend years debating in GitHub comments whether it's a good idea.
More articles, videos, talks
When Should a Spring Boot Dev Actually Switch?
What could make a Java Spring Boot developer want to switch to Go or Rust? Mohammedouchkhi asked /r/golang and distilled the answers into this article.
Go proposal: Goroutine metrics
Another accepted proposal, explained by Anton Zhiyanov: runtime/metrics will report goroutine states, and goroutine and thread counts.
Gist of Go: Concurrency testing
Kernighan's Law says that debugging is twice as hard as coding—and since concurrent programming is already hard, what about debugging concurrent code? Well, better arm your code with good tests to avoid debugging as good as possible.
Projects
Libraries
GitHub - KromDaniel/regengo: a compile-time finite state machine generator for regular expressions
While typical regular expression libraries do most of the work at runtime, regengo moves the bulk of the work to compile time by generating a state machine from the regexp. (Compare to modernc.org/rec.
GitHub - sandrolain/httpcache: A Transport for http.Client that will cache responses according to the HTTP RFC
For httpcache users: Version 1.4.0 adds a few responses for even better RFC 9111 compliance. A heads-up: This is also the final v1 version. The developer focuses on v2 because modernizing the library requires breaking changes.
Tools and applications
GitHub - 1rhino2/go-memory-visualizer: Real-time Go struct memory layout visualization and optimization for VS Code. Analyze padding, alignment, and cache performance with one-click field reordering.
When every byte in memory counts, having immediate feedback on the memory layout of a structure can come in handy.
GitHub - MadAppGang/dingo: A meta-language for Go that adds Result types, error propagation (?), and pattern matching while maintaining 100% Go ecosystem compatibility
Does Go have its TypeScript moment?
Supertonic (TTS) - a Hugging Face Space by Supertone
Make your Go apps talk to the user. Supertonic is a fast text-to-speech (TTS) engine that comes with a Go SDK and runs on the local machine or even inside a browser, for maximum privacy (and independence from network connections).
Completely unrelated to Go
There's always going to be a way to not code error handling
Imagine a language where error handling is the happy path.
A ChatGPT prompt equals about 5.1 seconds of Netflix
What would you rather do tonight: Watch Netflix for 1,5 hours, or run 500-1000 ChatGPT queries?
(Spoiler: From an engery standpoint, it's the same.)
