How did machines get “smart”?
In March 2016, the computer program AlphaGo faced off against Lee Sedol, the top Go player in the world. Computers had proven their superiority at chess, but the game of Go was supposed to be beyond machine comprehension, with a vast space of moves no computer could ever explore. AlphaGo won 4-1.
The lead of AlphaGo, said in an interview that the model “play[ed] millions of matches against itself, learning from each victory and loss.”1
A decade later, people routinely rely on Large Language Models for knowledge work and obtain decent results. Before this, the way computers contributed to our writing was through spell-check. Popular descriptions of these models tell us that all they do is predict text “One Word at a Time”2.
We have had game programs playing each other for decades. The idea of using statistics to predict text goes back to at least 19133. To understand what changed the performance of these systems recently, we must look closer at the mechanics of Deep Learning.
Distinguishing AI from Traditional Programming
Computers can do a large variety of things, but few of them are naturally categorized as Artificial Intelligence (AI). I would define AI as a system where the “smarts” are in the program’s ability to adapt, rather than exclusively residing in the programmer’s explicit instructions.
Take an operating system, for example. You can put Linux on a USB drive and boot it on virtually any computer. However, this is because systems are designed modularly and rigorously tested against a matrix of hardware configurations. The programmer anticipated the variations. Not quite Artificial Intelligence.
True AI comes into play when a program is highly likely to encounter situations the programmer hasn’t explicitly prepared for. In areas like language or vision, there are far more edge cases than one could ever manually code.
So, how have we dealt with these sprawling problems?
The Era of Expert Systems
You can try to find expert rules, model the problem conceptually, and attempt to emulate it via an algorithm. For generating text, you might create a list of words and encode grammar structures directly into code. For chess, you might try to hard-code human strategies and heuristics to narrow down the choices.
Brute Force and Search Trees
Early pioneers like Claude Shannon approached problems by mapping out a tree of possible future moves. The issue here is the “combinatorial explosion”, where the search space grows so fast that you can spend vast amounts of computation exploring a bad path. Powerful traditional chess engines like Stockfish successfully combined hard-coded human heuristics with deep search trees to dominate the game.
Early Machines that Learned
In 1960s Arthur Samuel had developed a system that learned to play checkers. The system pitched a training version against the previously strongest version of itself, incrementally tweaking its own algorithms. However, the constraints of this predefined learning approach meant it didn’t necessarily apply to more complex domains, and it often lost to human players.
From Heuristics to Machine Learning
In the process of living online, humanity has generated staggering amounts of example data. Long before the internet, Andrey Markov manually counted letters in a novel to calculate the statistical frequency of vowels and consonants.
Computers excel at performing simple operations at scale. That is more or less the only thing they do. So we can easily extend Markov’s work from letters to words and from a novel to massive collections of literature.
For language, we can calculate the statistical frequencies of word pairs or triplets, and string them together to generate something resembling sentences. This would fall under the definition of traditional Machine Learning. We can think of traditional machine learning as turning a general problem into a statistical one and extracting some useful signal from those statistics. When applied to a large amount of text we can get something that looks like language but has little to no meaning.
The Neural Revolution
A biological neuron is a cell specialized to integrate and transmit electrochemical signals. When neurotransmitters bind to it, the cell’s voltage shifts. If the voltage surpasses a specific threshold, the neuron fires, releasing more neurotransmitters to downstream cells in a massive cascading chain.
Artificial neurons are loosely inspired by the real ones. An upstream neuron provides a raw number to the current neuron. That number is modulated by a multiplier (called a weight), representing the varying strengths of neural connections. The neuron then passes the result through an activation function (the mathematical threshold).
In Deep Learning, millions of these artificial neurons are organized into stacked layers. Data enters the initial layer and is sequentially transformed, layer by layer, into increasingly abstract and useful representations.
The real breakthrough wasn’t just the architecture; it was the realization that neural networks could algorithmically adjust their own weights to “learn” the relationships within the data, given enough compute and examples.
Turning the World into Numbers
But how do we feed reality into an algorithm? We’ve mastered the art of digitizing the world. Humanity turned spoken language into written characters, and computers turn characters into standard numerical codes. Images are grids of pixels, which are just clusters of numbers determining red, green, and blue intensities. Audio can be turned into a list of numbers representing sound pressure, obtained from a microphone.
So, all we have to do is feed these massive arrays of numbers into a neural network, add more layers, and we get intelligence, right? We will examine that next.