Imagine a hyper-intelligent, god-like alien called Omega arrives on Earth. It offers you two boxes. One of them is transparent, and contains £1,000. The other is opaque, and contains either £1 million, or nothing at all. Omega has now disappeared off to its own planet, and left you with a decision. You can choose either both boxes, or just the opaque one.
It may sound obvious that you should take both: you get the extra thousand in either case. But the twist is that, before Omega filled the boxes, it predicted what you would do. And if it predicted you would take just the opaque box, it put the million pounds in there. If it predicted you’d take both, it left the opaque box empty. It knows you know this. It’s done this to 100 other people already, and it’s successfully predicted their choice 99 times. What is the rational, optimal thing to do? Do you two-box or one-box?
This thought experiment is called Newcomb’s paradox, and it has been a source of heated philosophical debate for more than 50 years. The philosopher Robert Nozick, in a long and agonised discussion of the problem, wrote in 1969:
“I have put this problem to a large number of people, both friends and students in class. To almost everyone it is perfectly clear and obvious what should be done. The difficulty is that these people seem to divide almost evenly on the problem, with large numbers thinking that the opposing half is just being silly.”
The book Rationality: From AI to Zombies mentions this problem. Its author, Eliezer Yudkowsky, acknowledges the philosophical difficulties. But he has a robust and sensible approach. The rational course of action, he says, is the course of action that gives you a million pounds, rather than the one that doesn’t.
In the early 2000s, the young Yudkowsky had something of an epiphany. He was only 20 or so, but he was already quite a well-known figure in certain subcultures on the early modern internet: the transhumanist, technophile set who talked about cryonics and uploading our brains into the cloud.
This was before social media, so these subcultures existed as mailing lists. Great long nerdy conversations went on in people’s emails. Yudkowsky set up his own, SL4; it was about how to bring about the technological Singularity, where advances in computing and AI will render human life unrecognisable. He also founded what was then known as the Singularity Institute (now the Machine Intelligence Research Institute), with the same goal.
The idea was to bring about superintelligent AI as quickly as possible, because, he thought, being superintelligent, it would be clever enough to fix all the world’s problems: “I have had it,” he wrote, as a 17-year-old in 1996.
“I have had it with crack houses, dictatorships, torture chambers, disease, old age, spinal paralysis, and world hunger. I have had it with a planetary death rate of 150,000 sentient beings per day. I have had it with this planet. I have had it with mortality.
“None of this is necessary. The time has come to stop turning away from the mugging on the corner, the beggar on the street. It is no longer necessary to look nervously away, repeating the mantra: ‘I can’t solve all the problems of the world.’ We can. We can end this.”*
The reasoning went, very roughly: human intelligence has made human life much better, over the last few centuries; if we make something much more intelligent, it will fix the problems that remain.
But some time between 2000 and 2002, Yudkowsky had the aforementioned epiphany: just because an AI is intelligent, doesn’t mean it will do what we want it to do. Instead, it will do what we tell it to do. And the difference between those two statements is huge, and potentially disastrous. AI could, he said, ultimately destroy humanity, if we are not immensely careful about how we build it: not because it “goes rogue”, or “becomes sentient”, but simply because it follows its instructions to the letter, rather than in spirit.
Yudkowsky had, though, a devil of a time convincing people of this. If it’s intelligent, it would know what we wanted it to do, his critics argued. It would know the right thing to do.
So, in 2006, he started to write a blog, to explain why they were all wrong. He had to explain why artificial intelligence wouldn’t be like human intelligence. But, he realised, to explain that, he had to explain why human intelligence wasn’t exactly like rational thought — it is full of biases and shortcuts and systematic errors.
But to explain why human thought wasn’t like rational thought, he found, he had to explain what rational thought actually was — in the decision-theory sense, of making optimal decisions with the available information. And then, he found, he had to explain all the ways in which human thought differed from rational thought: he had to list all those biases and shortcuts and systematic errors.
And to explain those, he had to explain — almost everything. At one point, he wanted to write a relatively unimportant post using an analogy of AI utility functions. But in order to do that, he felt he ought to first explain a few other concepts. To explain those, he realised, he needed to explain a few more. But to explain those…
Anyway, long story short, over the following month he ended up writing about two dozen blog posts explaining evolution by natural selection. And then he could write the not-especially-central-to-his-thesis post about the utility functions.
The whole series of blog posts, which became known as the Sequences, sprawled out to a million words. It was edited down into an ebook version, Rationality, which is a more manageable 600,000 or so — though still rather longer than War and Peace, or than all three volumes of The Lord of the Rings put together.
The Sequences are the founding text for what would, later, become known as the “rationality community”, or the “rationalists”. I’ve written about them, and their AI concerns, a few times in the past; my own first book, The Rationalist’s Guide to the Galaxy, is about them. (It’s also just 80,000 words and gets most of the key ideas across, if you are pressed for time. All good bookshops!) They are a strange bunch: clever, thoughtful, sometimes paranoid; they get accused of being a cult, with the Sequences as their holy scripture and Yudkowsky as a sort of messianic figure (although I think that is not true.)
But the community is that rare thing, on the internet: a place where people can disagree and argue calmly and in good faith. And it sprang up around Yudkowsky’s great sprawling series of blog posts, because he explicitly encouraged that norm.
As well as — or as part of — trying to lay the foundations for AI safety, it tries to establish what it means to be rational, and to make us more so. In other words, the aim is to help us be “less wrong”, as the blog the Sequences are published on is called. Following decision theorists like Judea Pearl, he defined rationality in two forms.
First is epistemic rationality, establishing true beliefs about the world — maximising the extent to which my mental model of the universe, the “map” in my head, corresponds to the universe itself, the “territory” of reality. I believe that there is a door behind me, and that black holes emit Hawking radiation: those are facts about my brain. Whether there actually is a door behind me, or whether black holes really do emit Hawking radiation, are facts about reality.
Epistemic rationality is about trying to make the beliefs line up with the facts. “This correspondence between belief and reality is commonly called ‘truth,” says Yudkowsky, “and I’m happy to call it that.”
The second is instrumental rationality, choosing the course of action that is most likely, given what you know now, to bring about the things you want to achieve. If you want to achieve Goal X — wealth, world domination, world peace, preventing climate change, Oxford United winning the Champions League — what steps are most likely to achieve that? Rationality, he says, means winning. Winning at whatever you want to win at, whether it’s saving the world or making money. But winning.
I can’t begin to go into the whorls and sprawls of its reasoning: in its efforts to ground morality and rationality on Bayes’ theorem and utilitarian calculus, Rationality reminds me of one of those huge Enlightenment-era works of philosophy, by Spinoza or Leibniz or someone, that start out with some guy sitting in an armchair and end up trying to establish a Theory of Everything. But I do want to come back to Newcomb’s problem, because I think it is key.
In 1969, Nozick concluded, after much agonising, that you should take both boxes. Imagine the far side of the opaque box is transparent, he says, and that a friend of yours can see into both. Whatever Omega has done, whether the £1 million is there or not, she would be hoping that you take both: if it’s not there, you at least gain £1,000; if it is, you get a bonus £1,000 on top of your million. It’s already happened. The money is in the box, or it isn’t.
Yudkowsky, though, disagrees. The rational thing to do, he says, assuming that you would like a million pounds, is the thing that is most likely to make you a million pounds. Almost everyone who chose to one-box got a million pounds. Almost everyone who chose to two-box didn’t. Only a very clever person could convince themselves that the sensible thing to do is the option that will almost certainly cost you £1 million.
That’s why I like the book. It may address very serious, high-concept ideas — AI, transhumanism, morality, alien gods visiting the planet and playing strange parlour games with boxes — but it gets to them all in a robust, commonsense way. The moral thing to do is, usually, the thing that kills the fewest people, or makes the largest number happy. As he writes, in a wonderful section on how it seems obvious that consciousness has an effect on the world, because philosophers write books about consciousness, “You can argue clever reasons why this is not so, but you have to be clever.”
But following these robust, commonsense views to their logical conclusion often ends up in strange places. If you accept some fairly uncontroversial premises, you can end up with AIs destroying the world, brains uploaded to the cloud, galaxy-spanning civilisations of quasi-immortal demigods.
It’s been especially relevant this year. The rationalist community was well ahead of the public, and of many scientists, in spotting the dangers of Covid-19: they are used to data, to exponential curves, to reasoning under uncertainty. They, and the wider tech community, were using masks and stocking up on essential goods, even as others were saying to worry about the flu instead, or mocking tech workers for avoiding handshakes. And now, when we should be working out the risks to ourselves and our families of going home for Christmas, a bit of extra familiarity with Bayes’ theorem and probability theory would probably not be a bad idea.
In the 12 years or so since the Sequences ended, Yudkowsky himself has withdrawn somewhat, working at MIRI or on other projects, occasionally cropping up on Facebook or Twitter. But they remain as a fascinating piece of writing, pulling together a thousand strands of thought into one place. And, whether or not you agree with their conclusions on AI or rationality, they have created a small, safe harbour for peaceable disagreement on the internet. That alone is a legacy worth having.
*Precocious 17-year-olds can be quite annoying.