January 25, 2021

“Ain’t I a woman?” asked American abolitionist, Sojourner Truth in 1851. In terms of human worth and value, was she not the equal of any white woman? Well, no. Not according to modern face recognition software. As computer scientist Joy Buolamwini discovered when researching AI bias some 170 years later, the algorithm thought Sojourner was a man.

Many image classification systems tend to misclassify black women’s faces as male. That’s if they recognise them as human faces at all. Google’s image classification system got it badly wrong in 2015 when it mislabelled black people as ‘Gorillas’.

It’s not unusual for algorithms to make mistakes about humans. Sometimes those errors are harmless. You might see adverts for inappropriate products, or jobs for which you are wildly unsuitable. Other times, they are more problematic, by sending people to jail with the wrong sentence, or rejecting potential job candidates because they don’t resemble recruits from past years.

It’s not just that the programs aren’t intelligent enough, it’s that they don’t share our values. They don’t have goals like “fairness” or even “not being racist”. And the more we delegate decisions to machines, the more that matters.

That’s the problem Brian Christian tackles in his new book, The Alignment Problem, subtitled, How can machines learn human values? One pithy question encompassing several others: How can machines learn anything? What are “human values”? And do either machines or humans learn values in the same way as learning information, strategies, skills, habits, goals?

The book begins with an account of how machines came to learn at all. It’s a history of brilliant and often odd people, ideas that seemed absurdly far-fetched, inspiration found in unlikely places, and creators confounded by their own creations.

From the early 20th century, computer development and research into human cognition were parallel projects with myriad connections. Engineers looked to the human mind to help them design machines that would think, but equally, neuroscientists and psychologists looked to mathematics and logic to build conceptual models of human thought. Those machines which began with the simple task of recognising whether a square was on the left or the right of a card, and progressed to playing Go better than any human, were seen as working models of the human mind.

In this way, familiar ideas in computer science map, very broadly, onto more familiar ideas in psychology. For example, cognitive scientist Tom Griffiths builds reward-motivated AI systems. He describes his daughter dropping food on the floor to earn praise for sweeping it up again in the same terms he might use at work. “As a parent you are designing the reward function for your kids, right?” He learned to praise her for a clean floor, not for the act of sweeping up.

When they couldn’t get their computer programs to succeed at given tasks, researchers looked back at humans to help them see what was missing. Machines quickly overtook humans in certain ways: speed of logical reasoning, or processing more information than one human could handle at once. But that was not enough. What did Neural Networks lack, that real human brains used to solve new problems, and learn to navigate new environments?

Computers built to mimic human logical reasoning were missing fundamental human drives, among them curiosity. Novelty and surprise, it turned out, were as important as information processing, not only for human life but for machines playing the Atari game Montezuma’s Revenge. The human brain’s dopamine system suggested a new model to reward experimentation and persistence in game-playing computer programs, and that equipped them to win.

Alongside the machines, their human creators were also learning: not only how to build them, and what data to train them on, but also how to teach them. Learning by imitation, which comes naturally to humans almost from birth, can also work for robots using Artificial Intelligence. With a method called Inverse Reinforcement Learning, AI can even infer what a human is trying to do, and outstrip its teacher, in complex tasks like flying helicopter drones.

Machines can learn to outperform their human teachers at playing games, sorting images, or even controlling a vehicle on real roads. A robot can imitate your behaviour, and infer your goals, but can it learn the right thing to do? Can machines learn human values?

Ask a philosopher this question and you would probably get a question in return: What are “human values”? Most of us muddle along with an impure, cobbled-together morality made up of habits, boundary lines we took from family, religion, law or social norms, other lines we drew ourselves as a result of experience, instincts of fairness, loyalty, love and anger, and intuitions that we’d struggle to explain. How we behave is seldom as simple as applying an ordered list of moral principles. We’re influenced by what other people are doing, whether other people are watching, and who we think will find out what we did.

We certainly don’t have one, unanimous set of Human Values ready to be inserted into a computer. So if we humans sometimes struggle to know what is right or wrong, how can we expect a machine to get the correct answer? This is a problem that Brian Christian is slow to address, though it could just be that he comes at the question almost from the point of view of an AI program.

Christian goes to a problem that can be expressed in mathematical terms to a machine: uncertainty. How do we make decisions with incomplete information? That is something that can be programmed. Just as the human brain turned out, after all, not to run on binary logic, computers designed to use probability, instead of deterministic pictures of the world, cope better with real conditions. Can this approach help either humans or robots make decisions when we are not sure what is right or wrong?

Christian likes the example of the effective altruism movement, which uses a strong utilitarian approach to morality. The right decision is the one maximising the good that will result from the act. For example, Effective Altruism recommends funding the fight against malaria as the most good you can do with your money.

This approach lends itself more neatly to mathematical reasoning than the usual mess of categorical principles, consequentialist calculations and spontaneous intuitions that most of us use to make decisions. It also poses the question: what is “good”? But if we can agree on a broadly acceptable answer, it would be easy for a machine to apply.

AI designed to make optimal utilitarian decisions might result in more good overall, but it might also mean, for example, sacrificing innocent lives to save others. It would not always be aligned with our moral instincts, or with the way humans like to treat one another.

If moral values can’t be expressed as mathematics, can machines ever learn to share our values? Researchers trying to improve how machines learn stumbled on an illuminating insight. Stuart Russell was part of a team using Inverse Reinforcement Learning, a program designed to learn by observing what a human does, and infer what that person is trying to do. Building on IRL, his team developed Co-operative Inverse Reinforcement Learning, or CIRL. With CIRL, the robot takes on the (inferred) human goal as its own goal, but as helper, not usurper. Instead of driving, or flying, or playing Go, better than its human teacher, the AI joins the human’s support crew. The human’s goal becomes a shared goal. Machine and human goals are aligned.

This seemed to be a revelation for the researchers. “What if, instead of allowing machines to pursue their objectives, we insist that they pursue our objectives?” Christian quotes Russell as saying. “This is probably what we should have done all along.”

Humans are social. Moral values are social. Thinking is social. The whole of human society is based, not on a solitary thinker deriving everything from abstract reasoning, but on people co-operating, learning from each other, testing and revising our ideas and values, and persisting in profound disagreements some of which, in time, result in seismic changes in how we live together.

“Human Values” are not an abstraction living in one human mind, let alone one that can be programmed into an artificial intelligence. For machines to become aligned with human values, they would have to work alongside us, learning and adapting and changing, and they would still only align partially, with some of us more than others. And we would have to accept that.

But because we are social creatures, if we live alongside machines that learn from us and interact with us, will we not also be changed?

Christian briefly describes a future in which our machine helpers influence our behaviour, but this world is already here. Whether to sell us products or nudge us into healthier habits, our ubiquitous devices give us feedback and reinforcement learning, as much as the other way around. The more machines become our medium of relating to one another, the more we ourselves become understood as systems that are measurable and predictable in mathematical terms.

Alongside the history Christian tells, of AI as working models of the human mind, is a parallel shadow history, of human beings understood as machines. Our understanding of the brain, the mind, and the human person, have all been influenced over the last century by mathematics, logic and computer science. Aspects of human life that can be quantified, gamified, datafied, constitute the working model of ourselves for which products and policies are designed.

In the conclusion, Christian seems to recognise that this redefinition of humans, as walking neural nets being trained by the systems around us, is the real danger. While our attention is focused on the spectre of super-intelligent AI taking over the world, “We are in danger of losing control of the world not to AI or to machines as such but to models, to formal, often numerical specifications for what exists and for what we want.”

And, he might add, for what we are.

Sign up to UnHerd's new weekly email, in which Will Lloyd selects the best (and occasionally worst) writing from around the web.

Free, every Friday morning.