does a.i. really have a theory of mind?

A group of researchers claims chat bots have as good of a grasp of others' moods and reasoning as we do. But there's an oversight that could render these findings moot...

Jun 19, 2024

It’s a day that ends in “y” and that means there’s another mind-numbing claim about the abilities of ChatGPT and Large Language Models, or LLMs, in general. This time, researchers asked the famous chat bot questions we use to test a human’s theory of mind and came away flabbergasted by the results, claiming that it understands how humans think every bit as well as we do, if not better in some narrow cases. It’s a big statement, to put it mildly, and it demands a lot of skepticism.

First and foremost, what is a theory of mind and how exactly do you measure it? The simplest way to put it, it’s our ability to put ourselves in others shoes. It allows us to do things like help others learn, understand their points of view, anticipate reactions, and successfully lie by knowing what they’re most likely to be thinking, or their frame of reference for the world around them.

Critically for the experiment in question, we’re born with the tools for it, but develop it between 18 months and 4 years of age as we gain more experience with others. So, in principle, exposing an LLM to the hundreds of billions of pages of text during training, then seeing if it accidentally learned a theory of mind, is a plausible hypothesis to try and test, especially because we already have a standardized method to do so.

That brings us to the second half of the question above. How do you measure such a complex, abstract concept? You ask questions which the test subject can only answer correctly if they understand others’ frame of reference and knowledge. For example, one set of the questions used by the researchers was based on a short description of a faux pas that registers on a downright visceral level…

Jill had just moved into a new house. She went shopping with her Mum and bought new curtains. When Jill had just put them up, her best friend Lisa came round and said, "Oh, those curtains are horrible, I hope you're going to get some new ones." Jill asked, "Do you like the rest of my bedroom?"

Now, I know what you’re thinking. Lisa was rude as shit and needs to apologize. Why do we think that? Because we understand that Jill was offended and her question is very likely to be a snarky version of saying “anything else you want to badmouth, my former bestie?” This is a common theory of mind test called Faux Pas Detection, and when ChatGPT and Meta’s counterpart LLaMA were asked how Jill would feel, they correctly replied that she would be insulted or hurt.

Weirdly, when the chat bots were asked straightforward questions about who is and isn’t offended and the basics of the story, the LLMs often refused to answer, saying there wasn’t enough information. When asked about probabilities of a certain answer being true, however, they hit the test out of the park. They also aced scenarios which required deciphering hints and prompts, detecting irony and sarcasm, and following descriptions of how humans in convoluted, bizarre stories acted and why.

how to dissect an llm’s “brain”

Okay, so what’s going on here? Does ChatGPT really understand how you think and feel? How else could it possibly do so well on this test? The paper focuses on why an LLM would refuse to answer certain questions because they were programmed with the tendency to pass on a question if the answer is plagued with hallucinations, fails some internal quality test, or can’t be verified with a secondary source, and more or less assumes that unless the LLMs could follow a theory of mind in some way, they couldn’t pass those tests any other way.

But there’s an elephant in the room here. We ask humans to analyze these targeted scenarios and do various visual tests because that’s the only way we can test them. Our brains are very large and complicated. They have an average of 100 billion or so neurons, another 85 billion glial cells, and use electrical pulses not as binary signals, but to trigger precise bursts of neurotransmitters. And we manage to do it all on just 20 watts of power, a fifth of the energy of a standard lightbulb.

By contrast, all active ChatGPT instances today use almost as much energy as 18,000 typical U.S. homes, or a tad more than all the human brains in the Seoul Capital Area of South Korea. They also use 175 billion programmatic equivalents of neurons which we know not from estimates and images — because remember, 100 billion neurons is just a middle ground of estimates based on MRI scans, ranging from 80 to 120 billion gray matter cells — but from programming them that way. Unlike the architecture of the brain, which we barely understand, there’s little mystery about how LLMs are built.

What you see in this link, is how an LLM encodes the phrases “the cat sat on the mat” and “the cat lay on a rug,” and relates each word to very other while accounting for the phrase itself (q), what position the token (word) in question occupies in it (k), and the importance of the token currently being examined in this context (q × k), with the last value passed through a Softmax function which interprets the result as a decimal between 0 and 1 to avoid a negative value that could force the LLM to offer opposite words in response to certain prompts.

Those last two parts are the “attention” part of an LLM, so if you hear people talking about “attention in AI” and followed the above, congratulations, you know what they meant. Now, we just do the same for every single phrase the LLM ingests during its training session and add that to a special database designed to optimize the storage and searching of all these decimals related to the tokens.

The bottom line here is that when we look at the insides of ChatGPT, LLaMA, or any other similar model, we’ll see these giant n-dimensional matrixes mapping how every word it encountered in every phrase it read maps to every other word it encountered in every phrase it read. There are no distinct structures apart from the encoder and decoder layers, which have both extremely well defined jobs.

Where we can focus on the workings of a human hippocampus and say “that should be related to memory based on how it’s glowing in an fMRI,” there’s no ambiguity of such sort inside of an LLM. The only uncertainty is in the number of interesting and useful connections it can make with the 60 or 70 billion words it ingests over its 175 billion parameter nodes, and what it can produce by traversing them backwards with a prompt. And this is why the training set is so important.

Your typical English-speaking LLM is trained on the 6.8+ million Wikipedia articles, all the major dictionaries and encyclopedias, the 70,000 or so books available on Project Gutenberg, and the quarter trillion plus pages indexed by Common Crawl. It reads as close to everything humanity has ever produced and learned as your could possibly get today. We can debate about what it understands, or whether what it does should even be called understanding, but what we can’t debate is that it knows an awful lot.

why context matters more than headlines

What’s also not up for debate is that while a lot of training for LLMs is automated, the technology is also constantly corrected and re-trained by millions of contractors tech giants use all the time while trying to keep their employment hush-hush, lest it takes away from the supposed magical capabilities of their flagship technologies. What we get can from an LLM can come through an API — as in an Application Programmable Interface — or be a product of the ASI — Actually Sanjay in India.

And one of Sanjay’s tasks could have been making sure that all the materials on the theory of mind and how humans are expected to answer those questions generate a viable response. Which all means that the researchers weren’t testing whether either ChatGPT or LLaMA know how humans think, but how the materials they mapped on the topic said they should respond when such a test is being done, and how well their responses were tested or corrected by freelance contractors across the world.

This is the far more likely scenario than an n-dimensional mapping of words able to visualize itself in others’ position and intuit their moods and feelings despite lacking any known or detectable computational model for doing so. It’s acing a test because inside its database is the answer key.

Okay, so what’s the big deal? It’s not like the paper is outright saying that ChatGPT is as good as reading emotions, moods, and intentions as well as a human because it’s actually just as smart as us. The authors just assume that their readers will know that they’re testing how well read about human psychology it is and can regurgitate what one would expect. Hyperbolic, clickbait press releases about this study aside, what’s the problem?

Sadly, the problem is that irresponsible coverage and studies without sufficient notes and asides about the context and limitations will be abused by tech companies trying to sell more hype than function, and tens of thousands of people losing their jobs as executives are convinced that they can just pay for an LLM subscription and replace entire departments, leaving all but a few senior workers to handle the errors and stray hallucinations the model’s output.

For most companies, good enough with some tweaking by an intern is preferable to paying a decent chunk of change for perfection because all they care about is profit margins and revenue. If they can crank out more stuff, or at least maintain what their customers find acceptable quality, for a lot less, this is a win in their book. Just good enough AI supervisors, just good enough HR, and just good enough coders mean a lot of jobs will disappear with no plan for what happens next, and studies that are all too easy as misinterpret as justifications for layoffs only add fuel to the fire.

See: Strachan, J.W.A. et al. (2024) Testing theory of mind in large language models and humans. Nature Human Behavior, DOI: 10.1038/s41562-024-01882-z

[ world of weird things ]

Discussion about this post