Old & Busted: Large Language Models. New Hotness: General World Models
Why and How General World Models are Going to Take Over the World
In This Post:
What are General World Models?
How Humans Learn
How LLMs Learn
Why GWMs are So Different
I recently saw a promo video from Runway Research wherein the Narrator described 'Global World Models' (GWMs) as an AI model that learns like her dog, Reuben. I'm a sucker for any story with cute dogs in it, so I watched the whole thing (3.5 mins) and it reminded me of my own recent essay "A.I. as a Really Smart Horse." GWMs are something that I've had some intuition about, as they seem like a logical extension of Large Language Models ("LLMs") but as I searched around for some articles on this, I couldn't really find one that breaks it down, so I decided to write one:
What are General World Models?
A "General World Model" in AI is a system designed to understand and predict a wide range of real-world scenarios, not just one specific area, by using multi-modal inputs in a unified way. An AI is given a general idea of the world which it is free to develop and shape on its own, it can adapt and learn from new situations, similar to how humans use their broad knowledge to handle diverse challenges, even those in which they have no prior experience.
Why is this a big deal? More on that below. But briefly, it's a big deal because it allows AI to 'learn' in a way that's much more human than the way that now ubiquitous LLMs (ChatGPT, Claude, etc.) 'learned' how to do what they do. The recent turn into consumer-facing, multi-modal AIs (like, how ChatGPT can now review code, and 'see' images, and produce images, etc.) suggests GWMs are around the corner, and I'll explain why in a minute. If they are, our world is about to change even more dramatically. It took ChatGPT (an LLM) about a year to basically take over the world. A GWM is going to learn a lot faster, and better, and its utility to humans will be a quantum leap beyond whatever ChatGPT has done so far. First, let's have a look at how humans learn:
How Humans Learn
At a faraway residence, a lifetime ago, I'd often grow frustrated with my UPS driver because they would rumble down a busy adjacent avenue towards my street; when finally at the intersection, they would always turn right, away from my building. It was maddening, since my doorstep was only 50 ft off the intersection. They would drive off to parts unknown, and then mysteriously reappear 10 minutes later from the opposite direction. I had varying theories as to why this was so, and even if I could remember them, I'm sure they would be too embarrassing to list here. Regardless, if I was expecting some package and had been waiting for several days, those extra ten minutes felt like an eternity, and I could never understand why the hell the UPS driver wouldn't just turn down my street so that I could get my package 10 minutes earlier.
I later found out that this was quite by design - UPS designs their routes so that drivers rarely have to turn left. UPS drivers turn left only about 10% of the time. This strategic move by UPS apparently reduces the amount of time their drivers spend in traffic, speeds delivery (overall) and saves tons of money (and gas). So, well done, UPS. And shame on Eric for any petty feelings I may or may not have had in the past towards any hardworking UPS drivers.
My education on the routing of UPS trucks included two very different types of human learning:
Learning by observation. I detected a pattern and developed a hypothesis. I saw these drivers doing something that seemed, to me, irrational. They turned away from me, when they must have known that I really, really wanted my package, so naturally, they must have had something against me.
Learning by research. Well, not exactly research. I read about the UPS ‘no left turn’ protocol in an article, somewhere. I didn't go out searching for the explanation, because I thought I already had it.
The two kinds of learning themselves reveal two things about the nature of humanity.
The first type of learning - observation and hypothesis, needn’t be particularly sophisticated. In fact, it’s so simple, my dog does it. He gets excited whenever I put on my shoes, because he thinks we might be going to the park. Clearly, his little dog brain is capable of observing my behavior, and developing a hypothesis about what it might mean for his immediate future.
The second order learning - learning by research, is distinctly human. My dog cannot google things (that I'm aware of) or evaluate his own hypothesis in the company of other dogs. If he could, he might say to his dog friends "when Eric puts on his shoes, we're going to the park" and another dog might reply "when my owner puts on his shoes, he's taking me to the vet." Since his hypotheses are never subjected to inspection by more cynical dogs, he remains happy and excited every time I put on my shoes - even in those cases where I really do take him to the vet.
The difference, I think, is in syntax.
Why Syntax is Important
In the 1970's, a gorilla named Koko gained international stardom by learning sign language. Koko was taught ASL by Dr. Penny Patterson, and gave the world a glimpse into the mind, and feelings, of some of our closest evolutionary relatives. Koko would eventually develop a vocabulary of over 1000 signs and could understand over 2000 human words.
While we could understand Koko, and some of her feelings, and her intentions, she never really learned to communicate at a fully human level. In her adolescence, Koko was reported to have had an IQ roughly 1 year behind a human adolescent. When she was 4, she had the linguistic capacity of a 3-year-old. However, despite living until age 46, Koko never progressed much beyond that, at least linguistically, as was obvious in her syntax.
When she was thirsty, she wouldn't sign "I'm thirsty, could I have some water." She would sign "Have give me drink, Thirsty big pour drink." Her meaning was clear, even if her language was imprecise. That'll do, if you want some water. It won't do, if you're trying to design a building, or draft a constitution.
My early panic about ChatGPT arose out of this observation. Syntax separates us from the animals. All animals communicate, and some animals converse. Some animals, like cows and dolphins, even have regional accents. But humans' use of syntax gives rise to semantics, and that in turn gives rise to many other things. I later found voice through a much more eloquent writer in Yuval Harari's essay in the Economist: "Yuval Noah Harari argues that AI has hacked the operating system of human civilisation" wherein he argued that through the mastery (or fakery) of language, LLMs granted themselves almost all the capabilities that define being human:
“Language is the stuff almost all human culture is made of. Human rights, for example, aren't inscribed in our DNA. Rather, they are cultural artefacts we created by telling stories and writing laws. Gods aren't physical realities. Rather, they are cultural artefacts we created by inventing myths and writing scriptures.
Money, too, is a cultural artefact. Banknotes are just colourful pieces of paper, and at present more than 90% of money is not even banknotes--it is just digital information in computers. What gives money value is the stories that bankers, finance ministers and cryptocurrency gurus tell us about it. Sam Bankman-Fried, Elizabeth Holmes and Bernie Madoff were not particularly good at creating real value, but they were all extremely capable storytellers.”
A machine that has mastered (or can fake) language, therefore, can master (or fake) religion, politics, finance, and all the rest. Syntax - the precise selection and ordering of words - allows for the construction of meaning, without the use of intonation, body language, pointing, gesturing, examples etc. In other words, it allows for the rise of written language, and therefore the transmission of ideas, concepts, values, and knowledge, across generations, populations, and continents. LLMs may now have that power - time will tell. What we know now is that they didn't learn it in the same way that we (humans) did:
How LLMs Learn:
Large Language models learn syntactically - they're basically giant statistical engines, which calculate the next letter or word based on the context they can immediately observe. In constructing a sentence, an LLM trying to finish the sentence:
When I'm in [[the blank]], I enjoy [[blank]] pudding.
has many options for how to complete each blank, and the selection of one can and should inform the other.
If the first blank is filled in with 'the hospital', it might read like this:
When I'm in [[the hospital]], I enjoy [[Jello]] pudding.
If the first blank is filled in with 'London', it might read like this:
When I'm in [[London]], I enjoy [[figgy]] pudding.
This works in both directions. So if the second blank were filled in with 'Jello,' a human might intuit that 'hospital' was the most reasonable guess as to what the first blank should be.
For a machine, though, it's not intuition - it's just a math problem. ChatGPT was trained on a giant corpus of human text (the internet) and from that, is able to make a guess about how to complete each sentence, or answer questions, etc.
This may seem like an elaborate trick, and probably why Noam Chomsky referred to ChatGPT as 'glorified autocorrect' and who knows, maybe he's right. With GWMs, however, we'll move beyond any doubt, into something much, much more human.
Why GWMs are So Different
Chat GPT is a language engine, trained on language. The human brain is a thought engine, trained on sights, smells, language, touch, experiences, and community. GWMs will be closer to the latter.
They'll do so through multi-modality. Multi-modal AIs (like, how ChatGPT can now review code, and images, and produce images, etc.) and especially 'natively' multi-modal AIs (like Google's Gemini) are the technological platform from which AI can embark on an entirely new kind of learning called 'reward-free.' That, in turn, moves us towards general behavioral models.
Reward Free Learning, and General Behavioral Models
GWMs move us significantly closer to 'reward-free' machine learning. What's that? Since researchers began working on AI, they've struggled with how to 'reward' AI systems once they do something properly. Conventional computer code doesn't need reward. You type instructions and it executes. If it fails to execute, it's probably because you wrote the code wrong. Either way, the computer doesn't feel one way or another about the outcome. But if a computer program is to learn, it must sometimes fail, and it must be taught that succeeding is better than failing. The difference between a swan dive and a belly flop might be elementary to a human. Humans who have never belly flopped can observe another human doing so, and understand that it's painful. But why? And how would you explain that to a machine?
The LLMs like ChatGPT that have become so ubiquitous were trained with rewards. Some human trainer has to give the AI a doggie treat when it writes "When I'm in the hospital, I enjoy Jello pudding" and that treat must be withheld when the AI writes "When I'm in the hospital, I enjoy figgy pudding." Can LLMs learn from their mistakes? Yes. But they need some kind of 'refractor' to tell them which was the mistake, and which was the successful output.
GWMs introduce an entirely different possible form of learning - one without human assistance. Because GWMs can compare one form of input against another, the varying modes of input can be refractors of one another.
An AI that has a multimodal world model can observe hospital patients (say, via CCTV) throwing away their figgy pudding and enjoying their Jello pudding, and extrapolate that to an idea that 'Jello' pudding is something that's appropriate in hospitals, while figgy pudding is not. In this case, none of the disgusted patients has to tell the AI (with words, or with code) that the figgy pudding is the wrong answer and Jello pudding is more appropriate. The AI can assess as much by observing the patients' behavior. Later, when asked to complete the sentence "When I'm in the hospital, I enjoy [blank]] pudding." it can successfully complete the sentence with the word 'Jello.'
It can simultaneously observe the time, and infer that in London, around the holidays, figgy pudding is the most appropriate answer in order to finish this sentence:
"When I'm in [[the holiday spirit]], I enjoy [[blank]] pudding."
with the word 'figgy' but only if given the appropriate contextual clues.
General Behavioral Models
Once an AI has built a GWM, it necessarily creates a General Behavioral Model, which can then be tested in partnership with humans. Instead of identifying the syntactic relationship between written words, it can identify the relationship between actions, occurrences, speech, text, and behaviors. The same way that my dog gets excited every time I put on my shoes, it will be able to associate one kind of behavior with another, merely through observation. Unlike my dog, it will be able to ask, in conversation, what the hell is going on: "I see you are putting on your shoes . . . Are we going to the park?" to which I can answer "No, when I put on my shoes and put on my tool belt, I am going outside to fix something on the house." For that matter, it needn't require any human correction at all, provided it can fully observe me:
Eric + Shoes + Leash = Park
Eric + Shoes + Toolbelt - Leash = Going outside to fix something.
Having an AI that understands what we are trying to say has been groundbreaking. Having an AI that understands what we are trying to do will be even more so, as it undoubtedly will have some ideas on how to help us with whatever we're doing.
Hopefully you now understand why the narrator from the Runway Research video opined that GWMs might learn like a dog. But I think she put it that way because it's less threatening than the truth: GWMs will actually learn like a human. A huge amount of our human learning occurs pre-verbally. Between the ages of birth and 18 months, we lack the ability to understand or produce language, but we're busy building a model of the world. We learn through touch, sound, vision, smell, without ever having the ability to put words to things. That comes later. We learn the important things:
Mom = Friendly
Dog = Bitey
Anything that fits in my mouth = food
We are assisted in this task by older, wiser humans (our parents or caregivers) who hopefully establish the guardrails that keep any of our 'learning' from becoming fatal (e.g. don't play in the street). Through observations, experimentation, trial and error, we build a mental model of the world - how it works, and what should be done about it. Hopefully, we never stop learning, and therefore never stop refining that model. We (humans, but especially architects) subsequently design and build the world according to our models of it. The world we build informs the world model that our children develop. In a way, what's happening now, technologically, is no different. AIs will start building their own models of the world, based on the world that we have already created. Our role now is to create parental guardrails and steer that effort so that their world model reflects our highest aspirations, and learns from all our greatest mistakes.