As large language models advance in processing language, their boundaries are becoming clearer: they excel at writing, searching, editing, and programming, but struggle with challenges involving three-dimensional space, temporal evolution, and physical constraints. Morgan Stanley is betting that the next wave of growth will come from "world models"—AI systems that learn to understand, simulate, and make decisions within environments. Applications extend beyond robotics and autonomous driving to reshape digital content industries such as gaming, design, and film production.
According to Morgan Stanley's North American equity analyst Adam Jonas, "AI is moving beyond language toward models that understand, simulate and navigate the physical world." This implies that the next phase of competition will not focus on mimicking human conversation, but on compressing real-world rules into usable internal representations and turning them into interactive "imagination engines."
The report cites concrete engineering practices already underway: Waymo has used world models based on DeepMind’s Genie 3 to conduct billions of miles of virtual road testing; Microsoft employed Muse to create a fully AI-rendered, playable version of the 1997 game Quake II; and Roblox has disclosed research into using proprietary world models to generate immersive environments and iterate game designs via natural language. Established players—including DeepMind, Meta, Microsoft, Tesla, and NVIDIA—are investing heavily, while startups are also attracting talent and capital.
Notably, Morgan Stanley highlights two emerging companies: Fei-Fei Li’s World Labs, which focuses on generating navigable 3D worlds, and Yann LeCun’s AMI Labs, which emphasizes learning efficient latent representations for prediction and reasoning. Both approaches address the same core question: How should AI "understand the world," and when will this understanding evolve from demos to practical productivity?
From Language to Physics: World Models Address LLMs' Key Limitations The report describes the physical world as a more challenging battlefield, governed by constraints such as matter, thermodynamics, fluid dynamics, and lighting, all operating within a dynamic 3D space. While LLMs, trained primarily on text and its variants, perform well in white-collar tasks like coding, searching, and writing, they lack the ability to maintain consistent environmental representations and reasoning over time—particularly for questions like "What happens next?" or "What are the consequences of this action?"
World models are defined as "internally usable representations of environments." They must not only reproduce what is observed but also roll forward states and generate different future branches based on changing action conditions—an ability the report refers to as AI’s "imagination engine."
World Models Are Not Monolithic: Five Parallel Approaches Morgan Stanley categorizes current approaches into several overlapping groups:
1. Interactive, action-conditioned world models: Function like "learned game engines," where environments change in real time based on agent actions (e.g., DeepMind Genie). 2. Consistent 3D world generators: Emphasize geometric consistency and multi-perspective exploration (e.g., World Labs’ Marble). 3. Abstract representation/non-generative models: Prioritize predicting high-level latent structures and dynamics over pixel-level generation, favoring efficiency and reasoning (e.g., Meta’s V-JEPA, AMI Labs). 4. Predictive generative world models: Focus on forecasting next frames or states for planning, prediction, and driving inference (e.g., Wayve’s GAIA, NVIDIA Cosmos’ Predict). 5. Physically constrained simulation data engines: Combine world models with simulation/physics engines and data pipelines to produce physically consistent synthetic data for robotics training (e.g., NVIDIA Cosmos’ Transfer).
This classification underscores that, despite the shared label, world models vary widely in goals—from generating explorable worlds to compressing reality into computable states—leading to differences in product forms, computational demands, and commercialization paths.
Gaming and Content Production: High Potential, Gradual Adoption Gaming is highlighted as the most intuitive application. World models could generate interactive environments from minimal prompts, potentially accelerating content production by orders of magnitude. Microsoft’s playable Quake II demo, which bypasses traditional frame-by-frame rendering by predicting frames based on player input, serves as a powerful example.
However, Morgan Stanley’s video game analysts caution that adoption may be gradual. Long-term scenarios include incumbents integrating AI into existing toolchains or being disrupted by new paradigms. While generating playable worlds via natural language is already feasible, challenges remain in areas like computational speed, cost, meta-systems, latency, determinism, memory, and updates—issues that may prove fundamental to the world model paradigm. This suggests short-term constraints for incumbents but genuine long-term threats.
Autonomous Driving and Robotics: Practical Applications in Virtual Testing In autonomous driving, world models enable large-scale virtual testing of dangerous, rare, or expensive edge cases. Waymo’s use of Genie 3 for billions of miles of virtual driving tests helps train and validate system performance in scenarios difficult or risky to encounter in real-world conditions.
For robotics, world models address two key challenges: training data volume and pre-execution reasoning. Studies show that robots trained on world model-generated data can perform comparably to those trained on real interaction data. Still, Morgan Stanley notes that, in the near term, world models and synthetic data will likely complement rather than replace real-world data pipelines.
Critical hurdles involve fine physical details like contact and friction—subtle forces from fingers, actuator wear, surface friction, material properties, and joint static friction—which can cause significant discrepancies between simulation and reality.
Key Challenges: Long-Term Stability and Controllability The report outlines major obstacles:
- Error accumulation and temporal drift: Longer interactions increase risks of object drift, geometric deformation, and deviations from physical laws. Even advanced models like Genie 3 currently support only minutes of continuous interaction. - Limited controllability: Rich visuals offer limited value if action spaces are restricted to basic movements. - Multi-agent and social dynamics: Interactions among multiple entities are far more complex than single-agent navigation. - Data scale and diversity: Real sensor data collection remains expensive and slow, especially in robotics. - Lack of unified benchmarks: Long-term interaction quality lacks standardized metrics, relying instead on demos and task-based evaluations.
These constraints suggest that world models will likely first proliferate in fault-tolerant, fast-iterating digital content domains before gradually penetrating industries requiring strict physical consistency.
Fei-Fei Li’s Bet: Enabling AI to "See" 3D Space World Labs, founded by Fei-Fei Li and team in 2023, represents the "consistent 3D world generation" approach. Its flagship product, Marble, launched in November 2025, aims to generate persistent, explorable 3D environments from text, images, short videos, or rough 3D inputs, supporting editing and expansion.
The platform functions as a production-oriented toolkit, allowing object modification, coarse-to-detailed modeling via "Chisel," selective expansion, composition of multiple worlds into larger scenes, export to external 3D software/engines, and developer APIs. It also integrates with industry tools like Unreal Engine and Unity, connects with simulation platforms such as NVIDIA Isaac Sim, and demonstrates use cases in architectural design and robotics simulation.
Citing PitchBook data, the report notes World Labs has raised approximately $1.29 billion, with a post-money valuation of around $5.4 billion after a February 2026 funding round.
Yann LeCun’s Alternative: Predicting Structure, Not Rendering Pixels AMI Labs, emerging from stealth in March 2026 with involvement from Yann LeCun, follows a research-oriented path aligned with the JEPA framework. Instead of reconstructing pixels, it predicts latent representations of occluded or future states, learning world dynamics through abstract structures. Morgan Stanley classifies it under "abstract representation/non-generative models," highlighting potential in reasoning, planning, and physical AI systems like robotics.
While specific product details are scarce, possible applications include robotics, autonomous driving, video understanding/analysis, and camera-equipped AR/VR and smart assistants. AMI Labs debuted with over $1 billion in seed funding and a post-money valuation exceeding $4.5 billion.
Capital and Talent Converge: Competition in Spatial Intelligence Intensifies Beyond technical specs or demos, Morgan Stanley’s report signals a shifting landscape: world models are becoming the "common language" of AI’s next phase, embraced by giants and startups alike. They explain potential productivity leaps in gaming, film, and design, as well as the migration of training, validation, and planning for autonomous systems and robotics into virtual environments.
World models are not plug-and-play solutions. The report concludes with a roadmap: viable applications are emerging, but fundamental challenges—long-term stability, controllability, multi-agent dynamics, physical detail, and evaluation frameworks—remain. The key differentiator will be which players can engineer closed-loop solutions to these hard problems, determining how far the journey from digital to physical reality can go.