AI & TechArtificial IntelligenceNewswireScienceTechnology

AI Model Intuitively Grasps How the Physical World Works

Originally published on: December 8, 2025
▼ Summary

– Meta’s V-JEPA is an AI model that learns about the world by watching videos and demonstrates a form of “surprise” when events contradict its learned expectations.
– The model develops an intuitive understanding of how the world works without being pre-programmed with any assumptions about physics.
– Unlike many AI systems, V-JEPA does not process videos in “pixel space,” where every pixel is treated as equally important.
– Pixel-space models have limitations, as they can become distracted by irrelevant details and miss critical information in a scene.
– Researchers find the model’s ability to learn intuitive concepts, similar to a child’s development of object permanence, to be a plausible and interesting advancement.

The ability to understand how objects persist and interact is a fundamental human skill, and new research shows that artificial intelligence is beginning to develop a similar intuitive grasp of the physical world. By learning from video alone, without any pre-programmed rules, an advanced AI model can now demonstrate a sense of surprise when events violate its learned expectations, mirroring the developmental milestones observed in young children. This breakthrough moves beyond systems that simply classify pixels and toward models that build abstract, conceptual understandings of their environment.

Consider a simple test often used with infants. If you show a baby a glass of water, hide it behind a board, and then move the board as if it passed right through the glass, a one-year-old will typically show surprise. They have developed an intuitive understanding of object permanence. Researchers have now created an AI system that learns a comparable form of common-sense physics purely by watching videos. The model, known as Video Joint Embedding Predictive Architecture (V-JEPA), starts with no assumptions about how the world works. Through observation, it builds an internal model that allows it to predict what should happen next. When shown a video where an object seems to disappear or pass through another, the model’s response indicates it finds the event unexpected or implausible based on its accumulated knowledge.

This approach represents a significant shift from traditional computer vision models. Most systems designed to interpret video work in what’s termed “pixel space,” treating every single pixel with equal importance. This method has clear limitations. For a model analyzing a street scene, it might become overly distracted by the flickering of leaves on a tree while missing the crucial change of a traffic light from red to green. The pixel-level details are often irrelevant to the higher-level understanding of the scene. The V-JEPA model avoids this pitfall by learning to focus on abstract representations of objects and their relationships, filtering out unnecessary visual noise. This allows it to form more robust and generalizable concepts about how elements in a video should behave.

Experts in cognitive science and AI find the results compelling. The model’s ability to develop an intuitive, predictive understanding without explicit supervision suggests a path toward more sophisticated and reliable artificial intelligence. The core achievement is that the system learns to fill in missing information in a video, predicting what happens in masked or obscured portions of a scene based on context. This requires building a mental model of object permanence and basic physical interactions. While still a long way from human cognition, this research points toward AI systems that can learn about the world in a more human-like, efficient manner, potentially leading to advancements in robotics, autonomous vehicles, and other technologies that require a nuanced understanding of a dynamic physical environment.

(Source: Wired)

Topics

ai models 95% video analysis 85% computer vision 80% infant cognition 80% v-jepa 80% object permanence 75% pixel space 75% model limitations 70% surprise detection 70% physics understanding 70%