GeoPhys: The Geometry of
Physical Plausibility

1Bielefeld University   2University of Oxford   3Cold Spring Harbor Laboratory   4MBZUAI   5Independent   6Honda Research Institute EU   7New York University   8Flatiron Institute, Simons Foundation

Abstract

While humans can identify physically implausible events within milliseconds, machine learning approaches addressing the same problem are extremely slow and expensive. They either rely on external multimodal-LLM judges or require ad-hoc modifications to the training procedure. In this work, we argue that indicators of physical plausibility are implicitly captured by five geometric properties of the per-frame embeddings produced by frozen image encoders. In aggregate, we call them GeoPhys.

TL;DR: Physical plausibility is a geometric property of feature trajectories through frozen image encoders. No video pretraining, no physics supervision, no learned ranker. GeoPhys set new detection SoTA and steer video generation as a cheap verifier.

0
Physics-violation detection on LikePhys. GeoPhys beats all twelve SoTA diffusion models.
0
Detection on IntPhys2, within 3.1 points of human performance (96.4%). The best prior baseline (V-JEPA 2, GPT-4o and Gemini) sits near chance.
50.0 0
PhysicsIQ via best-of-N on MAGI-1 24B, at 4.65× lower memory than a world-model verifier.

The Setup

A frozen backbone maps each frame to a pooled feature; stacking them across time gives a trajectory. Plausible videos trace smooth, locally linear paths. Violations bend.

Method walkthrough

The five signals. Each measures one way a trajectory can betray a physics violation, computed directly on the per-frame features with no learned parameters.

GeoPhys signals: plausible vs violated feature-space trajectories
drop static/images/fig1.png here
(paper Figure 1: plausible vs. violated trajectories)
Figure 1. GeoPhys signals of plausible vs. violated dynamics in frozen feature space. A frozen backbone maps each frame to a pooled feature, tracing a trajectory. Plausible videos stay smooth; violations bend and jump, read out by the five signals below.

Speed variation

How far the feature point travels between consecutive frames. Steady motion keeps these steps even; an object that teleports, vanishes, or stutters makes the step sizes spike.

Curvature

How sharply the path turns at each step. Good representations keep natural motion locally straight; a wall-pass or sudden reversal bends the trajectory.

Angle consistency

Whether those turns are uniform or erratic. A consistent path bends by similar amounts each step; violations scatter the turning angles.

Acceleration

The change in the step vector itself, the second difference. It catches abrupt shifts in speed or direction that smooth dynamics avoid.

Prediction residual

How far the next point strays from a linear predictor fit on the recent past. A violation breaks the local low-dimensional structure and the residual jumps.

On the trajectory , with step , turning angle , acceleration , and residual of a linear predictor on . All statistics run over ; larger values mean less regular dynamics.

Four frozen backbones. The score runs on self-supervised transformers and explicit models of primate visual cortex. None is trained on video or physics. For each backbone we pick the readout layer that maximises the plausible-versus-violated curvature gap on a held-out split. Backbones are complementary: each specialises in a different signal and wins a different physics domain.

GeoPhys pipeline
Optional figure slot · drop static/images/pipeline.png here
(paper Figure 2: GeoPhys pipeline)
Pipeline. A single video feeds each backbone unchanged. The pooled per-frame embedding is stacked across time into a trajectory, on which the five signals operate.

It tracks the brain

A useful signal should reflect a real phenomenon, not an artefact of pretraining statistics. We compare per-frame GeoPhys signals against human EEG on matched violation-of-expectation stimuli.

On two object-permanence scenarios, Create (one object enters an occluder, two emerge) and Vanish (two enter, one emerges), CORnet-S IT speed follows the contralateral delay activity, an EEG marker of visual working memory. Both rise and stay elevated after the violation, and the GeoPhys signal scales with the number of tracked objects.

Model and brain comparison
Drop static/images/brain.png here
(paper Figure 4: model and brain comparison)
Model and brain. GeoPhys CORnet-S IT speed and the human EEG contralateral delay activity are both elevated in the invalid condition after occlusion offset, and scale with object number.
Takeaway 1

GeoPhys signals align with human EEG responses to object-permanence violations and scale with object number, consistent with physical-violation perception.

It detects violations at scale

Pairwise physics-violation accuracy (%). Main-set average per benchmark. Higher is better.
VerifierLikePhysIntPhys2
Best video diffusion model (Hunyuan)56.4
GPT-4o53.8
Gemini 2.5 Flash55.6
V-JEPA 2 (1.1B, video-pretrained)57.5
DINOv2 (L12)78.658.8
DINOv3 (L18)80.860.5
CORnet-S (IT / V1)78.261.1
VOneNet (V1)77.661.7
GeoPhys · Majority vote90.977.5
GeoPhys · OR ensemble98.393.3
Human96.4

A leave-one-scene-out linear probe on the same DINOv3 features reaches only 62.4% on LikePhys, so the gain comes from trajectory geometry, not the backbone alone.

0255075100chanceBaselinesControlBackboneEnsembleAnimateDiff39.2AnimateDiff-SDXL44.0ZeroScope46.7ModelScope47.1Mochi48.1CogVideoX-5B50.2CogVideoX-2B51.8Wan2.1-1B52.0LTX-Video55.3CogVideoX-1.556.2Wan2.1-14B56.2Hunyuan56.4Linear probe62.4GeoPhys (V-JEPA 2)78.3DINOv278.6DINOv380.8CORnet-S78.2VOneNet77.6GeoPhys Majority90.9GeoPhys OR ensemble98.3pairwise accuracy (%)
Detection on LikePhys. Pairwise accuracy across all twelve video-diffusion baselines, the feature controls (linear probe, V-JEPA 2), and GeoPhys. Every GeoPhys backbone beats every baseline; the OR ensemble reaches 98.3. Dashed line marks chance.
0255075100chanceBaselinesControlBackboneEnsembleHumanCosmos49.4Qwen-VL52.3Gemini 1.552.3GPT-4o53.8VideoMAEv253.8V-JEPA53.8Gemini 2.5 Flash55.6V-JEPA 257.5Linear probe55.5GeoPhys (V-JEPA 2)59.5DINOv258.8DINOv360.5CORnet-S61.1VOneNet61.7GeoPhys Majority77.5GeoPhys OR ensemble93.3Human96.4pairwise accuracy (%)
Detection on IntPhys 2. All baselines (MLLM judges, video encoders, world models), controls, the four GeoPhys backbones, ensembles, and the human reference. OR ensemble 93.3, within 3.1 points of human.
Takeaway 2

GeoPhys unlocks physics-plausibility detection: every backbone surpasses all state-of-the-art baselines on both benchmarks (LikePhys 98% and IntPhys2 93%).

It steers video generation

Beyond passive measurement, GeoPhys transfers to active control. As a best-of-N verifier it reranks a generator's candidates by plausibility, with no learned ranker and no PhysicsIQ-specific tuning.

On PhysicsIQ GeoPhys outscores every inference-time verifier and closes more of the gap to the oracle ceiling than a billion-parameter world model.

01530456075PhysicsIQ score (%)BaselineVideoMAEQwen2.5-VLQwen3-VLWMRewardGeoPhysOracle18.815.020.330.125.634.040.2MAGI-1 4.5BI2V18.911.021.030.733.134.039.7Wan2.1 14BI2V19.911.421.330.933.234.141.3CogVideoX-5BI2V29.523.529.732.334.338.640.2MAGI-1 24BI2V50.052.650.655.162.364.572.9MAGI-1 24BV2V
Generation results · PhysicsIQ best-of-N (N=16). Every inference-time verifier across all five generator settings. GeoPhys is the strongest real verifier on each, and on MAGI-1 24B V2V reaches 64.5, closing most of the gap from baseline (50.0) to the oracle (72.9).

PhysicsIQ score (%, higher is better). I2V settings score lower than the V2V setting; GeoPhys leads the real verifiers in every column.

Test-time scaling. More candidates, better physics. As the best-of-N budget grows, GeoPhys keeps climbing toward the oracle ceiling while cheaper verifiers and the no-verifier baseline plateau, so spending compute at inference pays off only with the right plausibility signal.

Test-time scaling on PhysicsIQ
drop static/images/fig8.png here
(paper Figure 8: test-time scaling)
Figure 8. Test-time scaling: PhysicsIQ score versus the best-of-N budget.

Results in motion. Pick a physics family, then drag the divider to compare the no-verifier baseline against the GeoPhys best-of-N selection.

Baseline
GeoPhys
BaselineGeoPhys

Same scenario, same candidate pool. The baseline takes a random sample; GeoPhys reranks by plausibility. Δ is the per-scenario PhysicsIQ gain over baseline.

Cheaper Frozen image encoders cost a fraction of a video-pretrained world model.

GPU memory · GB
GeoPhys single1.2
GeoPhys ensemble2.0
WMReward9.3
4.65× less memory than WMReward
Latency · s / video
GeoPhys single0.25
GeoPhys ensemble1.0
WMReward1.5
1.5× lower wall-clock

GeoPhys · single

1.2 GB
DINOv3 ViT-L (0.3B). 0.25 s/video. Frozen, no training.

GeoPhys · ensemble

2.0 GB
Four backbones (0.7B). 1.0 s/video.

WMReward

9.3 GB
V-JEPA 2 ViT-giant (1.1B). 1.5 s/video, video-pretrained.

1.5× lower wall-clock and 4.65× lower memory than the world-model verifier. At a fixed budget that is roughly 5× more candidates.

Takeaway 3

Test-time compute scales physically plausible generation, but only with the right verifier. GeoPhys closes the gap to the oracle faster than world-model and other verifiers, at 4.65× less compute, using only frozen image encoders.

What GeoPhys is, and is not

We do not claim frozen backbones represent or simulate physics. GeoPhys is a correlate of plausibility, not an implementation. The signals are time-symmetric, and the best-of-N result is bounded by candidate diversity. We see GeoPhys as a diagnostic for current generators, not a substitute for a genuine world model.

Citation (BibTeX)

If you find this work useful, please cite it.

@misc{interno2026geophys,
  title   = {GeoPhys: The Geometry of Physical Plausibility},
  author  = {Intern\`{o}, Christian and Pondaven, Alexander and Issa, Habon
             and Pizzati, Fabio and Pinto, Francesco and Olhofer, Markus
             and Laptev, Ivan and Torr, Philip and Simoncelli, Eero P.
             and Hammer, Barbara and Klindt, David},
  year    = {2026},
  url     = {https://christianinterno.github.io/GeoPhys/}
}