GeoPhys: The Geometry of Physical Plausibility

Internò, Christian; Pondaven, Alexander; Issa, Habon; Pizzati, Fabio; Pinto, Francesco; Olhofer, Markus; Laptev, Ivan; Torr, Philip; Simoncelli, Eero P.; Hammer, Barbara; Klindt, David

Abstract

While humans can identify physically implausible events within milliseconds, machine learning approaches addressing the same problem are extremely slow and expensive. They either rely on external multimodal-LLM judges or require ad-hoc modifications to the training procedure. In this work, we argue that indicators of physical plausibility are implicitly captured by five geometric properties of the per-frame embeddings produced by frozen image encoders. In aggregate, we call them GeoPhys.

TL;DR: Physical plausibility is a geometric property of feature trajectories through frozen image encoders. No video pretraining, no physics supervision, no learned ranker. GeoPhys set new detection SoTA and steer video generation as a cheap verifier.

Physics-violation detection on LikePhys. GeoPhys beats all twelve SoTA diffusion models.

Detection on IntPhys2, within 3.1 points of human performance (96.4%). The best prior baseline (V-JEPA 2, GPT-4o and Gemini) sits near chance.

50.0 → 0

PhysicsIQ via best-of-N on MAGI-1 24B, at 4.65× lower memory than a world-model verifier.

The Setup

A frozen backbone maps each frame to a pooled feature; stacking them across time gives a trajectory. Plausible videos trace smooth, locally linear paths. Violations bend.

Method walkthrough

The five signals. Each measures one way a trajectory can betray a physics violation, computed directly on the per-frame features with no learned parameters.

GeoPhys signals: plausible vs violated feature-space trajectories — **Figure 1. GeoPhys signals of plausible vs. violated dynamics in frozen feature space.** A frozen backbone maps each frame to a pooled feature, tracing a trajectory. Plausible videos stay smooth; violations bend and jump, read out by the five signals below.

Speed variation

ϕ_{speed} = std_{t} ∥ v_{t} ∥

How far the feature point travels between consecutive frames. Steady motion keeps these steps even; an object that teleports, vanishes, or stutters makes the step sizes spike.

Curvature

ϕ_{curv} = mean_{t} θ_{t}

How sharply the path turns at each step. Good representations keep natural motion locally straight; a wall-pass or sudden reversal bends the trajectory.

Angle consistency

ϕ_{ang} = std_{t} θ_{t}

Whether those turns are uniform or erratic. A consistent path bends by similar amounts each step; violations scatter the turning angles.

Acceleration

ϕ_{accel} = mean_{t} ∥ a_{t} ∥^{2}

The change in the step vector itself, the second difference. It catches abrupt shifts in speed or direction that smooth dynamics avoid.

Prediction residual

ϕ_{perr} = mean_{t} ∥ ε_{t} ∥

How far the next point strays from a linear predictor fit on the recent past. A violation breaks the local low-dimensional structure and the residual jumps.

On the trajectory $Γ_{θ} (V) = (\overset{z}{ˉ}_{1}, \dots, \overset{z}{ˉ}_{T})$ , with step $v_{t} = \overset{z}{ˉ}_{t + 1} - \overset{z}{ˉ}_{t}$ , turning angle $θ_{t} = arccos \frac{⟨ v _{t} , v _{t + 1} ⟩}{∥ v _{t} ∥ ∥ v _{t + 1} ∥}$ , acceleration $a_{t} = \overset{z}{ˉ}_{t + 2} - 2 \overset{z}{ˉ}_{t + 1} + \overset{z}{ˉ}_{t}$ , and residual $ε_{t} = \overset{z}{ˉ}_{t + 1} - \hat{\overset{z}{ˉ}}_{t + 1}$ of a linear predictor $\hat{\overset{z}{ˉ}}_{t + 1}$ on $\overset{z}{ˉ}_{1 : t}$ . All statistics run over $t$ ; larger values mean less regular dynamics.

Four frozen backbones. The score runs on self-supervised transformers and explicit models of primate visual cortex. None is trained on video or physics. For each backbone we pick the readout layer that maximises the plausible-versus-violated curvature gap on a held-out split. Backbones are complementary: each specialises in a different signal and wins a different physics domain.

DINOv2ViT-L/14 · self-supervised DINOv3ViT-L · SSL + registers CORnet-Srecurrent ventral CNN VOneNetGabor V1 + ResNet-50

GeoPhys pipeline — **Pipeline.** A single video feeds each backbone unchanged. The pooled per-frame embedding is stacked across time into a trajectory, on which the five signals operate.

It tracks the brain

A useful signal should reflect a real phenomenon, not an artefact of pretraining statistics. We compare per-frame GeoPhys signals against human EEG on matched violation-of-expectation stimuli.

On two object-permanence scenarios, Create (one object enters an occluder, two emerge) and Vanish (two enter, one emerges), CORnet-S IT speed follows the contralateral delay activity, an EEG marker of visual working memory. Both rise and stay elevated after the violation, and the GeoPhys signal scales with the number of tracked objects.

Model and brain comparison — **Model and brain.** GeoPhys CORnet-S IT speed and the human EEG contralateral delay activity are both elevated in the invalid condition after occlusion offset, and scale with object number.

Takeaway 1

GeoPhys signals align with human EEG responses to object-permanence violations and scale with object number, consistent with physical-violation perception.

It detects violations at scale

drop static/images/ds_likephys.jpg

L LikePhys
detection

650 matched Blender pairs across 12 scenarios in four physics domains: rigid-body, continuum, fluid and optical.

arXiv:2510.11512 ↗

drop static/images/ds_intphys2.jpg

I IntPhys 2
detection

506 photorealistic Unreal Engine pairs probing four core-knowledge properties: permanence, solidity, continuity, immutability.

arXiv:2506.09849 ↗

drop static/images/ds_physicsiq.jpg

P PhysicsIQ
generation

396 real-world videos over 66 scenarios in solid mechanics, fluid dynamics, optics, thermodynamics and magnetism.

physics-iq.github.io ↗

Pairwise physics-violation accuracy (%). Main-set average per benchmark. Higher is better.
Verifier	LikePhys	IntPhys2
Best video diffusion model (Hunyuan)	56.4	–
GPT-4o	–	53.8
Gemini 2.5 Flash	–	55.6
V-JEPA 2 (1.1B, video-pretrained)	–	57.5
DINOv2 (L12)	78.6	58.8
DINOv3 (L18)	80.8	60.5
CORnet-S (IT / V1)	78.2	61.1
VOneNet (V1)	77.6	61.7
GeoPhys · Majority vote	90.9	77.5
GeoPhys · OR ensemble	98.3	93.3
Human	–	96.4

A leave-one-scene-out linear probe on the same DINOv3 features reaches only 62.4% on LikePhys, so the gain comes from trajectory geometry, not the backbone alone.

Detection on LikePhys. Pairwise accuracy across all twelve video-diffusion baselines, the feature controls (linear probe, V-JEPA 2), and GeoPhys. Every GeoPhys backbone beats every baseline; the OR ensemble reaches 98.3. Dashed line marks chance.

Detection on IntPhys 2. All baselines (MLLM judges, video encoders, world models), controls, the four GeoPhys backbones, ensembles, and the human reference. OR ensemble 93.3, within 3.1 points of human.

Takeaway 2

GeoPhys unlocks physics-plausibility detection: every backbone surpasses all state-of-the-art baselines on both benchmarks (LikePhys 98% and IntPhys2 93%).

It steers video generation

Beyond passive measurement, GeoPhys transfers to active control. As a best-of-N verifier it reranks a generator's candidates by plausibility, with no learned ranker and no PhysicsIQ-specific tuning.

On PhysicsIQ GeoPhys outscores every inference-time verifier and closes more of the gap to the oracle ceiling than a billion-parameter world model.

Generation results · PhysicsIQ best-of-N (N=16). Every inference-time verifier across all five generator settings. GeoPhys is the strongest real verifier on each, and on MAGI-1 24B V2V reaches 64.5, closing most of the gap from baseline (50.0) to the oracle (72.9).

PhysicsIQ score (%, higher is better). I2V settings score lower than the V2V setting; GeoPhys leads the real verifiers in every column.

Test-time scaling. More candidates, better physics. As the best-of-N budget grows, GeoPhys keeps climbing toward the oracle ceiling while cheaper verifiers and the no-verifier baseline plateau, so spending compute at inference pays off only with the right plausibility signal.

Test-time scaling on PhysicsIQ — **Figure 8.** Test-time scaling: PhysicsIQ score versus the best-of-N budget.

Results in motion. Pick a physics family, then drag the divider to compare the no-verifier baseline against the GeoPhys best-of-N selection.

Baseline

GeoPhys

BaselineGeoPhys

⟷

Same scenario, same candidate pool. The baseline takes a random sample; GeoPhys reranks by plausibility. Δ is the per-scenario PhysicsIQ gain over baseline.

Cheaper Frozen image encoders cost a fraction of a video-pretrained world model.

GPU memory · GB

GeoPhys single1.2

GeoPhys ensemble2.0

WMReward9.3

4.65× less memory than WMReward

Latency · s / video

GeoPhys single0.25

GeoPhys ensemble1.0

WMReward1.5

1.5× lower wall-clock

GeoPhys · single

1.2 GB

DINOv3 ViT-L (0.3B). 0.25 s/video. Frozen, no training.

GeoPhys · ensemble

2.0 GB

Four backbones (0.7B). 1.0 s/video.

WMReward

9.3 GB

V-JEPA 2 ViT-giant (1.1B). 1.5 s/video, video-pretrained.

1.5× lower wall-clock and 4.65× lower memory than the world-model verifier. At a fixed budget that is roughly 5× more candidates.

Takeaway 3

Test-time compute scales physically plausible generation, but only with the right verifier. GeoPhys closes the gap to the oracle faster than world-model and other verifiers, at 4.65× less compute, using only frozen image encoders.

What GeoPhys is, and is not

We do not claim frozen backbones represent or simulate physics. GeoPhys is a correlate of plausibility, not an implementation. The signals are time-symmetric, and the best-of-N result is bounded by candidate diversity. We see GeoPhys as a diagnostic for current generators, not a substitute for a genuine world model.