Seeking Physics in Diffusion Noise

1Brown University    2University of Edinburgh    3Massachusetts Institute of Technology

"A ray of light is shining diagonally on a plastic cup in the dark, with the shadow of the plastic cup appearing at the bottom"

Baseline

Ours

"A piece of wood block is gently placed on the surface of a bowl filled with water"

Baseline

Ours

"A plastic ruler is slowly submerged into a glass of water, highlighting the reflections as the ruler interacts with the liquid"

Baseline

Ours

"A vibrant, elastic basketball is thrown forcefully towards the ground, capturing its dynamic interaction with the surface upon impact"

Baseline

Ours

Abstract

Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.

Pipeline Overview

Experimental Results

Cross-Backbone Results on PhyGenBench

Backbone Method Overall S1 S2 S3 Mech Opti Ther Mate Win %
CogVideoX-2B Baseline 0.370 0.38 0.43 0.34 0.39
Ours 0.515 1.98 0.91 1.69 0.49 0.58 0.47 0.49 66.1%
CogVideoX-5B Baseline 0.363 1.54 0.58 1.21 0.283 0.493 0.322 0.308
Ours 0.365 1.52 0.53 1.30 0.292 0.456 0.256 0.408 62.5%
Wan 2.1-14B Baseline 0.569 2.05 1.28 1.79 0.525 0.740 0.489 0.458
Ours 0.612 2.09 1.46 1.86 0.600 0.767 0.533 0.492

S1: VQAScore (single-frame), S2: multi-frame physics (GPT-4o), S3: naturalness (GPT-4o). Mech: mechanics, Opti: optics, Ther: thermal, Mate: material properties. Win %: pairwise preference judged by GPT-4o, excluding ties.

-->

BibTeX

@misc{tang2026seekingphysicsdiffusionnoise,
    title={Seeking Physics in Diffusion Noise},
    author={Chujun Tang and Lei Zhong and Fangqiang Ding},
    year={2026},
    eprint={2603.14294},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2603.14294},
}