PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Seeking Physics in Diffusion Noise

Chujun Tang¹, Lei Zhong², Fangqiang Ding³

¹Brown University ²University of Edinburgh ³Massachusetts Institute of Technology

Paper Supplementary Code arXiv

"A ray of light generated by a laser pointer is passing through smoke"

Baseline

Ours

"A kite is soaring above a smooth and tranquil pond, with the reflection of the kite"

Baseline

Ours

"A timelapse captures the transformation as steam in a kitchen comes into contact with the window"

Baseline

Ours

"A tennis ball is gently placed on the surface of a bucket filled with water"

Baseline

Ours

Abstract

Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.

Pipeline Overview

Progressive trajectory selection with a physics verifier. Given a text prompt, we sample N denoising trajectories from different seeds and score each partially denoised sample at intermediate timesteps (e.g., t=600, 400) using a lightweight physics verifier applied to frozen DiT features.

Given a generated video and its prompt, we encode the video with a VAE, add diffusion noise at timestep t, and run a frozen diffusion transformer. We extract hidden states at layer ℓ, remove text-conditioning tokens, and spatially mean-pool video tokens to obtain per-frame features.

Experimental Results

Cross-Backbone Results on PhyGenBench

Backbone	Method	Overall	S1	S2	S3	Mech	Opti	Ther	Mate	Win %
CogVideoX-2B	Baseline	0.370	—	—	—	0.38	0.43	0.34	0.39	—
CogVideoX-2B	Ours	0.515	1.98	0.91	1.69	0.49	0.58	0.47	0.49	66.1%
CogVideoX-5B	Baseline	0.363	1.54	0.58	1.21	0.283	0.493	0.322	0.308	—
CogVideoX-5B	Ours	0.365	1.52	0.53	1.30	0.292	0.456	0.256	0.408	62.5%
Wan 2.1-14B	Baseline	0.569	2.05	1.28	1.79	0.525	0.740	0.489	0.458	—
Wan 2.1-14B	Ours	0.612	2.09	1.46	1.86	0.600	0.767	0.533	0.492	—

S1: VQAScore (single-frame), S2: multi-frame physics (GPT-4o), S3: naturalness (GPT-4o). Mech: mechanics, Opti: optics, Ther: thermal, Mate: material properties. Win %: pairwise preference judged by GPT-4o, excluding ties.

-->

BibTeX

@misc{tang2026seekingphysicsdiffusionnoise,
    title={Seeking Physics in Diffusion Noise},
    author={Chujun Tang and Lei Zhong and Fangqiang Ding},
    year={2026},
    eprint={2603.14294},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2603.14294},
}