Seeking Physics in Diffusion Noise
Abstract
Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.
Pipeline Overview
Progressive trajectory selection with a physics verifier. Given a text prompt, we sample N denoising trajectories from different seeds and score each partially denoised sample at intermediate timesteps (e.g., t=600, 400) using a lightweight physics verifier applied to frozen DiT features.
Given a generated video and its prompt, we encode the video with a VAE, add diffusion noise at timestep t, and run a frozen diffusion transformer. We extract hidden states at layer ℓ, remove text-conditioning tokens, and spatially mean-pool video tokens to obtain per-frame features.
Video Presentation
Experimental Results
Cross-Backbone Results on PhyGenBench
| Backbone | Method | Overall | S1 | S2 | S3 | Mech | Opti | Ther | Mate | Win % |
|---|---|---|---|---|---|---|---|---|---|---|
| CogVideoX-2B | Baseline | 0.370 | — | — | — | 0.38 | 0.43 | 0.34 | 0.39 | — |
| Ours | 0.515 | 1.98 | 0.91 | 1.69 | 0.49 | 0.58 | 0.47 | 0.49 | 66.1% | |
| CogVideoX-5B | Baseline | 0.363 | 1.54 | 0.58 | 1.21 | 0.283 | 0.493 | 0.322 | 0.308 | — |
| Ours | 0.365 | 1.52 | 0.53 | 1.30 | 0.292 | 0.456 | 0.256 | 0.408 | 62.5% | |
| Wan 2.1-14B | Baseline | 0.569 | 2.05 | 1.28 | 1.79 | 0.525 | 0.740 | 0.489 | 0.458 | — |
| Ours | 0.612 | 2.09 | 1.46 | 1.86 | 0.600 | 0.767 | 0.533 | 0.492 | — |
S1: VQAScore (single-frame), S2: multi-frame physics (GPT-4o), S3: naturalness (GPT-4o). Mech: mechanics, Opti: optics, Ther: thermal, Mate: material properties. Win %: pairwise preference judged by GPT-4o, excluding ties.
BibTeX
@misc{tang2026seekingphysicsdiffusionnoise,
title={Seeking Physics in Diffusion Noise},
author={Chujun Tang and Lei Zhong and Fangqiang Ding},
year={2026},
eprint={2603.14294},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.14294},
}