Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Authors: Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu

TODO

✅ Release inference code and model weights
⬜ Release training code

Abstract

Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module injecting motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) enhancing detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP~\citep{zhang2024tmp}, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130X. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment.

Usage

Environment

The code is based on Python 3.9, CUDA 11, and diffusers, and our development and testing are primarily conducted on Ubuntu 24.04 LTS.

Conda setup

git clone https://github.com/jamichss/Stream-DiffVSR.git
cd Stream-DiffVSR
conda env create -f requirements.yml
conda activate stream-diffvsr

Users with RTX 6000 Pro or RTX 50-series GPUs may need to update their environment by following the instructions below. For more details, please refer to Issue #10. We thank medienbueroleipzig, tpc2233 and b00nwin for testing and providing the detailed instructions!

## Conda setup for RTX 6000 Pro / RTX 50-Series GPUs
git clone https://github.com/jamichss/Stream-DiffVSR.git
cd Stream-DiffVSR
# 1. Create conda environment
conda create --prefix ./Diff_env python==3.10 -y
# 2. Activate the environment
conda activate ./Diff_env
# 3. Install pip dependencies
pip install -r requirements-cu12.txt
pip install --upgrade transformers peft diffusers accelerate
pip install xformers==0.0.32.post2

Pretrained models

Pretrained models are available here. You don't need to download them explicitly as they are fetched with inference code.

Inference

You can run the inference directly using the following command. No manual download of checkpoints is required, as the inference script will automatically fetch the necessary files.

python inference.py \
    --model_id 'Jamichsu/Stream-DiffVSR' \
    --out_path 'YOUR_OUTPUT_PATH' \
    --in_path 'YOUR_INPUT_PATH' \
    --num_inference_steps 4

The expected file structure for the inference input data is outlined below. The model processes individual video sequences contained within subdirectories.

YOUR_INPUT_PATH/
├── seq1/
│   ├── frame_0001.png
│   ├── frame_0002.png
│   └── ...
├── seq2/
│   ├── frame_0001.png
│   ├── frame_0002.png
│   └── ...

For additional acceleration using NVIDIA TensorRT, please execute the following command. Please note that utilizing TensorRT may introduce a slight degradation in the output quality while providing significant performance gains. Parameters image_height and image_width are required when using tensorRT; otherwise, they are not needed.

Note: TensorRT is mainly for speed/throughput, while xFormers helps reduce GPU memory usage. They are currently not compatible, so xFormers-based memory optimizations are unavailable when TensorRT is enabled, which may significantly increase GPU memory usage and lead to OOM issues at higher resolutions.

python inference.py \
    --model_id 'Jamichsu/Stream-DiffVSR' \
    --out_path 'YOUR_OUTPUT_PATH' \
    --in_path 'YOUR_INPUT_PATH' \
    --num_inference_steps 4 \
    --enable_tensorrt \
    --image_height <YOUR_OUTPUT_HEIGHT> \
    --image_width <YOUR_OUTPUT_WIDTH>

When executing the TensorRT command for the first time with a new output resolution, you may observe that the process takes an extended period to build the dedicated TensorRT engine. We kindly ask for your patience. Please note that this engine compilation is a one-time setup step for that specific resolution, essential for enabling subsequent accelerated inference at the same setting.

Citation

If you find this work useful, please consider citing:

@article{shiu2025stream,
  title={Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion},
  author={Shiu, Hau-Shiang and Lin, Chin-Yang and Wang, Zhixiang and Hsiao, Chi-Wei and Yu, Po-Fan and Chen, Yu-Chih and Liu, Yu-Lun},
  journal={arXiv preprint arXiv:2512.23709},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
acceleration		acceleration
pipeline		pipeline
scheduler		scheduler
temporal_autoencoder		temporal_autoencoder
util		util
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
push2hub.py		push2hub.py
requirements-cu12.txt		requirements-cu12.txt
requirements.yml		requirements.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

TODO

Abstract

Usage

Environment

Conda setup

Pretrained models

Inference

Citation

About

Uh oh!

Releases

Packages

Languages

License

jamichss/Stream-DiffVSR

Folders and files

Latest commit

History

Repository files navigation

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

TODO

Abstract

Usage

Environment

Conda setup

Pretrained models

Inference

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages