Abstract
Diffusion-based generative models for speech enhancement often face challenges in balancing performance and inference efficiency. To address this, we propose a model of Variance-Preserving Velocity-guided Interpolant Diffusion (VPVID), a novel framework that achieves competitive enhancement performance while maintaining high computational efficiency. Our approach incorporates a scalable interpolant framework that reconstructs the reverse diffusion process using velocity terms and state variables. Unlike traditional score-matching objectives, we employ a velocity-based loss function that directly estimates the instantaneous rate of change, providing more stable training and efficient data distribution learning. We further combine stochastic diffusion sampling with probability flow ordinary differential equations, augmented by an adaptive corrector mechanism, creating a flexible sampling strategy that balances quality and efficiency. Extensive experiments on VoiceBank-DEMAND and WSJ0-CHiME3 datasets demonstrate that VPVID significantly outperforms existing baselines across multiple metrics, particularly excelling in noise separation with SI-SIR improvement up to 4.7 dB. Furthermore, VPVID achieves up to 7× faster inference than existing diffusion-based methods while maintaining excellent speech enhancement and dereverberation performance.
Speech Enhancement Experiments (WSJ0-CHiME3)
Audio samples demonstrating noise reduction performance on the WSJ0-CHiME3 dataset.
Sample 1: 051o0211
Clean (Target)

Noisy (Input)

SGMSEP

VPIDM

FLOWSE

VPVID (Ours)

VPVID-PC (Ours)

VPVID-ODE (Ours)

Sample 2: 22ga010f
Clean (Target)

Noisy (Input)

SGMSEP

VPIDM

FLOWSE

VPVID (Ours)

VPVID-PC (Ours)

VPVID-ODE (Ours)

Sample 3: 422c020o
Clean (Target)

Noisy (Input)

SGMSEP

VPIDM

FLOWSE

VPVID (Ours)

VPVID-PC (Ours)

VPVID-ODE (Ours)

Sample 4: 423o0304
Clean (Target)

Noisy (Input)

SGMSEP

VPIDM

FLOWSE

VPVID (Ours)

VPVID-PC (Ours)

VPVID-ODE (Ours)

Speech Dereverberation Experiments (WSJ0-Reverb)
Audio samples demonstrating reverberation removal performance on the WSJ0-Reverb dataset.
Sample 1: 441c0208
Anechoic (Target)

Reverb (Input)

SGMSEP

VPIDM

FLOWSE

VPVID (Ours)

VPVID-PC (Ours)

VPVID-ODE (Ours)

Sample 2: 441o030y
Anechoic (Target)

Reverb (Input)

SGMSEP

VPIDM

FLOWSE

VPVID (Ours)

VPVID-PC (Ours)

VPVID-ODE (Ours)

Sample 3: 442o0301
Anechoic (Target)

Reverb (Input)

SGMSEP

VPIDM

FLOWSE

VPVID (Ours)

VPVID-PC (Ours)

VPVID-ODE (Ours)

Sample 4: 447c020s_597_1.07_-8.1
Anechoic (Target)

Reverb (Input)

SGMSEP

VPIDM

FLOWSE

VPVID (Ours)

VPVID-PC (Ours)

VPVID-ODE (Ours)
