VPVID: Variance-Preserving Velocity-Guided Interpolant Diffusion for Speech Enhancement and Dereverberation

Abstract

Diffusion-based generative models for speech enhancement often face challenges in balancing performance and inference efficiency. To address this, we propose a model of Variance-Preserving Velocity-guided Interpolant Diffusion (VPVID), a novel framework that achieves competitive enhancement performance while maintaining high computational efficiency. Our approach incorporates a scalable interpolant framework that reconstructs the reverse diffusion process using velocity terms and state variables. Unlike traditional score-matching objectives, we employ a velocity-based loss function that directly estimates the instantaneous rate of change, providing more stable training and efficient data distribution learning. We further combine stochastic diffusion sampling with probability flow ordinary differential equations, augmented by an adaptive corrector mechanism, creating a flexible sampling strategy that balances quality and efficiency. Extensive experiments on VoiceBank-DEMAND and WSJ0-CHiME3 datasets demonstrate that VPVID significantly outperforms existing baselines across multiple metrics, particularly excelling in noise separation with SI-SIR improvement up to 4.7 dB. Furthermore, VPVID achieves up to 7× faster inference than existing diffusion-based methods while maintaining excellent speech enhancement and dereverberation performance.