Region-of-Interest Aware Diffusion Models for Controllable Video Editing

Ahsan Raza Siddiqui; Muhammad Taha Qureshi

Authors

Ahsan Raza Siddiqui Department of Information Technology, Karakoram International University, University Road, Gilgit 15100, Gilgit-Baltistan, Pakistan Author
Muhammad Taha Qureshi Department of Software Engineering, Mohammad Ali Jinnah University, Street 10, Phase II, Gulshan-e-Iqbal, Karachi 75300, Pakistan Author

Abstract

Diffusion-based generative models have recently become a prominent approach for controllable image and video synthesis, enabling a range of applications in creative production, content retargeting, and post-processing workflows. These models typically operate over high-dimensional spatiotemporal tensors and rely on iterative denoising processes guided by conditioning signals such as text or exemplars. However, most existing approaches treat the video volume in a spatially uniform manner, which limits their ability to perform localized, semantically meaningful edits that preserve contextual consistency outside a user-specified region. This is a significant limitation in practical video editing scenarios, where users often require precise modifications within a region of interest while maintaining global coherence. This paper investigates region-of-interest aware diffusion models for controllable video editing, in which user-specified spatial or spatiotemporal regions guide the evolution of the denoising process. The proposed formulation treats regions of interest as first-class conditioning objects that influence sampling dynamics, attention patterns, and loss weighting. A tensorial representation of region masks is integrated into the diffusion process to jointly regulate spatial focus, temporal consistency, and identity preservation outside the edited areas. The study explores both training-time and sampling-time mechanisms for region control, including weighted reconstruction objectives and region-aware score fields. Experimental analyses on diverse editing tasks, including object replacement, attribute modification, and localized stylization, indicate that region-of-interest aware diffusion provides controllable behavior while maintaining temporal stability and content preservation in non-edited regions.

Region-of-Interest Aware Diffusion Models for Controllable Video Editing

Authors

Abstract

Downloads

Published

Issue

Section

License