This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics—such as non-rigid subject motion and complex camera movements—that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train \textbf{InstructMove}, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.
Our data construction pipeline. (a) We begin by sampling suitable frame pairs from videos, ensuring realistic and moderate transformations. (b) These frame pairs are used to prompt Multimodal Large Language Models (MLLMs) to generate detailed editing instructions. (c) This process results in a large-scale dataset with realistic image pairs and precise editing instructions.
Overview of the proposed model architecture for instruction-based image editing. The source and target images are first encoded into latent representations zs and ze using a pre-trained encoder. The target latent ze is then transformed into a noisy latent zet through the forward diffusion process. We concatenate the source image latent and the noisy target latent along the width dimension to form the model input, which is fed into the denoising U-Net ϵθ to predict a noise map. The right half of the output, corresponding to the noisy target input, is cropped and compared with the original noise map.
Existing methods struggle with complex edits such as non-rigid transformations (e.g., changes in pose and expression), object repositioning, or viewpoint adjustments. They often either fail to follow the editing instructions or produce images with inconsistencies, such as identity shifts. In contrast, our method, trained on real video frames with naturalistic transformations, successfully handles these edits while maintaining consistency with the original input images.
Utilizing local mask and additional controls for localized and more precise edits. (a) Our model can utilize a mask to specify which part of the image to edit, enabling localized adjustments and resolving ambiguities in the instructions. (b) When combined with pretrained ControlNet, our model can accept additional inputs, such as human poses or rough sketches, to achieve precise edits in subject poses or object positioning. This level of control is not possible with previous methods.
We can finetune a larger and more powerful T2I model (e.g., FLUX.dev) on our dataset using the proposed Spatial Conditioning strategy to achieve higher-quality, high-resolution image edits.
@misc{cao2024instructionbasedimagemanipulationwatching,
title={Instruction-based Image Manipulation by Watching How Things Move},
author={Mingdeng Cao and Xuaner Zhang and Yinqiang Zheng and Zhihao Xia},
year={2024},
eprint={2412.12087},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.12087},
}