Customizing Text-to-Image Diffusion with Camera Viewpoint Control

Code Paper Project Gallery Demo

Our method allows additional control on the camera viewpoint of the custom object in generated images by text-to-image diffusion models, such as Stable Diffusion. Given a 360 degree mutliview dataset (~50 images), we fine-tune FeatureNeRF blocks in the intermediate feature space of the diffusion model to condition the generation of the object in a target camera pose.


Model customization introduces new concepts to existing text-to-image models, enabling generation of the new concept in novel contexts. However, such methods lack accurate camera view control w.r.t the object, and users must resort to prompt engineering (e.g., adding ``top-view'') to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of camera viewpoint for model customization. This allows us to modify object properties amongst various background scenes via text prompts, all while incorporating the target camera pose as additional control. This new task presents significant challenges in merging a 3D representation from the multi-view images of the new concept with a general, 2D text-to-image model. To bridge this gap, we propose to condition the 2D diffusion process on rendered, view-dependent features of the new object. During training, we jointly adapt the 2D diffusion modules and 3D feature predictions to reconstruct the object's appearance and geometry while reducing overfitting to the input multi-view images. Our method performs on par or better than existing image editing and model personalization baselines in preserving the custom object's identity while following the input text prompt and camera pose for the object.


Given multi-view images of a target concept Teddy bear our method customizes a text-to-image diffusion model with that concept with an additional condition of target camera pose. We modify a subset of transformer layers to be pose-conditioned. This is done by adding a new FeatureNeRF block in intermediate feature space of the transformer layer. We finetune the new weights with the multiview dataset while keeping pre-trained model weights frozen.

Comparison to Baselines

All our results are based SDXL model. We customize the model on various categories of multiview images, e.g., car, teddybear, toys, motorcycle. We compare our method with image-editing methods, 3D NeRF editing methods, and Customization methods. Specifically, we comapre with SDEdit, LEDITS++, ViCA NeRF, and our proposed baseline of LoRA + Camera pose where we concatenate camera matrix with the text prompt.


We show results of varying the target camera pose and text prompt for each custom concept. More results are shown on the Gallery.


We show that our method can be used in various applications like generating the object in varying pose while keeping the same background using SDEdit or generating nice panorama scenes. We can also compose multiple instances of the object in FeatureNeRF space to generate scenes with desired relative camera pose between the two.


Our method has still various limitations. It fails when we extrapolate the target camera pose far from the seen training views. The SDXL model has a bias to generate front facing object in such scenarios. It also sometimes fails to incoroporate the text prompt, especially when composing with another object.

Concurrent Works