Customizing Text-to-Image Diffusion with Object Viewpoint Control

¹ CMU ²Adobe Research

Our method allows additional control on the object viewpoint of the custom object in generated images by text-to-image diffusion models, such as Stable Diffusion. Given a 360 degree mutliview dataset (~50 images), we fine-tune FeatureNeRF blocks in the intermediate feature space of the diffusion model to condition the generation of the object in a target camera pose.

Abstract

Model customization introduces new concepts to existing text-to-image models, enabling generation of the new concept in novel contexts. However, such methods lack accurate camera view control w.r.t the object, and users must resort to prompt engineering (e.g., adding ``top-view'') to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of object viewpoint for model customization. This allows us to modify object properties amongst various background scenes via text prompts, all while incorporating the target camera pose as additional control. This new task presents significant challenges in merging a 3D representation from the multi-view images of the new concept with a general, 2D text-to-image model. To bridge this gap, we propose to condition the 2D diffusion process on rendered, view-dependent features of the new object. During training, we jointly adapt the 2D diffusion modules and 3D feature predictions to reconstruct the object's appearance and geometry while reducing overfitting to the input multi-view images. Our method performs on par or better than existing image editing and model personalization baselines in preserving the custom object's identity while following the input text prompt and camera pose for the object.

Algorithm

Given multi-view images of a target concept Teddy bear our method customizes a text-to-image diffusion model with that concept with an additional condition of target camera pose. We modify a subset of transformer layers to be pose-conditioned. This is done by adding a new FeatureNeRF block in intermediate feature space of the transformer layer. We finetune the new weights with the multiview dataset while keeping pre-trained model weights frozen.

Comparison to Baselines

All our results are based SDXL model. We customize the model on various categories of multiview images, e.g., car, teddybear, toys, motorcycle. We compare our method with image-editing methods, 3D NeRF editing methods, and Customization methods. Specifically, we comapre with SDEdit, LEDITS++, ViCA NeRF, and our proposed baseline of LoRA + Camera pose where we concatenate camera matrix with the text prompt.

Results

We show results of varying the target camera pose and text prompt for each custom concept. More results are shown on the Gallery.

Applications

We show that our method can be used in various applications like generating the object in varying pose while keeping the same background using SDEdit or generating nice panorama scenes. We can also compose multiple instances of the object in FeatureNeRF space to generate scenes with desired relative camera pose between the two.

Limitations

Our method has still various limitations. It fails when we extrapolate the target camera pose far from the seen training views. The SDXL model has a bias to generate front facing object in such scenarios. It also sometimes fails to incoroporate the text prompt, especially when composing with another object.

Citation


                    @inproceedings{kumari2024customdiffusion360,

                      author = {Kumari, Nupur and Su, Grace and Zhang, Richard and Park, Taesung and Shechtman, Eli and Zhu, Jun-Yan},

                      title  = {Customizing Text-to-Image Diffusion with Object Viewpoint Control},

                      booktitle = {SIGGRAPH Asia},

                      year   = {2024},

                }

Concurrent and Related Works

Ta-Ying Cheng, Matheus Gadelha, Thibault Groueix, Matthew Fisher, Radomir Mech, Andrew Markham, and Niki Trigoni. Learning Continuous 3D Words for Text-to-Image Generation. CVPR 2024.

Lukas Höllein, Aljaž Božič, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, and Matthias Nießner. Viewdiff: 3d-consistent image generation with text-to-image models. CVPR 2024.

James Burgess, Kuan-Chieh Wang, and Serena Yeung-Levy. Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models. ECCV 2024.

Ziyang Yuan, Mingdeng Cao, Xintao Wang, Zhongang Qi, Chun Yuan, and Ying Shan. CustomNet: Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models. ACM Multimedia 2024.

Acknowledgements

We are thankful to Kangle Deng, Sheng-Yu Wang, and Gaurav Parmar for their helpful comments and discussion and to Sean Liu, Ruihan Gao, Yufei Ye, and Bharath Raj for proofreading the draft. This work was partly done by Nupur Kumari during the Adobe internship. The work is partly supported by Adobe Research, the Packard Fellowship, the Amazon Faculty Research Award, and NSF IIS-2239076. Grace Su is supported by the NSF Graduate Research Fellowship (Grant No. DGE2140739).

Customizing Text-to-Image Diffusion with Object Viewpoint Control

Nupur Kumari^1*

Grace Su^1*

Richard Zhang²

Taesung Park²

Eli Shechtman²

Jun-Yan Zhu¹

¹ CMU ²Adobe Research

Abstract

Algorithm

Comparison to Baselines

Results

Applications

Limitations

Citation

Concurrent and Related Works

Acknowledgements

Customizing Text-to-Image Diffusion with Object Viewpoint Control

Nupur Kumari1*

Grace Su1*

Richard Zhang2

Taesung Park2

Eli Shechtman2

Jun-Yan Zhu1

1 CMU 2Adobe Research

Abstract

Algorithm

Comparison to Baselines

Results

Applications

Limitations

Citation

Concurrent and Related Works

Acknowledgements

Nupur Kumari^1*

Grace Su^1*

Richard Zhang²

Taesung Park²

Eli Shechtman²

Jun-Yan Zhu¹

¹ CMU ²Adobe Research