L-MAGIC: Language Model Assisted Generation of Images with Coherence (2025)

Zhipeng Cai Matthias Mueller Reiner Birkl Diana Wofk Shao-Yen Tseng
Junda Cheng Gabriela Ben-Melech Stan Vasudev Lai Michael Paulitsch
Intel Labs
Corresponding author (zhipeng.cai@intel.com)

Abstract

In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC, a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language models without fine-tuning, ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view rendering quality compared to related works, with >>>70%percent7070\%70 % preference in human evaluations.Combined with conditional diffusion models, L-MAGIC can accept various input modalities, including but not limited to text, depth maps, sketches, and colored scripts. Applying depth estimation further enables 3D point cloud generation and dynamic scene exploration with fluid camera motion. Code is available at https://github.com/IntelLabs/MMPano.

1 Introduction

L-MAGIC: Language Model Assisted Generation of Images with Coherence (1)

Diffusion models have achieved state-of-the-art performance in image generation. However, generating a 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT panoramic scene from a single perspective image remains a challenge, which is an important problem in many computer vision applications, such as architecture design, movie scene creation, and virtual reality (VR).

Training a model to directly generate panoramic images is challenging due to the lack of diverse large-scale datasets. Hence, most existing works separate panoramic scenes into multiple perspective views, and inpaint them using pre-trained diffusion models. To ensure generalization, the diffusion model is either frozen without any architecture change[11] or combined with extra modules trained on small datasets for integrating multi-view information[25].

A common approach to encode the scene information during multi-view inpainting is to provide a text-conditioned diffusion model with the description of the input image, which is generated by a user or an image captioning model[13]. Though effective for extending local scene content, such approaches suffer from incoherence in the overall scene layout. Specifically, using the same text description for diffusing different views along the 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT panorama leads to artifacts and unnecessarily repeated objects. Current inpainting methods have no mechanism to leverage global scene information in individual views.

In this work, we show that state-of-the-art (vision) language models, without fine-tuning, can be used to control multi-view diffusion and effectively address the above problem. We propose L-MAGIC (Fig.1), a novel framework leveraging large language models to enable the automatic generation of diverse yet coherent 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT views from a given input image. L-MAGIC relies on iterative warping-and-inpainting. Pre-trained (vision) language models are used to: (1) generate layout descriptions of different views that are used in text-conditioned inpainting, (2) automatically determine whether salient objects should be repeated or not for a specific scene, and (3) monitor the inpainting outputs to avoid challenging cases where diffusion models violate the text guidance. A key contribution is the prompt design for language and diffusion models to make L-MAGIC fully automatic. In addition, smooth multi-view fusion and super-resolution techniques are used to ensure high resolution and quality when producing the final panorama.

Experiments show that L-MAGIC generates 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT panoramic scenes with higher quality and more coherent layouts compared to state-of-the-art methods. Not relying on model fine-tuning makes L-MAGIC effective on in-the-wild images, and extendable, using conditional diffusion models[19, 29], to other types of inputs such as text, depth maps, sketch drawings and color scripts/segmentation masks. Applying depth estimation further enables the creation of 3D point clouds and immersive scene fly-through experiences with both camera rotation and translation.

2 Related Work

Diffusion models. Diffusion models learn to generate data by inverting the diffusion process, i.e., removing the noise in the data (see[26] for a detailed survey). By separating the data generation process into multiple noise removal steps (reverse process)[10], diffusion models learn image synthesis much more effectively than GANs[9]. Recently, latent diffusion models[19] have been proposed to improve the training speed and image synthesis quality by performing the diffusion in latent space. By training the model on large-scale image-caption pairs[21], they achieve remarkable quality and robustness in text-conditioned image synthesis. Further fine-tuning of latent diffusion models using large mask strategies[24] produces robust text-conditioned inpainting models, which are used in this work.

Panoramic scene generation. Various approaches have been proposed for panoramic scene generation. Some of them[8, 22] treat the panorama as a single equirectangular image, and generate it in a single forward pass. However, such approaches struggle to close the loop at both ends of the generated equirectangular image, even with purposely designed spherical positional embeddings[8]. Meanwhile, the lack of large scale training data makes it impractical to train a generalizable model for image-to-panorama, and limits the robustness of the model, resulting in inconsistent outputs with the input descriptions.

Some recent methods[11, 25] create panoramas by generating multiple perspective views using robust pre-trained diffusion models trained on large-scale perspective data. Text2room[11] generates 10 rotated views of a panorama using stable diffusion v2 inpainting without layout control and focuses on indoor scenes and mesh generation. MVDiffusion[25] ensures multi-view local texture consistency by extra multi-view attention modules fine-tuned on small datasets. Though applicable to both text-to-panorama and image-to-panorama, these methods struggle to generate diverse 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT views. Specifically, there is no mechanism to encode the global scene layout into the generation or inpainting of different views. Hence, conditioning the method only on the input image or text results in salient objects (e.g., beds in a bedroom) being generated repeatedly across views. In this work, we guide the multi-view diffusion process with large language models to automatically generate panoramic scenes with coherent and diverse 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT layouts.

Language models. Recent language model advancements have enabled many important applications (see[30] for details). Trained on large-scale data with humans in the loop, ChatGPT[4] has demonstrated super-human performance on various language-based tasks. In this work, we utilize ChatGPT to automatically generate coherent multi-view scene descriptions for the consumption of a pre-trained diffusion model that generates multiple perspective views of a panoramic scene. Leveraging multi-modal data, vision language models[28] have further enhanced language models to understand visual inputs. In this work, we utilize pre-trained VQA models[13] to automatically generate a scene description for the input image, and to avoid unnecessarily repeated objects across the generated panoramic scene.

L-MAGIC: Language Model Assisted Generation of Images with Coherence (2)

3 Methdology

The goal of this work is to generate a coherent 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT panoramic scene given a single (perspective) image. Note that this setting is very general since the input image could be either captured in the real world or synthesized. For example using conditional diffusion models[19, 29], one can synthesize the input image using inputs such as text descriptions, sketches, depth images, and so on.

As shown in Fig.2, L-MAGIC generates the panoramic scene with iterative warping-and-inpainting. The warping step generates an incomplete perspective view and the mask of the missing region (Sec.3.1). The inpainting step completes the masked region with assistance from language models (Sec.3.2). The final panorama is created by fusing the generated views with some post-processing to enhance the quality and resolution (Sec.3.3).

3.1 Warping

At each warping step, we project all completed views onto a unit sphere representing the panoramic scene, and then render the next incomplete view to inpaint based on the relative camera pose. To project an image to the unit sphere, we first construct a mesh by defining the vertices 𝒱𝒱{\mathcal{V}}caligraphic_V on each image pixel and creating edges {\mathcal{E}}caligraphic_E between adjacent pixels. Then, we project the vertices to a unit sphere by

𝐯sp=𝐊1𝐯𝐊1𝐯,subscript𝐯spsuperscript𝐊1𝐯normsuperscript𝐊1𝐯{\mathbf{v}}_{\text{sp}}=\frac{{\mathbf{K}}^{-1}{\mathbf{v}}}{||{\mathbf{K}}^{%-1}{\mathbf{v}}||},bold_v start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT = divide start_ARG bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_v end_ARG start_ARG | | bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_v | | end_ARG ,(1)

where 𝐊𝐊{\mathbf{K}}bold_K is the intrinsic matrix, 𝐯𝒱𝐯𝒱{\mathbf{v}}\in{\mathcal{V}}bold_v ∈ caligraphic_V is the homogeneous coordinate[23] of a pixel, and 𝐯spsubscript𝐯sp{\mathbf{v}}_{\text{sp}}bold_v start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT is the projected location. To warp a completed view A𝐴Aitalic_A to a novel view B𝐵Bitalic_B, where 𝐑𝐑{\mathbf{R}}bold_R is the rotation matrix from A𝐴Aitalic_A to B𝐵Bitalic_B, we rotate each projected vertex 𝐯spsubscript𝐯sp{\mathbf{v}}_{\text{sp}}bold_v start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT of A𝐴Aitalic_A by 𝐯rot=𝐑𝐯spsubscript𝐯rotsubscript𝐑𝐯sp{\mathbf{v}}_{\text{rot}}={\mathbf{R}}{\mathbf{v}}_{\text{sp}}bold_v start_POSTSUBSCRIPT rot end_POSTSUBSCRIPT = bold_Rv start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT, and then perform rasterization-based rendering[18] (,)=rasterize(𝒱rot,,𝐊)rasterizesubscript𝒱rotsuperscript𝐊({\mathcal{I}},{\mathcal{M}})=\text{rasterize}({\mathcal{V}}_{\text{rot}},{%\mathcal{E}},{\mathbf{K}}^{\prime})( caligraphic_I , caligraphic_M ) = rasterize ( caligraphic_V start_POSTSUBSCRIPT rot end_POSTSUBSCRIPT , caligraphic_E , bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where 𝒱rotsubscript𝒱rot{\mathcal{V}}_{\text{rot}}caligraphic_V start_POSTSUBSCRIPT rot end_POSTSUBSCRIPT is the set of rotated vertices and 𝐊superscript𝐊{\mathbf{K}}^{\prime}bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the intrinsic matrix of image B𝐵Bitalic_B. The output {\mathcal{I}}caligraphic_I is a warped image and {\mathcal{M}}caligraphic_M is a binary mask indicating whether inpainting is required for a pixel (obtained by checking for each pixel whether ray-casting hits a valid mesh face).

To ensure the local inpainting consistency of each perspective view, we use a large field of view (FoV) and adjust the rotation angles so that both known and unknown regions are reasonably large after warping. In practice, a FoV of 100 degrees with roughly 40 degrees of rotation between adjacent views works well. To further reduce the iterative error accumulation, we expand the scene alternatively from both sides of the input image, rather than expanding in a single direction. To ensure a smooth 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT loop closure, we tune the rotation angles so that the final view has a large incomplete region at the center, resulting in a sequence of views with rotation angles of {0,41,41,82,82,123,200.5(for loop closure)}superscript0superscript41superscript41superscript82superscript82superscript123superscript200.5(for loop closure)\{0^{\circ},41^{\circ},-41^{\circ},82^{\circ},-82^{\circ},123^{\circ},200.5^{%\circ}\text{(for loop closure)}\}{ 0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 41 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , - 41 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 82 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , - 82 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 123 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 200.5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT (for loop closure) }.

3.2 Inpainting with Language Model Assistance

The inpainting step completes a warped view with a consistent local style and a coherent 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT scene layout.We utilize the Stable Diffusion v2 inpainting model[19], which can effectively extrapolate the large missing region of each warped view while maintaining local style consistency. However, naive inpainting without any prior will generate severe artifacts. One common prior explored before[11, 25] is a user-provided text description of the scene or input image.Yet, using the same description in different views may generate duplicate objects such as multiple beds in a bedroom (see Fig.4), since perspective inpainting methods have no mechanism to split the layout into different views. Please refer to Sec.4.3 for a detailed ablation study.

1:Input: Initial image {\mathcal{I}}caligraphic_I, intrinsic matrix 𝐊𝐊{\mathbf{K}}bold_K of {\mathcal{I}}caligraphic_I, intrinsics 𝐊superscript𝐊{\mathbf{K}}^{\prime}bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and poses 𝒫={𝐑i}i=1N𝒫superscriptsubscriptsubscript𝐑𝑖𝑖1𝑁{\mathcal{P}}=\{{\mathbf{R}}_{i}\}_{i=1}^{N}caligraphic_P = { bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of warped views, vision language model v()subscriptv{\mathcal{L}}_{\text{v}}(\cdot)caligraphic_L start_POSTSUBSCRIPT v end_POSTSUBSCRIPT ( ⋅ ), language model (){\mathcal{L}}(\cdot)caligraphic_L ( ⋅ ), text-conditioned inpainting model finpaint()subscript𝑓inpaintf_{\text{inpaint}}(\cdot)italic_f start_POSTSUBSCRIPT inpaint end_POSTSUBSCRIPT ( ⋅ ).

2:dv(,‘Description of input image’)subscript𝑑subscriptv‘Description of input image’d_{{\mathcal{I}}}\leftarrow{\mathcal{L}}_{\text{v}}({\mathcal{I}},\text{`%Description of input image'})italic_d start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ← caligraphic_L start_POSTSUBSCRIPT v end_POSTSUBSCRIPT ( caligraphic_I , ‘Description of input image’ )

3:d360(d,‘Layout of individual views’)subscript𝑑360subscript𝑑‘Layout of individual views’d_{360}\leftarrow{\mathcal{L}}(d_{{\mathcal{I}}},\text{`Layout of individual %views'})italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT ← caligraphic_L ( italic_d start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , ‘Layout of individual views’ )

4:dscene(d,‘Remove object-level descriptions’)subscript𝑑scenesubscript𝑑‘Remove object-level descriptions’d_{\text{scene}}\leftarrow{\mathcal{L}}(d_{{\mathcal{I}}},\text{`Remove object%-level descriptions'})italic_d start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT ← caligraphic_L ( italic_d start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , ‘Remove object-level descriptions’ )

5:drepeat(d,‘Avoid duplicated objects’)subscript𝑑repeatsubscript𝑑‘Avoid duplicated objects’d_{\text{repeat}}\leftarrow{\mathcal{L}}(d_{{\mathcal{I}}},\text{`Avoid %duplicated objects'})italic_d start_POSTSUBSCRIPT repeat end_POSTSUBSCRIPT ← caligraphic_L ( italic_d start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , ‘Avoid duplicated objects’ )

6:𝒲1,1warp(𝒞,𝐊,𝐊,𝐑1)subscript𝒲1subscript1warp𝒞𝐊superscript𝐊subscript𝐑1{\mathcal{W}}_{1},{\mathcal{M}}_{1}\leftarrow\text{warp}({\mathcal{C}},{%\mathbf{K}},{\mathbf{K}}^{\prime},{\mathbf{R}}_{1})caligraphic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← warp ( caligraphic_C , bold_K , bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )

7:1finpaint(𝒲1,1,dscene)subscript1subscript𝑓inpaintsubscript𝒲1subscript1subscript𝑑scene{\mathcal{I}}_{1}\leftarrow f_{\text{inpaint}}({\mathcal{W}}_{1},{\mathcal{M}}%_{1},d_{\text{scene}})caligraphic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT inpaint end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT ) \triangleright expand the FoV of {\mathcal{I}}caligraphic_I

8:𝒞{1}𝒞subscript1{\mathcal{C}}\leftarrow\{{\mathcal{I}}_{1}\}caligraphic_C ← { caligraphic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }

9:fori=2𝑖2i=2italic_i = 2 to N𝑁Nitalic_Ndo

10:c0𝑐0c\leftarrow 0italic_c ← 0

11:𝒲i,iwarp(𝒞,𝐊,𝐊,𝐑i)subscript𝒲𝑖subscript𝑖warp𝒞𝐊superscript𝐊subscript𝐑𝑖{\mathcal{W}}_{i},{\mathcal{M}}_{i}\leftarrow\text{warp}({\mathcal{C}},{%\mathbf{K}},{\mathbf{K}}^{\prime},{\mathbf{R}}_{i})caligraphic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← warp ( caligraphic_C , bold_K , bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

12:digenerate_prompt(d360,dscene,drepeat,i)subscript𝑑𝑖generate_promptsubscript𝑑360subscript𝑑scenesubscript𝑑repeat𝑖d_{i}\leftarrow\text{generate$\_$prompt}(d_{360},d_{\text{scene}},d_{\text{%repeat}},i)italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← generate _ prompt ( italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT repeat end_POSTSUBSCRIPT , italic_i )

13:ifinpaint(𝒲i,i,di)subscript𝑖subscript𝑓inpaintsubscript𝒲𝑖subscript𝑖subscript𝑑𝑖{\mathcal{I}}_{i}\leftarrow f_{\text{inpaint}}({\mathcal{W}}_{i},{\mathcal{M}}%_{i},d_{i})caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT inpaint end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

14:foreach object 𝒪𝒪{\mathcal{O}}caligraphic_O in drepeatsubscript𝑑repeatd_{\text{repeat}}italic_d start_POSTSUBSCRIPT repeat end_POSTSUBSCRIPTdo

15:ifv(i,‘Is𝒪ini?’)=’yes’c<20subscriptvsubscript𝑖‘Is𝒪ini?’’yes’𝑐20{\mathcal{L}}_{\text{v}}({\mathcal{I}}_{i},\text{`Is ${\mathcal{O}}$ in ${%\mathcal{I}}_{i}$?'})=\text{'yes'}\bigcap c<20caligraphic_L start_POSTSUBSCRIPT v end_POSTSUBSCRIPT ( caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ‘Is caligraphic_O in caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ?’ ) = ’yes’ ⋂ italic_c < 20then

16:cc+1𝑐𝑐1c\leftarrow c+1italic_c ← italic_c + 1

17:Go to line13.

18:endif

19:endfor

20:𝒞𝒞{i}𝒞𝒞subscript𝑖{\mathcal{C}}\leftarrow{\mathcal{C}}\bigcup\{{\mathcal{I}}_{i}\}caligraphic_C ← caligraphic_C ⋃ { caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }

21:endfor

22:return merge(𝒞𝒞{\mathcal{C}}caligraphic_C)

To address these problems, we use a vision language model v()subscriptv{\mathcal{L}}_{\text{v}}(\cdot)caligraphic_L start_POSTSUBSCRIPT v end_POSTSUBSCRIPT ( ⋅ ) (BLIP-2[13]) and a language model (){\mathcal{L}}(\cdot)caligraphic_L ( ⋅ ) (ChatGPT4[4]) to guide the inpainting process.In the following, we describe our method (Alg.1) in detail. The exact prompts used in Alg.1 to interact with language and diffusion models are provided in AppendixA.

Before warping and inpainting, we first prompt v()subscriptv{\mathcal{L}}_{\text{v}}(\cdot)caligraphic_L start_POSTSUBSCRIPT v end_POSTSUBSCRIPT ( ⋅ ) to generate the description dsubscript𝑑d_{\mathcal{I}}italic_d start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT for the input image {\mathcal{I}}caligraphic_I (line2).We ask two questions so that dsubscript𝑑d_{\mathcal{I}}italic_d start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT contains both coarse and fine levels of detail.Next, we ask (){\mathcal{L}}(\cdot)caligraphic_L ( ⋅ ) to imagine the global scene layout d360subscript𝑑360d_{360}italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT (line3) based on dsubscript𝑑d_{\mathcal{I}}italic_d start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT, where each line of d360subscript𝑑360d_{360}italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT corresponds to the description of a specific view. To avoid duplicate objects, we request a compact description of individual views without mentioning objects in other views.

d360subscript𝑑360d_{360}italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT mostly contains objects of individual views. Using such descriptions as the inpainting prompt can lead to inconsistent style at distant views. Hence, we ask (){\mathcal{L}}(\cdot)caligraphic_L ( ⋅ ) to remove objects from dsubscript𝑑d_{\mathcal{I}}italic_d start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT and obtain the final scene-level description dscenesubscript𝑑scened_{\text{scene}}italic_d start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT, e.g., ‘a bedroom with a wooden bed’ becomes ‘a bedroom’ (line4). dscenesubscript𝑑scened_{\text{scene}}italic_d start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT is later used together with d360subscript𝑑360d_{360}italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT to ensure a consistent multi-view style.

Though dscenesubscript𝑑scened_{\text{scene}}italic_d start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT ensures the multi-view style consistency, the training data bias of diffusion models may still result in objects commonly associated with a particular scene being generated, even if not explicitly mentioned in d360subscript𝑑360d_{360}italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT. For example, a bed is often generated with the word ‘bedroom’ in the prompt, resulting in duplicate beds in multiple views. To avoid this problem, we further let (){\mathcal{L}}(\cdot)caligraphic_L ( ⋅ ) automatically determine whether there are some objects in the scene that require repetition avoidance (line5).

After each warping step, we use the outputs from lines2 to5 to automatically generate the prompt for text-conditioned inpainting (line12). Specifically, for the warped view with 0superscript00^{\circ}0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT rotation (i = 1), we use dscenesubscript𝑑scened_{\text{scene}}italic_d start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT as the prompt (disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) for text-conditioned inpainting (line7). For other views (line13), if there is no object in drepeatsubscript𝑑repeatd_{\text{repeat}}italic_d start_POSTSUBSCRIPT repeat end_POSTSUBSCRIPT, i.e., no repetition avoidance required, we perform inpainting with the prompt ‘a peripheral view of <<<dscenesubscript𝑑scened_{\text{scene}}italic_d start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT>>> where we see <<<the corresponding description in d360subscript𝑑360d_{360}italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT>>>. If any object exists in drepeatsubscript𝑑repeatd_{\text{repeat}}italic_d start_POSTSUBSCRIPT repeat end_POSTSUBSCRIPT, we use the positive prompt of ‘a peripheral view of <<<dscenesubscript𝑑scened_{\text{scene}}italic_d start_POSTSUBSCRIPT scene end_POSTSUBSCRIPT>>> where we only see <<<the corresponding description in d360subscript𝑑360d_{360}italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT>>>, and the negative prompt of ‘any type of <<<the object in drepeatsubscript𝑑repeatd_{\text{repeat}}italic_d start_POSTSUBSCRIPT repeat end_POSTSUBSCRIPT>>> (one sentence for each object in drepeatsubscript𝑑repeatd_{\text{repeat}}italic_d start_POSTSUBSCRIPT repeat end_POSTSUBSCRIPT). The positive prompt template prevents Stable Diffusion from generating common objects of an environment (e.g., the bed in a bedroom). The negative prompt template avoids duplication of objects mentioned in drepeatsubscript𝑑repeatd_{\text{repeat}}italic_d start_POSTSUBSCRIPT repeat end_POSTSUBSCRIPT.

Bias exists in the training data of diffusion models — an image with the caption of ‘a bedroom’ mostly contains a bed. Therefore, repeated objects can still be generated even with constraints from the prompt. To further alleviate this problem, we use v()subscriptv{\mathcal{L}}_{\text{v}}(\cdot)caligraphic_L start_POSTSUBSCRIPT v end_POSTSUBSCRIPT ( ⋅ ) to check whether each inpainted image isubscript𝑖{\mathcal{I}}_{i}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains objects mentioned in drepeatsubscript𝑑repeatd_{\text{repeat}}italic_d start_POSTSUBSCRIPT repeat end_POSTSUBSCRIPT (line15).If the answer is ‘yes’, we re-run inpainting until the answer becomes ‘no’ or the maximum number of trials c𝑐citalic_c is reached.

3.3 Quality and Resolution Enhancement

Several techniques are also proposed to enhance the quality and resolution of the final panorama.

As shown in Appendix.B, adjacent pixels at the center of an image have a larger angular distance than the ones at the side of an image. When warping a completed view to a novel view, the original central region becomes the side region, making the rendered image blurry due to interpolation. Meanwhile, the resolution of the Stable Diffusion output is 512512512512512*512512 ∗ 512, the panorama created by these images has a low resolution. To address both problems, we apply super-resolution[3] to the output isubscript𝑖{\mathcal{I}}_{i}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of each inpainting step, increasing the resolution of isubscript𝑖{\mathcal{I}}_{i}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 20482048204820482048*20482048 ∗ 2048. Then, we warp the high-resolution image to a low-resolution novel view so that no (strong) interpolation is required. After performing all warping and inpainting steps, we simply fuse the super-resolution images to generate a high-resolution panorama.

During warping and panorama generation, multiple perspective images might have overlaps at the same region. To avoid sharp boundaries when merging them, we perform a weighted average, i.e., given multiple warped pixels at the same location with colors 𝐜isubscript𝐜𝑖{\mathbf{c}}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the final merged pixel color is 𝐜merge=iwi𝐜iiwisubscript𝐜mergesubscript𝑖subscript𝑤𝑖subscript𝐜𝑖subscript𝑖subscript𝑤𝑖{\mathbf{c}}_{\text{merge}}=\frac{\sum_{i}w_{i}{\mathbf{c}}_{i}}{\sum_{i}w_{i}}bold_c start_POSTSUBSCRIPT merge end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, where the weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is computed as the distance to the nearest image boundary at the original view i𝑖iitalic_i. This strategy effectively down-weights the pixels near the warping boundaries, ensuring a smooth transition during multi-view fusion.

To create the final panorama (line22), we first project each view to the unit sphere same as in Sec.3.1. Then, we perform the equirectangular projection[5] to warp multiple projected views to the same equirectangular plane, and merge them into a single equirectangular image.

3.4 Discussion

L-MAGIC is fully automatic — no human interaction is required to link language models and diffusion models. This is realized by 1) careful prompt engineering, which enables language models to output texts that can be automatically converted into the inpainting prompt, and 2) handling the edge cases where language models or diffusion models do not generate outputs that satisfy the requirements in the input prompt. For example, ChatGPT sometimes still outputs the layout description d360subscript𝑑360d_{360}italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT with an erroneous format, making automatic prompt extraction fail catastrophically. We automatically detect such cases, and re-run line3 to ensure the algorithm flow, see AppendixA for more details.

L-MAGIC requires no model fine-tuning, which ensures the zero-shot performance and makes it capable of accepting other types of inputs leveraging conditional generative models (see Sec.4 for results). This advantage also allows individual modules to be replaced by future methods to enhance the performance, e.g., change BLIP-2+ChatGPT to GPT-4V[1], or use other inpainting models.

4 Experiments

We describe our experimental setup in Sec.4.1 and compare our method with other panorama generation methods in Sec.4.2. We then analyze the contribution of individual components of our methods in Sec.4.3. Finally, we demonstrate several down-stream applications, such as scene fly-through and 3D scene generation in Sec.4.4.

4.1 Experimental Setup

Baselines. We evaluate our method on both image-to-panorama and text-to-panorama. For image-to-panorama, we compare against:

  1. 1.

    Stable Diffusion v2[19]: we use the prompt ‘360 degree panorama of <<<scene description>>>’ to inpaint the panorama image in a single diffusion process.

  2. 2.

    Text2room[11]: we take the panorama generation component (at steps 11-20 of the pipeline) for comparison.

  3. 3.

    MVDiffusion image-to-panorama model[25].

To enable text conditioning in these methods, we use BLIP-2 to obtain the description of the input image. For text-to-panorama, we compare against:

  1. 1.

    Text2light[8]: GAN-based text to panorama model.

  2. 2.

    Stable Diffusion v2[19]: we use the prompt ‘360 degree panorama of <<<input prompt>>>’ to generate the panoramic image in a single diffusion process.

  3. 3.

    LDM3D[22] panorama model: we only use the output RGB image and the prompt ‘360 degree panorama of <<<input prompt>>>’ to generate the panoramic image in a single diffusion process.

  4. 4.

    Text2room panorama generation module [11].

  5. 5.

    MVDiffusion text-to-panorama model[25].

Implementation Details. L-MAGIC is implemented with PyTorch[15] and the official release of BLIP-2[13], Stable Diffusion[19] and the ChatGPT4 API[4]. It takes 2-5 minutes to generate a 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT panorama depending on the number of repetitions in line17 of Alg.1. We take the official code and model for all other methods. In text-to-panorama, L-MAGIC uses Stable Diffusion v2 conditioned on the given text prompt to generate the input image.

Data. To evaluate the in-the-wild performance, we collect data that does not overlap with the training data of any method for both tasks. For image-to-panorama, we use 20 indoor and 20 outdoor images from tanks-and-temples[12] and RealEstate10K[31]. For text-to-panorama, we use ChatGPT to generate 20 random scene descriptions (10 indoor and 10 outdoor, see AppendixC).

Metrics. To evaluate different methods with respect to quality and multi-view diversity, we compute the Inception Score (IS)[20] for the perspective views of the panorama. Since existing quantitative metrics do not capture all aspects of human perception of quality[7], we follow existing works[11, 8, 14] for a complementary human evaluation. To this end, we set up a voting web page that shows side-by-side panoramic scenes generated using the same input, one with our method and one with a baseline. We ask 15 anonymous voters to choose which panorama has higher quality and scene structure (see AppendixD for an example voting page). To minimize voting bias, we randomly shuffle the order of the side-by-side panoramas and hide the generation method names. We use the votes to compute a preference score from 0 to 1 for our method compared to the baselines. This score is simply the percentage of votes for our method with respect to quality and structure.

4.2 Main Results

L-MAGIC: Language Model Assisted Generation of Images with Coherence (3)
L-MAGIC: Language Model Assisted Generation of Images with Coherence (4)
L-MAGIC: Language Model Assisted Generation of Images with Coherence (5)

As shown in Fig.3, our method performs better for both image-to-panorama and text-to-panorama, even compared to task-specific methods. To further understand the performance of different methods, we provide in Fig.4 and5 the qualitative results for both tasks. Stable Diffusion v2 treats a panorama as a single image. It cannot close the 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT loop since equirectangular projection[5] splits the loop-closure area to two sides of an image (moved to the middle of the rendered panorama for better visualization). In the meantime, due to the lack of large-scale panorama training data, it still generates unnecessarily repeated objects such as multiple beds in a bedroom. Text2room and MVDiffusion separate a panorama into multiple perspective views. Inpainting them using the same prompt results in unreasonably repeated objects in multiple views. Due to the limited panorama training data, Text2light cannot fully understand zero-shot scene descriptions generated by ChatGPT, resulting in scenes not consistent with the input prompt. Similar to Stable Diffusion v2, treating panorama as a single image makes it fail on loop closure. LDM3D is fine-tuned on top of a perspective latent diffusion model. Though better than Text2light, it still cannot close the loop and sometimes fails to generate scenes that are consistent with the details of the prompt (e.g., generating a non-modern living room when asked for a modern one). Our method works robustly on various inputs, generating panoramic scenes with high perspective rendering quality and reasonable 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT scene layouts (see supplementary videos for more details).

4.3 Analysis

L-MAGIC: Language Model Assisted Generation of Images with Coherence (6)

We further analyze different components of our method in this section. The analysis is conducted from 3 aspects: 1) the scene prior, 2) the inpainting method and 3) quality enhancement techniques. For each aspect, we remove a component or replace it with other methods, and perform the same evaluation as in the main experiments. We use the same data used in the image-to-panorama main experiment for analysis. The results are reported in Fig.6. Please refer to AppendixE for the visualization comparisons.

For the scene prior, we remove the prior from chatGPT (no GPT) and all text guidance (no prompt) respectively. Without the global layout prior from chatGPT, the structure of the outputs becomes worse. Without any prompt guidance, both the scene layout and the perspective view rendering quality degrade severely.

For the inpainting method, we replace the Stable Diffusion v2 model with 3 state-of-the-art text-conditioned inpainting methods, namely, Blended Latent Diffusion (BLD inpaint)[6], Deep Floyd (DF inpaint)[2] and Stable Diffusion XL (SDXL inpaint)[16]. Interestingly, though some of the methods[16] are published later than Stabld Diffusion v2, their capacity to perform large mask inpainting is limited on in-the-wild images, resulting in worse performance in terms of both scene layouts and rendering quality. Note that the inception scores for some methods (e.g., BLD inpaint) are higher than ours despite a much worse performance from human evaluation. This is caused by the adversarial samples generated by these methods (see AppendixF for examples), where the local patches of the adversarial samples are not consistent with the input image, yet leading to a high diversity in the inception score. Similar issues have also been discovered in other problems[7], which shows the importance of human evaluation.

For the quality enhancement techniques, we remove respectively the super-resolution (no SR) and smoothing (no smooth) techniques mentioned in Sec.3.3. Though the scene layout does not degrade much, the perspective view rendering quality is lower due to the blur or artifacts.

4.4 Applications

L-MAGIC: Language Model Assisted Generation of Images with Coherence (7)
L-MAGIC: Language Model Assisted Generation of Images with Coherence (8)

Combining matured computer vision techniques makes our pipeline applicable to a wide range of applications.

Anything-to-panorama. Conditional diffusion models[29, 19] can now generate an image from diverse types of inputs. The strong zero-shot performance of L-MAGIC makes it possible to generate panoramic scenes from potentially any inputs compatible with conditional diffusion models. As shown in Fig.7, we can use[29] to generate a single image from 1) a depth map, 2) a sketch drawing or 3) a colored script or segmentation image. Then, this generated image can be used in L-MAGIC to produce realistic panoramic scenes. This flexibility makes L-MAGIC beneficial to a wide range of design applications.

3D scene generation. Applying state-of-the-art depth estimation models, we can further generate 3D scenes from the output of L-MAGIC. Specifically, we compute the depth map for multiple perspective views, then we merge the depth maps into the equirectangular image plane by aligning all views to the initial view. Then we convert the corresponding panoramic depth map to a 3D point cloud. We use Metric3D[27] and DPT-hybrid[17] to estimate the depth for indoor and outdoor scenes respectively. The alignment is done by optimizing the scale and shift of each depth map to enforce the depth from multiple views to be the same at the same pixel. Fig.8 shows sampled results. Despite some artifacts caused by the limitation of monocular depth models, L-MAGIC can generate diverse indoor and outdoor point clouds from various types of inputs.

Immersive video. One can also render immersive videos from our panorama. Specifically, we first generate a panorama using our pipeline, and then generate a series of camera poses for individual video frames. Then we warp the panorama to each frame view according to the camera poses. To further enable scene fly-through with camera translations, we apply depth-based warping when translation is involved in a frame, and inpaint the missing region introduced after translation. See AppendixG for implementation details and the supplementary videos for the results.

5 Conclusion

We have proposed L-MAGIC, a novel method that can generate 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT panoramic scenes from a single input image. L-MAGIC leverages large (vision) language models to guide diffusion models to smoothly extend the local scene content with a coherent 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT layout. We have also proposed techniques to enhance the quality and resolution of the generated panorama. Extensive experiments demonstrate the effectiveness of L-MAGIC, outperforming state-of-the-art methods for image-to-panorama and text-to-panorama across metrics. Combined with state-of-the-art computer vision techniques such as conditional diffusion models and depth estimation models, our method can consume various types of inputs (text, sketch drawings, depth maps etc.) and generate outputs beyond a single panoramic image (videos with camera translations, 3D point clouds, etc.). See AppendixH for discussions about limitations and future works.

References

  • [1]https://openai.com/blog/chatgpt-can-now-see-hear-and-speak.
  • [2]https://github.com/deep-floyd/if.
  • SD [4]https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler.
  • [4]https://openai.com/blog/introducing-chatgpt-and-whisper-apis.
  • [5]https://en.wikipedia.org/wiki/equirectangular_projection.
  • Avrahami etal. [2023]Omri Avrahami, Ohad Fried, and Dani Lischinski.Blended latent diffusion.ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
  • Barratt and Sharma [2018]Shane Barratt and Rishi Sharma.A note on the inception score.arXiv preprint arXiv:1801.01973, 2018.
  • Chen etal. [2022]Zhaoxi Chen, Guangcong Wang, and Ziwei Liu.Text2light: Zero-shot text-driven hdr panorama generation.ACM Transactions on Graphics (TOG), 41(6):1–16, 2022.
  • Goodfellow etal. [2014]Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio.Generative adversarial nets.Advances in neural information processing systems, 27, 2014.
  • Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.Advances in neural information processing systems,33:6840–6851, 2020.
  • Höllein etal. [2023]Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and MatthiasNießner.Text2room: Extracting textured 3d meshes from 2d text-to-imagemodels.arXiv preprint arXiv:2303.11989, 2023.
  • Knapitsch etal. [2017]Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun.Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
  • Li etal. [2023]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.Blip-2: Bootstrapping language-image pre-training with frozen imageencoders and large language models.arXiv preprint arXiv:2301.12597, 2023.
  • Lugmayr etal. [2022]Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, andLuc VanGool.Repaint: Inpainting using denoising diffusion probabilistic models.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 11461–11471, 2022.
  • Paszke etal. [2019]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, etal.Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019.
  • Podell etal. [2023]Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, JonasMüller, Joe Penna, and Robin Rombach.Sdxl: improving latent diffusion models for high-resolution imagesynthesis.arXiv preprint arXiv:2307.01952, 2023.
  • Ranftl etal. [2021]René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun.Vision transformers for dense prediction.In Proceedings of the IEEE/CVF international conference oncomputer vision, pages 12179–12188, 2021.
  • Ravi etal. [2020]Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo,Justin Johnson, and Georgia Gkioxari.Accelerating 3d deep learning with pytorch3d.arXiv preprint arXiv:2007.08501, 2020.
  • Rombach etal. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and BjörnOmmer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF conference on computer visionand pattern recognition, pages 10684–10695, 2022.
  • Salimans etal. [2016]Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, andXi Chen.Improved techniques for training gans.Advances in neural information processing systems, 29, 2016.
  • Schuhmann etal. [2022]Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, RossWightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, MitchellWortsman, etal.Laion-5b: An open large-scale dataset for training next generationimage-text models.Advances in Neural Information Processing Systems,35:25278–25294, 2022.
  • Stan etal. [2023]Gabriela BenMelech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton,Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller,etal.Ldm3d: Latent diffusion model for 3d.arXiv preprint arXiv:2305.10853, 2023.
  • Sturm [2005]Peter Sturm.Multi-view geometry for general camera models.In 2005 IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR’05), pages 206–212. IEEE, 2005.
  • Suvorov etal. [2022]Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova,Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, KiwoongPark, and Victor Lempitsky.Resolution-robust large mask inpainting with fourier convolutions.In Proceedings of the IEEE/CVF winter conference onapplications of computer vision, pages 2149–2159, 2022.
  • Tang etal. [2023]Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Furukawa.Mvdiffusion: Enabling holistic multi-view image generation withcorrespondence-aware diffusion.arXiv preprint arXiv:2307.01097, 2023.
  • Yang etal. [2022]Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, WentaoZhang, Bin Cui, and Ming-Hsuan Yang.Diffusion models: A comprehensive survey of methods and applications.ACM Computing Surveys, 2022.
  • Yin etal. [2023]Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen,and Chunhua Shen.Metric3d: Towards zero-shot metric 3d prediction from a single image.In Proceedings of the IEEE/CVF International Conference onComputer Vision, pages 9043–9053, 2023.
  • Zhang etal. [2023a]Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu.Vision-language models for vision tasks: A survey.arXiv preprint arXiv:2304.00685, 2023a.
  • Zhang etal. [2023b]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference onComputer Vision, pages 3836–3847, 2023b.
  • Zhao etal. [2023]WayneXin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou,Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, etal.A survey of large language models.arXiv preprint arXiv:2303.18223, 2023.
  • Zhou etal. [2018]Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely.Stereo magnification: Learning view synthesis using multiplaneimages.arXiv preprint arXiv:1805.09817, 2018.

Appendix A L-MAGIC Prompts

In Sec.3.2 we have briefly described how to use language models in L-MAGIC. Here, we provide more details on our prompt design when applying language models.

For line2 of Alg.1, we ask the following two questions to the BLIP-2 model (v()subscript𝑣{\mathcal{L}}_{v}(\cdot)caligraphic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( ⋅ )):

  1. Q1BLIPBLIP{}_{\text{BLIP}}start_FLOATSUBSCRIPT BLIP end_FLOATSUBSCRIPT

    Question: What is this place (describe with fewer than 5 words)? Answer:

  2. Q2BLIPBLIP{}_{\text{BLIP}}start_FLOATSUBSCRIPT BLIP end_FLOATSUBSCRIPT

    Question: Describe the foreground and background in detail and separately? Answer:

These two questions make the model output scene-level coarse and fine descriptions without focusing on centralized objects, which is beneficial for inferring the global scene layout at line3. The final dsubscript𝑑d_{{\mathcal{I}}}italic_d start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT is the answers of both questions.

To obtain scene layout descriptions d360subscript𝑑360d_{360}italic_d start_POSTSUBSCRIPT 360 end_POSTSUBSCRIPT of individual views, we ask the following question to ChatGPT ((){\mathcal{L}}(\cdot)caligraphic_L ( ⋅ )) at line3:

  1. Q1GPTGPT{}_{\text{GPT}}start_FLOATSUBSCRIPT GPT end_FLOATSUBSCRIPT

    Given a scene with <<<answer ofQ1BLIPBLIP{}_{\text{BLIP}}start_FLOATSUBSCRIPT BLIP end_FLOATSUBSCRIPT>>>, where in font of us we see <<<answer ofQ2BLIPBLIP{}_{\text{BLIP}}start_FLOATSUBSCRIPT BLIP end_FLOATSUBSCRIPT>>>. Generate 6 rotated views to describe what else you see in this place, where the camera of each view rotates 60 degrees to the right (you dont need to describe the original view, i.e., the first view of the 6 views you need to describe is the view with 60 degree rotation angle). Dont involve redundant details, just describe the content of each view. Also don’t repeat the same object in different views. Don’t refer to previously generated views. Generate concise (<<< 10 words) and diverse contents for each view. Each sentence starts with: View xxx(view number, from 1-6): We see…

As mentioned in Sec.3.4, ChatGPT sometimes cannot fully follow the format request inQ1GPTGPT{}_{\text{GPT}}start_FLOATSUBSCRIPT GPT end_FLOATSUBSCRIPT, which makes automatic prompt generation fail. To avoid this catastrophic failure, we check whether the output ofQ1GPTGPT{}_{\text{GPT}}start_FLOATSUBSCRIPT GPT end_FLOATSUBSCRIPT has the required number of lines (6), and whether each line starts from ‘View XXX (line number): We see’. We re-run line3 if any of the condition is violated. This ensures that ChatGPT understands our question and satisfies all our format requests.

To remove object-level information at line4, we ask:

  1. Q2GPTGPT{}_{\text{GPT}}start_FLOATSUBSCRIPT GPT end_FLOATSUBSCRIPT

    Modify the sentence: <<<answer ofQ1BLIPBLIP{}_{\text{BLIP}}start_FLOATSUBSCRIPT BLIP end_FLOATSUBSCRIPT>>> so that we remove all the objects from the description (e.g., ’a bedroom with a bed’ would become ’a bedroom’. Do not change the sentence if the description is only an object). Just output the modified sentence.

To adaptively judge whether we should avoid repeated objects, we ask the following two questions at line5

  1. Q3GPTGPT{}_{\text{GPT}}start_FLOATSUBSCRIPT GPT end_FLOATSUBSCRIPT

    Given a scene with <<<answer ofQ1BLIPBLIP{}_{\text{BLIP}}start_FLOATSUBSCRIPT BLIP end_FLOATSUBSCRIPT>>>, where in font of us we see <<<answer ofQ2BLIPBLIP{}_{\text{BLIP}}start_FLOATSUBSCRIPT BLIP end_FLOATSUBSCRIPT>>>. What would be the two major foreground objects that we see? Use two lines to describe them where each line is in the format of “We see: xxx (one object, dont describe details, just one word for the object. Start from the most possible object. Don’t mention background objects like things on the wall, ceiling or floor.)”

  2. Q4GPTGPT{}_{\text{GPT}}start_FLOATSUBSCRIPT GPT end_FLOATSUBSCRIPT

    Do we often see multiple <<<each object in the answer ofQ3GPTGPT{}_{\text{GPT}}start_FLOATSUBSCRIPT GPT end_FLOATSUBSCRIPT>>> in a scene with <<<answer ofQ1BLIPBLIP{}_{\text{BLIP}}start_FLOATSUBSCRIPT BLIP end_FLOATSUBSCRIPT>>>? Just say ’yes’ or ’no’ with all lower case letters.

The final drepeatsubscript𝑑repeatd_{\text{repeat}}italic_d start_POSTSUBSCRIPT repeat end_POSTSUBSCRIPT is the set of objects inQ3GPTGPT{}_{\text{GPT}}start_FLOATSUBSCRIPT GPT end_FLOATSUBSCRIPT such that the corresponding answer ofQ4GPTGPT{}_{\text{GPT}}start_FLOATSUBSCRIPT GPT end_FLOATSUBSCRIPT is ‘no’.

Appendix B Blur During Warping

L-MAGIC: Language Model Assisted Generation of Images with Coherence (9)
L-MAGIC: Language Model Assisted Generation of Images with Coherence (10)
L-MAGIC: Language Model Assisted Generation of Images with Coherence (11)

As mentioned in Sec.3.3, adjacent pixels at the center of an image have a large angular distance than the ones at the side of an image, which causes the blurry warped image. The cause of this issue lies in the construction process of an image. Specifically, let x𝑥xitalic_x be the horizontal coordinates of a pixel at the image plane, and let fxsubscript𝑓𝑥f_{x}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and cxsubscript𝑐𝑥c_{x}italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT be respectively the focal length and the principal location (camera center at the image plane) of the camera on the horizontal direction. Then, the horizontal angular distance between the camera rays of x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x1+1subscript𝑥11x_{1}+1italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 is

α=|arctan(|x+1cx|fx)arctan(|xcx|fx)|𝛼𝑥1subscript𝑐𝑥subscript𝑓𝑥𝑥subscript𝑐𝑥subscript𝑓𝑥\alpha=|\arctan(\frac{|x+1-c_{x}|}{f_{x}})-\arctan(\frac{|x-c_{x}|}{f_{x}})|italic_α = | roman_arctan ( divide start_ARG | italic_x + 1 - italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ) - roman_arctan ( divide start_ARG | italic_x - italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ) |(2)

Fig.9 shows the change of value α𝛼\alphaitalic_α w.r.t. |xcx|𝑥subscript𝑐𝑥|x-c_{x}|| italic_x - italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT |, where large |xcx|𝑥subscript𝑐𝑥|x-c_{x}|| italic_x - italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | means x𝑥xitalic_x is at the side of an image, and small |xcx|𝑥subscript𝑐𝑥|x-c_{x}|| italic_x - italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | means x is at the center of an image. We can see that the angular distance α𝛼\alphaitalic_α is larger for centered pixels regardless of the focal length fxsubscript𝑓𝑥f_{x}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Hence, within the same angle, there are more pixels on the side of an image than at the center of an image. This means that when warping the center region of an image to another view, we require interpolation since more pixels are created in the corresponding warped region. This phenomenon causes the blurry warping mentioned in Sec.3.3, see Fig.10 for an example.

Appendix C Text Inputs for Text-to-panorama

In order to evaluate the performance of different algorithms on in-the-wild inputs, we ask ChatGPT to generate 20 random scene descriptions (10 indoor and 10 outdoor) in the main experiment of text-to-panorama (Sec.4.2). The resulting text prompts are:

  1. 1.

    Autumn maple forest path.

  2. 2.

    Tropical beach at sunset.

  3. 3.

    Snowy mountain peak view.

  4. 4.

    Tuscan vineyard in summer.

  5. 5.

    Desert under starlit sky.

  6. 6.

    Sakura blossom park, Kyoto.

  7. 7.

    Rustic Provencal lavender fields.

  8. 8.

    Underwater coral reef scene.

  9. 9.

    Ancient Mayan jungle ruins.

  10. 10.

    Manhattan skyline at night.

  11. 11.

    Victorian-era library.

  12. 12.

    Rustic Italian kitchen.

  13. 13.

    Minimalist Scandinavian bedroom.

  14. 14.

    Moorish-styled bathroom.

  15. 15.

    Vintage record store interior.

  16. 16.

    Luxurious Hollywood dressing room.

  17. 17.

    Industrial loft-style office.

  18. 18.

    Art Deco hotel lobby.

  19. 19.

    Japanese Zen meditation room.

  20. 20.

    Modern living room with a sofa and a TV.

Appendix D Voting Web Page

L-MAGIC: Language Model Assisted Generation of Images with Coherence (12)
L-MAGIC: Language Model Assisted Generation of Images with Coherence (13)

As mentioned in Sec.4.1, we use a voting web page during human evaluation. Fig.11 shows the example web page for both image-to-panorama and text-to-panorama. In each voting, we show for each method a panorama and the perspective video rendered from the panorama so that the user can use the panorama to clearly see the 360 degree layout and loop closure, and use the perspective video to see the rendering quality. Besides voting for one of the two results, we also allow to vote for both results when there is no obvious winner for a certain criterion.

Appendix E Ablation visualizations

L-MAGIC: Language Model Assisted Generation of Images with Coherence (14)

In Sec.4.3, we conducted ablation studies and reported quantitative results. Here we further show visualizations of the ablation experiment in Fig.12. Consistent with the quantitative results, changing L-MAGIC components hurts the visual quality of the output panorama.

Appendix F Bias in Quantitative Metrics

L-MAGIC: Language Model Assisted Generation of Images with Coherence (15)

As mentioned in Sec.4.3, the Inception Score (IS) sometimes cannot fully reflect the preference from human evaluations. Fig.13 shows ”adversarial” examples where the panorama has poor quality and multi-view coherence yet has a higher inception score compared to the result with better human evaluation preference. This shows the importance of human evaluations in the experiment.

Appendix G Video generation

L-MAGIC: Language Model Assisted Generation of Images with Coherence (16)

L-MAGIC: Language Model Assisted Generation of Images with Coherence (17)

When generating video frames with pure camera rotation, we follow the strategy of Sec.3.1, project the panorama to a unit sphere, and render each frame according to the rotation matrix and camera intrinsics of the frame.

To generate immersive videos with camera translations, we apply depth-based warping, and inpaint, using Stable Diffusion v2, the small missing regions caused by occlusion. For depth-based warping, we first apply pre-trained depth estimation models[17] on perspective views of the generated panorama, and warp them to the corresponding frame of the video. Naive mesh-based warping following Sec.3.1 may generate mesh faces between different objects, which is not ideal. Hence, we rely on point-based warping. To avoid grid-shaped sparsely distributed missing pixels (Fig.14 left) and ensure the sharpness of the warped image, we apply a super-resolution-based approach similar to the strategy in Sec.3.3. Specifically, we enlarge the resolution of the depth map from 512512512512512*512512 ∗ 512 to 20482048204820482048*20482048 ∗ 2048, and then warp the high-resolution depth map to each frame with a resolution of 512512512512512*512512 ∗ 512 (Fig.14 right).

To achieve super-resolution on the depth map, we first perform super-resolution on the RGB image, increasing its resolution to 20482048204820482048*20482048 ∗ 2048. Since state-of-the-art depth estimation models are not effective on high-resolution images, instead of directly estimating the depth of the high-resolution image, we separate it into 1313131313*1313 ∗ 13 patches of resolution 512512512512512*512512 ∗ 512 with overlappings between neighbouring patches, perform depth estimation on individual patches and align the depth map of each patch with the one of the low-resolution image to ensure a smooth depth transition over patches and a reasonable object geometry.

Appendix H Limitation and Future Work

In terms of the limitation, L-MAGIC currently relies on the input prompt to encode the global scene layout information. Designing a fine-grained layout encoding mechanism that can ensure multi-view coherence at a more detailed level is an important and interesting future work.

L-MAGIC: Language Model Assisted Generation of Images with Coherence (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Melvina Ondricka

Last Updated:

Views: 5808

Rating: 4.8 / 5 (68 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Melvina Ondricka

Birthday: 2000-12-23

Address: Suite 382 139 Shaniqua Locks, Paulaborough, UT 90498

Phone: +636383657021

Job: Dynamic Government Specialist

Hobby: Kite flying, Watching movies, Knitting, Model building, Reading, Wood carving, Paintball

Introduction: My name is Melvina Ondricka, I am a helpful, fancy, friendly, innocent, outstanding, courageous, thoughtful person who loves writing and wants to share my knowledge and understanding with you.