A collaborative approach to image generation

We introduce PASTA, a reinforcement learning agent that refines text-to-image output over multiple turns of interaction with a user by learning their unique preferences. This process is made possible by a novel user simulation technique.

You have a perfect image in your mind. You enter a prompt, hit generate, and the result is close to what you were thinking, but not quite right. You try refining the prompt, adding more detail, but you can't seem to bridge the gap between your idea and the final image.

This is a common experience. While text-to-image (T2I) models are incredibly powerful, they often struggle to capture the nuance and specificity of an individual's unique creative intent given just a single prompt. What if we could turn image generation into a collaborative conversation?

In this post, we describe our research “Preference Adaptive and Sequential Text-to-image Agent” (PASTA), a reinforcement learning (RL) agent that collaborates with users to progressively refine T2I results. This approach eliminates the need for users to rely on trial-and-error prompt refinement to reach a desirable image. Through human evaluations, we created a novel dataset of sequential preferences, which we then used to compare PASTA with a baseline state-of-the-art model. The results demonstrated that PASTA, trained with our mix of real and simulated data, consistently produced images that users rated as more satisfying. We’ve also released our foundational dataset with a collection of over 7,000 human rater interactions with PASTA.

How PASTA works

To effectively train an AI agent to adapt to a user's individual preferences, a large, diverse set of interaction data is needed. However, gathering this data from real users is challenging due to several factors, including user privacy. To address this, we trained PASTA using a two-stage strategy that combines real human feedback with large-scale user simulation.

First, we collected a high-quality foundational dataset with over 7,000 raters' sequential interactions. These interactions included prompt expansions generated by a Gemini Flash large multimodal model and corresponding images generated by a Stable Diffusion XL (SDXL) T2I model. This initial seed of authentic preference data was then used to train a user simulator, designed to generate additional data that replicate real human choices and preferences.

At the heart of our method is a user model, comprising two key components: 1) a utility model that predicts the degree to which a user will like any set of images, and 2) a choice model that predicts which set of images they will select when presented with several sets. We constructed the user model using pre-trained CLIP encoders and added user-specific components. We trained the model using an expectation-maximization algorithm that allows us to simultaneously learn the specifics of user preferences while also discovering latent “user types,” that is, clusters of users with similar tastes (e.g., tendencies to prefer images with animals, scenic views, or abstract art).

The trained user simulator can provide feedback and express preferences on generated images, and make selections from sets of proposed images. This allows us to generate over 30,000 simulated interaction trajectories.. Our approach does more than just create more data; it gives us a controlled environment in which to explore a vast range of user behaviors so we can train the PASTA agent to effectively collaborate with users.

With this robust, data-driven foundation, the PASTA agent is trained to effectively engage with arbitrary users to generate images that match their preferences. The agent itself is a value-based reinforcement learning model that learns to select the best "slate" of prompt expansions (i.e., elaborations of the current prompt used to generate subsequent images) to show the user at each turn. Its goal is to maximize the user's cumulative satisfaction over the entire interaction.

Once PASTA is trained and deployed, a user initiates the engagement with an initial prompt. PASTA first uses a candidate generator (a large multimodal model) to create a diverse set of potential prompt expansions. Then, a candidate selector (our trained RL agent) selects the optimal slate of four such expansions, which are used to generate corresponding images to present to the user. The user selects the image that is closest to their vision, which provides feedback that guides PASTA's next set of suggestions. This collaborative back-and-forth allows the model to learn the user's preferences on the fly, steering the creative process toward their ideal goal with each step.

Putting PASTA to the test

To evaluate our approach, we trained PASTA as a value-based reinforcement learning agent using implicit Q-learning (IQL). We specifically wanted to see how the use of different training data impacted performance. We created three versions of the agent: 1) trained only on the real volunteer-rater data, 2) trained only on the simulated data, and 3) trained on a combination of real and simulated datasets.

We then ran a series of human evaluations comparing these agents to a baseline model (i.e., base Gemini Flash and SDXL models with no additional training) across four metrics: accuracy over the Pick-a-Pic dataset, Spearman’s rank correlation, choice model accuracy, and cross turn accuracy. Pick-a-Pic accuracy and Spearman's rank correlation assess the model's ability to predict user preferences and rankings on existing, large-scale, single-turn datasets. Choice model accuracy and cross-turn accuracy measure the model's ability to predict which image a user will choose at a given turn and whether the selected images are an improvement over the previous turn, respectively.

The results demonstrated that training PASTA on synthetic data alone didn't beat the baseline and while the agent trained on real human data showed significant improvement, it also didn’t outperform the baseline. However, the agent trained on the combination of both real and simulated data offered the best performance, confirming that our user simulation successfully captures key dynamics of human interaction while providing the scale needed for robust RL training.

When we asked raters to directly compare the final images from our best-performing agent against the baseline, 85% preferred PASTA's generated images. The difference is especially striking with abstract prompts. Starting with a simple idea like "an image of love", PASTA adapted to different user types to create a wide variety of results, from tender portraits to abstract, geometric art.

What's next?

PASTA shows that the future of generative AI can be more interactive, preference adaptive, and collaborative. The methods we developed, particularly the use of robust user simulators, can be applied to many other generative tasks to create AI that better aligns and adapts to human users.

To help spur further research, we have open-sourced our sequential rater dataset and our simulated user data. We can't wait to see what the community builds with it.

Acknowledgements

The author list is: Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, and Craig Boutilier. Special thanks to Mark Simborg for his help crafting this blog post and Kimberly Schwede for creating the figures in this post.

OC

A collaborative approach to image generation

How PASTA works

Putting PASTA to the test

What's next?

Acknowledgements

关于《A collaborative approach to image generation》的评论

发表评论

摘要

相关新闻

相关讨论