Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs > arXiv:2204.06125

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

(cs)
[Submitted on 13 Apr 2022]

Title:Hierarchical Text-Conditional Image Generation with CLIP Latents

Authors:Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
Download a PDF of the paper titled Hierarchical Text-Conditional Image Generation with CLIP Latents, by Aditya Ramesh and 4 other authors
Download PDF
Abstract:Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2204.06125 [cs.CV]
  (or arXiv:2204.06125v1 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2204.06125
arXiv-issued DOI via DataCite

Submission history

From: Aditya Ramesh [view email]
[v1] Wed, 13 Apr 2022 01:10:33 UTC (41,596 KB)
Full-text links:

Access Paper:

    Download a PDF of the paper titled Hierarchical Text-Conditional Image Generation with CLIP Latents, by Aditya Ramesh and 4 other authors
  • Download PDF
  • PostScript
  • Other Formats
Current browse context:
cs.CV
< prev   |   next >
new | recent | 2204
Change to browse by:
cs

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar

8 blog links

(what is this?)
a export BibTeX citation Loading...

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack