Text-to-Image Machine Learning

NOTE: Make sure you also see this post that has a summary of all the Text-to-Image scripts supported by Visions of Chaos with example images.

Text-to-Image

Input a short phrase or sentence into a neural network and see what image it creates.

I am using Big Sleep from Phil Wang (@lucidrains).

Phil used the code/models from Ryan Murdock (@advadnoun). Ryan has a blog post explaining the basics of how all the parts connect up here. Ryan has some newer Text-to-Image experiments but they are behind a Patreon paywall, so I have not played with them. Hopefully he (or anyone) releases the colabs publicly sometime in the future. I don’t want to experiment with a Text-to-Image system that I cannot share with everyone, otherwise it is just a tease.

The most simple explanation is that BigGAN generates images that try to satisfy CLIP which rates how closely the image matches the input phrase. BigGAN creates an image and CLIP looks at it and says “sorry, that does not look like a cat to me, try again”. As each repeated iteration is performed BigGAN gets better at generating an image that matches the desired phrase text.

Big Sleep Examples

Big Sleep uses a seed number which means you can have thousands/millions of different outputs for the same input phrase. Note there is an issue with the seed not always being able to create the same images though. From my testing, even with the torch_deterministic flag set to True and setting the CUDA envirnmental variable does not help. Every time Big Sleep is called it will generate a different image with the same seed. That means you will never be able to reproduce the same output in the future.

These images are 512×512 pixels square (the largest resolution Big Sleep supports) and took 4 minutes each to generate on an RTX 3090 GPU. The same code takes 6 minutes 45 seconds per image on an older 2080 Super GPU.

Also note that these are the “cherry picked” best results. Big Sleep is not going to create awesome art every time. For these examples or when experimenting with new phrases I usually run a batch of multiple images and then manually select the best 4 or 8 to show off (4 or 8 because that satisfies one or two tweets).

To start, these next four images were created from the prompt phrase “Gandalf and the Balrog”

Big Sleep - Gandalf and the Balrog

Big Sleep - Gandalf and the Balrog

Big Sleep - Gandalf and the Balrog

Big Sleep - Gandalf and the Balrog

Here are results from “disturbing flesh”. These are like early David Cronenberg nightmare visuals.

Big Sleep - Disturbing Flesh

Big Sleep - Disturbing Flesh

Big Sleep - Disturbing Flesh

Big Sleep - Disturbing Flesh

A suggestion from @MatthewKafker on Twitter “spatially ambiguous water lillies painting”

Big Sleep - Spatially Ambiguous Water Lillies Painting

Big Sleep - Spatially Ambiguous Water Lillies Painting

Big Sleep - Spatially Ambiguous Water Lillies Painting

Big Sleep - Spatially Ambiguous Water Lillies Painting

Big Sleep - Spatially Ambiguous Water Lillies Painting

Big Sleep - Spatially Ambiguous Water Lillies Painting

Big Sleep - Spatially Ambiguous Water Lillies Painting

Big Sleep - Spatially Ambiguous Water Lillies Painting

“stormy seascape”

Big Sleep - Stormy Seascape

Big Sleep - Stormy Seascape

Big Sleep - Stormy Seascape

Big Sleep - Stormy Seascape

After experimenting with acrylic pour painting in the past I wanted to see what BigSleep could generate from “acrylic pour painting”

Big Sleep - Acrylic Pour Painting

Big Sleep - Acrylic Pour Painting

Big Sleep - Acrylic Pour Painting

Big Sleep - Acrylic Pour Painting

I have always enjoyed David Lynch movies so let’s see what “david lynch visuals” results in. This one got a lot of surprises and worked great. These images really capture the feeling of a Lynchian cinematic look. A lot of these came out fairly dark so I have tweaked exposure in GIMP.

Big Sleep - David Lynch Visuals

Big Sleep - David Lynch Visuals

Big Sleep - David Lynch Visuals

Big Sleep - David Lynch Visuals

Big Sleep - David Lynch Visuals

Big Sleep - David Lynch Visuals

Big Sleep - David Lynch Visuals

Big Sleep - David Lynch Visuals

More from “david lynch visuals” but these are more portraits. The famous hair comes through.

Big Sleep - David Lynch Visuals

Big Sleep - David Lynch Visuals

Big Sleep - David Lynch Visuals

Big Sleep - David Lynch Visuals

“H.R.Giger”

Big Sleep - H.R.Giger

Big Sleep - H.R.Giger

Big Sleep - H.R.Giger

Big Sleep - H.R.Giger

Big Sleep - H.R.Giger

Big Sleep - H.R.Giger

Big Sleep - H.R.Giger

Big Sleep - H.R.Giger

“metropolis”

Big Sleep - Metropolis

Big Sleep - Metropolis

Big Sleep - Metropolis

Big Sleep - Metropolis

“surrealism”

Big Sleep - Surrealsim

Big Sleep - Surrealsim

Big Sleep - Surrealsim

Big Sleep - Surrealsim

“colorful surrealism”

Big Sleep - Colorful Surrealsim

Big Sleep - Colorful Surrealsim

Big Sleep - Colorful Surrealsim

Big Sleep - Colorful Surrealsim

Availability

I have now added a simple GUI front end for Big Sleep into Visions of Chaos, so once you have installed all the pre-requisites you can run these models on any prompt phrase you feed into them. The following images shows Big Sleep in the process of generating an image for the prompt text “cyberpunk aesthetic”.

Text-to-Image GUI

After spending a lot of time experimenting with Big Sleep over the last few days, I highly encourage anyone with a decent GPU to try these. The results are truly fascinating. This page says at least a 2070 8GB or greater is required, but Martin in the comments managed to generate a 128×128 image on a 1060 6GB GPU after 26 (!!) minutes.

Jason.

10 responses to “Text-to-Image Machine Learning

  1. hello
    dear professor, your work is so great and beautiful, thank you very much. I want to know how to export the 3D image as bmp format pictures in the form of slices  along a direction (ex z axis)
    thanks very much.
    Warm regards
    jmk2021

  2. Thank you. These are absolutely fascinating! Now if only I could put my hands on a decent GPU (RTX 3080 or +).

  3. In the “2070 8GB or greater required” link, one post said:

    I’ve been able to get VRAM usage down near 6GB and even 4GB by lowering image_size and num_cutouts parameters.
    –num-cutouts=16 and –image-size=128 should work on a 4GB card, but I haven’t tested yet.

    Is this something you can do with your implementation?

    • I do provide the option for image size so you can try 128×128. If that fails I could add a “low memory” checkbox that adds the num cutouts option.

  4. Finally got it all working. I can confirm that a 10×0 class GPU card with 6GB is able to create pictures at the 128×128 size. My particular 1060 card take about 26 mins per image. They are small but perfectly formed.

  5. just a general point – due to the sheer number of different options in Machine Learning you keep adding (p.s. please don’t stop!) – I have come across a few “module not found” errors – perhaps a general note in the setup that if one comes across this, a simple “pip install in the command prompt will resolve these issues 🙂 (latest was for CLIP captioning: ‘nltk’ was missing)

    • I have added checking for the next version to detect “ModuleNotFoundError” and point the user to the instructions page.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s