NOTE: Make sure you also see this post that has a summary of all the Text-to-Image scripts supported by Visions of Chaos with example images.
Input a short phrase or sentence into a neural network and see what image it creates.
I am using Big Sleep from Phil Wang (@lucidrains).
Phil used the code/models from Ryan Murdock (@advadnoun). Ryan has a blog post explaining the basics of how all the parts connect up here. Ryan has some newer Text-to-Image experiments but they are behind a Patreon paywall, so I have not played with them. Hopefully he (or anyone) releases the colabs publicly sometime in the future. I don’t want to experiment with a Text-to-Image system that I cannot share with everyone, otherwise it is just a tease.
The most simple explanation is that BigGAN generates images that try to satisfy CLIP which rates how closely the image matches the input phrase. BigGAN creates an image and CLIP looks at it and says “sorry, that does not look like a cat to me, try again”. As each repeated iteration is performed BigGAN gets better at generating an image that matches the desired phrase text.
Big Sleep Examples
Big Sleep uses a seed number which means you can have thousands/millions of different outputs for the same input phrase. Note there is an issue with the seed not always being able to create the same images though. From my testing, even with the torch_deterministic flag set to True and setting the CUDA envirnmental variable does not help. Every time Big Sleep is called it will generate a different image with the same seed. That means you will never be able to reproduce the same output in the future.
These images are 512×512 pixels square (the largest resolution Big Sleep supports) and took 4 minutes each to generate on an RTX 3090 GPU. The same code takes 6 minutes 45 seconds per image on an older 2080 Super GPU.
Also note that these are the “cherry picked” best results. Big Sleep is not going to create awesome art every time. For these examples or when experimenting with new phrases I usually run a batch of multiple images and then manually select the best 4 or 8 to show off (4 or 8 because that satisfies one or two tweets).
To start, these next four images were created from the prompt phrase “Gandalf and the Balrog”
Here are results from “disturbing flesh”. These are like early David Cronenberg nightmare visuals.
A suggestion from @MatthewKafker on Twitter “spatially ambiguous water lillies painting”
After experimenting with acrylic pour painting in the past I wanted to see what BigSleep could generate from “acrylic pour painting”
I have always enjoyed David Lynch movies so let’s see what “david lynch visuals” results in. This one got a lot of surprises and worked great. These images really capture the feeling of a Lynchian cinematic look. A lot of these came out fairly dark so I have tweaked exposure in GIMP.
More from “david lynch visuals” but these are more portraits. The famous hair comes through.
I have now added a simple GUI front end for Big Sleep into Visions of Chaos, so once you have installed all the pre-requisites you can run these models on any prompt phrase you feed into them. The following images shows Big Sleep in the process of generating an image for the prompt text “cyberpunk aesthetic”.
After spending a lot of time experimenting with Big Sleep over the last few days, I highly encourage anyone with a decent GPU to try these. The results are truly fascinating. This page says at least a 2070 8GB or greater is required, but Martin in the comments managed to generate a 128×128 image on a 1060 6GB GPU after 26 (!!) minutes.
dear professor, your work is so great and beautiful, thank you very much. I want to know how to export the 3D image as bmp format pictures in the form of slices along a direction (ex z axis)
thanks very much.
These are only 2D images, so no 3D to export.
Thank you. These are absolutely fascinating! Now if only I could put my hands on a decent GPU (RTX 3080 or +).
Yes, from what I can find BigSleep needs a 2070 8GB GPU or better to run.
In the “2070 8GB or greater required” link, one post said:
I’ve been able to get VRAM usage down near 6GB and even 4GB by lowering image_size and num_cutouts parameters.
–num-cutouts=16 and –image-size=128 should work on a 4GB card, but I haven’t tested yet.
Is this something you can do with your implementation?
I do provide the option for image size so you can try 128×128. If that fails I could add a “low memory” checkbox that adds the num cutouts option.
Finally got it all working. I can confirm that a 10×0 class GPU card with 6GB is able to create pictures at the 128×128 size. My particular 1060 card take about 26 mins per image. They are small but perfectly formed.
Thanks for letting me (and others) know.
just a general point – due to the sheer number of different options in Machine Learning you keep adding (p.s. please don’t stop!) – I have come across a few “module not found” errors – perhaps a general note in the setup that if one comes across this, a simple “pip install in the command prompt will resolve these issues 🙂 (latest was for CLIP captioning: ‘nltk’ was missing)
I have added checking for the next version to detect “ModuleNotFoundError” and point the user to the instructions page.