Text-to-Image Summary – Part 2

This is Part 2. See Part 1 for the previous post.

This post continues listing the Text-to-Image scripts included with Visions of Chaos and some example outputs from each script.


Name: VQGAN Gumbel
Author: Eleiber
Original script: https://colab.research.google.com/drive/1tim3xTsZXafK-A2rOUsevckdl4OitIiw
Time for 512×512 on a 3090: 3 minutes 27 seconds
Maximum resolution on a 24 GB 3090: 768×768 or 1120×480
Description: Variation using the gumbel-8192 model. Results are a bit rougher than others.

'a childs drawing of a space nebula' VQGAN Gumbel Text-to-Image
a childs drawing of a space nebula

'a movie monster in the style of Edvard Munch' VQGAN Gumbel Text-to-Image
a movie monster in the style of Edvard Munch

'a raytraced image of the Amazon Rainforest' VQGAN Gumbel Text-to-Image
a raytraced image of the Amazon Rainforest

'a tropical beach in the style of Polock' VQGAN Gumbel Text-to-Image
a tropical beach in the style of Polock

'digital art of a rose' VQGAN Gumbel Text-to-Image
digital art of a rose


Name: OpenAI DVAE+CLIP
Author: Katherine Crowson
Original script: https://colab.research.google.com/drive/10DzGECHlEnL4oeqsN-FWCkIe_sq3wVqt
Time for 512×512 on a 3090: 3 minutes 07 seconds
Maximum resolution on a 24 GB 3090: 768×768 or 1120×480
Description: Results are very colorful and more abstract. By default it gives more noisy output images but this can be disabled if you prefer.

'a dragon' OpenAI DVAE+CLIP Text-to-Image
a dragon

'a hyperrealistic painting of planets' OpenAI DVAE+CLIP Text-to-Image
a hyperrealistic painting of planets

'a mountain cabin' OpenAI DVAE+CLIP Text-to-Image
a mountain cabin

'a woodcut of a mountain range in the style of Marvel Comics' OpenAI DVAE+CLIP Text-to-Image
a woodcut of a mountain range in the style of Marvel Comics

'an angry person' OpenAI DVAE+CLIP Text-to-Image
an angry person


Name: Aphantasia
Author: Vadim Epstein
Original script: https://github.com/eps696/aphantasia
Time for 512×512 on a 3090: 1 minute 5 seconds
Maximum resolution on a 24 GB 3090: 4096×4096 or 2520×1080
Description: Different and more messy pastel abstract “Turneresque” output. I spent a few hours trying many different combinations of settings trying to get the output more coherent and deeper colors. The following samples are as good as I could push it. I give up for now. If you can do better let me know. It does support creating larger 1280×720 resolution images on a 3090 GPU.

'a marble sculpture of a computer' Aphantasia Text-to-Image
a marble sculpture of a computer

'an eyeball' Aphantasia Text-to-Image
an eyeball

'an octopus' Aphantasia Text-to-Image
an octopus

'digital art of frogs in the style of Dr Seuss' Aphantasia Text-to-Image
digital art of frogs in the style of Dr Seuss

'medusa' Aphantasia Text-to-Image
medusa


Name: Text2Image VQGAN
Author: Vadim Epstein
Original script: https://colab.research.google.com/github/eps696/aphantasia/blob/master/CLIP_VQGAN.ipynb
Time for 512×512 on a 3090: 2 minutes 8 seconds
Maximum resolution on a 24 GB 3090: 768×768 or 1120×480
Description: Allows larger sized 480p images (854×480) on a 3090 GPU.

'a digital painting of the Las Vegas strip' Text2Image VQGAN Text-to-Image
a digital painting of the Las Vegas strip

'a midnineteenth century engraving of a cute monster' Text2Image VQGAN Text-to-Image
a midnineteenth century engraving of a cute monster

'a skeleton' Text2Image VQGAN Text-to-Image
a skeleton

'an ultrafine detailed painting of a crying person' Text2Image VQGAN Text-to-Image
an ultrafine detailed painting of a crying person

'puppies' Text2Image VQGAN Text-to-Image
puppies


Name: MSE VQGAN+CLIP z+quantize
Author: jbusted
Original script: https://colab.research.google.com/drive/1gFn9u3oPOgsNzJWEFmdK-N9h_y65b8fj
Time for 512×512 on a 3090: 6 minutes 19 seconds
Maximum resolution on a 24 GB 3090: 768×768 or 1120×480
Description: Awesome crisp results. Allows larger sized 480p images (854×480) on a 3090 GPU. One of the best scripts in this list worth exploring.

'a charcoal drawing of a country town' MSE VQGAN+CLIP z+quantize Text-to-Image
a charcoal drawing of a country town

'a hyperrealistic painting of an ugly creature' MSE VQGAN+CLIP z+quantize Text-to-Image
a hyperrealistic painting of an ugly creature

'a landscape made of mist' MSE VQGAN+CLIP z+quantize Text-to-Image
a landscape made of mist

'a mosaic of christmas' MSE VQGAN+CLIP z+quantize Text-to-Image
a mosaic of christmas

'an octopus in the style of Vincent van Gogh' MSE VQGAN+CLIP z+quantize Text-to-Image
an octopus in the style of Vincent van Gogh

MSE VQGAN+CLIP z+quantize allows specifying an image as the input starting point. If you take the output and repeatedly use it as the input with some minor image stretching each frame you can get a movie zooming into the Text-to-Image output. No blending of frames or optical flow for this one, just straight combining of the 854×480 resolution frames into a movie. The VQGAN model was “vqgan_imagenet_f16_16384” and the CLIP model was “ViT-B/32”. The prompts for this movie were “hyperrealistic homer simpson”, “hyperrealistic marge simpson”, “hyperrealistic bart simpson”, “hyperrealistic lisa simpson” and “hyperrealistic maggie simpson”. The original 480p upload was badly compressed and looked terrible after YouTube compressed it, so I upscaled the 480p to 2160p (4K) in DaVinci Resolve and reuploaded to YouTube. This caused their compression to do a better encoding job so the movie is now watchable.

This next example is how MSE VQGAN+CLIP z+quantize interprets various common human phobias. Text prompts were “a hyperrealistic painting depicting acrophobia” etc. To try and smooth out the “flickering” when zooming I started using ImageMagick for zooming. ImageMagick allows sub pixel image resizing options. This movie was also originally 480p and upsized to 4K in Davinci Resolve before uploading.

I have also added some basic scripting (as in automating a series of steps rather than a Python py script) support to Visions of Chaos. Scripting allows the prompt, zoom speed, rotation and panning to be changed during the movie with smooth interpolations between them each frame.

Text-to-Image Script GUI

The following video is a test of the scripting. This video is a Powers of Ten homage with zooming in from the largest scales to the smallest scales.

Another recent addition is the ability to use a series of images as “seed images” that are processed one at a time and then combined into a movie. The following GIF of the Alien chestburster scene is an example of this. The Text-to-Image prompt was “impasto oil painting”.

This next example movie is showing a “Self-Driven” zoom movie. As in a regular zoom movie the output frames are slightly stretched and fed back into the system each frame. The self-driven difference with this movie is that the Text-to-Image prompt text is automatically changed every 2 seconds by CLIP detecting what it “sees” in the current frame. This way the movie subjects are automatically changed and steered in new directions in a totally automated way. There is no human control except me setting the initial “Rainbow colored blobs” prompt. After that it was fully automated.

By default the CLIP Image Captioning script is very good at detecting what is in an image. Using the default accuracy resulted in a zoom movie that got stuck with a single topic or subject. One got stuck on a slight variation of a prompt dealing with kites, so as the zoom movie went deeper it only showed kites. Luckily after tweaking and decreasing the accuracy of the CLIP captioning the predicitons allow the resulting subjects to drift to new topics during the movie.


Name: Monster Maker
Author: P_Hoep
Original script: https://colab.research.google.com/drive/1ZbLnt5fLS_BDfpQY-9Dh_T40pLjfqSAC
Time for 512×512 on a 3090: 2 minutes 01 seconds
Description: No longer available. I was contacted by the author who does not want it shared publicly. The colab link no longer works.

'a black and white photo of a library in the style of Rembrandt' Monster Maker Text-to-Image
a black and white photo of a library in the style of Rembrandt

'a forest fire' Monster Maker Text-to-Image
a forest fire

'a forest path' Monster Maker Text-to-Image
a forest path

'a heart made of feathers' Monster Maker Text-to-Image
a heart made of feathers

'a surrealist painting of the Las Vegas strip' Monster Maker Text-to-Image
a surrealist painting of the Las Vegas strip


Name: CLIP Guided Diffusion
Author: Katherine Crowson
Original script: https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj
Time for 256×256 on a 3090: 1 minutes 35 seconds
Maximum resolution on a 24 GB 3090: Locked to 256×256
Description: This one gives very unique results compared to the other scripts. Locked to 256×256 resolution. Some of the results can be very detailed and interesting, but a lot of time it is hit and miss to get a result that reliably matches the input phrase. The following samples came hand picked from a large batch run of random phrases.

'a clown' CLIP Guided Diffusion Text-to-Image
a clown

'a hyperrealistic painting of a witch' CLIP Guided Diffusion Text-to-Image
a hyperrealistic painting of a witch

'a sea monster' CLIP Guided Diffusion Text-to-Image
a sea monster

'a surrealist sculpture of an android' CLIP Guided Diffusion Text-to-Image
a surrealist sculpture of an android

'Brad Pitt' CLIP Guided Diffusion Text-to-Image
Brad Pitt

'New York City' CLIP Guided Diffusion Text-to-Image
New York City


Name: CLIP Guided Diffusion v2
Author: afiaka87
Original script: https://colab.research.google.com/github/afiaka87/clip-guided-diffusion/blob/main/colab_clip_guided_diff_hq.ipynb
Time for 256×256 on a 3090: 2 minutes 38 seconds
Maximum resolution on a 24 GB 3090: Locked to 256×256
Description: Modified CLIP Guided Diffusion with more options. This one gives very unique results compared to the other scripts. Locked to 256×256 resolution. Hopefully larger resolution versions of this script will appear in the future. Some of the results can be very detailed and interesting, but a lot of time it is hit and miss to get a result that reliably matches the input phrase. The following samples came hand picked from a large batch run of random phrases.

'a digital painting of a crying person' CLIP Guided Diffusion v2 Text-to-Image
a digital painting of a crying person

'a fine art painting of heaven in the style of Edvard Munch' CLIP Guided Diffusion Text-to-Image
a fine art painting of heaven in the style of Edvard Munch

'a flemish baroque of an angry person' CLIP Guided Diffusion v2 Text-to-Image
a flemish baroque of an angry person

'a flemish baroque of hell' CLIP Guided Diffusion v2 Text-to-Image
a flemish baroque of hell

'a surrealist painting of a witch' CLIP Guided Diffusion v2 vText-to-Image
a surrealist painting of a witch

'the australian outback' CLIP Guided Diffusion v2 Text-to-Image
the australian outback


Name: CLIPRGB
Author: Jonathan Whitaker
Original script: https://colab.research.google.com/drive/1MiKaFFgau6V5QhIed5tpNdLUiSbof4nI
Time for 512×512 on a 3090: 4 minutes 51 seconds
Maximum resolution on a 24 GB 3090: 4096×4096
Description: Very early 0.1 version shows a lot of potential. Can render huge resolution images up to 4096×4096 on a 3090 so I am really looking forward to future versions of this code with sharper details.

'a digital painting of a wizard' CLIPRGB
a digital painting of a wizard

'a forest path' CLIPRGB
a forest path

'a tattoo of planets' CLIPRGB
a tattoo of planets

'a vampire' CLIPRGB
a vampire


Name: CLIP Guided Diffusion v3
Author: Michael Friesen
Original script: https://colab.research.google.com/drive/1Fl2SZvLv23MVSAHxkoiNdxPeAZwibvu1
Time for 512×512 on a 3090: 2 minutes 23 seconds
Maximum resolution on a 24 GB 3090: Locked to 512×512
Description: Modified CLIP Guided Diffusion that generates larger 512×512 images. Some of the results can be very detailed and interesting, but a lot of time it is hit and miss to get a result that reliably matches the input phrase. The following samples came hand picked from a large batch run of random phrases.

'a cubist painting of a castle' CLIP Guided Diffusion v2 Text-to-Image
a cubist painting of a castle

'a human made of vines' CLIP Guided Diffusion Text-to-Image
a human made of vines

'a rough seascape' CLIP Guided Diffusion v2 Text-to-Image
a rough seascape

'frogs' CLIP Guided Diffusion v2 Text-to-Image
frogs

'h r giger' CLIP Guided Diffusion v2 Text-to-Image
h r giger

'a matte painting of a landscape' CLIP Guided Diffusion v2 Text-to-Image
a matte painting of a landscape


Name: Zoetrope 5
Author: Bearsharktopusdev
Original script: https://colab.research.google.com/drive/1LpEbICv1mmta7Qqic1IcRTsRsq7UKRHM
Time for 512×512 on a 3090: 2 minutes 36 seconds
Maximum resolution on a 24 GB 3090: 768×768 or 1280×720
Description: Nice crisp results. Can generates up to 720p (1280×720) resolution images on a 3090. Includes a lot of new ideas from multiple people to help improve the outputs.

'a detailed painting of a Pixar character' Zoetrope 5 Text-to-Image
a detailed painting of a Pixar character

'a futuristic city' Zoetrope 5 Text-to-Image
a futuristic city

'a planet' Zoetrope 5 Text-to-Image
a planet

'a surrealist sculpture of a sea monster' Zoetrope 5 Text-to-Image
a surrealist sculpture of a sea monster

'an art deco scultpture of a policeman' Zoetrope 5 Text-to-Image
an art deco scultpture of a policeman

'cyberpunk art of a forest fire in the style of Edvard Munch' Zoetrope 5 Text-to-Image
cyberpunk art of a forest fire in the style of Edvard Munch


Name: CLIP RGB Optimization
Author: hotgrits
Original script: https://cdn.discordapp.com/attachments/730484623028519072/871624258260987934/CLIP__RGB_Optimization_v0_3.ipynb
Time for 512×512 on a 3090: 2 minutes 50 seconds
Maximum resolution on a 24 GB 3090: 4096×4096
Description: Another CLIP RGB based script without the pixelated artefacts of the CLIPRGB script. Can render huge resolution images up to 4096×4096 on a 3090. This script gives more impressionistic textures. By default the output was a bit too dark for my liking so I have added options to tweak the gamma and contrast of the output images in the script. The gamma and contrast tweaks are only at the display stage and do not change the internal image being generated.

'a babbling brook' CLIP RGB Optimization
a babbling brook

'a movie monster' CLIP RGB Optimization
a movie monster

'an amusement park' CLIP RGB Optimization
an amusement park

'Chewbacca' CLIP RGB Optimization
Chewbacca

'Freddy Kruger in the style of Rembrandt' CLIP RGB Optimization
Freddy Kruger in the style of Rembrandt


Name: MSE Regulized VQGAN+CLIP
Author: jbusted
Original script: https://colab.research.google.com/drive/1hf1seGOZctOJUznkhJNblLluXHbWLKZh
Time for 512×512 on a 3090: 3 minutes 16 seconds
Maximum resolution on a 24 GB 3090: 768×768 or 1120×480
Description: Generates good images but they tend to be inside a grey/purple border void.

'a bronze sculpture of a heart' MSE Regulized VQGAN+CLIP
a bronze sculpture of a heart

'a cubist painting of Buzz Lightyear' MSE Regulized VQGAN+CLIP
a cubist painting of Buzz Lightyear

'a house made of string' MSE Regulized VQGAN+CLIP
a house made of string

'an art deco sculpture of a vampire' MSE Regulized VQGAN+CLIP
an art deco sculpture of a vampire

'chalk art of C-3PO' MSE Regulized VQGAN+CLIP
chalk art of C-3PO


Name: Sequential VQGAN+CLIP
Author: Jakeukalane and Avengium
Original script: https://colab.research.google.com/drive/1CcibxlLDng2yzcjLwwwSADRcisc1qVCs
Time for 512×512 on a 3090: 1 minutes 41 seconds
Maximum resolution on a 24 GB 3090: 768×768 or 1120×480
Description: Really nice results and fast.

'a campfire in the style of Vincent van Gogh' Sequential VQGAN+CLIP
a campfire in the style of Vincent van Gogh

'a colorful parrot' Sequential VQGAN+CLIP
a colorful parrot

'a hyperrealistic painting of C-3PO' Sequential VQGAN+CLIP
a hyperrealistic painting of C-3PO

'an impressionist painting of Buzz Lightyear made of paper' Sequential VQGAN+CLIP
an impressionist painting of Buzz Lightyear made of paper

'New York City' Sequential VQGAN+CLIP
New York City


Name: CLIPRGB ImStack
Author: Jonathan Whitaker
Original script: https://colab.research.google.com/drive/1MCC2IwAaRNCTBUzghuG41ypAkxjJvGtq
Time for 512×512 on a 3090: 2 minutes 07 seconds
Maximum resolution on a 24 GB 3090: 2048×2048
Description: Another CLIP RGB variation. Nice results after some brightness, contrast and sharpness tweaks to the generated images. Could still be a bit sharper.

'a fine art painting of an angry person' CLIPRGB ImStack
a fine art painting of an angry person

'a fireplace in the style of Claude Monet' CLIPRGB ImStack
a fireplace in the style of Claude Monet

'a frog in the style of Beksinski' CLIPRGB ImStack
a frog in the style of Beksinski

'a nightmare creature in the style of H R Giger' CLIPRGB ImStack
a nightmare creature in the style of H R Giger

'a pointalism painting of a vampire made of copper' CLIPRGB ImStack
a pointalism painting of a vampire made of copper


Continued in Part 3 and Part 4.

Jason.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s