Study: Text-to-image

by providing an interface to visual content based on natural language. This study explores the usage patterns and creative interactions that field experts tend to have while using text-to-image technology. The study was conducted using a Discord server over the course of 4 months, gathering 76 participants ranging from different related fields such as graphic design, photography and digital illustration. The participants were presented with a text-to-image bot and given a private space to use the technology for their own purpose. Qualitative data, collected through interviews and meetings, was complemented by quantitative measures extracted from server-gathered interaction data, including usage, prompts used, and types of operations (text-to-image, image-to-image, prompt interpolation). The study results reveal that the distribution of images generated per user (usage) follows an exponential pattern, and there is a negative correlation between expectations of text-image alignment and the Tolerance for Ambiguity (TA) personality trait (Norton 1975; Furnham and Marks 2013; Zenasni, Besançon, and Lubart 2008; Herman et al. 2010). These findings constitute some preliminary evidence suggesting that the adoption pattern of this new technology is dependent on social factors. Additionally, the findings suggest that expectations regarding this technology could be linked to TA, potentially resulting from an individual’s attunement to a specific TOC.

6.1 Scope

Text-to-image technology made a substantial impact in the field of computer vision since its inception. On Jan 6, 2021, OpenAI released CLIP, a model that was trained on 400 million image-text pairs, able to create a joint latent space combining visual and textual information (Radford et al. 2021). CLIP works in both directions, in the sense that it is not only capable of identifying objects in a picture, but also to guide the generation of images starting from text. Since the public release of CLIP, many open-source implementations of image generators based on its capabilities have been released, and their generated output has started to flood the internet.

CLIP and, more generally, the idea that we can control a generative model by conditioning its output on text (or, more precisely, token embeddings) constitutes a dramatic interface improvement. As it should be evident from all the TOCs reviewed in Section 2.1, language seems to play central role when dealing with concepts. CLIP effectively provides a latent space that unifies lexical and visual concepts. This multi-modality binds the compositional elements of language to the visual domain, allowing the generation of novel content obtained through the sophisticated combination making capabilities afforded by the pre-trained token embeddings.

Naturally, the limitations that neural networks generally encounter when dealing with analytical concepts are still in place. For example, study participants quickly find out that quantifiers have little effect on image generation: the prompt yields a random number of circles (see Figure 6.1). This should not be surprising, given the nature of the D[] operator, yet the technological opacity might conceal the reasons of this behavior. Our habitual way of thinking about intelligent systems perhaps creates the expectation that they must be analytical, yet in this case we are dealing with something completely different.

Referring back to the birth of generative art and the experiments with programmable plotters, it seems that R[Program] and D[Language] are very different, yet complementary, ways to manipulate visual concepts. Ultimately, both methods make use of randomness to generate variety, but in the former probabilities are defined by the programmer, whereas in the latter they are determined during training. For the sake of completeness, the post-phenomenological form of these two mediated interactions might be summarized as follows (from Section 3.1):

6.2 A clash of paradigms

r0.42

After the release of CLIP, the ML community began exploring its capabilities in various directions. Ryan Murdock combined CLIP with BigGAN (Brock, Donahue, and Simonyan 2018) to create BigSleep¹, which was one of the first publicly available tools in this space. Katherine Crowson experimented with CLIP guidance in a novel method for image synthesis known as diffusion, as proposed by (Dhariwal and Nichol 2021). As 2021 drew to a close, numerous forums and user groups devoted to generative art began discussing image generation and their workflows as Python scripts that could be run on Google’s Colab, a platform that provides free (or nearly free) GPU computing time. Many of these Colab scripts utilized Crowson’s work, with Disco Diffusion² proving particularly popular among the pioneers. As the community continued to experiment with CLIP and its potential applications, its impact on the field of ML and beyond remained a topic of ongoing interest and discussion.

As a member of the Facebook group , I witnessed first hand the wave of controversial AI Art that took over the group in the months of April and May 2022. Around that time, images generated with Disco Diffusion were pretty much the only content being posted in that group, to the point that some members started to complain about it vigorously (e.g. Figure [fig:aiartcomplain]). Official statistics for the group are not available, but I was able to scrape the public posts and the trend is evident from the collected data, presented in Figure 6.2. The moderators of the group implemented a new strategy to handle the overwhelming volume of posts related to AI-generated images. They created a new dedicated group called and stopped publishing these types of images in the original group. As a result, there was a noticeable decrease in the number of posts around May, but this did not indicate a decline in interest. Rather, it was a deliberate effort to reduce the workload of the page admins who were struggling to keep up with the moderation demands.

6.3 Stable Diffusion

On August 22nd 2022, Stability.ai released to the public Stable Diffusion (SD) v1.4 (Rombach et al. 2021), a LDM that was trained on LAION-5B, a . While OpenAI, Google, Meta, and Microsoft have all made public releases of their research and development efforts, it is rare to see them share pre-trained models with the public. These companies defend their position by citing the potential dangers of unmanaged use, and at the same time safeguard their investments from the competition. Training models at large scale is undoubtedly a big expense that is hardly justifiable if it does not produce returns. Stability.ai, however, made the bold move of releasing their model to the public allowing anyone to use it without any cost. Since its release, SD has taken the creative industry by storm.

SD outperforms the generation speed of other openly-available models by several orders of magnitude because it conducts the diffusion process in latent space, unlike previous solutions, such as Disco Diffusion, that operate at pixel-level. It is able to do so because it uses a VAE to encode and decode images and token embeddings to and from latent space³. This significant speed increase, coupled with the large-scale dataset it was trained on, make SD one of the most exciting technologies that we have seen in this field⁴. Indeed, it is remarkable that such a small file (less than 4GB) can encode the information of almost 6 billion images that humans have shared on the internet.

The opportunities that this technology has opened up for the general public are immense. Many developers around the world have spontaneously developed user interfaces, extensions, performance improvements and also trained alternative models that can be used to generate specific subjects or in specific styles, all in less than a year since its public release.

In September 2022, it became possible to observe this technology in action in a study that could accommodate a large group of participants simultaneously. Unfortunately, the popular interfaces of SD (official webui⁵, automatic1111 webui⁶, invokeAI⁷) are not designed for group use. Projects in this space often use Discord as a platform to let users try the technology in a social context. For example, Midjourney, an image generation service based on SD, is a tool that existed only as Discord bot up until very recently (at the time of this writing, Midjourney’s API access was just announced). Discord’s interface enables programmatic interaction through bots, which can respond to predefined commands and execute arbitrary code. As SD was released, many open source solutions featuring image generation pipelines via Discord became available on GitHub⁸ and with all these components ready, a study observing interactions with SD became possible at a larger scale.

6.4 Method

A Discord server named MediaBots was opened on September 2nd 2022 to host the study. Participants were selected among design visual arts practitioners, photographers and design students. They were invited to a 2-hour workshop, during which they had the opportunity to use text-to-image technology. During the study, 76 people from 5 different workshops accessed the server. The workshop and participant details are summarized in Table [tab:workshops].

The server was setup so that participants would be required to join the study in order to see the content on the server. To join the study they were asked to confirm they have read the information sheet and then complete a questionnaire⁹ measuring:

The purpose of including these measures was to explore whether any of these factors was correlated with the usage patterns observed in the Discord server. In particular the TA scale is hypothesized to be a proxy for a participant’s affinity towards a non-analytical approach. The intuition is that ambiguity is more characteristic of D[] than R[] so this personality trait might have some influence in the way the tools are used and perceived.

Once the participants completed the questionnaire, they were shown how to generate images through the bot’s slashcommands, essentially special messages that begin with a / that are recognized by the bot as structured instructions.

The SD bot was named , a wordplay in homage of Sandro Botticelli which translates from Italian to roughly (much like Dall-E¹⁰ is an homage to Salvador Dalí and the Pixar animation character Wall-E). Under the hood, Botticello is using yasd-discord-bot¹¹ and dalle-flow¹² to interpret the commands and handle the generation queue. The choice of this architecture was made because the jina¹³ interface offered by the back-end supports queuing, making it possible for several users to use the same bot simultaneously, a crucial feature in a workshop setting.

6.5 Results and discussion

The analysis relative to the data collected through the questionnaires during workshops and tool usage data logged by the bot is presented as follows:

Compared to earlier studies (Herman et al. 2010; Furnham and Marks 2013), the TA scale was found to be slightly less reliable, which may be partly attributed to the impact of the COVID-19 pandemic. For instance, responses to certain items such as and may have deviated from the rest of the scale due to travel and gathering restrictions that have been in place in recent years.

The two outliers identified also reveal an interesting story. After an informal conversation with one of them, they explained the situation. They are a couple who run an Instagram page and the reason they had generated so many images was because they were posting content on social media and generating images on behalf of their friends.

The discovery of the exponential distribution in user usage suggests that a small number of individuals are responsible for the majority of the interaction. Despite the diverse range of participants, it is noteworthy that the Q-Q plot appears as a near-perfect straight line, as depicted on the right side of Figure 6.8. The exact cause of this phenomenon is not entirely clear. However, one possible explanation, based on the observation of two outliers, is that network effects play a crucial role in predicting tool usage. The relationship between the Person and Press elements could provide added incentive for individuals who are well-connected in a social network. It is worth noting that there is no significant correlation between usage and any other measured variables in the data collected, indicating that external factors may be influencing participants’ use of the bot.

The data reveals a significant correlation between the Expectation of Alignment item, , and the TA scale variable. Additionally, the Expectation of Human Compatibility item, , exhibits a significant correlation with EA, but not with TA. This indicates that TA and EHC may exert independent effects on EA, a relationship that warrants further investigation. To evaluate this interaction, a linear regression analysis was conducted using SPSS, with the findings illustrated in Figure 6.9.

The negative coefficient attributed to TA is particularly noteworthy concerning the potential impact of this personality trait on the perception of text-to-image technology. A plausible interpretation of this result is that individuals with high TA may exhibit lower expectations regarding image alignment, as they already anticipate and accept the inherent ambiguity present in both language and images. Conversely, those who display intolerance to uncertainty are more likely to expect a clear alignment between language and images, projecting this expectation onto the technology. Nonetheless, none of the three variables exhibit a significant correlation with log(usage), implying that these expectations do not ultimately influence the inclination to use the technology.

In conclusion, we may summarize the findings according to the ACASIA modules as follows:

6.6 Limitations and further research

Overall, this study’s limitations are related primarily to the lack of a structured methodology to address this novel type of interaction. In fact, the concept of image alignment with prompts is unique to this new form of generation and calls for further investigation. Alignment in this context has both objective and subjective components and its operationalization is dependent on the TOC that a person might be attuned with. In a creative context, Alignment perhaps is not critical or even desirable. A user could be looking for novel misaligned associations and combinations. In this sense, the non-analytic nature of D[] can be considered an asset, echoing the hypothesis that D[] is an inherently creative operator, as (Hoorn 2023) suggests.

Moreover, the initial evidence discovered in this study suggesting the hypothesis of a socially driven adoption of this tool, remains inconclusive. The absence of correlation between number of images generated per user (tool usage) and any of the measured variables appears to rule out factors related to personal aspects, such as technology acceptance (TA) and expectations (EHC and EA), as drivers of use. However, the qualitative evidence emerged from this study in support of Press-driven adoption is not definitive and can only indicate a direction for further research.

In conclusion, the relationship between EA, TA and EHC constitutes an unexpected but relevant finding of this research. It points to a possible link between personality traits such as TA and expectations about the behavior of text-to-image technology. Further research is needed to investigate whether this connection is generalizable to all technologies adopting D[]. This finding also suggest that there might be individual differences in how we expect concepts to behave, implying that humans develop an unconscious preference for a TOC, which they then tend to ascribe to the technology they use.

is never a process that begins from scratch: to design is always to redesign. There is always something that exists first as a given, as an issue, as a problem.

Source code available at: https://github.com/lucidrains/big-sleep ↩︎
Source code available at: https://github.com/alembics/disco-diffusion ↩︎
For an image of resolution 512x512, the pixel space is 3 × 512 × 512, which is then encoded/decoded by the VAE to a 4 × 64 × 64 latent space where the diffusion process takes place more efficiently.↩︎
At the time of this writing.↩︎
Project page: https://beta.dreamstudio.ai/generate ↩︎
Project page: https://github.com/AUTOMATIC1111/stable-diffusion-webui ↩︎
Project page: https://github.com/invoke-ai/InvokeAI ↩︎
GitHub (https://github.com) is a web-based platform that provides version control and collaboration services using Git, a distributed version control system. It allows developers to create, manage, and collaborate on repositories (projects) that contain source code, documentation, and other files.↩︎
See Appendix 10 for the full list of items in the questionnaire.↩︎
DALL-E is a machine learning-based image generation system created by OpenAI.↩︎
Source code available at: https://github.com/AmericanPresidentJimmyCarter/yasd-discord-bot ↩︎
Source code available at: https://github.com/jina-ai/dalle-flow ↩︎
Source code available at: https://github.com/jina-ai/jina ↩︎

6 Study: Text-to-image