Concept Formation in Computational Creativity Download PDF

6 Study: Text-to-image

by providing an interface to visual content based on natural language. This study explores the usage patterns and creative interactions that field experts tend to have while using text-to-image technology. The study was conducted using a Discord server over the course of 4 months, gathering 76 participants ranging from different related fields such as graphic design, photography and digital illustration. The participants were presented with a text-to-image bot and given a private space to use the technology for their own purpose. Qualitative data, collected through interviews and meetings, was complemented by quantitative measures extracted from server-gathered interaction data, including usage, prompts used, and types of operations (text-to-image, image-to-image, prompt interpolation). The study results reveal that the distribution of images generated per user (usage) follows an exponential pattern, and there is a negative correlation between expectations of text-image alignment and the Tolerance for Ambiguity (TA) personality trait (Norton 1975; Furnham and Marks 2013; Zenasni, Besançon, and Lubart 2008; Herman et al. 2010). These findings constitute some preliminary evidence suggesting that the adoption pattern of this new technology is dependent on social factors. Additionally, the findings suggest that expectations regarding this technology could be linked to TA, potentially resulting from an individual’s attunement to a specific TOC.

6.1 Scope

Text-to-image technology made a substantial impact in the field of computer vision since its inception. On Jan 6, 2021, OpenAI released CLIP, a model that was trained on 400 million image-text pairs, able to create a joint latent space combining visual and textual information (Radford et al. 2021). CLIP works in both directions, in the sense that it is not only capable of identifying objects in a picture, but also to guide the generation of images starting from text. Since the public release of CLIP, many open-source implementations of image generators based on its capabilities have been released, and their generated output has started to flood the internet.

CLIP and, more generally, the idea that we can control a generative model by conditioning its output on text (or, more precisely, token embeddings) constitutes a dramatic interface improvement. As it should be evident from all the TOCs reviewed in Section 2.1, language seems to play central role when dealing with concepts. CLIP effectively provides a latent space that unifies lexical and visual concepts. This multi-modality binds the compositional elements of language to the visual domain, allowing the generation of novel content obtained through the sophisticated combination making capabilities afforded by the pre-trained token embeddings.

Naturally, the limitations that neural networks generally encounter when dealing with analytical concepts are still in place. For example, study participants quickly find out that quantifiers have little effect on image generation: the prompt yields a random number of circles (see Figure 6.1). This should not be surprising, given the nature of the D[] operator, yet the technological opacity might conceal the reasons of this behavior. Our habitual way of thinking about intelligent systems perhaps creates the expectation that they must be analytical, yet in this case we are dealing with something completely different.

A collection of images generated using Stable Diffusion v1.5 and Automatic1111 UI, based on the prompt (Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 3828270218, Size: 512x512, and Model hash: cc6cb27103). Despite the prompt, the images display a random number of circles, indicating that the image generation model does not accurately capture cardinality. The figure reveals the limitations of LDMs in generating content that strictly adheres to prompts that contain analytic concepts such as numbers, highlighting the need for further improvements in understanding and representing cardinality in image generation models.

Referring back to the birth of generative art and the experiments with programmable plotters, it seems that R[Program] and D[Language] are very different, yet complementary, ways to manipulate visual concepts. Ultimately, both methods make use of randomness to generate variety, but in the former probabilities are defined by the programmer, whereas in the latter they are determined during training. For the sake of completeness, the post-phenomenological form of these two mediated interactions might be summarized as follows (from Section 3.1):

6.2 A clash of paradigms

r0.42 image

After the release of CLIP, the ML community began exploring its capabilities in various directions. Ryan Murdock combined CLIP with BigGAN (Brock, Donahue, and Simonyan 2018) to create BigSleep1, which was one of the first publicly available tools in this space. Katherine Crowson experimented with CLIP guidance in a novel method for image synthesis known as diffusion, as proposed by (Dhariwal and Nichol 2021). As 2021 drew to a close, numerous forums and user groups devoted to generative art began discussing image generation and their workflows as Python scripts that could be run on Google’s Colab, a platform that provides free (or nearly free) GPU computing time. Many of these Colab scripts utilized Crowson’s work, with Disco Diffusion2 proving particularly popular among the pioneers. As the community continued to experiment with CLIP and its potential applications, its impact on the field of ML and beyond remained a topic of ongoing interest and discussion.

Bar chart illustrating the number of posts per day in the Facebook group (https://www.facebook.com/groups/procgenart), with a noticeable explosion of posts attributed to the trend of AI-generated images. The chart highlights the increase in group activity related to AI-generated images in April and May 2022, followed by an abrupt decline as moderators took countermeasures to direct this type of content to a dedicated group named . The figure demonstrates the impact of AI-generated images on the group’s activity and the moderation challenges faced by the online community.

As a member of the Facebook group , I witnessed first hand the wave of controversial AI Art that took over the group in the months of April and May 2022. Around that time, images generated with Disco Diffusion were pretty much the only content being posted in that group, to the point that some members started to complain about it vigorously (e.g. Figure [fig:aiartcomplain]). Official statistics for the group are not available, but I was able to scrape the public posts and the trend is evident from the collected data, presented in Figure 6.2. The moderators of the group implemented a new strategy to handle the overwhelming volume of posts related to AI-generated images. They created a new dedicated group called and stopped publishing these types of images in the original group. As a result, there was a noticeable decrease in the number of posts around May, but this did not indicate a decline in interest. Rather, it was a deliberate effort to reduce the workload of the page admins who were struggling to keep up with the moderation demands.

6.3 Stable Diffusion

On August 22nd 2022, Stability.ai released to the public Stable Diffusion (SD) v1.4 (Rombach et al. 2021), a LDM that was trained on LAION-5B, a . While OpenAI, Google, Meta, and Microsoft have all made public releases of their research and development efforts, it is rare to see them share pre-trained models with the public. These companies defend their position by citing the potential dangers of unmanaged use, and at the same time safeguard their investments from the competition. Training models at large scale is undoubtedly a big expense that is hardly justifiable if it does not produce returns. Stability.ai, however, made the bold move of releasing their model to the public allowing anyone to use it without any cost. Since its release, SD has taken the creative industry by storm.

SD outperforms the generation speed of other openly-available models by several orders of magnitude because it conducts the diffusion process in latent space, unlike previous solutions, such as Disco Diffusion, that operate at pixel-level. It is able to do so because it uses a VAE to encode and decode images and token embeddings to and from latent space3. This significant speed increase, coupled with the large-scale dataset it was trained on, make SD one of the most exciting technologies that we have seen in this field4. Indeed, it is remarkable that such a small file (less than 4GB) can encode the information of almost 6 billion images that humans have shared on the internet.

The opportunities that this technology has opened up for the general public are immense. Many developers around the world have spontaneously developed user interfaces, extensions, performance improvements and also trained alternative models that can be used to generate specific subjects or in specific styles, all in less than a year since its public release.

In September 2022, it became possible to observe this technology in action in a study that could accommodate a large group of participants simultaneously. Unfortunately, the popular interfaces of SD (official webui5, automatic1111 webui6, invokeAI7) are not designed for group use. Projects in this space often use Discord as a platform to let users try the technology in a social context. For example, Midjourney, an image generation service based on SD, is a tool that existed only as Discord bot up until very recently (at the time of this writing, Midjourney’s API access was just announced). Discord’s interface enables programmatic interaction through bots, which can respond to predefined commands and execute arbitrary code. As SD was released, many open source solutions featuring image generation pipelines via Discord became available on GitHub8 and with all these components ready, a study observing interactions with SD became possible at a larger scale.

6.4 Method

A Discord server named MediaBots was opened on September 2nd 2022 to host the study. Participants were selected among design visual arts practitioners, photographers and design students. They were invited to a 2-hour workshop, during which they had the opportunity to use text-to-image technology. During the study, 76 people from 5 different workshops accessed the server. The workshop and participant details are summarized in Table [tab:workshops].

The server was setup so that participants would be required to join the study in order to see the content on the server. To join the study they were asked to confirm they have read the information sheet and then complete a questionnaire9 measuring:

The purpose of including these measures was to explore whether any of these factors was correlated with the usage patterns observed in the Discord server. In particular the TA scale is hypothesized to be a proxy for a participant’s affinity towards a non-analytical approach. The intuition is that ambiguity is more characteristic of D[] than R[] so this personality trait might have some influence in the way the tools are used and perceived.

Once the participants completed the questionnaire, they were shown how to generate images through the bot’s slashcommands, essentially special messages that begin with a / that are recognized by the bot as structured instructions.

The SD bot was named , a wordplay in homage of Sandro Botticelli which translates from Italian to roughly (much like Dall-E10 is an homage to Salvador Dalí and the Pixar animation character Wall-E). Under the hood, Botticello is using yasd-discord-bot11 and dalle-flow12 to interpret the commands and handle the generation queue. The choice of this architecture was made because the jina13 interface offered by the back-end supports queuing, making it possible for several users to use the same bot simultaneously, a crucial feature in a workshop setting.

An example of the /image slashcommand used in the interface.
An example of the output generated by the /image command.

The commands offered by Botticello are:

6.5 Results and discussion

The analysis relative to the data collected through the questionnaires during workshops and tool usage data logged by the bot is presented as follows:

  1. For general knowledge, intuition and expectation items, distribution bar charts are presented in Figure 6.6. As both understanding and intuition are skewed towards higher values, it appears that most participants have an idea of how this technology works or at least they can intuitively grasp it.

  2. TA scale shows good reliability when all items were included (N = 10, Cronbach’s alpha = 0.797). Constructed scale follows normal distribution (see Q-Q plot in Figure 6.7).

  3. Number of images generated per user (or tool usage) is not normally distributed. Q-Q plot and Kolmogorov–Smirnov confirm exponential distribution (see 6.8). Two outlier cases visible in the charts have been removed with a filter on number of generated images with condition  <  = 400.

image image
image image

In order from top to bottom, left to right: distribution of understanding, intuition, expectation of human compatibility, and expectation of alignment.

Compared to earlier studies (Herman et al. 2010; Furnham and Marks 2013), the TA scale was found to be slightly less reliable, which may be partly attributed to the impact of the COVID-19 pandemic. For instance, responses to certain items such as and may have deviated from the rest of the scale due to travel and gathering restrictions that have been in place in recent years.

The two outliers identified also reveal an interesting story. After an informal conversation with one of them, they explained the situation. They are a couple who run an Instagram page and the reason they had generated so many images was because they were posting content on social media and generating images on behalf of their friends.

The discovery of the exponential distribution in user usage suggests that a small number of individuals are responsible for the majority of the interaction. Despite the diverse range of participants, it is noteworthy that the Q-Q plot appears as a near-perfect straight line, as depicted on the right side of Figure 6.8. The exact cause of this phenomenon is not entirely clear. However, one possible explanation, based on the observation of two outliers, is that network effects play a crucial role in predicting tool usage. The relationship between the Person and Press elements could provide added incentive for individuals who are well-connected in a social network. It is worth noting that there is no significant correlation between usage and any other measured variables in the data collected, indicating that external factors may be influencing participants’ use of the bot.

image image

Q-Q plots for tolerance for ambiguity. Left: normal distribution. Right (values adjusted to be in a positive range after recoding): exponential distribution.

image image

Q-Q plots for total number of generated images per user. Left: normal distribution. Right: exponential distribution.

The data reveals a significant correlation between the Expectation of Alignment item, , and the TA scale variable. Additionally, the Expectation of Human Compatibility item, , exhibits a significant correlation with EA, but not with TA. This indicates that TA and EHC may exert independent effects on EA, a relationship that warrants further investigation. To evaluate this interaction, a linear regression analysis was conducted using SPSS, with the findings illustrated in Figure 6.9.

Results for linear regression on Expectation of Alignment with Tolerance of Ambiguity and Expectation of Human Compatibility as independent variables.

The negative coefficient attributed to TA is particularly noteworthy concerning the potential impact of this personality trait on the perception of text-to-image technology. A plausible interpretation of this result is that individuals with high TA may exhibit lower expectations regarding image alignment, as they already anticipate and accept the inherent ambiguity present in both language and images. Conversely, those who display intolerance to uncertainty are more likely to expect a clear alignment between language and images, projecting this expectation onto the technology. Nonetheless, none of the three variables exhibit a significant correlation with log(usage), implying that these expectations do not ultimately influence the inclination to use the technology.

In conclusion, we may summarize the findings according to the ACASIA modules as follows:

6.6 Limitations and further research

Overall, this study’s limitations are related primarily to the lack of a structured methodology to address this novel type of interaction. In fact, the concept of image alignment with prompts is unique to this new form of generation and calls for further investigation. Alignment in this context has both objective and subjective components and its operationalization is dependent on the TOC that a person might be attuned with. In a creative context, Alignment perhaps is not critical or even desirable. A user could be looking for novel misaligned associations and combinations. In this sense, the non-analytic nature of D[] can be considered an asset, echoing the hypothesis that D[] is an inherently creative operator, as (Hoorn 2023) suggests.

Moreover, the initial evidence discovered in this study suggesting the hypothesis of a socially driven adoption of this tool, remains inconclusive. The absence of correlation between number of images generated per user (tool usage) and any of the measured variables appears to rule out factors related to personal aspects, such as technology acceptance (TA) and expectations (EHC and EA), as drivers of use. However, the qualitative evidence emerged from this study in support of Press-driven adoption is not definitive and can only indicate a direction for further research.

In conclusion, the relationship between EA, TA and EHC constitutes an unexpected but relevant finding of this research. It points to a possible link between personality traits such as TA and expectations about the behavior of text-to-image technology. Further research is needed to investigate whether this connection is generalizable to all technologies adopting D[]. This finding also suggest that there might be individual differences in how we expect concepts to behave, implying that humans develop an unconscious preference for a TOC, which they then tend to ascribe to the technology they use.

is never a process that begins from scratch: to design is always to redesign. There is always something that exists first as a given, as an issue, as a problem.