AI Image and Music Generation

Bob Ross-inspired robot painting image of robot with blue head

AI Image and Music Generation

Thursday, May 2, 2024

By Charlie Mendoza, Creative Services Manager

I currently use AI generated imagery as an automated circa 1500 A.D. atelier. Having AI at your disposal for realizing visuals is like having an inexhaustible number of talented assistants rolled into one. Assistant “X” is facile with drapery, assistant “Y” is facile with shrubbery, assistant “Z” excels at skyscapes and sfumato. Exploratory ideation and composing can be reduced from months to a mere half hour. Welcome to the era of magic. For practical purposes, I use AI to generate editorial illustrations. Editorial illustration functions as both a visual Venus flytrap and as a pictorial conveyance of a theme or topic. At present, these functions are comfortably within AI’s capacity. Beyond that, if I expect genuine invention, I will be disappointed. But I can certainly expect it to deliver unanticipated springboard inspiration, which is very welcome indeed. For the purpose of generating this kind of imagery I do not cite a specific style. What I mean by this is that I do not request something to be rendered “in-the-style-of-entity-X,” nor do I request a variation of “known entity X” engaging in this, that or the other action or pose. As we have seen recently, AI imagery is being increasing used to create deep fakes and it is becoming more and more cunning in generating high fidelity likenesses. As they say, it IS learning as it goes.In contrast, for most animals some learning is immediate, and some learning takes extended time and repeated rehearsal. Because AI generated imagery can seem like the revealing of the contents of an erratic consciousness, I will follow suit and proceed to indulge my own stream of consciousness while discussing this topic. Please bear with me. Welcome to the conundrum.

A foal is usually able to stand and walk within hours of being born and most can gallop at approximately twenty-four hours of age. A puppy is most often able to walk and even run by the age of four weeks. These two of our oldest close friends, the horse, and the dog, are also able to learn and retain new information well into their adulthood. Even though they are unable to demonstrate a level of problem solving that even a human toddler can command, we do sense a palpable (if quiet) wisdom that reside in our old friends. It is understood that we aren’t competing with them. They reinforce our empathic inclinations, and we respect and cherish both their strengths and their limits. Contrast that attitude with our species’ inability to accept limitations on our activities. As a result of our robust mimetic and memetic processing and behavior, we have been able to develop tools that augment and amplify what our own appendages and those of our non-human friends can accomplish. Our civilization is heavily invested in an anchorage of anthropomorphic precepts. The homo sapien literally created anthropomorphic deities while simultaneously developing anthropomorphic tools. Reflect on how many tools over many tens of thousands of years have been based on the functioning and capabilities of our humble hand alone. We are heavily surrounded and enveloped by anthropogenic devices. Welcome to the present.

A bird, a dog, a pig, a chimp, and many other higher life forms can “read” a mirror or a reflection on the water (and in some cases a photograph, painting, film, etc.). This is no small feat. Reading a two-dimensional image requires complex alignment of shapes in addition to making the connection that the image “participates” alongside and within a 3D environment. This is akin to consciousness alchemy. Both land and sea life feature denizens that not only project complex camouflage but are also duped by sophisticated camouflage deceptions. Homo sapiens are not the only organism that “sees things” that aren’t real. Then there is the case of exoteric imagery/sign systems, such as a simple line drawing of a five fingered hand or classic happy face, which can be comprehended by anyone anywhere regardless of their native language or exposure to pictographics. Universal communication of this type is ancient in use and practice but was never impervious to infection from the esoteric demands of message control (which is a topic unto itself). Add to the mix the seemingly unique human penchant for perceiving imagery, sounds, and ideations out of pure randomness. To be sure, apophenic¹ acuity is the homo sapien’s ace in the hole when it comes to pattern recognition and ideation connection(s). AI is not—at least not yet—capable of the undefinable complexity that makes pareidolia and apophenia possible within our consciousness. When we experience commonplace apophany we look up at a cloud formation as it “appears” to take the form of a Brontosaurus or a ghostly clipper ship or a fluffy bunny. Similarly, a distant car door slamming immediately followed by the cawing of a passing bird that trails into an oncoming siren just as a fellow pedestrian coughs can register as a very interesting and satisfying rhythmic sequence. Likewise, the decelerating squeal of air brakes might sound uncannily like your name being hissed. Between legitimate sensory input and imagined input there are too many possibilities, too many forks in the road. We not only process incoming input, but we also selectively filter it as it assaults us. Mothers can detect their child’s laughter at forty yards in a crowded park. I’m able to walk down Fifth Avenue on a busy afternoon and hold a conversation with someone walking alongside; all the while filtering out the traffic and other pedestrian vocalizing. A microphone cannot do that. A microphone “hears” everything. Filtering incoming stimulus is de rigueur for us.

Further, we are all to a one practicing semioticians.² The reading of our environment is critical, and we can read the exact same sign in different situations and automatically understand its meaning. A classic example is if I hold my thumb up on the side of a road it is understood I’m hitchhiking, if I hold my thumb up to a co-worker I’m approving, if I hold my thumb up while scuba diving then I’m signaling that I’m headed to the surface. AI as a surrogate consciousness is not only not consciously and selectively filtering input, it is not arriving at the random but pertinent associations that the homo sapien consciousness can weave. There is indication that the human is hardwired for speech but not necessarily wired to read or write. Language systems aside, the human has always been capable of reading critical signs and signals both within the natural environment and on the faces and in the behavioral attitude of other animals. Humans “read the room.” AI does not “read” in the same manner if at all. AI text prompted image generation currently operates by parsing linguistic requests, scraping the internet for conformities, and assembling associations in accordance with its evolving knowledge base. It develops pattern recognition through the linking of specific inputs with their corresponding desired outputs. Algorithmic alchemy. Welcome to machine learning.

To our current knowledge, only the homo sapien is capable of reflecting on its own consciousness to the degree that it attempts to replicate consciousness itself by mimicking how we develop ideas, problem solve, collate data, organize connective blocks of information, separate-and-join, and more. Nothing new under the sun; it was inevitable that we would attempt to replicate the command center of all tool implementation. The humble shovel didn’t really shed any new insight into how the humble hand should scoop or scrape. Will a faux “consciousness” shed any insight into how we do or should think? Or since this new tool is so very self-reflective, will it merely be an early step on a long yellow brick road answering all of our queries along the way? Or will it lead to an unforeseen cul-de-sac? Welcome to the evolving now.

Artificial intelligence image generation has been in the hands of the greater public since the dawn of 2022. I was first exposed to and explored Midjourney at that time and after generating 36,000+ images to date I find it to be the more interesting and responsive of options out there. Midjourney relies on a neural network which mimics the interconnectivity of neurons of the human brain in order to “learn skills” by examining and comparing enormous amounts of data. These networks/algorithms are sequences of mathematical operations wherein each operation could be imagined to be exemplifying a single neuron’s firing. Midjourney responds to text prompts and proceeds to seek out related patterns as it scrapes through literally millions of online digital images at a speed that rivals the dreams of science fiction. Hence, a bot is an automated piece of software that is programmed to perform specific tasks. It cannot be emphasized enough that these tasks are relentlessly repeated routines that a bot can execute at a speed and at a level of accuracy that a human being simply cannot compete with. So, once a text request describing an image has been issued the bot spawns a checklist of attributes that the image could possibly include. Then, a second neural network (a diffusion model) cobbles together the image and arranges the pixels required for the appearance of the image. Within tens of seconds four image candidates are formed. The fidelity between the prompt and resultant imagery can range from the highly bizarre to the nearly accurate. Welcome to feeling deficient.

AI image generation has both inspired and frightened artists and non-artists alike. The human artist is aware of the benefit of constant and compounded rehearsal to hone the manual skills required to generate even a simple line drawing. The average adult may denigrate their drawing ability and deficiencies and chalk it up to a lack of talent. The reality is quite simply a dearth of rehearsal. If as an adult, you draw like a seven-year-old it is most likely because the last time you drew with serious intent was when you were seven years old. You are only as honed as your last invested rehearsal. Given that, it is no wonder that an image generating tool that generates complex renditions in a plethora of styles in record time can shock, delight and even frighten most people. There is zero doubt that AI generated imagery will augment stock photography, illustration, and visual communications in general at a level that will empower anyone with a smart device like never before in the history of visual arts enabling tools. Should visual artists feel threatened? When the tool’s performance overshadows the tool wielder’s intent or capabilities confusion can ensue. Factor in the burning questions of authorship and originator of “style,” and the inflammation of the topic increases tenfold. Currently unresolved domestic and international legalities aside, the controversy surrounding such a tool cannot be overstated. At present, the U.S. Copyright Office only recognizes intellectual property generated by humans. As a result, AI generated content is considered to be in the public domain. The Copyright Office is being inundated with lobbying activities from both content creators and tech/communications giants laser-focused on AI issues. We cannot ignore the alarming degree to which artistic styles and in some cases outright verbatim copying occurs in AI’s output. That is why the human operator needs to operate a step ahead of the bot. Obviously no single human can be expected to catalog the plethora of styles and compositions that reside on the web. As a result, AI can “show” you something you have never been exposed to and hence you interpret it as original. In order to approach a degree of improbable preexistence the prompt(s) benefits from being simultaneously specific and nebulous allowing for a lane of invention to occur at both the language checklist and the diffusion stage.

I for one embrace and celebrate the arrival of an indefatigable assistant that possesses the ability to circumvent my personal creative filters and inspire me while (at least for now) retaining sufficient guilelessness to be literal, obtuse, simple, complex, and confused all at once. As this tool absorbs more and more of our species’ ideations regarding drawing, photography, painting, sculpture, etc., it will certainly come to demonstrate levels of sophistication that no one human could embody. But will it develop the moving target we call taste? Will it eventually dictate taste? I am reminded that the automobile is faster and more comfortable than riding a horse for mundane conveyance purposes. Sitting in a car is certainly less fatiguing than walking or running. Traveling by car, train, motorbike, bus, etc., conserves our energy so that we can arrive at a destination in order to either sit somewhere different or expend our energy elsewhere. Of course, there are those instances when riding a horse or walking are head and shoulders above riding a bicycle or sitting on a bus. Apples and oranges? Possibly. The takeaway herein is that we were designed to be physically active and physical activity is one of those essential feedback loops that maintains our health in 360 degrees. Handmade art is also an essential feedback loop and as such it will not fade away. The making of handmade art is too physical and too healthy to be allowed to dwindle. The speed and ease of AI reminds us of the awe we experience when we witness the standing and stumbling of a newborn foal, but we should also be reminded to recognize the obvious anthropogenic nature of all of our tools.

This newest of tools promises to be a “friend” like no other tool before it. The fear of AI eclipsing us in all arenas and ultimately dominating us is placing too much emphasis on our own nature to not accept our limitations, particularly limits on our ambitions. Since we have attempted to make this tool in the image of our own consciousness, we might reflect on the reality that we may inadvertently invest it with our empathy as well. AI isn’t competing with us. We in fact should reinforce an arc of empathic inclinations in its makeup. If we program wisely and refer it to our better examples it could evolve to respect and cherish our limits. Again, this is not a competition between human and machine tumbling into an unknown future.

In the spring of 1872, George Brayton obtained a U.S. patent for a “constant pressure internal combustion engine,” initially using vaporized gas, and marketed as Brayton’s Ready Motor. One hundred years later in December of 1972, the last humans to walk on the surface of the moon returned to Earth. Any speculations on what AI will contribute by 2072? Welcome to tomorrow.

A very recent AI generative arrival is song and instrumental music generation. Interfacing is similar to the image generation in that the user can make use of suggested styles (folk, pop, latin, etc.) or the user can input descriptors of the desired musical. The resultant music may or may not conform to the user’s expectation of the requested style, but the results do exhibit a robust semblance to existing musical stylings and structures. Again, similar to AI image generation, there is zero creativity occurring. The bot identifies the requested style and scrapes complying musical patterns from the internet (or an otherwise accessible database) and generates a ”new” variation pattern. Once again, as with the AI image generation, the bot is the dealer and dealer’s choice prevails. In the case of generating a song, the user can have AI generate lyrics based on a user-defined title-of-song, or the user can input original lyrics (in any language). The “vocalizing” process encompasses utilizing cloned human vocalists³ that cover a wide range of musical styles. The cloned vocals not only “sing” the lyrics utilizing a vocal character that pattern-fits the requested style, but the “singing” conforms to a melody and cadence that is simpatico to the arrived at “music.” Welcome to a global songfest.

Click below to listen to examples of my AI song generation. One is a sonic logo/promo for MKP and the other two are full songs with different “styles” applied to the same lyrics.

Promo
Song 1
Song 2

MKP communications inc. is a New-York based marketing communications agency specializing in merger/change communications for the financial services industry.

¹ Apophenia is the tendency to perceive a connection or meaningful pattern between unrelated or random things (such as objects or ideas). Pareidolia is the tendency to perceive a specific, often meaningful image in a random or ambiguous visual pattern (e.g., a portrait of Jesus on a tortilla).

² Semiotics is an investigation into how meaning is created and how meaning is communicated. Its origins lie in the academic study of how signs and symbols (visual and linguistic) create meaning.

³ Due to copyright restrictions vocal characters from protected intellectual property are not allowed. In other words, the prompt is restricted from accepting references to Beethoven, Shakira, Shahenshah-e-Qawwal, the Beatles, Charles Trenet, etc.