A level-headed evaluation of generative AI's true potential

At every step and every level there's an abundance of detail with material consequences.
- John Salvatier, Reality has a surprising amount of detail

This week, OpenAI revealed a new text-to-video tool called Sora. The demos were quite impressive. This release, of course, resulted in a new wave of fearmongering about how AI is scary and will result in all of us losing our jobs to robots. So I think now is a great time to have a level-headed talk about why that's not the case.

Creative Prompting

I want to discuss the use of generative AI for creative content such as images, video, and, in general, non-fictional content.

Prompting a generative AI model is like babysitting a malevolent child whose only objective is to piss you off by taking everything you say literally. You must be precise with your words, not wasting a single character on unimportant details that could skew the results in a completely different direction. But you also need to provide as much detail as possible so the model can't possibly misunderstand your desires.

This... sucks. There's nothing that can get me to bang my head against a wall for 3 straight hours like having a one-way conversation with a computer and trying to understand why it's not producing the results I want it to produce. I have to do this a lot in my career as a Software Engineer, but generative AI is worse. It's not a predictable programming language, where functions can just be idempotent. Instead, it's a black box that can produce completely different results given the same input, practically every time.

Even when your prompt shows promise of consistency, it may be a false positive. Billion-dollar companies are suffering from this reality. Recently, an AirCanada AI Chatbot lied to a passenger about bereavement discounts, and the company was required to honor the discount!

Speed

Despite its flaws in consistency, Generative AI can help churn out results quickly. The image at the start of this blog entry was created using generative AI, and it only took a few seconds to make. It's probably true that I can generate an image faster by using generative AI than it would take for an artist to create it themselves.

Sora gives people the ability to do the same but with entire videos. We can now generate complex media faster than ever before! What seems to follow is the idea that we don't need nearly as many artists, software engineers, copywriters, etc. because one person can now do the job of several, right? ...right?

An Abundance of Detail

When using generative AI to create art, you might find yourself thinking, "No, that's not quite what I wanted." I see this come up a lot in forums discussing generative AI, and the folks coming to the defense of the tool tend to suggest that the prompt lacks detail and that you need to be more specific. That's a fair point, especially when you zoom out of generative AI and look at how the world works today.

Detail is what separates my generated picture from a professional motion picture that makes over $100 million in revenue at the box office; my personal blog from the blog of a professional full-time writer; my 2014 GameMaker implementation of Wingman Sam from a 30-hour AAA video game title. Detail separates the hobbyist from the professional.

This is what helps me sleep at night.

If you want to use generative AI to create work that's as impressive as professional content, you're going to need a prompt with an insane amount of detail. It will also take a lot of trial and error to finally get the model to produce exactly what you want. Generative AI is just a tool, similar to autocomplete, a compiler, a spell checker, or any other assistive tool. It's not going to do any work for you without being its own chore.

As an example, let's talk about Sora again. Sora is just generating silent videos. They have no audio, no dialogue, and none of the subjects can speak or make any noise. There's [already a concern](https://community.openai.com/t/sora-wait-but-does-it-come-with-sound/632185/3) about this. But what if you want to make a movie? How do I use generative AI to:

Generate a script?
Read the script?
Generate a video to go with the script?

You know, aspects of modern film that actually make a film interesting? Suddenly you need to integrate a bunch of different generative AI tools to get the job done. OpenAI has a music generation tool called Jukebox, which could add music to the video, but not dialogue. Adding dialogue to generated videos sounds like a tall order. Not only do you need to generate a video of a specific scene, but you also need to get the characters in that scene to move their mouths in sync with the words spoken from the script. The generated audio of the characters talking must be input as well because the video generator needs to know how the words are being spoken. Not just tone, but pace. Imagine how many people you're going to need just to get these details ironed out.

As you can see, detail becomes quite difficult. I included a quote at the start of this blog post:

At every step and every level there's an abundance of detail with material consequences.

This is from a blog post I recently read, and I love it. It's called Reality has a surprising amount of detail. The title is exactly what I want to communicate about highly-detailed prompting. Reality is infinite in detail, but your AI model has a token limit. It can only take you so far before you must pivot to relying on other people to get the job done, and so far, people are still the ones doing the heavy lifting.

Less Detail, More Variety

Focusing on detail sounds exhausting, so what happens if you opt to keep it simple? What if you just provide shorter descriptions that only take a few seconds to write?

You end up with extremely diverse results, most of which don't align with what you had in mind. Now you have to search through hundreds, maybe even thousands of generated artifacts to eventually find the one you're looking for because the prompt is simply too generic. You may never find the output you want. When you opt for brevity when prompting, you're asking the model to figure out the details for you (and the model has a wild imagination). It can come up with seemingly infinite permutations for you to sift through.

Try generating an image with the prompt "a man." You'll get all sorts of men. But I wanted a man with blue hair, brown eyes, and a mustache, wearing a black jacket with a hood on. How long is it going to take me to find that exact output if I just input "a man" as the prompt? Probably forever.

There's an equilibrium between depth and breadth where you'll be able to minimize your efforts. It will still take lots of critical thought about details, and lots of searching.

Job Impact

Okay, enough yapping about prompts. How does AI actually impact our jobs? Why won't it be replacing us?

Businesses don't solve small problems like generating images or video clips; businesses solve large, complex problems for stakeholders with conflicting interests.

The hardest part of solving these problems is the logistics. Tackling large, complex problems requires the time and effort of a large group of problem solvers, all of whom need to be organized by other problem solvers to keep things moving.

Grab two people at random, and allow them to self-organize their work in isolation. Odds are, their standards and schedules are not aligned with one another. If you ask them to work together, you might need to set the standards so that they're working at the same time, and producing the same quality of results. Cohesion is defined as a "force;" you need to do work to align independent units.

Teams and components can run smoothly in isolation, but a lot of conflicts arise when you introduce several teams/components into a single environment. When multiple co-dependent units are working toward a goal, some level of cohesion is required to get anything meaningful done.

This work to achieve cohesion is complex and frequently involves resolving interpersonal conflicts, which are often illogical or emotional. I wouldn't put a robot in charge of resolving those types of conflicts.

To wrap this up in a neat little bow:

Businesses make money by solving big problems for lots of people (or other businesses that are made up of people)
Big problems are hard to solve because there's a lot of moving parts
Moving parts need to be orchestrated to resolve interpersonal conflicts and maintain cohesion
People are the most reliable resource when it comes to resolving interpersonal conflicts.
Robots suck at this, no matter how much your data-stealing AI girlfriend indicates otherwise.

Conclusion

When I look at the current state of generative AI, I don't see any way for jobs to be automated out of existence. The prompting interface and lack of actual autonomy or critical thinking on the part of the model is the key limitation. Generative AI will continue to empower people to be more efficient at their job, but that's as far as it can go.

There's one statement I keep hearing a lot which I can agree with: AI is not going to take your job. The biggest threat to your job security is someone who is using AI to be more efficient than you. Generative AI is a great resource to help you move faster, but it won't be completely automating your job any time soon. Add it to your toolbox to stay ahead of the curve.