what ai see

Art Intel Project 2 | Fall 2021

Joseph Hong | Aayush Deo

Documentation

In the beginning

We had a really good idea of what we wanted to do for the project. As enthusiasts of animations and cartoons, we thought that it’d be interesting to recreate a scene from one of the films with a machine learning model and reverse toonify it, meaning change it from cartoon/style to image. This was the opposite of what we had learned in class, where we were transferring different styles of art onto images, and we thought it would have been a great opportunity to discuss the idea of real/fake and who the creator was, especially when it came to the realistic images of the characters.

And guess what,

We found a model that could actually perform the reverse toonification! It was a model called Pixel2Style2Pixel , and one of its features was taking a cartoon and generating a realistic model of that character using the StyleGAN-2 that we learned about in class. All we needed to do was split the clip, take the images, and feed them into the Colab which we conveniently located. We had everything set, or so we thought.

Oh how wrong we were.

One of the first things that we should have realized as a bad sign was that the examples that were shown in the GitHub repo were a lot different than what was advertised in different articles. There were sketches, but they weren’t anything more than just a few lines. And there was another option, segmentation maps, but those were a completely different thing entirely from cartoons or animations. And there was another bad sign: it only supported faces, meaning that the entire clips that we had initially been planning on using were more or less useless.

We thought we could tempt fate.

And failed epically, of course. We decided to be stubborn, split the clip’s frames, used a facial tracking model to try and generate the full frontal image of the face to get past that face-only restriction, and were ready to change the thing from image to real life when we came across a problem that we had many times before, just not with Colab: errors. Now, the difference between then and now was that we didn’t actually know enough about the code to know what was happening–– it didn’t help that the code itself wasn’t accessible through Colab. It was giving us an error that something was out of index; the problem was we had no idea what that thing was. We tried different images, different image types, different just about everything. We looked into how to get past the problem for nearly a week, trying to configure the image files according to what different posts said was the problem, but in the end we had the same error. And so we ultimately decided to close up shop and choose a new topic.

And we ended up starting over.

This time, we had a different theme in mind, regarding the topic of bias in machine learning, something that came up throughout the course.

The idea for the set of visualizations was to depict the “trends” the machine learning model found within the Internet that showed the general biases that we might hold. This meant that we would be selecting certain prompts that we believed could have some kind of bias to them.

We first focused on occupations, going with the typical doctor/nurse prompts then moving on to others such as professors and receptionists. I guess gender bias was the first thing that we thought of (I know there may be a form of racial bias as well, but I’m not really sure about that area). We also considered bias that the database might have towards different cultures (because of the lack of information for some cultures), so we chose prompts such as “wedding photo” or “big dinner”.

We fed these prompts into the VQGAN + CLIP model via CoLab that we worked with during class, tweaking a few things here and there (i.e. the seed, some words to guide the model to produce an image of a person such as “a photo of”) and sent them through 1000 iterations each in order to produce a legible image. We made multiple images of the same prompts in order to get a general idea of whether the images we got were outliers (of course there’s always the possibility that all of the images we got were outliers… oh well), then chose two of those images that best represented (or just were kinda funny) the group of images to put them on the “online gallery”.

It wasn’t too shabby, though.

Not at all. In fact, we mostly got what we had expected. To be honest, it’s kinda funny how we were trying to show the bias in machine learning models by selectively choosing prompts that we thought contained bias–– and going as far as to tweak the prompt when we didn’t get what we wanted (aka actual people) ––thus taking the whole “garbage in, garbage out” idea to the next level by searching for that garbage and changing those searching parameters for the sole sake of finding that one piece of trash.

Anyway, a lot of the images that we found contained the kinds of gender/culture bias that we expected. Doctors, unless stated specifically as “female doctor”, all seemed to be male, and nurses were the opposite. Receptionists were always female (or a pile of papers sometimes if you just used the prompt “receptionist”) and professors male. But as we looked through the images we also started to see that there were other subtle forms of bias such as one towards age–– professors and doctors were generally old, while nurses and receptionists were young. As for the images of weddings and food, well, yes, they were all the Western styles of weddings and food. The food was surprisingly realistic. My stomach is growling. Be right back.

Then there was the actual presentation;

how do you make it work? What order do these images need to be displayed, and how do they need to be revealed? At first, we considered a format where the user can guess the prompt by looking at the image (similar to the movie posters from class) but we realized that there was a flaw in that–– what we were trying to emphasize was not the ‘accuracy in detail’ of the model but rather the bias the images showed. And so we spent some time trying to think about what could place emphasis on the bias, rather than the prompt. I have no idea why it took so long for us to realize that we could simply reverse the order of revelation so that the images were revealed after the prompt, in a “this is what the machine thought best represented that prompt” kind of way. And so that’s what we did.

Bringing things to a close, it was all about

the destination, or so I hope. We went through a long experience full of a lot of mishaps along the way but it ultimately led us to a topic that was not only totally different but also (personally) a lot more interesting. As we discussed in class, it’s easy to think that machine learning is more objective than we are because of the fact that they don’t think or feel the way we do (yet). But at the same time it’s important to remember that artificial intelligence is exactly what it’s called–– it’s artificial. It learns, but it learns what it’s given, from a database. And what’s in that database changes the way that it’ll see the world. That view might be different from what we see, and we might think that there’s just something wrong with the model, adjust some weights, and hope everything will come out the way “it ought to”. But maybe it’s just showing us where change is required in our society, or how imbalanced the past is compared to the present. Lately, there has been a shift towards realizing and overcoming biases and paradigms; perhaps machine learning can tell us where we need to start. Perhaps what AI see may become what we see.