Part 1: Why build a synthetic colleague?
The Implementor
March 2020
Creating a face and looking into adversarial training
To build a helpful and likeable synthetic colleague, we’re going to need a friendly and familiar face. Unfortunately, none of us proves to be particularly artistic, and we’re not sure if anyone would be willing to lend their face to a robot. Luckily, we have a solution: We can generate a face from a database of real Implementers!
Generative image models were first popular on the web when Google in 2015 showed off Deep Dream, or as it was known at the time: Inceptionism.
To many of us, this was also our first introduction to artificial intelligence, and it sparked our interest, as we looked at these surreal images that deeply conflicted with what we thought machines capable of. A computer could follow logical rules, but these pictures had concepts in them.
The landscapes weren’t real, they clearly weren’t images retrieved from a database, nor did they originate from any kind of rule set that a human would produce. However, they did make sense in terms of illustrating the concept of, say, “landscape” (first picture to the left). This is what an effective representation looks like – to the degree that we can visualise such a thing. The machine had learnt the essence of a landscape.
Since then, generative image models have developed rapidly, and it’s common to use the progress of generative models as shorthand to showcase just how far artificial intelligence has come in the last few years, like the below tweet from renowned AI researcher Ian Goodfellow (currently Apple Inc., former Google Brain team).
GANs, or Generative Adversarial Networks, are the type of models used to produce most of today's breakthrough in image generation. Fundamentally, it’s two models: A generator and a discriminator.
The discriminator is an image classification model that receives images and tries to guess whether they are “real” or “fake”.
The generator is an image generation model that tries to make images, which the discriminator will misclassify as real.
At each step of the training process, we take a scoop of images (in our case, 32 random pictures of people from Implement), and then we allow the generator to generate 32 random images. These will initially just be random white noise. The 64 images are shuffled and shown to the discriminator that will then try to guess which are real and which are fake.
Initially, it’s very easy for the model to tell the difference. 32 of the images are of people, and the other 32 are just noise. On the other hand, the discriminator initially has no idea what it’s looking for – on the very first run, it has no idea what “real” is, so it’s also very dumb.
As the training progresses, the discriminator quickly learns to look for face-like things in the image, but equally quickly, the generator will learn that it can trick the discriminator more easily if it produces more face-like images.
We’ve allowed this process to run for about 5 days on a very powerful laptop. This means that the game has been played for about 175,000 rounds, and more than 10 million images have been analysed. In the video below, you can see how the generated faces start out as noise, then quickly become a vague outline of a human face and finally improve slowly over the course of time. Scrub around to see the difference more clearly – password: implementor.
If you, at this point, are a little disappointed that the results aren’t like the 2018 state-of-the-art results, I feel you. There are two reasons for this: First, to train a generative model of such complexity, you need a lot more computing resources than we have on our personal, but powerful, laptop. Making state-of-the-art results requires six days of work on a DGX-1 computer, which costs about DKK 1 million. Granted, a lot of people report decent success in replicating the results, spending a bit more time on much cheaper hardware (approx. DKK 10,000–20,000), but less will do.
The second reason for the less than state-of-the-art result has to do with the amount of data. In order to learn something general about – or the essence of – what an Implementer is, we need a lot of examples. We used our CV database to grab images of Implementers. In there, we can find about 1,000 images. This is enough to learn a bit, but it won’t provide a complete picture of what an Implementer is. We see that the Implementor has just about learnt what a shirt looks like, it learns some rough facial features like eyes, nose and mouth, and it makes a few attempts at different haircuts, but it never really gets to anything too convincing. However, it’s not enough to learn details of eyes, glasses or stable representations of different haircuts. We also see that it struggles to produce a lot of female characteristics.
First off, the Implementor has a face! It’s pretty small, and it might not be that pretty, but we’d argue that it looks a little like a consultant. We might be on to something. Besides that, we now have some trained layers. Let’s look at the model.
The model we used is strongly based on the DCGAN (Deep Convolutional GAN) architecture of 2015. We upscaled some things, adding a few additional neurons. We used LeakyReLU activations of the neurons, some average pooling and batch normalisation.
The image above shows the generator side, but the discriminator side is similar, just mirrored. It has five layers (each of the blue cubes). Without going into the nuances of artificial neural networks, think of them as stacked statistical models, which, as explained in the first post, progressively learn more advanced features as one moves up the layers.
If we – as our next project – were to do some kind of image analysis of the Implementor, we could use our generative model as a starting point, and instead of starting with pure white noise (like we saw this model did in the beginning), we could leverage some of the layers from this model as a starting point, containing information on.
The above is a simplification of the true process, and actually, most researchers have found effective shortcuts, which means that they don’t have to do a full generative model that requires expensive specialised hardware to train in.
The general field of this kind of research is called transfer learning, i.e. using information from one task to improve performance in another – or pre-training.
While very few people use GANs as pre-training or transfer learning within image analysis, the ideas are constantly used, and some of the biggest breakthroughs in recent years come from the idea of training an image model on one simple/cheap task and then using what it learnt in this realm to boost performance in a more complex/expensive realm.
That’s it for now! The Implementor now has a face, and we just need to write a CV to get it onto a project!