Deepfake technology uses deep neural networks to convincingly replace one face with another in a video. The technology has obvious potential for abuse and is becoming ever more widely accessible. Many good articles have been written about the important social and political implications of this trend.
This isn’t one of those articles. Instead, in classic Ars Technica fashion, I’m going to take a close look at the technology itself: how does deepfake software work? How hard is it to use—and how good are the results?
I thought the best way to answer these questions would be to create a deepfake of my own. My Ars overlords gave me a few days to play around with deepfake software and a $1,000 cloud computing budget. A couple of weeks later, I have my result, which you can see above. I started with a video of Mark Zuckerberg testifying before Congress and replaced his face with that of Lieutenant Commander Data (Brent Spiner) from Star Trek: The Next Generation. Total spent: $552.
The video isn’t perfect. It doesn’t quite capture the full details of Data’s face, and if you look closely you can see some artifacts around the edges.
Still, what’s remarkable is that a neophyte like me can create fairly convincing video so quickly and for so little money. And there’s every reason to think deepfake technology will continue to get better, faster, and cheaper in the coming years.
In this article I’ll take you with me on my deepfake journey. I’ll explain each step required to create a deepfake video. Along the way, I’ll explain how the underlying technology works and explore some of its limitations.
Deepfakes need a lot of computing power and data
We call them deepfakes because they use deep neural networks. Over the last decade, computer scientists have discovered that neural networks become more and more powerful as you add additional layers of neurons (see the first installment of this series for a general introduction to neural networks). But to unlock the full power of these deeper networks, you need a lot of data and a whole lot of computing power.
That’s certainly true of deepfakes. For this project, I rented a virtual machine with four beefy graphics cards. Even with all that horsepower, it took almost a week to train my deepfake model.
I also needed a heap of images of both Mark Zuckerberg and Mr. Data. My final video above is only 38 seconds long, but I needed to gather a lot more footage—of both Zuckberg and Data—for training.
To do this, I downloaded a bunch of videos containing their faces: 14 videos with clips from Star Trek: The Next Generation and nine videos featuring Mark Zuckerberg. My Zuckerberg videos included formal speeches, a couple of television interviews, and even footage of Zuckerberg smoking meat in his backyard.
I loaded all of these clips into iMovie and deleted sections that didn’t contain Zuckerberg or Data’s face. I also cut down longer sequences. Deepfake software doesn’t just need a huge number of images, but it needs a huge number of different images. It needs to see a face from different angles, with different expressions, and in different lighting conditions. An hour-long video of Mark Zuckerberg giving a speech may not provide much more value than a five-minute segment of the same speech, because it just shows the same angles, lighting conditions, and expressions over and over again. So I trimmed several hours of footage down to 9 minutes of Data and 7 minutes of Zuckerberg.