- S-Lab, Nanyang Technological University
TL;DR: In this work, we focus on detecting DeepFake manipulation sequences rather than binary lables.
Abstract
Since photorealistic faces can be readily generated by facial manipulation technologies nowadays,
potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection
methods are thus proposed. However, existing methods only focus on detecting one-step facial
manipulation. As the emergence of easy-accessible facial editing applications, people can easily
manipulate facial components using multi-step operations in a sequential manner. This new
threat requires us to detect a sequence of facial manipulations, which is vital for both detecting
deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize
the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation
Seq-DeepFake). Unlike existing deepfake detection task only demanding a binary label prediction,
detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation
operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face
images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors.
Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence
(e.g. image captioning) task and propose a concise yet effective Seq-DeepFake Transformer
(SeqFakeFormer). Moreover, we build a comprehensive benchmark and set up rigorous evaluation protocols
and metrics for this new research problem. Extensive experiments demonstrate the effectiveness of SeqFakeFormer.
Several valuable observations are also revealed to facilitate future research in more broad deepfake detection problems.
Seq-DeepFake Dataset
Seq-DeepFake is the first large-scale dataset for Sequential DeepFake Manipulation Detection. It consists of
85k sequentially manipulated face images, each with ground-truth sequence annotation. The dataset
includes high diversity manipulation sequences with lengths from 0 to 5, and is generated based on two different
facial manipulation methods:
- Sequential facial components manipulation
- 35,166 number of face images
- 28 types of manipulation sequences
- Sequential facial attributes manipulation
- 49,920 number of face images
- 26 types of manipulation sequences
Some sample images and their annotations are shown below (Mouse Over: the original image).
For more information about the data structure, annotation details and other properties about the dataset,
you can refer to our
github page.
Bangs-Smiling
eyebrow-nose
Eyeglasses
eye-lip-nose-eyebrow-hair
eye
Beard-Bangs-Eyeglasses-Young
lip-nose-eye
Young
Method
Proposed SeqFakeFormer
Figure below shows the architecture of proposed Seq-DeepFake Transformer (SeqFakeFormer).
We first feed the face image into a CNN to learn features of spatial manipulation regions,
and extract their spatial relation via self-attention modules in the encoder.
Then sequential relation based on features of spatial relation is modeled through
cross-attention modules deployed in the decoder with an auto-regressive mechanism, detecting the sequential facial manipulation.
A spatial enhanced cross-attention module is integrated into the decoder, contributing to a more effective cross-attention.
Benchmark results
We tabulate the first benchmark for detecting sequential facial manipulation in Table 1~3.
SeqFakeFormer outperforms all SOTA deepfake detection methods in both manipulation types,
under two evaluation metrics.
Recovery results
Examples of comparision between recovery results obtained by the
correct inverse sequences (Mouse Out)
and wrong sequences (Mouse Over).
Experiment results show that the sequence order matters for better recovering original images, which
proves the importance of detecting DeepFake manipulation sequences.
Bibtex
@inproceedings{shao2022seqdeepfake,
title={Detecting and Recovering Sequential DeepFake Manipulation},
author={Shao, Rui and Wu, Tianxing and Liu, Ziwei},
booktitle={European Conference on Computer Vision (ECCV)},
year={2022}
}
Acknowledgement
We referred to the project page of AvatarCLIP when creating this
project page.