SeqDeepFake: Detecting and Recovering Sequential DeepFake Manipulation

  • S-Lab, Nanyang Technological University

TL;DR: In this work, we focus on detecting DeepFake manipulation sequences rather than binary lables.

Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting one-step facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using multi-step operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation Seq-DeepFake). Unlike existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence (e.g. image captioning) task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). Moreover, we build a comprehensive benchmark and set up rigorous evaluation protocols and metrics for this new research problem. Extensive experiments demonstrate the effectiveness of SeqFakeFormer. Several valuable observations are also revealed to facilitate future research in more broad deepfake detection problems.

Seq-DeepFake Dataset

Seq-DeepFake is the first large-scale dataset for Sequential DeepFake Manipulation Detection. It consists of 85k sequentially manipulated face images, each with ground-truth sequence annotation. The dataset includes high diversity manipulation sequences with lengths from 0 to 5, and is generated based on two different facial manipulation methods:

Some sample images and their annotations are shown below (Mouse Over: the original image). For more information about the data structure, annotation details and other properties about the dataset, you can refer to our github page.

Proposed SeqFakeFormer

Figure below shows the architecture of proposed Seq-DeepFake Transformer (SeqFakeFormer). We first feed the face image into a CNN to learn features of spatial manipulation regions, and extract their spatial relation via self-attention modules in the encoder. Then sequential relation based on features of spatial relation is modeled through cross-attention modules deployed in the decoder with an auto-regressive mechanism, detecting the sequential facial manipulation. A spatial enhanced cross-attention module is integrated into the decoder, contributing to a more effective cross-attention.

Benchmark results

We tabulate the first benchmark for detecting sequential facial manipulation in Table 1~3. SeqFakeFormer outperforms all SOTA deepfake detection methods in both manipulation types, under two evaluation metrics.

Recovery results

Examples of comparision between recovery results obtained by the correct inverse sequences (Mouse Out) and wrong sequences (Mouse Over). Experiment results show that the sequence order matters for better recovering original images, which proves the importance of detecting DeepFake manipulation sequences.

    title={Detecting and Recovering Sequential DeepFake Manipulation},
    author={Shao, Rui and Wu, Tianxing and Liu, Ziwei},
    booktitle={European Conference on Computer Vision (ECCV)},

We referred to the project page of AvatarCLIP when creating this project page.