FACE ANTI SPOOFING USING 3-ViT ARCHITECTURE WITH A CNN BACKBONE
Priyanshu Sahani Sahani
Paper Contents
Abstract
Enhancing the security of facial recognition systems against presentation attacks (PAs) necessitatesrobust face anti-spoofing (FAS) mechanisms. While traditional FAS methods relying on handcraftedfeatures have shown limitations in real-world scenarios, deep learning-based approaches havedemonstrated significant performance improvements. In this research, we propose an ensemble visiontransformer (ViT)-based FAS system that leverages both CNN-based feature extraction andtransformer-based global representation learning. Our architecture incorporates depth-based auxiliarysupervision and fuses predictions from multiple ViT branches (Base, Small, and Tiny) using learnableweights, improving robustness against diverse attack types. We evaluate the proposed method on theCelebA-Spoof dataset using intra-dataset testing protocols and standard metrics such as ACER,APCER, and BPCER. Experimental results show that our ensemble model outperforms individual ViTmodels and achieves competitive performance with strong generalization capabilities. This workprovides insights into the integration of depth cues and ensemble strategies, contributing to theadvancement of practical FAS solutions for secure biometric authentication.
Copyright
Copyright © 2025 Priyanshu Sahani. This is an open access article distributed under the Creative Commons Attribution License.