Comprehensive Evaluation of Deep Learning Architectures for Static American Sign Language Recognition: From CNNs to Hybrid Sequential Models
Navdeep Doriya Doriya
Paper Contents
Abstract
This work introduces a thorough comparison of five machine learning models for static American Sign Language (ASL) recognition on a dataset of 8,784 high-resolution (128128 RGB) images of 26 letter classes. We compare: (1) MobileNetV2 (97.00% accuracy), (2) MobileNetV2+RNN hybrid (96.51%), (3) Custom CNN (85.69%), (4) LSTM (87.99%), and (5) Random Forest (91.50%). Our findings show three results:Spatial Features Predominate: The plain MobileNetV2 performs better than its hybrid RNN-augmented version (97.00% > 96.51%), indicating that feature extraction through convolution is more crucial than sequential modeling for static ASL.Surprising LSTM Viability: The baseline LSTM model obtains 87.99% accuracy by treating raw pixel rows as sequences, demonstrating static images maintain temporally encoded patterns.Practical Significance: Highest Accuracy: MobileNetV2 (97.00% at 25 milliseconds)Best Speed-Accuracy Trade-Off: Custom CNN (85.69% at 10ms)Fastest Inference: Random Forest (91.50% at 5ms)We publish complete implementations, e.g., 4-layer Custom CNN and MobileNetV2+GRU hybrid, for reproducibility. This work offers actionable advice for choosing ASL recognition architectures on the grounds of accuracy, latency, and hardware specifications.
Copyright
Copyright © 2025 Navdeep Doriya. This is an open access article distributed under the Creative Commons Attribution License.