Speech-to-Text Technology: Advances, Challenges, and Ethical Considerations
Jayant Tyagi Tyagi
Paper Contents
Abstract
Speech-to-text (STT) technology converts spoken words into text, enabling tools like voice assistants, lecture transcription apps, and cultural archives. Despite advancements, STT struggles with noisy environments, low-resource languages, and ethical concerns such as fairness and privacy. We developed a CNN-Transformer model optimized for Indian languages and real-time applications, incorporating innovative data augmenta- tion and model compression techniques. Our model achieves an 18% reduction in Word Error Rate (WER) compared to baselines, tested across diverse Indian dialects and noisy settings like GLBITM fests and Gr. Noida markets. We enhanced the model with transfer learning to support low-resource languages like Bhojpuri and Kannada, and proposed ethical guidelines to ensure equitable access. Results, supported by diagrams, demonstrate robustness and low latency (140ms). For GLBITM students, this enables transcribing AI lectures, while Gr. Noida communities can archive Diwali speeches or Holi songs. Future work includes expanding to more dialects, integrating gesture recognition, and developing mobile apps for rural education, fostering AI-driven learning and cultural preservation.
Copyright
Copyright © 2025 Jayant Tyagi. This is an open access article distributed under the Creative Commons Attribution License.