DATA QUALITY AND CLEANING TECHNIQUES FOR BIG DATA
Dr. A. Antony Prakash A. Antony Prakash
Paper Contents
Abstract
In the big data era, maintaining data quality is made more difficult by the amount, speed, and diversity of data being produced. Since inaccurate insights, distorted projections, and less-than-ideal decision-making can result from low-quality data, data cleaning is an essential step in the data analysis process. The numerous problems with data quality that come with large data, such as noisy data, outliers, missing values, and inconsistencies, are examined in this work. From conventional procedures like imputation and normalization to more sophisticated machine learning-based strategies like anomaly identification and outlier handling, it explores cutting-edge data cleaning methods and tools designed for large-scale datasets. The study also emphasizes how data preparation systems, such Hadoop and Apache Spark, can help with problems related to data quality at scale. It also addresses the difficulties in cleaning unstructured data (text, photos, etc.) and provides strategies for managing complicated data kinds. The purpose of this paper is to give academics and practitioners the information they need to guarantee high-quality data for effective big data analytics by giving an overview of data cleaning best practices, current trends, and upcoming technologies.
Copyright
Copyright © 2025 Dr. A. Antony Prakash. This is an open access article distributed under the Creative Commons Attribution License.