Recent Papers | Research Publications

DATA QUALITY AND CLEANING TECHNIQUES FOR BIG DATA

Dr. A. Antony Prakash A. Antony Prakash

Download Paper

Paper Contents

Abstract

In the big data era, maintaining data quality is made more difficult by the amount, speed, and diversity of data being produced. Since inaccurate insights, distorted projections, and less-than-ideal decision-making can result from low-quality data, data cleaning is an essential step in the data analysis process. The numerous problems with data quality that come with large data, such as noisy data, outliers, missing values, and inconsistencies, are examined in this work. From conventional procedures like imputation and normalization to more sophisticated machine learning-based strategies like anomaly identification and outlier handling, it explores cutting-edge data cleaning methods and tools designed for large-scale datasets. The study also emphasizes how data preparation systems, such Hadoop and Apache Spark, can help with problems related to data quality at scale. It also addresses the difficulties in cleaning unstructured data (text, photos, etc.) and provides strategies for managing complicated data kinds. The purpose of this paper is to give academics and practitioners the information they need to guarantee high-quality data for effective big data analytics by giving an overview of data cleaning best practices, current trends, and upcoming technologies.

Copyright

Paper ID: IJPREMS50900014743

ISSN: 2321-9653

Publisher: ijprems

Abstract
Copyright