Dataset Poisoning Detection Toolkit for Ensuring Machine Learning Model Integrity
Vicky Kumar, Sunny Kumar, Varun Kumar, Sudhanshu Kumar, Ahtesham Farooqui
Paper Contents
Abstract
This paper introduces the Dataset Poisoning Detection Toolkit, a lightweight system for identifying malicious or corrupted records in machine learning datasets before model training. Dataset poisoningwhere attackers insert fake, mislabeled, or anomalous samplescan significantly reduce model accuracy. The toolkit addresses this by performing early-stage checks such as duplicate detection, outlier analysis, and label imbalance evaluation. Developed in Java, it uses OpenCSV for data processing and Apache Commons Math for statistical analysis. An optional JavaFX interface provides simple visualization for easier inspection. Designed to be modular, easy to use, and resource-friendly, the toolkit is especially suitable for beginners, students, and small-scale ML projects that require dataset validation but cannot rely on complex security frameworks. By generating a clear cleanliness report and highlighting suspicious patterns, the system helps ensure better data quality, reduces training errors, and improves basic security awareness in machine learning workflows.tell me the keyword to write in research paper
Copyright
Copyright © 2025 Vicky Kumar, Sunny Kumar, Varun Kumar, Sudhanshu Kumar, Ahtesham Farooqui. This is an open access article distributed under the Creative Commons Attribution License.