Predicting wine quality from physicochemical features using a clean ML pipeline: preprocessing, feature selection, and model comparison.
Wine quality is traditionally assessed by human experts, but many production and quality control decisions depend on measurable physicochemical properties: acidity, sugar, sulfur dioxide, alcohol content, and more. The question is: how far can we get by predicting human-rated quality scores from those numeric features alone?
This project uses a public wine quality dataset to build a supervised learning model that predicts a discrete quality score. The focus is not just on reaching a single metric, but on building a clean and reusable ML pipeline: from data exploration and preprocessing to feature selection, model training, and simple deployment.
Exploratory data analysis (EDA) was used to understand feature distributions, detect potential outliers, and inspect correlations between physicochemical properties and quality labels.
The project is structured as a clean ML pipeline rather than just a single βfitβ call. The main stages are:
The chosen model achieves solid performance on the held-out test split, capturing the main patterns between physicochemical properties and perceived quality. While there is still noise due to subjective ratings and overlapping feature distributions, the model is able to distinguish clearly low-quality from clearly high-quality wines.
Beyond the exact numbers, the project demonstrates a complete workflow for building a robust classifier on a real dataset with noisy labels.
The deployment decouples the front-end UI from the underlying scikit-learn pipeline, making it easy to swap or retrain models without changing the interface.
This project reinforced the value of building end-to-end pipelines and thinking about how a model will be used, not just how it scores on a benchmark metric.