🍷 Wine Quality Prediction

Predicting wine quality from physicochemical features using a clean ML pipeline: preprocessing, feature selection, and model comparison.

Role

ML pipeline design and implementation (solo project)

Timeline

2024 · Personal / coursework-style project

Tech

Python, scikit-learn, Pandas, Matplotlib, Hugging Face Spaces

Live Demo GitHub Repo ← Back to Projects

Wine quality prediction plots and interface

TL;DR

Built an end-to-end ML pipeline to predict wine quality from physicochemical measurements (acidity, alcohol, pH, etc.).
Implemented preprocessing, feature exploration, and model comparison using classical machine learning algorithms.
Evaluated multiple classifiers and selected a balanced model considering accuracy and robustness, not just overfitting the dataset.
Deployed the final model as an interactive demo where users can adjust inputs and see the predicted wine quality.

Problem & Context

Wine quality is traditionally assessed by human experts, but many production and quality control decisions depend on measurable physicochemical properties: acidity, sugar, sulfur dioxide, alcohol content, and more. The question is: how far can we get by predicting human-rated quality scores from those numeric features alone?

This project uses a public wine quality dataset to build a supervised learning model that predicts a discrete quality score. The focus is not just on reaching a single metric, but on building a clean and reusable ML pipeline: from data exploration and preprocessing to feature selection, model training, and simple deployment.

Data & Inputs

Tabular dataset of wines with physicochemical attributes such as fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, sulfur dioxide levels, density, pH, sulphates, and alcohol.
A discrete quality label (e.g., 0–10) assigned by human tasters, treated as a classification target.
Standard preprocessing: handling missing values (if any), scaling numeric features, and train/validation/test splitting.

Exploratory data analysis (EDA) was used to understand feature distributions, detect potential outliers, and inspect correlations between physicochemical properties and quality labels.

Approach & Pipeline

The project is structured as a clean ML pipeline rather than just a single “fit” call. The main stages are:

EDA & feature understanding: correlation matrix, pair plots, and summary statistics for each feature.
Preprocessing: standardization of numeric features (e.g., StandardScaler) to help algorithms like logistic regression and SVM converge and behave well.
Feature selection / importance: simple techniques such as univariate tests and tree-based feature importances to see which variables matter most.
Model comparison: trained and compared several scikit-learn models (e.g., logistic regression, random forest, gradient boosting) using cross-validation.

Results & Evaluation

The chosen model achieves solid performance on the held-out test split, capturing the main patterns between physicochemical properties and perceived quality. While there is still noise due to subjective ratings and overlapping feature distributions, the model is able to distinguish clearly low-quality from clearly high-quality wines.

Evaluated performance with accuracy, confusion matrix, and per-class metrics (precision/recall).
Compared simpler models (logistic regression) against more flexible ones (e.g. random forest / gradient boosting).
Observed that models with too much capacity can overfit; cross-validation was used to choose stable hyperparameters.

Beyond the exact numbers, the project demonstrates a complete workflow for building a robust classifier on a real dataset with noisy labels.

Implementation

Implemented in Python using scikit-learn and Pandas for data handling and modeling.
Code organized into clear steps: data loading, preprocessing, model training, evaluation, and deployment.
Deployed as a simple web demo (using a lightweight framework such as Gradio or Streamlit) on Hugging Face Spaces, allowing users to input feature values and see predictions.

The deployment decouples the front-end UI from the underlying scikit-learn pipeline, making it easy to swap or retrain models without changing the interface.

Challenges & Lessons Learned

Real-world labels (like wine quality) are noisy and subjective; even a strong model will not reach 100% accuracy.
Model evaluation must consider class imbalance and the cost of misclassifying near-neighbor classes (e.g., 6 vs 7).
A clear, modular pipeline (preprocessing → modeling → deployment) makes it easier to iterate than a single large notebook.

This project reinforced the value of building end-to-end pipelines and thinking about how a model will be used, not just how it scores on a benchmark metric.

Links

Live Demo on Hugging Face Spaces · GitHub Repository · Back to all projects