An interactive Streamlit + Scikit‑Learn web application that allows anyone to upload a dataset, automatically identify whether the problem is classification or regression, train multiple ML models interactively, visualize the results, and export the trained model, all from a single browser interface.
- Overview
- Architecture
- Key Features
- Data Flow Diagram
- Design Decisions
- Preprocessing Pipeline
- Supported Models
- Metrics Explained
- User Guide
- Deployment Guide
- Troubleshooting
- Performance Tips
- Roadmap
ML Playground provides a practical and educational environment for machine learning experimentation. It bridges the gap between Jupyter notebooks and full production pipelines, ideal for data analysts, ML students, educators, and developers who want to quickly prototype or demonstrate machine learning behavior on real data.
Goals:
- Automate tedious parts of ML workflow (data cleaning, encoding, scaling)
- Detect task type and prevent invalid configurations (classifier on continuous targets)
- Provide instant, intuitive visual feedback
- Ensure full forward and backward compatibility with future Streamlit and Scikit‑Learn versions
flowchart TD
A[CSV Upload / Sample Data] --> B[Preprocessing Layer]
B --> C[Task Detection Engine]
C --> D[Model Selector]
D --> E[Training Engine]
E --> F[Metrics & Visualization]
F --> G[Export / Download]
| Component | Purpose |
|---|---|
| Frontend (Streamlit) | Renders user interface, handles file uploads, parameter selection, and visualization. |
| Task Detection Engine | Analyzes target column distribution to classify problem type (classification vs regression). |
| Preprocessing Layer | Builds ColumnTransformer with numeric scaling and categorical one-hot encoding. |
| Training Engine | Constructs scikit-learn Pipeline, trains model, evaluates metrics. |
| Visualization Engine | Uses Plotly for interactive confusion matrix, ROC, and residual plots. |
| Export Layer | Serializes the trained pipeline using Joblib for reuse in production scripts. |
Automatically determines if the target column represents a classification or regression task using data type and unique value heuristics.
Select from built‑in algorithms:
- Classification: Logistic Regression, Random Forest, XGBoost (optional)
- Regression: Linear Regression, Random Forest, XGBoost (optional)
- Real‑time metrics panel
- Confusion matrix & ROC for classification
- Residual and importance plots for regression
Save the full pipeline as .joblib for production deployment.
Supports all versions of scikit‑learn 0.24+ by auto‑detecting encoder parameters (sparse_output vs sparse).
sequenceDiagram
participant User
participant StreamlitApp
participant Preprocessor
participant Model
participant PlotlyCharts
User->>StreamlitApp: Upload CSV / Choose sample
StreamlitApp->>Preprocessor: Analyze data types
Preprocessor->>Model: Provide processed tensors
Model->>StreamlitApp: Return predictions + metrics
StreamlitApp->>PlotlyCharts: Render visuals interactively
User->>StreamlitApp: Download model.joblib / predictions.csv
| Decision | Rationale |
|---|---|
| ColumnTransformer-based preprocessing | Guarantees clean numeric and categorical handling without data leakage. |
Streamlit caching (@st.cache_data) |
Improves performance on repeated dataset loads. |
| Pipeline encapsulation | Ensures model reproducibility and allows joblib export. |
| Responsive Plotly charts | Future-proof visualization compatible with Streamlit 2025+. |
| Auto-hyperparameter UI | Simplifies model experimentation for non-programmers. |
Numeric features:
- Standard scaling (zero mean, unit variance)
Categorical features:
- OneHotEncoder (
sparse_outputfor sklearn ≥1.2 orsparse=Falseotherwise)
Pipeline Example:
preprocess = ColumnTransformer(
transformers=[
("num", StandardScaler(), numeric_features),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
]
)
model = RandomForestClassifier()
pipe = Pipeline([("preprocess", preprocess), ("model", model)])| Model | When to Use | Key Parameters |
|---|---|---|
| Logistic Regression | Simple linear separable data | C, max_iter |
| RandomForestClassifier | Nonlinear problems, tabular data | n_estimators, max_depth |
| XGBClassifier | Large-scale data or high-dimensional | learning_rate, n_estimators, max_depth |
| Model | When to Use | Key Parameters |
|---|---|---|
| Linear Regression | Simple linear relationships | None |
| RandomForestRegressor | Nonlinear regression | n_estimators, max_depth |
| XGBRegressor | Complex relationships, feature interactions | learning_rate, max_depth |
| Metric | Formula | Task | Interpretation |
|---|---|---|---|
| Accuracy | Correct / Total | Classification | Higher is better; best for balanced classes |
| F1 Score | 2 × (Precision × Recall) / (Precision + Recall) | Classification | Balances precision vs recall |
| ROC‑AUC | ∫TPR d(FPR) | Classification (binary) | Measures separability; 0.5 = random |
| R² | 1 - SSR/SST | Regression | Proportion of variance explained |
| MAE | Σ | y−ŷ | / n |
| RMSE | √(Σ(y−ŷ)²/n) | Regression | Penalizes large errors more |
Upload your CSV or use the built-in samples.
Choose which column to predict, the system auto‑detects task type.
Click Train Model, adjust hyperparameters (if needed).
Interactive Plotly charts display metrics, confusion matrices, ROC, or residuals.
Download model_pipeline.joblib and predictions.csv for reuse.
streamlit run app_streamlit_clean.pyPush your repository to GitHub, then deploy via share.streamlit.io.
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
EXPOSE 8501
CMD ["streamlit", "run", "app_streamlit_clean.py", "--server.port=8501", "--server.address=0.0.0.0"]- Select Streamlit template.
- Upload repository → done!
| Issue | Cause | Solution |
|---|---|---|
OneHotEncoder got unexpected keyword 'sparse' |
scikit‑learn ≥1.2 uses sparse_output |
Auto‑detected; upgrade sklearn if needed. |
Unknown label type: continuous |
Chose classifier for numeric target | Auto‑detect now prevents this. |
mean_squared_error got unexpected kw 'squared' |
sklearn <1.0 | Automatically handled. |
DeprecationWarning: use_container_width |
Streamlit 2025 update | Already replaced with width='stretch'. |
- For large CSVs, use preprocessed subsets.
- Use RandomForest for fast, robust tabular modeling.
- Enable XGBoost only if you have sufficient memory.
- Limit categorical cardinality to avoid excessive OHE expansion.
- Add cross‑validation & Optuna tuning
- Integrate SHAP explainability
- Support time‑series & NLP preprocessing
- Multi‑page dashboards & reporting
- REST API for live predictions
- Continuous Integration with GitHub Actions