Import all essential libraries for data analysis, visualization, and ML models.
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
# sklearn modules for dataset loading, splitting, and modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score
📊 2. Load and Inspect Dataset
Load dataset and inspect basic structure and stats.
Python
# Example: load your own CSV file
data = pd.read_csv("your_dataset.csv")
# View first few rows
print(data.head())
# View column names and types
print(data.info())
# Get summary statistics for numeric columns
print(data.describe())
🔍 3. Select Features and Target
Define the input features (X) and target variable (y).
Python
# Example: assume 'target' column is what we want to predict
X = data.drop(columns=['target'])
y = data['target']
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)
📈 4. Visualizing Data
Use Seaborn and Matplotlib to explore variable relationships.
Python
# Scatter plot between one feature and target
feature = 'your_feature_name' # replace with actual column name
plt.figure(figsize=(8, 6))
plt.scatter(X[feature], y)
plt.xlabel(feature)
plt.ylabel("Target Value")
plt.title(f"{feature} vs Target")
plt.show()
# Correlation heatmap
plt.figure(figsize=(8, 6))
sb.heatmap(data.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()
Example loading California Housing data and converting the continuous target variable into a binary classification problem (Above/Below Median Price).
Python
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load data and convert target
data = fetch_california_housing()
X = data.data
y = data.target
medianPrice = np.median(y)
binary_y = (y > medianPrice).astype(int) # Target 0 or 1
X_train, X_test, y_train, y_test = train_test_split(X, binary_y, test_size=0.2, random_state=42)
🏃 2. KNN (Unscaled Data)
KNN without scaling often performs poorly if features have vastly different ranges.
Python
est = KNeighborsClassifier()
est.fit(X_train, y_train)
pred = est.predict(X_test)
print("Accuracy (Unscaled):", accuracy_score(y_test, pred))
📏 3. KNN with Feature Scaling (MinMaxScaler)
Scaling is crucial for distance-based algorithms. Fit the scaler on training data ONLY, then transform both train and test sets.
Python
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Choose a scaler (MinMax or Standard)
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KNN on scaled data
est_scaled = KNeighborsClassifier()
est_scaled.fit(X_train_scaled, y_train)
pred_scaled = est_scaled.predict(X_test_scaled)
print("Accuracy (Scaled):", accuracy_score(y_test, pred_scaled))
🎨 4. Visualizing Misclassified Points
Plotting two features to show true classes (`c=y_test`) and highlighting errors (`edgecolors='r'`).
Python
import matplotlib.pyplot as plt
# Using unscaled test data for plotting for clarity
pred_unscaled = KNeighborsClassifier().fit(X_train, y_train).predict(X_test)
plt.figure(figsize=(8, 6))
plt.scatter(X_test[:, 0], X_test[:, 1],
c=y_test,
linewidths = (y_test != pred_unscaled) * 1.5,
edgecolors='r')
plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
plt.title("KNN Test Predictions (Errors Highlighted in Red)")
plt.colorbar()
plt.show()
🎲 1. Generate & Visualize Synthetic Data
Create sample data using `make_blobs` for binary classification demonstration.
Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
n = 1000
X, y = make_blobs(n, centers=2, cluster_std=[2,4], random_state=42)
plt.scatter(X[:,0], X[:,1], c=y)
plt.title("Synthetic Blob Data")
plt.colorbar(label="Class")
plt.show()
🧠 2. Train the Logistic Regression Model
Split the data, fit a `LogisticRegression` model, and evaluate its accuracy.
Python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
est = LogisticRegression()
est.fit(X_train, y_train)
pred = est.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))
print("Coefficients:", est.coef_)
print("Intercept:", est.intercept_)
📈 3. Visualize the Decision Boundary
The decision boundary is the line where the model predicts the probability of class 1 is 0.5.
Python
def dividingLine(x):
b = est.intercept_[0]
w1, w2 = est.coef_.T
# Equation derived from w1*x1 + w2*x2 + b = 0
c = -b / w2
m = -w1 / w2
return m * x + c
x_range = np.array([np.min(X[:,0]), np.max(X[:,0])])
# Highlight misclassified points (y_test != pred) with red edges
plt.scatter(X_test[:,0], X_test[:,1], c=y_test,
linewidths = (y_test != pred) * 2, edgecolors='r')
plt.plot(x_range, dividingLine(x_range), color='black', linestyle='--')
plt.title("Logistic Regression Decision Boundary")
plt.axis('equal')
plt.show()
🌸 1. Load and Inspect Iris Data (Multiclass)
Load a standard multiclass dataset and visualize its structure using Pairplots and Correlation Heatmaps.
Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
ds = load_iris()
X = ds.data
y = ds.target
# Detailed Feature Visualization (Pairplot)
df = load_iris(as_frame=True)['frame']
sns.pairplot(df, hue='target', palette='colorblind')
plt.show()
# Correlation Heatmap
cor = np.corrcoef(X, rowvar=False)
plt.figure(figsize=(7, 6))
sns.heatmap(cor, annot=True, xticklabels=ds.feature_names, yticklabels=ds.feature_names)
plt.title("Feature Correlation")
plt.show()
🔪 2. Cross-Validation Split Strategies
Common ways to split data for robust evaluation: K-Fold (most common), Leave-One-Out (for small datasets), ShuffleSplit (random sampling).
Python
from sklearn.model_selection import KFold, ShuffleSplit
X_demo = np.arange(20).reshape(-1,1)
# Use KFold for fixed splits
cv = KFold(n_splits=4, shuffle=True, random_state=42)
print("K-Fold Splits (n_splits=4):")
for i, (train_index, test_index) in enumerate(cv.split(X_demo)):
print(f" Fold {i+1}: Train count={len(train_index)}, Test count={len(test_index)}")
🎯 3. Evaluating Model with `cross_val_score`
Evaluate a base model (e.g., KNN with k=3) across all folds defined by the CV strategy.
Python
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
# X and y from Iris data, loaded previously
est = KNeighborsClassifier(n_neighbors=3)
cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(est, X, y, scoring='accuracy', cv=cv_strategy)
print(f"Average Score: {np.mean(scores):.4f}")
print(f"Standard Deviation: {np.std(scores):.4f}")
🔍 4. Hyperparameter Tuning with `GridSearchCV`
Search for optimal hyperparameters (`n_neighbors` and distance metric `p`) by exhaustively testing every combination defined in `param_grid`.
Python
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
est = KNeighborsClassifier()
# StratifiedShuffleSplit maintains class proportions across folds
cv_grid = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
param_grid = {
'n_neighbors': range(3, 11), # k from 3 to 10
'p': range(1, 6) # distance metric (p=1: Manhattan, p=2: Euclidean)
}
search = GridSearchCV(est, param_grid, scoring='accuracy', cv=cv_grid, n_jobs=-1)
search.fit(X, y)
print("Best CV Score:", search.best_score_)
print("Best Parameters:", search.best_params_)
print("Best Model:", search.best_estimator_)
📊 5. Heatmap of Grid Search Scores
Visualize the `mean_test_score` for every combination of hyperparameters using a heatmap for easy comparison.
Python
# Assuming 'search' object was created in the previous step
results = pd.DataFrame(search.cv_results_)
# Pivot the results table to get (p x n_neighbors) matrix
pivot_table = results.pivot_table(
index='param_p',
columns='param_n_neighbors',
values='mean_test_score'
)
plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, fmt=".4f", cmap="YlGnBu")
plt.title("Grid Search Accuracy Scores")
plt.xlabel("n_neighbors (k)")
plt.ylabel("p (Distance Metric)")
plt.show()
📄 Structured ML Assignment Workflow Prompts
Prompt 1: Initial Assignment Setup
You will complete a full Machine Learning assignment on a CSV dataset by working in 4 structured phases, one after another. The final objective is to explore the data, build regression and classification models, and visualize a decision boundary with misclassified points, following a clean, human-like workflow. At every phase, you must write short explanations, make simple reasoning-based choices, and maintain readable, well-commented code. Your work will follow these four phases: Phase 1 — EDA & Preprocessing Understand the dataset using pairplots, correlation heatmaps, and summary statistics. Prepare the data by splitting and scaling. Phase 2 — Regression Build and evaluate SLR, MLR, Ridge, and Lasso models using R², comparing their performance and interpreting results. Phase 3 — Classification Convert the target to a binary label using median split, train Logistic Regression and KNN classifiers, and compare their accuracy. Phase 4 — Decision Boundary + Final Summary Plot the decision boundary, highlight misclassified points, overlay a y = m x + c line, and write a final insight-oriented summary. After each phase, you must stop and wait for the next instruction. If ready, respond: “Ready. Waiting for Phase 1.”
Load & Inspect the Dataset Load the CSV "DATASET_NAME.csv" using pandas and display .head(), .info(), .describe(), and the dataset shape. Write a short paragraph describing what the dataset looks like, including the target column and any early observations. Visual EDA Generate a pairplot of numeric features and discuss visible trends or clusters in 4–6 sentences. Then compute a correlation heatmap with annotations and describe the strongest correlations, especially those related to the target. Preprocessing Steps Select the target column, separate X and y, perform a train–test split (test_size=0.2, fixed random_state), and apply Min-Max scaling to the features. Briefly explain why scaling is important. When finished, do not move to modeling. End with: “Phase 1 complete. Waiting for Phase 2.”
PHASE 2 — Regression Modeling (SLR, MLR, Ridge, Lasso) In this phase, you will model the continuous target and evaluate performance using R². Work step-by-step and write short explanations after each result. Simple Linear Regression (SLR) Select one meaningful feature based on Phase 1 insights (pairplot or correlations). Train an SLR model and report R², coefficient(s), and intercept. Also add a concise interpretation of whether the single feature seems predictive. Multiple Linear Regression (MLR) Train MLR using all features. Report R², coefficients, and intercept, then compare MLR vs SLR in 3–5 sentences. Mention whether adding features improved the model and why that is expected. Ridge & Lasso Regression Train Ridge and Lasso, tuning alpha over a small set (e.g., [0.01, 0.1, 1, 10]). Report the best R² for each model and briefly describe how regularization affects coefficients. Write 3–6 sentences comparing MLR, Ridge, and Lasso in terms of fit and general behavior. When Phase 2 is finished, stop and say: “Phase 2 complete. Waiting for Phase 3.”
Prompt 4: Phase 3 Instructions (Classification)
PHASE 3 — Classification (Median Split, Logistic Regression, KNN) In this phase, convert the regression problem into classification and compare model performances using accuracy. Binary Target Creation (Median Split) Convert the continuous target into a binary label (1 if above median, 0 otherwise). Perform a fresh train–test split and re-apply scaling. Logistic Regression Train a Logistic Regression model and report accuracy. Print coefficients and add a short explanation about how well it separated the classes. K-Nearest Neighbors (KNN) Train a KNN classifier using multiple k values (such as 3, 5, 7). Select the best k based on accuracy and explain your choice in 3–5 sentences. Then compare KNN vs Logistic Regression briefly. When complete, stop and say: “Phase 3 complete. Waiting for Phase 4.” Do not draw decision boundaries yet — that happens in the next phase.
PHASE 4 — Decision Boundary Visualization & Final Summary In this final phase, you will visualize classification behavior and wrap up the assignment clearly and professionally. Mandatory Decision Boundary Plot Pick two meaningful features (based on correlations or model coefficients). Train your chosen classifier (Logistic or best KNN) on these two features and: Plot the decision boundary using a meshgrid Plot the test points by true class color Highlight misclassified points Overlay a reference line y = m x + c for visual separation Interpret the Plot Write 6–10 sentences discussing separability, misclassification patterns, and what the boundary visually tells us about the model. Final Summary Concisely recap: Key EDA findings Best regression model (R²) Best classification model (accuracy) One clear insight from the decision boundary End with: “All phases completed.”