Machine Learning Midterm Cheat Sheet

📦 1. Importing Libraries

Import all essential libraries for data analysis, visualization, and ML models.

Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

# sklearn modules for dataset loading, splitting, and modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score

📊 2. Load and Inspect Dataset

Load dataset and inspect basic structure and stats.

Python

# Example: load your own CSV file
data = pd.read_csv("your_dataset.csv")

# View first few rows
print(data.head())

# View column names and types
print(data.info())

# Get summary statistics for numeric columns
print(data.describe())

🔍 3. Select Features and Target

Define the input features (X) and target variable (y).

Python

# Example: assume 'target' column is what we want to predict
X = data.drop(columns=['target'])
y = data['target']

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

📈 4. Visualizing Data

Use Seaborn and Matplotlib to explore variable relationships.

Python

# Scatter plot between one feature and target
feature = 'your_feature_name'  # replace with actual column name
plt.figure(figsize=(8, 6))
plt.scatter(X[feature], y)
plt.xlabel(feature)
plt.ylabel("Target Value")
plt.title(f"{feature} vs Target")
plt.show()

# Correlation heatmap
plt.figure(figsize=(8, 6))
sb.heatmap(data.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()

🤖 5. Train-Test Split

Split dataset into training and test sets.

Python

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

📉 6. Multiple Linear Regression

Fit a Linear Regression model using all available features.

Python

est = LinearRegression()
est.fit(X_train, y_train)

y_pred = est.predict(X_test)

print("Multiple Linear Regression")
print("R² Score:", r2_score(y_test, y_pred))
print("Weights (Coefficients):", est.coef_)
print("Bias (Intercept):", est.intercept_)

🧠 7. Lasso & Ridge Regression (Regularization)

Lasso shrinks coefficients to zero (feature selection); Ridge penalizes large coefficients (reduces overfitting).

Python

# Lasso
est_lasso = Lasso(alpha=0.1) 
est_lasso.fit(X_train, y_train)
print("Lasso R²:", r2_score(y_test, est_lasso.predict(X_test)))

# Ridge
est_ridge = Ridge(alpha=1.0)
est_ridge.fit(X_train, y_train)
print("Ridge R²:", r2_score(y_test, est_ridge.predict(X_test)))

📦 1. Data Preparation for KNN Classification

Example loading California Housing data and converting the continuous target variable into a binary classification problem (Above/Below Median Price).

Python

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data and convert target
data = fetch_california_housing()
X = data.data
y = data.target
medianPrice = np.median(y)
binary_y = (y > medianPrice).astype(int) # Target 0 or 1

X_train, X_test, y_train, y_test = train_test_split(X, binary_y, test_size=0.2, random_state=42)

🏃 2. KNN (Unscaled Data)

KNN without scaling often performs poorly if features have vastly different ranges.

Python

est = KNeighborsClassifier()
est.fit(X_train, y_train)
pred = est.predict(X_test)
print("Accuracy (Unscaled):", accuracy_score(y_test, pred))

📏 3. KNN with Feature Scaling (MinMaxScaler)

Scaling is crucial for distance-based algorithms. Fit the scaler on training data ONLY, then transform both train and test sets.

Python

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Choose a scaler (MinMax or Standard)
scaler = MinMaxScaler() 
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN on scaled data
est_scaled = KNeighborsClassifier()
est_scaled.fit(X_train_scaled, y_train)

pred_scaled = est_scaled.predict(X_test_scaled)

print("Accuracy (Scaled):", accuracy_score(y_test, pred_scaled))

🎨 4. Visualizing Misclassified Points

Plotting two features to show true classes (`c=y_test`) and highlighting errors (`edgecolors='r'`).

Python

import matplotlib.pyplot as plt

# Using unscaled test data for plotting for clarity
pred_unscaled = KNeighborsClassifier().fit(X_train, y_train).predict(X_test)

plt.figure(figsize=(8, 6))
plt.scatter(X_test[:, 0], X_test[:, 1], 
            c=y_test, 
            linewidths = (y_test != pred_unscaled) * 1.5, 
            edgecolors='r')
plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
plt.title("KNN Test Predictions (Errors Highlighted in Red)")
plt.colorbar()
plt.show()

🎲 1. Generate & Visualize Synthetic Data

Create sample data using `make_blobs` for binary classification demonstration.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

n = 1000
X, y = make_blobs(n, centers=2, cluster_std=[2,4], random_state=42)

plt.scatter(X[:,0], X[:,1], c=y)
plt.title("Synthetic Blob Data")
plt.colorbar(label="Class")
plt.show()

🧠 2. Train the Logistic Regression Model

Split the data, fit a `LogisticRegression` model, and evaluate its accuracy.

Python

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

est = LogisticRegression()
est.fit(X_train, y_train)

pred = est.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Coefficients:", est.coef_)
print("Intercept:", est.intercept_)

📈 3. Visualize the Decision Boundary

The decision boundary is the line where the model predicts the probability of class 1 is 0.5.

Python

def dividingLine(x):
    b = est.intercept_[0]
    w1, w2 = est.coef_.T
    # Equation derived from w1*x1 + w2*x2 + b = 0
    c = -b / w2
    m = -w1 / w2
    return m * x + c

x_range = np.array([np.min(X[:,0]), np.max(X[:,0])])

# Highlight misclassified points (y_test != pred) with red edges
plt.scatter(X_test[:,0], X_test[:,1], c=y_test, 
            linewidths = (y_test != pred) * 2, edgecolors='r') 
plt.plot(x_range, dividingLine(x_range), color='black', linestyle='--')
plt.title("Logistic Regression Decision Boundary")
plt.axis('equal')
plt.show()

🌸 1. Load and Inspect Iris Data (Multiclass)

Load a standard multiclass dataset and visualize its structure using Pairplots and Correlation Heatmaps.

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

ds = load_iris()
X = ds.data
y = ds.target

# Detailed Feature Visualization (Pairplot)
df = load_iris(as_frame=True)['frame']
sns.pairplot(df, hue='target', palette='colorblind')
plt.show()

# Correlation Heatmap
cor = np.corrcoef(X, rowvar=False)
plt.figure(figsize=(7, 6))
sns.heatmap(cor, annot=True, xticklabels=ds.feature_names, yticklabels=ds.feature_names)
plt.title("Feature Correlation")
plt.show()

🔪 2. Cross-Validation Split Strategies

Common ways to split data for robust evaluation: K-Fold (most common), Leave-One-Out (for small datasets), ShuffleSplit (random sampling).

Python

from sklearn.model_selection import KFold, ShuffleSplit

X_demo = np.arange(20).reshape(-1,1)
# Use KFold for fixed splits
cv = KFold(n_splits=4, shuffle=True, random_state=42)

print("K-Fold Splits (n_splits=4):")
for i, (train_index, test_index) in enumerate(cv.split(X_demo)):
    print(f"  Fold {i+1}: Train count={len(train_index)}, Test count={len(test_index)}")

🎯 3. Evaluating Model with `cross_val_score`

Evaluate a base model (e.g., KNN with k=3) across all folds defined by the CV strategy.

Python

from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# X and y from Iris data, loaded previously
est = KNeighborsClassifier(n_neighbors=3)
cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(est, X, y, scoring='accuracy', cv=cv_strategy)

print(f"Average Score: {np.mean(scores):.4f}")
print(f"Standard Deviation: {np.std(scores):.4f}")

🔍 4. Hyperparameter Tuning with `GridSearchCV`

Search for optimal hyperparameters (`n_neighbors` and distance metric `p`) by exhaustively testing every combination defined in `param_grid`.

Python

from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit

est = KNeighborsClassifier()
# StratifiedShuffleSplit maintains class proportions across folds
cv_grid = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

param_grid = {
    'n_neighbors': range(3, 11), # k from 3 to 10
    'p': range(1, 6)             # distance metric (p=1: Manhattan, p=2: Euclidean)
}

search = GridSearchCV(est, param_grid, scoring='accuracy', cv=cv_grid, n_jobs=-1)
search.fit(X, y)

print("Best CV Score:", search.best_score_)
print("Best Parameters:", search.best_params_)
print("Best Model:", search.best_estimator_)

📊 5. Heatmap of Grid Search Scores

Visualize the `mean_test_score` for every combination of hyperparameters using a heatmap for easy comparison.

Python

# Assuming 'search' object was created in the previous step
results = pd.DataFrame(search.cv_results_)

# Pivot the results table to get (p x n_neighbors) matrix
pivot_table = results.pivot_table(
    index='param_p', 
    columns='param_n_neighbors', 
    values='mean_test_score'
)

plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, fmt=".4f", cmap="YlGnBu")
plt.title("Grid Search Accuracy Scores")
plt.xlabel("n_neighbors (k)")
plt.ylabel("p (Distance Metric)")
plt.show()

Machine Learning Midterm Cheat Sheet

📦 1. Importing Libraries

📊 2. Load and Inspect Dataset

🔍 3. Select Features and Target

📈 4. Visualizing Data

🤖 5. Train-Test Split

📉 6. Multiple Linear Regression

🧠 7. Lasso & Ridge Regression (Regularization)

📦 1. Data Preparation for KNN Classification

🏃 2. KNN (Unscaled Data)

📏 3. KNN with Feature Scaling (MinMaxScaler)

🎨 4. Visualizing Misclassified Points

🎲 1. Generate & Visualize Synthetic Data

🧠 2. Train the Logistic Regression Model

📈 3. Visualize the Decision Boundary

🌸 1. Load and Inspect Iris Data (Multiclass)

🔪 2. Cross-Validation Split Strategies

🎯 3. Evaluating Model with `cross_val_score`

🔍 4. Hyperparameter Tuning with `GridSearchCV`

📊 5. Heatmap of Grid Search Scores

📄 Structured ML Assignment Workflow Prompts

Prompt 1: Initial Assignment Setup

Prompt 2: Phase 1 Instructions (EDA & Preprocessing)

Prompt 3: Phase 2 Instructions (Regression Modeling)

Prompt 4: Phase 3 Instructions (Classification)

Prompt 5: Phase 4 Instructions (Visualization & Summary)