Exercise X: Common Mistakes and Domain-Relevant Insights#

This exercise is designed to help you identify and address common mistakes in machine learning workflows, with a specific focus on issues relevant to neuroscience data. You’ll explore two key topics:

  1. Cross-Validation and Avoiding Data Leakage

  2. Handling Class Imbalance

Part 1: Cross-Validation and Avoiding Data Leakage#

Introduction#

Cross-validation is a powerful technique to evaluate model performance by splitting the data into training and testing sets multiple times. However, improper handling of preprocessing steps can lead to data leakage, where information from the test set influences the training process, resulting in overly optimistic performance estimates.

Task 1: Identifying Data Leakage#

This exercise demonstrates the proper and improper ways to apply cross-validation in a classification task, focusing on feature selection. You’ll see how performing feature selection on the entire dataset before cross-validation (“the wrong way”) can lead to misleadingly optimistic results, and how integrating feature selection within each cross-validation fold (“the correct way”) provides a realistic error estimate.

Scenario#

  • Number of samples (N): 50

  • Number of predictors (p): 5000

  • Number of classes: 2 (balanced)

  • Predictors are independent of class labels (no real signal).

  • True error rate of any classifier: 50%

We will:

  1. Simulate the “wrong way”: Feature selection on the entire dataset before cross-validation.

  2. Simulate the “correct way”: Feature selection within each cross-validation fold.

import numpy as np

# Generate synthetic data
N, p, num_predictors, K = 50, 5000, 100, 10
np.random.seed(0)
y = np.array([0] * (N // 2) + [1] * (N // 2))
np.random.shuffle(y)
X = np.random.randn(N, p)

n_simulations = 500

Step 1: Simulate the Wrong Way#

from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsClassifier

def simulate_wrong(X:np.ndarray, y:np.ndarray, num_predictors:int, K:int) -> None:
    # Fit a KNN classifier with the wrong number of predictors
    selector_wrong = SelectKBest(score_func=f_classif, k=num_predictors)
    selector_wrong.fit(X, y)
    X_selected_wrong = selector_wrong.transform(X)

    # Initialize cross-validation
    kf = KFold(n_splits=K, shuffle=True, random_state=None)
    errors_wrong = []
    knn_wrong = KNeighborsClassifier(n_neighbors=1)

    for train_index, test_index in kf.split(X_selected_wrong):
        X_train, X_test = X_selected_wrong[train_index], X_selected_wrong[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Train the classifier
        knn_wrong.fit(X_train, y_train)

        # Predict on the test set
        y_pred = knn_wrong.predict(X_test)

        # Calculate error rate
        error = np.mean(y_pred != y_test)
        errors_wrong.append(error)

    avg_error_wrong = np.mean(errors_wrong)
    return avg_error_wrong

wrong_estimates = [simulate_wrong(X, y, num_predictors, K) for _ in range(n_simulations)]
print(
    f"Average error rate with the wrong number of predictors: {np.mean(wrong_estimates):.3f} ± {np.std(wrong_estimates):.3f}"
)
Average error rate with the wrong number of predictors: 0.044 ± 0.012

Question:

  • Why does feature selection on the entire dataset before cross-validation lead to an overly optimistic error rate?

Solution

When feature selection is performed on the entire dataset before cross-validation, the test set indirectly influences the selection of features. This means that information from the test set leaks into the training process, leading to an overfit model that performs unrealistically well during validation. As a result, the estimated error rate is lower than the true error rate because the model has effectively “seen” part of the test data during training.

Step 2: Simulate the Correct Way#

def simulate_right(X:np.ndarray, y:np.ndarray, num_predictors:int, K:int) -> None:
    # Initialize cross-validation
    errors_correct = []
    kf_correct = KFold(n_splits=K, shuffle=True, random_state=None)

    for train_index, test_index in kf_correct.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Feature selection on the training data only
        selector_correct = SelectKBest(score_func=f_classif, k=num_predictors)
        selector_correct.fit(X_train, y_train)
        X_train_selected = selector_correct.transform(X_train)
        X_test_selected = selector_correct.transform(X_test)

        # Train the classifier
        knn_correct = KNeighborsClassifier(n_neighbors=1)
        knn_correct.fit(X_train_selected, y_train)

        # Predict on the test set
        y_pred = knn_correct.predict(X_test_selected)

        # Calculate error rate
        error = np.mean(y_pred != y_test)
        errors_correct.append(error)

    avg_error_correct = np.mean(errors_correct)
    return avg_error_correct

right_estimates = [simulate_right(X, y, num_predictors, K) for _ in range(n_simulations)]
print(
    f"Average error rate with the right number of predictors: {np.mean(right_estimates):.3f} ± {np.std(right_estimates):.3f}"
)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[4], line 30
     27     avg_error_correct = np.mean(errors_correct)
     28     return avg_error_correct
---> 30 right_estimates = [simulate_right(X, y, num_predictors, K) for _ in range(n_simulations)]
     31 print(
     32     f"Average error rate with the right number of predictors: {np.mean(right_estimates):.3f} ± {np.std(right_estimates):.3f}"
     33 )

Cell In[4], line 30, in <listcomp>(.0)
     27     avg_error_correct = np.mean(errors_correct)
     28     return avg_error_correct
---> 30 right_estimates = [simulate_right(X, y, num_predictors, K) for _ in range(n_simulations)]
     31 print(
     32     f"Average error rate with the right number of predictors: {np.mean(right_estimates):.3f} ± {np.std(right_estimates):.3f}"
     33 )

Cell In[4], line 12, in simulate_right(X, y, num_predictors, K)
     10 # Feature selection on the training data only
     11 selector_correct = SelectKBest(score_func=f_classif, k=num_predictors)
---> 12 selector_correct.fit(X_train, y_train)
     13 X_train_selected = selector_correct.transform(X_train)
     14 X_test_selected = selector_correct.transform(X_test)

File /opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/sklearn/base.py:1389, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1382     estimator._validate_params()
   1384 with config_context(
   1385     skip_parameter_validation=(
   1386         prefer_skip_nested_validation or global_skip_validation
   1387     )
   1388 ):
-> 1389     return fit_method(estimator, *args, **kwargs)

File /opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/sklearn/feature_selection/_univariate_selection.py:569, in _BaseFilter.fit(self, X, y)
    564     X, y = validate_data(
    565         self, X, y, accept_sparse=["csr", "csc"], multi_output=True
    566     )
    568 self._check_params(X, y)
--> 569 score_func_ret = self.score_func(X, y)
    570 if isinstance(score_func_ret, (list, tuple)):
    571     self.scores_, self.pvalues_ = score_func_ret

File /opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/sklearn/utils/_param_validation.py:189, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    187 global_skip_validation = get_config()["skip_parameter_validation"]
    188 if global_skip_validation:
--> 189     return func(*args, **kwargs)
    191 func_sig = signature(func)
    193 # Map *args/**kwargs to the function signature

File /opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/sklearn/feature_selection/_univariate_selection.py:172, in f_classif(X, y)
    170 X, y = check_X_y(X, y, accept_sparse=["csr", "csc", "coo"])
    171 args = [X[safe_mask(X, y == k)] for k in np.unique(y)]
--> 172 return f_oneway(*args)

File /opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/sklearn/feature_selection/_univariate_selection.py:115, in f_oneway(*args)
    113 # flatten matrix to vector in sparse case
    114 f = np.asarray(f).ravel()
--> 115 prob = special.fdtrc(dfbn, dfwn, f)
    116 return f, prob

KeyboardInterrupt: 

Step 3: Compare Results#

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('bmh')
plt.rcParams.update({'font.size': 14})

true_error = 50.0  # True error rate (no signal in data)
avg_error_wrong = np.mean(wrong_estimates)
avg_error_correct = np.mean(right_estimates)
print("========================================")
print(f"True Error:   {true_error:.2f}%")
print(f"Wrong Way:    {avg_error_wrong * 100:.2f}%")
print(f"Correct Way:  {avg_error_correct * 100:.2f}%")
print("========================================")

print("Conclusion:")
print("The 'Wrong Way' of performing feature selection before cross-validation leads to an overly optimistic error rate.")
print("The 'Correct Way' of integrating feature selection within each cross-validation fold provides an unbiased estimate of the true error rate.")

# plot the distribution of error rates and the true error rate
fig, ax = plt.subplots(figsize=(10,10))
sns.boxplot(data=[wrong_estimates, right_estimates], ax=ax)
# add a horizontal line for the true error rate
ax.axhline(0.5, color='red', linestyle='--', label='True error rate')
ax.set_xticklabels(["Wrong way", "Right way"])
ax.set_ylabel("Error rate")
ax.set_title("Error rates with the wrong and right way of cross-validation")
plt.legend()
plt.show()
========================================
True Error:   50.00%
Wrong Way:    4.37%
Correct Way:  58.75%
========================================
Conclusion:
The 'Wrong Way' of performing feature selection before cross-validation leads to an overly optimistic error rate.
The 'Correct Way' of integrating feature selection within each cross-validation fold provides an unbiased estimate of the true error rate.
../../_images/88ca4be42e7bacb553d7faa7d83784335d0f1939b5a3794b6637bf815b7074c4.png

Question:

  • How does performing feature selection within each fold of cross-validation provide a more accurate error estimate?

Solution

Performing feature selection within each fold ensures that the test set remains completely unseen during the training process. This avoids data leakage, as the feature selection is based solely on the training data within that fold. By doing this, the evaluation provides a realistic estimate of the model’s performance on truly unseen data, preventing over-optimistic error rates.

Additional Notes#

The ‘Correct Way’ often produces error rates higher than the true error rate (e.g., ~60%) in this scenario. This happens because:#

  1. The predictors are randomly generated and independent of the labels, so selected features correlate with the labels purely by chance.

  2. High dimensionality and small sample size lead to overfitting, especially with inconsistent feature sets across folds.

  3. The nearest-neighbor classifier is sensitive to noisy or irrelevant features, highlighting this issue.

To further investigate:#

  • Inspect the features selected in each fold.

  • Reduce the number of predictors (e.g., from 5000 to 500) and observe the results.

  • Use a different classifier (e.g., logistic regression) to reduce sensitivity to noise.

Other Sources of Data Leakage#

Feature selection is just one example of data leakage. Other preprocessing steps can also inadvertently introduce leakage if applied to the entire dataset before splitting into training and testing sets. Examples include:

  • Normalization or Standardization: Scaling features based on the entire dataset rather than only on the training set can allow information from the test set to influence the training process.

  • Dimensionality Reduction: Methods like PCA, when applied before splitting the data, use information from the test set to define components, leading to over-optimistic evaluations.

  • Target Encoding for Categorical Variables: If encoding schemes that depend on the target variable are applied globally, the test set’s target values influence the training process.

  • Imputation of Missing Data: Imputing missing values using statistics (e.g., mean, median) computed over the entire dataset can leak test set information into the training set.

Summary of Data Leakage#

Data leakage occurs whenever information from the test set influences the training process. It leads to overly optimistic error estimates and unreliable models. To avoid data leakage:

  • Perform all preprocessing steps (e.g., scaling, feature selection, dimensionality reduction) separately within each fold of cross-validation.

  • Treat the test set as truly unseen data, ensuring no information flows from the test set to the training process.


Part 2: Handling Imbalanced Data#

Introduction#

Class imbalance is a common issue in machine learning, where one class significantly outnumbers the other(s). This can result in biased models that favor the majority class, leading to poor performance on the minority class, which may be the class of interest.

In this part of the tutorial, you will:

  1. Observe the effects of class imbalance on model performance.

  2. Apply techniques like class weighting and SMOTE (Synthetic Minority Oversampling Technique) to address imbalance.

  3. Evaluate and compare the improvements in model performance.

Step 1: Loading an Imbalanced Dataset#

image

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv("creditcard.csv")

# Display class distribution
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(x="Class", data=df, ax=ax)
ax.set_title("Class Distribution (Fraud vs Non-Fraud)")
ax.set_yscale("log")
ax.set_ylabel("Log(Count)")
ax.set_xticklabels(["Non-Fraud", "Fraud"])
plt.show()

# Split the dataset into features and target
X = df.drop(columns=["Class"])
y = df["Class"]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
../../_images/dcaee52bbc290c39bc014ea31b7a2c21303d10c30c95f9b3b6ffec41122fc782.png

Step 2: Training Logistic Regression on Imbalanced Data#

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train a logistic regression model
model_imbalanced = LogisticRegression(solver="liblinear", random_state=42)
model_imbalanced.fit(X_train, y_train)

# Predict
y_pred_imbalanced = model_imbalanced.predict(X_test)

Step 3: Using Class Weighting to Handle Imbalance#

# Train a logistic regression model with class weighting
model_weighted = LogisticRegression(
    solver="liblinear", class_weight="balanced", random_state=42
)
model_weighted.fit(X_train, y_train)

# Predict
y_pred_weighted = model_weighted.predict(X_test)

Step 4: Oversampling with SMOTE and Other Sampling Techniques**#

Descriptions of Sampling Methods#

from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Apply SMOTE to balance the dataset
smote = SMOTE(sampling_strategy='minority', random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Apply Random Oversampling
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

# Apply Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

# Train logistic regression models
model_smote = LogisticRegression(solver="liblinear", random_state=42)
model_smote.fit(X_train_smote, y_train_smote)

model_ros = LogisticRegression(solver="liblinear", random_state=42)
model_ros.fit(X_train_ros, y_train_ros)

model_rus = LogisticRegression(solver="liblinear", random_state=42)
model_rus.fit(X_train_rus, y_train_rus)

# Predict with SMOTE
y_pred_smote = model_smote.predict(X_test)

# Predict with Random Oversampling
y_pred_ros = model_ros.predict(X_test)

# Predict with Random Undersampling
y_pred_rus = model_rus.predict(X_test)

Step 5: Comparing Results#

Choosing Metrics for Comparison#

  • ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Measures the model’s ability to distinguish between classes across various thresholds. A higher AUC indicates better discrimination.

    • Useful for evaluating overall model performance when false positives and false negatives are equally important.

  • PR-AUC (Precision-Recall AUC): Focuses on the precision-recall tradeoff, particularly valuable for imbalanced datasets. PR-AUC highlights the model’s performance in predicting the minority class.

    • More informative than ROC-AUC when the positive class is rare.

  • F1-Score: Balances precision and recall, making it suitable for cases where the cost of false positives and false negatives are significant.

    • Provides a single score to gauge model performance when class distributions are uneven.

By comparing these metrics, you gain a comprehensive understanding of model performance for both majority and minority classes, ensuring the chosen model aligns with your use case.

from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import precision_recall_curve, auc


# Function to plot ROC curves
def plot_roc_curves(y_test:np.ndarray, models:np.ndarray, labels:np.ndarray, ax:np.ndarray) -> None:
    """
    Plot ROC curves for multiple models.

    Parameters
    ----------
    y_test : np.array
        True labels
    models : fitted models
        List of fitted models
    labels : list
        List of model names
    ax : matplotlib axes
        Axes to plot the ROC curves
    """
    for model, label in zip(models, labels):
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
        roc_auc = roc_auc_score(y_test, y_pred_proba)
        ax.plot(fpr, tpr, label=f"{label} (AUC = {roc_auc:.2f})")

    # Plot diagonal line
    ax.plot([0, 1], [0, 1], color="navy", linestyle="--")
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.set_title("ROC-AUC Comparison")
    ax.legend(loc="lower right")

# function to plot PR curves
def plot_pr_curves(y_test:np.ndarray, models:np.ndarray, labels:np.ndarray, ax:np.ndarray) -> None:
    """
    Plot precision-recall curves for multiple models.

    Parameters
    ----------
    y_test : np.array
        True labels
    models : fitted models
        List of fitted models
    labels : list
        List of model names
    ax : matplotlib axes
        Axes to plot the PR curves
    """
    for model, label in zip(models, labels):
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
        pr_auc = auc(recall, precision)
        ax.plot(recall, precision, label=f"{label} (AUC = {pr_auc:.2f})")

    # set horizontal line for the baseline
    ax.axhline(y=sum(y_test) / len(y_test), color="navy", linestyle="--")
    ax.set_xlabel("Recall")
    ax.set_ylabel("Precision")
    ax.set_title("Precision-Recall Comparison")
    ax.legend(loc="lower left")
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
plot_roc_curves(y_test, [model_imbalanced, model_weighted, model_smote, model_ros, model_rus],
                ["Imbalanced", "Weighted", "SMOTE", "Random Oversampling", "Random Undersampling"], axes[0])
plot_pr_curves(y_test, [model_imbalanced, model_weighted, model_smote, model_ros, model_rus],
                ["Imbalanced", "Weighted", "SMOTE", "Random Oversampling", "Random Undersampling"], axes[1])
../../_images/e6622d94b51e981a77366e055058a1b081b58d0753fa9daf3797e723d0a4c7c4.png
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    average_precision_score,
)

metrics = {
    "Technique": [
        "Imbalanced",
        "Class Weighting",
        "SMOTE",
        "Random Oversampling",
        "Random Undersampling",
    ],
    "Precision": [
        precision_score(y_test, y_pred_imbalanced, zero_division=0),
        precision_score(y_test, y_pred_weighted, zero_division=0),
        precision_score(y_test, y_pred_smote, zero_division=0),
        precision_score(y_test, y_pred_ros, zero_division=0),
        precision_score(y_test, y_pred_rus, zero_division=0),
    ],
    "Recall": [
        recall_score(y_test, y_pred_imbalanced),
        recall_score(y_test, y_pred_weighted),
        recall_score(y_test, y_pred_smote),
        recall_score(y_test, y_pred_ros),
        recall_score(y_test, y_pred_rus),
    ],
    "F1-Score": [
        f1_score(y_test, y_pred_imbalanced, zero_division=0),
        f1_score(y_test, y_pred_weighted, zero_division=0),
        f1_score(y_test, y_pred_smote, zero_division=0),
        f1_score(y_test, y_pred_ros, zero_division=0),
        f1_score(y_test, y_pred_rus, zero_division=0),
    ],
    "ROC-AUC": [
        roc_auc_score(y_test, model_imbalanced.predict_proba(X_test)[:, 1]),
        roc_auc_score(y_test, model_weighted.predict_proba(X_test)[:, 1]),
        roc_auc_score(y_test, model_smote.predict_proba(X_test)[:, 1]),
        roc_auc_score(y_test, model_ros.predict_proba(X_test)[:, 1]),
        roc_auc_score(y_test, model_rus.predict_proba(X_test)[:, 1]),
    ],
    "PR-AUC": [
        average_precision_score(y_test, model_imbalanced.predict_proba(X_test)[:, 1]),
        average_precision_score(y_test, model_weighted.predict_proba(X_test)[:, 1]),
        average_precision_score(y_test, model_smote.predict_proba(X_test)[:, 1]),
        average_precision_score(y_test, model_ros.predict_proba(X_test)[:, 1]),
        average_precision_score(y_test, model_rus.predict_proba(X_test)[:, 1]),
    ],
}

metrics_df = pd.DataFrame(metrics)
print(metrics_df)

# Visualize metrics
melted_metrics = pd.melt(
    metrics_df, ["Technique"], var_name="Metric", value_name="Score"
)
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(
    x="Technique", y="Score", hue="Metric", data=melted_metrics, palette="tab10", ax=ax
)
ax.set_title("Model Performance Metrics Comparison")
ax.set_ylabel("Score")
ax.set_ylim(0, 1)
ax.legend(title="Metric")
plt.xticks(rotation=45)
plt.show()
              Technique  Precision    Recall  F1-Score   ROC-AUC    PR-AUC
0            Imbalanced   0.738462  0.648649  0.690647  0.903861  0.581640
1       Class Weighting   0.066872  0.878378  0.124283  0.967929  0.702332
2                 SMOTE   0.080278  0.858108  0.146821  0.952065  0.662726
3   Random Oversampling   0.055411  0.864865  0.104150  0.950780  0.658294
4  Random Undersampling   0.051231  0.871622  0.096774  0.959640  0.628496
../../_images/5d4f4ece0383b51f0e00308adcaf504d7f0301740b1a50740ffcd71962c7b210.png

Wrap-Up#

  • There are many techniques for handling imbalanced data, and the choice of method depends on the problem and dataset characteristics.

  • An important question to ask is: How do we define a “good” model? The definition of “good” depends on the specific application and the costs of false positives and false negatives.

    • Example: In a medical diagnosis context, recall (sensitivity) might be prioritized to ensure all potential cases are identified, even at the cost of false positives.

    • Example: In fraud detection, precision might be more critical to avoid flagging legitimate transactions.

Rules of Thumb#

  1. Use PR-AUC over ROC-AUC when the dataset is highly imbalanced, as PR-AUC focuses more on the minority class.

  2. Evaluate multiple metrics (e.g., F1-Score, ROC-AUC, PR-AUC) to get a comprehensive view of model performance.

  3. Test different techniques (e.g., SMOTE, class weighting, undersampling) and assess their impact on both minority and majority class performance.