Python / Python Mathematical Intuition and Scikit Learn Interview Questions
Why does fitting a scaler or transformer on the entire dataset (before train/test split) cause data leakage, mathematically?
Data leakage occurs when information from outside the training set improperly influences the model. If you fit a StandardScaler on the full dataset before splitting, the computed mean and standard deviation incorporate statistics from the test set. The scaled training data therefore implicitly contains information about the test set's distribution — even though no test labels are involved, the model's effective input distribution has been informed by test data it should never have seen.
This violates the assumption underlying generalisation estimates: the test set should represent a complete simulation of unseen future data, where you have no access to its statistics at training time. Although the leakage from this specific mistake is often small in magnitude, it systematically biases test performance to look better than true generalisation performance — and the bias compounds when more elaborate preprocessing or feature engineering is involved.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# WRONG: fit scaler on all data, THEN split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # leakage! mean/std include test rows
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# CORRECT: split first, fit scaler ONLY on training data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit on train only
X_test_scaled = scaler.transform(X_test) # transform test with train's stats
# BEST PRACTICE: use a Pipeline — guarantees correct fit/transform separation
# automatically, especially important inside cross-validation loops
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression()),
])
pipeline.fit(X_train, y_train) # scaler only sees X_train internally
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
