Python / Python Mathematical Intuition and Scikit Learn Interview Questions
Beyond scaling, why must feature selection methods also be included inside a cross-validation pipeline rather than applied beforehand?
Feature selection methods like SelectKBest choose features based on a statistical test (e.g. ANOVA F-value, mutual information) computed between each feature and the target across the available data. If you perform feature selection on the entire dataset before cross-validation, the selected features were chosen using information from what will later become both training and validation folds — even though no model has been fit yet, the choice of which features matter already encodes information about the validation fold's relationship between X and y.
This is a particularly insidious form of leakage because it doesn't involve fitting a predictive model — yet it still systematically inflates cross-validated performance estimates, since the selected feature subset is implicitly tuned to perform well on the data used to select it, including the validation folds. The correct procedure performs feature selection independently within each CV fold, using only that fold's training data, exactly mirroring how a Pipeline correctly handles scaling.
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# WRONG: select features using ALL data, then cross-validate
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y) # leakage: uses all of y
wrong_scores = cross_val_score(LogisticRegression(), X_selected, y, cv=5)
# This score is optimistically biased!
# CORRECT: feature selection inside the pipeline, refit per fold
pipeline = Pipeline([
('selector', SelectKBest(score_func=f_classif, k=10)),
('classifier', LogisticRegression()),
])
correct_scores = cross_val_score(pipeline, X, y, cv=5)
# Each fold independently selects its own top-10 features
# using only that fold's training data
print('Leaked estimate: ', wrong_scores.mean()) # often higher
print('Honest estimate: ', correct_scores.mean()) # more realistic
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
