Python / Python Mathematical Intuition and Scikit Learn Interview Questions
Why does using simple label encoding (integers) for nominal categorical features mislead most machine learning models, mathematically?
Label encoding assigns each category an arbitrary integer: e.g. Red=0, Green=1, Blue=2. The problem is that most models — linear regression, logistic regression, distance-based methods, and even many tree splitting algorithms that treat features as ordered — implicitly assume numeric features have a meaningful order and magnitude. A linear model would learn a single coefficient β for this feature, implying Blue (2β) is "twice as much" of something as Green (1β), and the effect of going from Red to Green is identical in size to going from Green to Blue. For a nominal (unordered) category like colour, this numeric relationship is meaningless and introduces a false signal.
One-hot encoding solves this by representing each category as a separate binary indicator column, removing any implied ordering or magnitude relationship: the model learns an independent coefficient for each category, with no false constraint linking them. The mathematical price is increased dimensionality — k categories become k (or k-1, with drop='first' to avoid the dummy variable trap) separate columns instead of one.
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import numpy as np
colors = np.array(['Red', 'Green', 'Blue', 'Green']).reshape(-1, 1)
# Label encoding — implies false ordering (Blue=2 > Green=1 > Red=0)
le = LabelEncoder()
label_encoded = le.fit_transform(colors.ravel())
print(label_encoded) # [2, 1, 0, 1] — numerically meaningless order
# One-hot encoding — no implied order, each category is independent
ohe = OneHotEncoder(sparse_output=False, drop='first')
one_hot = ohe.fit_transform(colors)
print(one_hot)
# [[0, 1], # Red (Green=0, Red=0 -> dropped baseline)
# [1, 0], # Green
# [0, 0], # Blue (baseline, dropped category)
# [1, 0]] # Green
# EXCEPTION: tree-based models can sometimes handle label-encoded
# nominal features reasonably well since they split on thresholds
# rather than assuming linear magnitude relationships, but one-hot
# encoding (or target encoding) is still generally safer
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
