Python / Python Mathematical Intuition and Scikit Learn Interview Questions
Mathematically, why does stochastic gradient descent (SGD) scale to large datasets better than batch gradient descent?
Batch gradient descent computes the exact gradient of the loss using all n training examples before taking a single parameter update step: ∇L(θ) = (1/n)Σᵢ ∇Lᵢ(θ). This requires O(n) computation per update — for datasets with millions of examples, even one update step becomes expensive, and you typically need many updates to converge.
SGD instead estimates the gradient using a single randomly sampled example (or a small mini-batch): ∇L_i(θ) for a random i. This is an unbiased estimator of the true gradient — its expected value equals the true gradient — but with added noise/variance. The key insight is that SGD can take many more update steps in the same amount of computation (since each step is O(1) or O(batch_size) instead of O(n)), and despite the noisier individual steps, the overall trajectory converges because the noise averages out over many iterations. For very large datasets, this tradeoff strongly favours SGD: you converge faster in wall-clock time even though each individual step is less precise.
from sklearn.linear_model import SGDRegressor, LinearRegression
import numpy as np
import time
# Simulating a large dataset
n_samples = 1_000_000
X = np.random.randn(n_samples, 10)
y = X @ np.random.randn(10) + np.random.randn(n_samples) * 0.1
# SGDRegressor processes data in small batches, scales to large n
start = time.time()
sgd = SGDRegressor(max_iter=5, tol=1e-3)
sgd.fit(X, y)
print(f'SGD time: {time.time() - start:.3f}s')
# Closed-form OLS (LinearRegression) computes (X^T X)^-1 X^T y
# Cost scales with O(n*d^2 + d^3) — fine here but problematic
# for very high-dimensional or extremely large n cases
start = time.time()
lr = LinearRegression().fit(X, y)
print(f'Closed-form time: {time.time() - start:.3f}s')
# partial_fit allows incremental learning on streaming/chunked data —
# impossible with the closed-form or full-batch approach
sgd2 = SGDRegressor()
for chunk_start in range(0, n_samples, 10000):
chunk = slice(chunk_start, chunk_start + 10000)
sgd2.partial_fit(X[chunk], y[chunk])
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
