Python / Data Science Essentials Interview Questions
NumPy (Numerical Python) is the foundational library for scientific computing in Python. At its core it provides the ndarray — an N-dimensional array of a single, fixed data type stored in a contiguous block of memory. That single design decision is the source of almost all of NumPy's performance advantage over Python lists.
Python lists store references to Python objects scattered around the heap. Each arithmetic operation on a list requires Python to look up each object, check its type, extract the value, compute, and then box the result back into a new Python object. A million-element loop pays that overhead a million times.
NumPy sidesteps the overhead in two ways. First, all elements in an ndarray share the same dtype (e.g., float64, int32), so there is no per-element type check and no boxing. Second, NumPy operations are implemented as compiled C (and sometimes Fortran) routines that operate on the raw memory buffer in tight loops — this is called vectorisation. The Python interpreter is invoked once for the whole array, not once per element.
import numpy as np
import time
n = 10_000_000
py_list = list(range(n))
np_arr = np.arange(n, dtype=np.float64)
t0 = time.perf_counter()
py_result = [x * 2.5 for x in py_list]
print(f'List loop : {time.perf_counter()-t0:.3f}s')
t0 = time.perf_counter()
np_result = np_arr * 2.5 # vectorised — no Python loop
print(f'NumPy : {time.perf_counter()-t0:.3f}s')
# Typical ratio: 50x–200x faster for NumPyMemory is also more compact. A Python integer object takes ~28 bytes; a NumPy int64 element takes exactly 8 bytes. For a million-element array that is the difference between 28 MB and 8 MB.
Knowing the idiomatic array-creation functions is a baseline NumPy skill. Each function is designed for a specific situation and picking the right one keeps code readable and avoids unnecessary copies.
import numpy as np
# From Python sequences
a = np.array([1, 2, 3, 4]) # 1-D, dtype inferred (int64)
b = np.array([[1, 2], [3, 4]], dtype=np.float32) # 2-D, explicit dtype
# Pre-filled arrays
np.zeros((3, 4)) # 3×4 array of 0.0
np.ones((2, 2)) # 2×2 array of 1.0
np.full((3, 3), 7) # 3×3 array filled with 7
np.eye(4) # 4×4 identity matrix
# Ranges
np.arange(0, 10, 2) # [0 2 4 6 8] — like range() but returns ndarray
np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1.] — N evenly spaced points
# Random arrays (use default_rng for reproducibility)
rng = np.random.default_rng(seed=42)
rng.random((3, 3)) # uniform [0, 1)
rng.standard_normal(1000) # standard normal distribution
rng.integers(0, 100, size=10) # random ints in [0, 100)
# From existing data without copying
np.asarray([1.0, 2.0, 3.0]) # no copy if already array-like and matching dtype
np.frombuffer(b'\x01\x02\x03', dtype=np.uint8) # from raw bytesnp.linspace is preferred over np.arange for floating-point ranges because arange with a float step can produce unexpected element counts due to floating-point rounding. linspace guarantees exactly N points.
Every NumPy array has a shape attribute — a tuple giving the size along each dimension. Shape is fundamental because most NumPy operations depend on it, and shape mismatches are the most common source of errors in numerical code.
import numpy as np
a = np.arange(24)
print(a.shape) # (24,)
# reshape — change shape without copying data
b = a.reshape(4, 6) # 4 rows, 6 columns
c = a.reshape(2, 3, 4) # 3-D: 2 blocks of 3×4
# -1 means 'infer this dimension'
d = a.reshape(6, -1) # (6, 4) — NumPy works out the 4
print(b.shape) # (4, 6)
print(b.ndim) # 2
print(b.size) # 24 — total number of elements
# Axes: axis=0 is rows (down), axis=1 is columns (across)
m = np.array([[1, 2, 3],
[4, 5, 6]])
print(m.sum(axis=0)) # [5 7 9] — sum down each column
print(m.sum(axis=1)) # [6 15] — sum across each row
print(m.sum()) # 21 — grand total
# Flatten and ravel
m.flatten() # always returns a copy
m.ravel() # returns view if possible (faster)A view shares memory with the original array — modifying the view modifies the original. reshape usually returns a view; flatten always returns a copy. Use np.shares_memory(a, b) to check.
Broadcasting is the set of rules NumPy uses to perform element-wise operations on arrays of different but compatible shapes, without physically copying data to make them the same size. It is one of the most powerful and often misunderstood NumPy features.
The rules, applied dimension by dimension starting from the trailing (rightmost) axis:
- If the arrays have different numbers of dimensions, prepend 1s to the shape of the smaller-dimensional array.
- Dimensions of size 1 are stretched to match the other array's size in that dimension.
- If any dimension neither matches nor is 1, a
ValueErroris raised.
import numpy as np
# Scalar broadcast over array
a = np.array([1, 2, 3])
print(a * 10) # [10 20 30]
# (3,) and (3, 1) — column vector subtraction from each column
matrix = np.array([[10, 20, 30],
[40, 50, 60]])
row_min = matrix.min(axis=1, keepdims=True) # shape (2, 1)
normalised = matrix - row_min # broadcasts: (2,3) - (2,1) -> (2,3)
print(normalised)
# [[ 0 10 20]
# [ 0 10 20]]
# Outer product via broadcasting
col = np.array([[1], [2], [3]]) # shape (3, 1)
row = np.array([10, 20, 30]) # shape (3,) -> treated as (1, 3)
print(col * row)
# [[10 20 30]
# [20 40 60]
# [30 60 90]]Broadcasting avoids the memory cost of explicit np.tile or np.repeat calls. The stretched values are never physically written — NumPy just iterates as if they were. For large arrays this can mean the difference between fitting in RAM and running out of memory.
Beyond basic integer indexing, NumPy supports two advanced selection mechanisms that are essential for data-cleaning and filtering tasks.
Boolean masking: A comparison on an array produces a boolean array of the same shape. Passing that boolean array back as an index selects only the True positions.
import numpy as np
scores = np.array([88, 45, 72, 91, 60, 33, 95])
# Boolean mask
mask = scores >= 70
print(mask) # [True False True True False False True]
passing = scores[mask]
print(passing) # [88 72 91 95]
# Compound conditions
mid_range = scores[(scores >= 60) & (scores < 90)]
print(mid_range) # [88 72 60] — use & | ~ not and/or
# Assign through a mask
scores[scores < 50] = 50 # clamp low scores to 50
print(scores) # [88 50 72 91 60 50 95]
# np.where — vectorised if/else
grades = np.where(scores >= 70, 'Pass', 'Fail')
print(grades) # ['Pass' 'Fail' 'Pass' 'Pass' 'Fail' 'Fail' 'Pass']Fancy indexing: Pass an integer array (or list) as an index to select arbitrary elements in any order. Unlike slicing, fancy indexing always returns a copy, not a view.
data = np.array([10, 20, 30, 40, 50])
idx = np.array([4, 1, 4, 0]) # can repeat indices
print(data[idx]) # [50 20 50 10]
# 2-D fancy indexing
m = np.arange(16).reshape(4, 4)
rows = [0, 2]; cols = [1, 3]
print(m[rows, cols]) # m[0,1] and m[2,3]: [1 11]
NumPy ships a comprehensive set of universal functions (ufuncs) — compiled, vectorised operations that apply element-wise across the full array without Python loops. Knowing these avoids writing slow manual loops for standard computations.
import numpy as np
a = np.array([1.0, 4.0, 9.0, 16.0, 25.0])
# Element-wise math
np.sqrt(a) # [1. 2. 3. 4. 5.]
np.log(a) # natural log
np.log2(a) # base-2 log
np.log10(a) # base-10 log
np.exp(a) # e^x
np.abs(np.array([-3, 4, -1])) # [3 4 1]
# Aggregation
a.sum() # 55.0
a.mean() # 11.0
a.std() # standard deviation
a.var() # variance
a.min(); a.max() # extremes
a.argmin(); a.argmax() # INDEX of min/max
np.median(a) # 9.0
np.percentile(a, 75) # 75th percentile
# Linear algebra
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
np.dot(A, B) # matrix multiplication (also A @ B in Python 3.5+)
np.linalg.inv(A) # matrix inverse
np.linalg.det(A) # determinant
vals, vecs = np.linalg.eig(A) # eigenvalues and eigenvectors
# Sorting
unsorted = np.array([3, 1, 4, 1, 5])
np.sort(unsorted) # returns sorted copy: [1 1 3 4 5]
np.argsort(unsorted) # indices that would sort: [1 3 0 2 4]
A Pandas DataFrame is a two-dimensional, labelled data structure — think of it as a spreadsheet or a SQL table in memory. Rows and columns both have labels (the index and the column names), and each column can hold a different data type. A Series is the single-column equivalent.
| Feature | NumPy ndarray | Pandas DataFrame |
|---|---|---|
| Dimensions | N-dimensional | Always 2-D (rows × columns) |
| Data type | Single dtype per array | Each column has its own dtype |
| Labels | Integer positions only | Named row index + column headers |
| Missing values | No native support (use np.nan) | First-class NaN / NaT / pd.NA |
| Primary use | Numerical computation | Tabular data: ETL, analysis, SQL-like ops |
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'age': [30, 25, 35],
'salary': [95000.0, 72000.0, np.nan],
})
print(df.dtypes)
# name object
# age int64
# salary float64
print(df.shape) # (3, 3)
print(df.index) # RangeIndex(start=0, stop=3, step=1)
print(df.columns) # Index(['name', 'age', 'salary'], dtype='object')DataFrames are built on top of NumPy arrays — each column is essentially a NumPy array wrapped with extra metadata. When computation speed is paramount you often drop down to df.values or df.to_numpy() to get the raw array and run NumPy operations on it.
Pandas has a family of pd.read_* functions that handle virtually every common data format. Getting data in is usually the first step of any data science workflow, so these functions deserve close attention.
import pandas as pd
# --- CSV ---
df = pd.read_csv('sales.csv')
# Common options:
df = pd.read_csv(
'sales.csv',
sep=';', # custom delimiter (semicolon, tab, etc.)
header=0, # row to use as column names (0 = first row)
index_col='order_id', # use this column as the row index
usecols=['date', 'amount', 'region'], # read only these columns
dtype={'amount': 'float32'}, # explicit dtype
parse_dates=['date'], # auto-parse date strings
na_values=['N/A', '--', ''], # treat as NaN
nrows=1000, # read only first 1000 rows (useful for large files)
encoding='utf-8',
)
# --- Excel ---
df_xl = pd.read_excel('report.xlsx', sheet_name='Q1', skiprows=2)
# --- JSON ---
df_j = pd.read_json('data.json', orient='records')
# orient='records' expects [{...}, {...}] — the common API response shape
# --- SQL ---
import sqlite3
conn = sqlite3.connect('mydb.sqlite')
df_sql = pd.read_sql('SELECT * FROM orders WHERE amount > 100', conn)
# Always inspect after reading
print(df.shape)
print(df.head())
print(df.dtypes)
print(df.info()) # shows non-null counts per columnFor very large CSVs that do not fit in memory, pass chunksize=100_000 to read_csv — it returns an iterator of DataFrames, each containing that many rows. Process and aggregate chunk by chunk without loading the full file.
This distinction is tested in almost every Pandas interview. The short version: loc selects by label; iloc selects by integer position. They look similar but behave very differently, especially when the DataFrame index is not a default RangeIndex.
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol', 'Dave'],
'score': [88, 72, 95, 61],
'city': ['NYC', 'LA', 'NYC', 'Chicago'],
}, index=[10, 20, 30, 40]) # non-default index!
# --- loc: label-based ---
df.loc[20] # row with index label 20 (Bob)
df.loc[10:30] # rows 10, 20, 30 — INCLUSIVE stop
df.loc[10, 'name'] # single value: 'Alice'
df.loc[[10, 40], ['name', 'score']] # multiple rows and columns
df.loc[df['score'] >= 80] # boolean mask selection
# --- iloc: position-based ---
df.iloc[0] # first row (Alice) — positional 0
df.iloc[0:2] # rows 0 and 1 — EXCLUSIVE stop (like Python slicing)
df.iloc[0, 1] # row 0, column 1: 88
df.iloc[-1] # last row (Dave)
df.iloc[:, 0] # entire first column
# --- [] shorthand ---
df['name'] # single column as Series
df[['name', 'city']] # multiple columns as DataFrame
df[df['score'] > 80] # boolean filtering — OK for rows onlyThe classic trap: loc stop is inclusive; iloc stop is exclusive. This asymmetry trips up even experienced developers. When in doubt, prefer explicit loc or iloc over the [] shorthand to avoid ambiguity.
Missing values are represented in Pandas as NaN (float Not-a-Number from NumPy), NaT (Not-a-Time for datetime columns), or pd.NA (the newer nullable integer/string missing marker). Handling them correctly is the most time-consuming step of real-world data cleaning.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice', 'Bob', None, 'Dave'],
'age': [30, np.nan, 35, 28],
'score': [88, 72, np.nan, np.nan],
})
# --- Detection ---
df.isnull() # boolean DataFrame — True where NaN
df.isnull().sum() # count NaNs per column
df.isnull().sum() / len(df) * 100 # % missing per column
df.notnull() # inverse of isnull
# --- Dropping ---
df.dropna() # drop rows with ANY NaN
df.dropna(how='all') # drop rows where ALL values are NaN
df.dropna(subset=['age']) # drop rows with NaN only in 'age'
df.dropna(axis=1, thresh=3) # drop columns with fewer than 3 non-NaN values
# --- Filling ---
df['score'].fillna(df['score'].mean()) # fill with column mean
df['score'].fillna(method='ffill') # forward fill (propagate last valid)
df['score'].fillna(method='bfill') # backward fill
df.fillna({'age': 0, 'name': 'Unknown'}) # column-specific fills
# --- Interpolation ---
df['score'].interpolate(method='linear') # linear interpolation between values
# --- Replace specific sentinel values ---
df.replace(-999, np.nan) # treat -999 as missingChoosing between dropping and filling requires domain knowledge. Dropping rows is acceptable when missing data is rare (below 5%) and appears randomly. Filling with mean/median is common for numerical features; filling with mode or a sentinel ('Unknown') for categoricals. For time-series, forward fill preserves temporal order.
Row filtering is one of the most frequent DataFrame operations. Pandas provides several syntaxes, each with different readability and performance trade-offs.
import pandas as pd
df = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA'],
'revenue': [120, 85, 200, 55, 140],
'category': ['A', 'B', 'A', 'C', 'B'],
})
# Boolean indexing — most common
df[df['revenue'] > 100]
# Compound conditions — use & | ~ (not and/or)
df[(df['city'] == 'NYC') & (df['revenue'] > 100)]
# isin — membership test
df[df['city'].isin(['NYC', 'LA'])]
# between — inclusive range
df[df['revenue'].between(80, 150)]
# str methods for text filtering
df[df['city'].str.startswith('N')]
df[df['city'].str.contains('C', case=False)]
# query() — string-based, readable for complex conditions
df.query('city == "NYC" and revenue > 100')
df.query('revenue > @threshold', local_dict={'threshold': 100})
# @ prefix references a Python variable inside query string
# filter() — filter columns or index labels (NOT rows by content)
df.filter(like='rev') # columns whose name contains 'rev'
df.filter(regex='^c') # columns starting with 'c'query() is readable for ad-hoc analysis and slightly faster for very large DataFrames because it avoids creating the intermediate boolean array. However, it does not support all Python expressions and can be harder to debug. For production pipelines, explicit boolean indexing is more explicit and testable.
GroupBy is the Pandas implementation of the split-apply-combine pattern: split the DataFrame into groups by one or more column values, apply an aggregation or transformation to each group, and combine the results into a new DataFrame. It is the primary tool for summary statistics on tabular data.
import pandas as pd
sales = pd.DataFrame({
'region': ['East','East','West','West','East','West'],
'product': ['A','B','A','B','A','A'],
'revenue': [100, 200, 150, 80, 120, 90],
'units': [10, 20, 15, 8, 12, 9],
})
# Single-column groupby with single aggregation
sales.groupby('region')['revenue'].sum()
# East 420 West 320
# Multiple aggregations on one column
sales.groupby('region')['revenue'].agg(['sum', 'mean', 'count', 'std'])
# Different aggregations per column
sales.groupby('region').agg(
total_revenue=('revenue', 'sum'),
avg_units =('units', 'mean'),
num_orders =('revenue', 'count'),
)
# Multi-column groupby
sales.groupby(['region', 'product'])['revenue'].sum()
# transform — returns same-length Series aligned with original index
# (useful for adding group statistics back as a new column)
sales['region_total'] = sales.groupby('region')['revenue'].transform('sum')
# filter — keep only groups satisfying a condition
big_regions = sales.groupby('region').filter(lambda g: g['revenue'].sum() > 400)transform vs agg: agg reduces each group to a scalar, returning a smaller DataFrame; transform keeps the original shape, broadcasting the group result back to each row. Use transform when you want to add a group statistic as a feature column without losing row-level detail.
Real-world data lives in multiple tables. Pandas merge() implements SQL-style joins, and concat() stacks DataFrames. Choosing the right join type prevents silently losing or duplicating rows.
| how= | Keeps rows from | Missing matches become |
|---|---|---|
| 'inner' | Both DataFrames (intersection) | NaN (none dropped because only matched) |
| 'left' | All of left, matched from right | NaN in right columns |
| 'right' | All of right, matched from left | NaN in left columns |
| 'outer' | Both (union) | NaN on whichever side has no match |
import pandas as pd
customers = pd.DataFrame({
'cust_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Carol', 'Dave'],
})
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'cust_id': [1, 2, 9], # cust_id 9 has no match; Dave has no order
'amount': [200, 150, 80],
})
# Inner join — only rows that match in both
pd.merge(customers, orders, on='cust_id', how='inner') # 2 rows
# Left join — all customers, NaN where no order
pd.merge(customers, orders, on='cust_id', how='left') # 4 rows
# Different key names in each table
pd.merge(customers, orders,
left_on='cust_id', right_on='cust_id') # same here, but shows syntax
# Merge on index
pd.merge(customers.set_index('cust_id'), orders,
left_index=True, right_on='cust_id')
# concat — stack vertically (rows) or horizontally (columns)
pd.concat([df1, df2], axis=0, ignore_index=True) # stack rows
pd.concat([df1, df2], axis=1) # add columns side by side
apply() runs a Python function on every row or column of a DataFrame. It is the most flexible transformation tool in Pandas but also the slowest because it falls back to a Python-level loop under the hood.
import pandas as pd
import numpy as np
df = pd.DataFrame({'price': [10.5, 20.0, 8.75, 35.0],
'qty': [3, 5, 2, 1 ]})
# --- Slow: apply with a Python lambda ---
df['revenue'] = df.apply(lambda row: row['price'] * row['qty'], axis=1)
# --- Fast: vectorised arithmetic (always prefer this) ---
df['revenue'] = df['price'] * df['qty']
# --- apply on a single column (Series.apply) ---
df['price_cat'] = df['price'].apply(lambda x: 'high' if x > 20 else 'low')
# --- Faster alternative: np.where ---
df['price_cat'] = np.where(df['price'] > 20, 'high', 'low')
# --- Multi-condition: np.select ---
conditions = [df['price'] > 25, df['price'] > 15, df['price'] > 0]
choices = ['premium', 'mid', 'budget']
df['tier'] = np.select(conditions, choices, default='unknown')
# When apply is genuinely needed:
# — calling a function that returns a list/dict/Series per row
# — complex multi-column logic that cannot be expressed as vectorised ops
df.apply(lambda row: pd.Series({'x': row['price']+1, 'y': row['qty']*2}), axis=1)Performance hierarchy for transformations (fastest to slowest): vectorised arithmetic > NumPy ufuncs > df.eval() string expressions > map() on a Series > apply() > explicit Python for-loop. Use apply only when no vectorised alternative exists; for simple conditions always use np.where or boolean indexing instead.
pd.pivot_table reshapes and aggregates a DataFrame simultaneously, producing a cross-tabulation — exactly like a spreadsheet pivot table. It is the go-to function for producing summary reports broken down by two categorical dimensions.
import pandas as pd
sales = pd.DataFrame({
'region': ['East','East','West','West','East','West','West'],
'quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q1', 'Q2'],
'product': ['A', 'A', 'A', 'A', 'B', 'B', 'B'],
'revenue': [100, 120, 90, 110, 80, 70, 95],
})
# Basic pivot: average revenue by region (rows) and quarter (columns)
pt = pd.pivot_table(
sales,
values='revenue',
index='region',
columns='quarter',
aggfunc='sum', # sum, mean, count, np.median, list, ...
fill_value=0, # replace NaN with 0
margins=True, # add row/column totals (labelled 'All')
margins_name='Total',
)
print(pt)
# quarter Q1 Q2 Total
# region
# East 180 120 300
# West 160 205 365
# Total 340 325 665
# Multiple values and multiple aggregations
pd.pivot_table(sales, values='revenue', index='region',
columns='product', aggfunc=['sum', 'count'])The inverse operation — converting a wide pivot back to long form — is pd.melt(). df.stack() and df.unstack() do similar reshape operations on the index levels directly.
Pandas exposes string methods through the .str accessor on object-dtype Series. These operations are vectorised over the whole column — no explicit loop needed — and handle NaN values gracefully (they propagate as NaN rather than raising an error).
import pandas as pd
df = pd.DataFrame({'name': [' Alice Smith ', 'bob jones', 'CAROL LEE', None],
'email': ['alice@corp.com', 'BOB@CORP.COM', 'carol@other.org', None]})
# Case normalisation
df['name'].str.strip().str.title() # 'Alice Smith', 'Bob Jones', 'Carol Lee', NaN
# Split into multiple columns
df[['first', 'last']] = df['name'].str.strip().str.split(' ', expand=True)
# Contains / startswith / endswith
df[df['email'].str.endswith('@corp.com', na=False)]
# Extract patterns with regex
df['domain'] = df['email'].str.extract(r'@(.+)$') # captures text after @
# Replace with regex
df['email'].str.lower().str.replace(r'[^a-z0-9@._]', '', regex=True)
# Count occurrences
df['name'].str.count('l') # 1, 0, 1, NaN
# Length
df['name'].str.len()
# Padding / justification
df['id'].str.zfill(6) # zero-pad to width 6
df['name'].str.ljust(20, '-') # left-justify, pad with dashesThe na=False argument in methods like str.contains and str.startswith is important — without it, NaN values produce NaN in the boolean mask, which causes issues in filtering. Passing na=False returns False for NaN rows, keeping them out of the filtered result cleanly.
Time-series data is everywhere in data science — sales by day, sensor readings by second, user activity by hour. Pandas has first-class datetime support built on NumPy's datetime64 type and Python's datetime module.
import pandas as pd
df = pd.DataFrame({
'date_str': ['2024-01-15', '2024-02-20', '2024-03-05'],
'value': [100, 200, 150],
})
# Parse string dates — always specify format for speed and correctness
df['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d')
# Extract components via .dt accessor
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name() # 'Monday', 'Tuesday', ...
df['quarter'] = df['date'].dt.quarter
# Date arithmetic
df['days_since'] = (pd.Timestamp.today() - df['date']).dt.days
df['next_month'] = df['date'] + pd.DateOffset(months=1)
# Set as index for time-series resampling
ts = df.set_index('date')
ts.resample('M').sum() # sum by month
ts.resample('W').mean() # mean by week
ts.resample('Q').agg({'value': ['sum', 'count']}) # quarterly stats
# Filtering date ranges
df[df['date'] >= '2024-02-01']
df[df['date'].between('2024-01-01', '2024-03-01')]Always parse dates explicitly with format= rather than relying on infer_datetime_format=True — the inferred path is slow and occasionally wrong for ambiguous formats like 01/02/03. For production pipelines, parse at read time using parse_dates=['date_col'] in pd.read_csv.
Matplotlib is Python's foundational plotting library, originally modelled after MATLAB's plotting API. Almost every other Python visualisation library (Seaborn, Pandas .plot(), Plotly static exports) either wraps Matplotlib or uses it as a rendering backend.
Understanding the object hierarchy is essential for customising plots beyond the defaults:
| Object | What it is | Created by |
|---|---|---|
| Figure | The entire canvas / window | plt.figure() or plt.subplots() |
| Axes | One coordinate system (plot area) inside a Figure | fig.add_subplot() or plt.subplots() |
| Axis | The X or Y axis of an Axes (note: Axes ≠ Axis) | Exists on every Axes |
| Artist | Every visible element — lines, patches, text, legends | plot(), bar(), text(), etc. |
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 2 * np.pi, 300)
# Object-oriented interface (recommended for complex plots)
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(x, np.sin(x), label='sin(x)', color='steelblue', linewidth=2)
ax.plot(x, np.cos(x), label='cos(x)', color='tomato', linestyle='--')
ax.set_title('Sine and Cosine', fontsize=14)
ax.set_xlabel('x (radians)')
ax.set_ylabel('Amplitude')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 2 * np.pi)
fig.tight_layout() # prevent label clipping
plt.savefig('trig.png', dpi=150, bbox_inches='tight')
plt.show()The pyplot (plt.*) interface is a state-machine shorthand that implicitly manages the current Figure and Axes. It is convenient for quick interactive plots but problematic in scripts and notebooks that create multiple figures — use the object-oriented fig, ax = plt.subplots() style for anything beyond a single simple chart.
Choosing the right chart type communicates data clearly; choosing the wrong one obscures it. Here are the workhorses of exploratory data analysis:
import matplotlib.pyplot as plt
import numpy as np
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
# 1. Line chart — trends over time or ordered x-axis
ax = axes[0, 0]
ax.plot([1, 2, 3, 4], [10, 15, 13, 18])
ax.set_title('Line: trends')
# 2. Bar chart — comparing discrete categories
ax = axes[0, 1]
ax.bar(['A', 'B', 'C'], [30, 45, 20])
ax.set_title('Bar: categories')
# 3. Scatter plot — relationship between two continuous variables
ax = axes[0, 2]
x = np.random.randn(100); y = x * 0.8 + np.random.randn(100) * 0.5
ax.scatter(x, y, alpha=0.5, c='steelblue')
ax.set_title('Scatter: correlation')
# 4. Histogram — distribution of one continuous variable
ax = axes[1, 0]
ax.hist(np.random.randn(1000), bins=30, color='salmon', edgecolor='white')
ax.set_title('Histogram: distribution')
# 5. Box plot — distribution summary with outliers
ax = axes[1, 1]
ax.boxplot([np.random.randn(100) for _ in range(3)], labels=['G1','G2','G3'])
ax.set_title('Box: spread & outliers')
# 6. Heatmap via imshow — 2-D matrix data (e.g., correlation matrix)
ax = axes[1, 2]
data = np.random.rand(4, 4)
im = ax.imshow(data, cmap='viridis')
plt.colorbar(im, ax=ax)
ax.set_title('Heatmap: 2-D matrix')
fig.tight_layout()
plt.show()Rule of thumb: line for temporal/ordered data, bar for nominal comparisons, scatter for two-variable relationships, histogram for single-variable distributions, box for group comparisons with outlier context, heatmap for correlation matrices and confusion matrices.
Multi-panel figures are standard in data science reports — comparing multiple variables or time periods side by side. Matplotlib provides several ways to arrange subplots.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 200)
# --- Regular grid ---
fig, axes = plt.subplots(2, 2, figsize=(10, 8), sharex=True)
# sharex=True links x-axis zoom/pan across all subplots
axes[0, 0].plot(x, np.sin(x))
axes[0, 0].set_title('sin')
axes[0, 1].plot(x, np.cos(x), color='tomato')
axes[0, 1].set_title('cos')
axes[1, 0].plot(x, np.tan(x))
axes[1, 0].set_title('tan')
axes[1, 1].set_visible(False) # hide unused subplot
fig.suptitle('Trig functions', fontsize=16)
fig.tight_layout(rect=[0, 0, 1, 0.95]) # leave room for suptitle
# --- Flatten for iteration ---
fig, axes = plt.subplots(2, 3, figsize=(14, 6))
for ax, col in zip(axes.flatten(), df.select_dtypes('number').columns):
ax.hist(df[col].dropna(), bins=20)
ax.set_title(col)
# --- GridSpec for irregular layouts ---
from matplotlib.gridspec import GridSpec
fig = plt.figure(figsize=(12, 6))
gs = GridSpec(2, 3, figure=fig)
ax1 = fig.add_subplot(gs[0, :2]) # spans first two columns of row 0
ax2 = fig.add_subplot(gs[0, 2]) # third column of row 0
ax3 = fig.add_subplot(gs[1, :]) # entire row 1axes.flatten() is the standard idiom when you want to loop over a 2-D grid of Axes objects as if they were a 1-D list. fig.tight_layout() automatically adjusts spacing to prevent labels overlapping between subplots — call it before plt.show() or fig.savefig().
Seaborn is a high-level statistical visualisation library built on top of Matplotlib. Where Matplotlib gives you full control over every pixel, Seaborn provides opinionated, attractive defaults and plot types designed specifically for statistical exploration — with far less boilerplate code.
| Aspect | Matplotlib | Seaborn |
|---|---|---|
| Level | Low-level — explicit control | High-level — declarative |
| Defaults | Functional but plain | Publication-quality themes out of the box |
| DataFrame integration | Manual (extract arrays) | Direct — pass df= and column names |
| Statistical plots | Manual calculation required | Built-in (regression, KDE, violin, pair) |
| Customisation | Unlimited | Matplotlib calls needed for fine-tuning |
import seaborn as sns
import matplotlib.pyplot as plt
# Load a built-in example dataset
tips = sns.load_dataset('tips')
# Seaborn: one line to create a scatter with regression line and hue
sns.regplot(data=tips, x='total_bill', y='tip')
# Matplotlib equivalent would require:
# 1. Compute regression manually
# 2. Plot scatter
# 3. Plot fitted line
# 4. Shade confidence interval — ~15 lines total
# Themes and contexts
sns.set_theme(style='whitegrid', context='notebook', palette='muted')
# styles: darkgrid, whitegrid, dark, white, ticks
# contexts: paper, notebook, talk, poster (scale font/line sizes)Seaborn plots return Matplotlib Axes objects, so all standard Matplotlib customisation still applies after the Seaborn call: ax = sns.scatterplot(...); ax.set_title('My Title'). Seaborn does not replace Matplotlib — it is a complement that handles the tedious parts of statistical plotting.
Seaborn divides its plots into relational (relationship between variables), distributional (distribution of a single variable), and categorical (comparison across categories). Knowing when to use each makes EDA far more efficient.
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
# --- Relational ---
# Scatter with colour encoding
sns.scatterplot(data=tips, x='total_bill', y='tip',
hue='smoker', size='size', palette='Set1')
# Regression line + scatter
sns.regplot(data=tips, x='total_bill', y='tip', ci=95)
# --- Distributional ---
# Histogram + KDE
sns.histplot(data=tips, x='total_bill', hue='sex', kde=True, bins=20)
# KDE only
sns.kdeplot(data=tips, x='total_bill', hue='sex', fill=True)
# ECDF — empirical cumulative distribution
sns.ecdfplot(data=tips, x='total_bill', hue='day')
# --- Categorical ---
# Box plot
sns.boxplot(data=tips, x='day', y='total_bill', hue='smoker', palette='pastel')
# Violin — box + KDE combined
sns.violinplot(data=tips, x='day', y='tip', inner='quartile')
# Bar chart with error bars (95% CI by default)
sns.barplot(data=tips, x='day', y='tip', estimator='mean', errorbar='ci')
# Strip plot — all individual points
sns.stripplot(data=tips, x='day', y='tip', jitter=True, alpha=0.4)
# --- Multi-variable overview ---
# Pair plot — scatter matrix of all numeric column pairs
sns.pairplot(tips, hue='sex', diag_kind='kde')
# Heatmap — great for correlation matrices
corr = tips.select_dtypes('number').corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1)
A correlation heatmap is one of the first plots every data scientist makes on a new dataset. It shows the Pearson (or other) correlation coefficient between every pair of numeric features as a colour-coded grid, immediately revealing which variables move together and which do not.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load example dataset
df = sns.load_dataset('penguins').select_dtypes('number')
# Compute correlation matrix
corr = df.corr() # Pearson by default; method='spearman' for ranked
print(corr)
# --- Basic heatmap ---
fig, ax = plt.subplots(figsize=(7, 5))
sns.heatmap(
corr,
annot=True, # show values inside each cell
fmt='.2f', # 2 decimal places
cmap='coolwarm', # blue = negative, red = positive
vmin=-1, vmax=1, # fix colour scale to [-1, 1]
linewidths=0.5, # add grid lines between cells
ax=ax,
)
ax.set_title('Feature Correlation Matrix')
fig.tight_layout()
# --- Mask upper triangle (remove redundancy) ---
import numpy as np
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f',
cmap='coolwarm', vmin=-1, vmax=1)Interpreting the output: values close to +1 mean strong positive linear correlation (both variables increase together), values close to -1 mean strong negative correlation (one increases as the other decreases), and values near 0 indicate little to no linear relationship. The diagonal is always 1.0 (a variable is perfectly correlated with itself). Masking the upper triangle removes the mirror image and makes the chart less cluttered.
FacetGrid is Seaborn's mechanism for trellis/small-multiples plots — the same chart repeated across different subsets of the data, defined by one or more categorical columns. It is one of Seaborn's most powerful features for exploring interaction effects between variables.
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
# --- FacetGrid manually ---
g = sns.FacetGrid(tips, col='time', row='sex', height=3, aspect=1.2)
g.map_dataframe(sns.histplot, x='total_bill', bins=15, kde=True)
g.add_legend()
g.set_titles(col_template='{col_name} service', row_template='Sex: {row_name}')
g.set_axis_labels('Total Bill ($)', 'Count')
# --- Figure-level functions (wrap FacetGrid automatically) ---
# relplot — relational
sns.relplot(data=tips, x='total_bill', y='tip',
col='smoker', hue='sex', kind='scatter', height=4)
# displot — distributional
sns.displot(data=tips, x='total_bill',
col='sex', row='time', kind='kde', fill=True)
# catplot — categorical
sns.catplot(data=tips, x='day', y='tip',
col='sex', kind='violin', height=5, aspect=0.8)The figure-level functions (relplot, displot, catplot) return a FacetGrid object, not an Axes. To customise them after creation you call FacetGrid methods like g.set_titles(), g.set_axis_labels(), or iterate over g.axes.flatten() to access individual Axes objects and apply standard Matplotlib customisation.
Descriptive statistics summarise the central tendency, spread, and shape of a dataset. Pandas df.describe() is the starting point for any exploratory analysis, but knowing the individual methods gives you more precise control.
import pandas as pd
import numpy as np
df = pd.read_csv('housing.csv')
# --- df.describe() ---
# Numeric columns: count, mean, std, min, 25%, 50%, 75%, max
df.describe()
# Include object columns too
df.describe(include='all')
# --- Individual statistics ---
df['price'].mean() # arithmetic mean
df['price'].median() # 50th percentile — robust to outliers
df['price'].mode()[0] # most frequent value (returns Series)
df['price'].std() # standard deviation (ddof=1 by default)
df['price'].var() # variance
df['price'].skew() # skewness: >0 right-skewed, <0 left-skewed
df['price'].kurt() # excess kurtosis (0 = normal dist)
df['price'].quantile(0.90) # 90th percentile
df['price'].quantile([0.25, 0.5, 0.75]) # multiple quantiles
# IQR — interquartile range (robust measure of spread)
Q1, Q3 = df['price'].quantile(0.25), df['price'].quantile(0.75)
IQR = Q3 - Q1
# Outlier detection via IQR fence
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['price'] < lower) | (df['price'] > upper)]
print(f'{len(outliers)} outliers detected ({len(outliers)/len(df)*100:.1f}%)')
DataFrames loaded from CSV often use unnecessarily large dtypes — 64-bit integers for values that fit in 8 bits, generic object dtype for repeated string categories. Downcasting dtypes can reduce memory by 4–8× without any data loss, enabling analysis of larger datasets within available RAM.
import pandas as pd
import numpy as np
df = pd.read_csv('large.csv')
print(f'Memory before: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB')
# --- Integer downcasting ---
for col in df.select_dtypes('int64').columns:
df[col] = pd.to_numeric(df[col], downcast='integer')
# downcast tries int8 -> int16 -> int32 depending on value range
# --- Float downcasting ---
for col in df.select_dtypes('float64').columns:
df[col] = pd.to_numeric(df[col], downcast='float') # float32
# --- Categorical: object columns with low cardinality ---
# If a column has < 5% unique values, Categorical saves memory
for col in df.select_dtypes('object').columns:
n_unique = df[col].nunique()
if n_unique / len(df) < 0.05: # less than 5% cardinality
df[col] = df[col].astype('category')
print(f'Memory after : {df.memory_usage(deep=True).sum() / 1e6:.1f} MB')
# Categorical also speeds up groupby on low-cardinality columns
# because grouping enumerates integers rather than comparing stringsThe Categorical dtype stores repeated strings as integer codes internally — a column with 5 unique city names in a million-row dataset stores one integer per row rather than one full string per row. This speeds up groupby, sort_values, and value_counts in addition to saving memory.
Reproducibility is a core requirement of data science — experiments, train/test splits, and simulations must produce the same result every run so that results can be verified and shared. NumPy's random number generation is the building block for all of this.
import numpy as np
# --- Legacy API (still common in older code) ---
np.random.seed(42)
np.random.rand(3) # [0.374, 0.951, 0.732] — same every time
# --- Modern API: Generator (preferred since NumPy 1.17) ---
rng = np.random.default_rng(seed=42)
# Using a Generator is thread-safe and has better statistical properties
rng.random(5) # uniform [0, 1)
rng.standard_normal(5) # N(0, 1)
rng.normal(loc=170, scale=10, size=1000) # N(mean, std)
rng.integers(0, 100, size=10) # random ints in [0, 100)
rng.choice(['a','b','c'], size=5, replace=True) # random sampling
rng.shuffle(arr) # in-place shuffle
rng.permutation(arr) # shuffled copy
# --- Distributions used in ML simulations ---
rng.binomial(n=10, p=0.3, size=100) # number of successes in n trials
rng.poisson(lam=5, size=100) # events per interval
rng.exponential(scale=2, size=100) # time between Poisson events
rng.uniform(low=0, high=10, size=100) # uniform distributionThe modern Generator API (np.random.default_rng) is preferred over np.random.seed because: the generator is a first-class object you can pass around (not a global state), it is thread-safe, and it uses the PCG64 algorithm which passes more statistical tests than the Mersenne Twister used by the legacy API.
Categorical columns are understood by counting their frequencies and cross-tabulating them against other variables. These two tools answer the questions 'what values exist and how often?' and 'how are two categorical variables related?'
import pandas as pd
tips = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
# --- value_counts ---
tips['day'].value_counts()
# Sat 87 Fri 19 Sun 76 Thur 62
tips['day'].value_counts(normalize=True).round(3)
# proportions: Sat 0.357, Sun 0.312, Thur 0.255, Fri 0.078
tips['day'].value_counts(dropna=False) # includes NaN count if any
# Count unique values
tips['day'].nunique() # 4
# Histogram of numeric with bins
pd.cut(tips['total_bill'], bins=5).value_counts().sort_index()
# --- pd.crosstab ---
# Frequency cross-table: how many smokers vs non-smokers per day
ct = pd.crosstab(tips['day'], tips['smoker'])
# smoker No Yes
# day
# Fri 4 15
# Sat 45 42
# Sun 57 19
# Thur 45 17
# Proportions within rows (what % of each day are smokers)
pd.crosstab(tips['day'], tips['smoker'], normalize='index').round(3)
# With aggregation (mean tip by day and smoker)
pd.crosstab(tips['day'], tips['smoker'],
values=tips['tip'], aggfunc='mean').round(2)
The default Matplotlib style is functional but plain. For presentations and reports you need publication-quality output — chosen colour palettes, correct font sizes, no chart junk, and lossless or high-resolution raster output.
import matplotlib.pyplot as plt
import numpy as np
# --- Using a style sheet ---
plt.style.use('seaborn-v0_8-whitegrid') # clean grid background
# Other useful styles: 'ggplot', 'fivethirtyeight', 'bmh', 'dark_background'
print(plt.style.available) # list all available styles
# --- Common appearance tweaks via rcParams ---
plt.rcParams.update({
'font.size': 12,
'axes.labelsize': 13,
'axes.titlesize': 14,
'legend.fontsize': 11,
'figure.dpi': 100,
'lines.linewidth': 2,
})
# --- Figure construction ---
fig, ax = plt.subplots(figsize=(8, 5))
x = np.linspace(0, 10, 200)
ax.plot(x, np.sin(x), color='#2E86AB', label='sin(x)')
ax.fill_between(x, np.sin(x), 0, alpha=0.15, color='#2E86AB')
ax.axhline(0, color='black', linewidth=0.8, linestyle='--')
ax.set_title('Sine Wave with Fill', pad=12)
ax.set_xlabel('x')
ax.set_ylabel('sin(x)')
ax.legend(loc='upper right')
ax.spines[['top', 'right']].set_visible(False) # remove chart junk
fig.tight_layout()
# --- Saving ---
fig.savefig('output.png', dpi=300, bbox_inches='tight') # raster
fig.savefig('output.pdf', bbox_inches='tight') # vector
fig.savefig('output.svg', bbox_inches='tight') # web/editUse bbox_inches='tight' whenever saving — it prevents axis labels being clipped at the edges. For publications use PDF or SVG (vector formats that scale without pixelation). For web and slides, PNG at 150–300 DPI is standard.
np.where is NumPy's vectorised if/else for arrays. In its three-argument form it returns a new array built element-by-element: where the condition is True, use values from x; where False, use values from y. It is the correct alternative to writing a Python loop with an if-statement inside.
import numpy as np
scores = np.array([88, 45, 72, 91, 60, 33, 95])
# Classify into Pass / Fail without a loop
labels = np.where(scores >= 70, 'Pass', 'Fail')
# ['Pass' 'Fail' 'Pass' 'Pass' 'Fail' 'Fail' 'Pass']
# Apply a discount: over 80 gets 20% off, rest gets 5% off
prices = np.array([100.0, 200.0, 50.0, 150.0])
discounted = np.where(prices > 80, prices * 0.80, prices * 0.95)
# [95. 160. 47.5 120.]
# Chain multiple conditions using np.select
conditions = [
scores >= 90,
(scores >= 70) & (scores < 90),
scores < 70,
]
choices = ['A', 'B', 'C']
grades = np.select(conditions, choices, default='F')
# ['B' 'C' 'B' 'A' 'C' 'C' 'A']
# One-argument form: returns indices where condition is True
failing_indices = np.where(scores < 70)
# (array([1, 4, 5]),) — tuple of index arrays
failing_scores = scores[failing_indices]
# [45 60 33]np.select generalises np.where to multiple conditions — the first matching condition wins. Use it whenever you have more than two output categories; chaining nested np.where calls quickly becomes unreadable.
Method chaining is the style of writing data transformations as a single expression where each step's result is the input to the next. It avoids creating intermediate variables, reads like a pipeline, and makes the data flow explicit from top to bottom.
import pandas as pd
# --- Without chaining (intermediate variables) ---
df1 = pd.read_csv('raw.csv')
df2 = df1.dropna(subset=['revenue'])
df3 = df2.rename(columns={'rev': 'revenue'})
df4 = df3[df3['revenue'] > 0]
df5 = df4.assign(log_revenue=lambda d: d['revenue'].apply(np.log1p))
result = df5.groupby('region')['log_revenue'].mean()
# --- With method chaining ---
import numpy as np
result = (
pd.read_csv('raw.csv')
.dropna(subset=['revenue'])
.rename(columns={'rev': 'revenue'})
.query('revenue > 0')
.assign(log_revenue=lambda d: np.log1p(d['revenue']))
.groupby('region')['log_revenue']
.mean()
)
# --- df.pipe() for custom functions ---
def remove_outliers(df, col, n_std=3):
mean, std = df[col].mean(), df[col].std()
return df[(df[col] - mean).abs() < n_std * std]
def add_rank(df, col):
df = df.copy()
df['rank'] = df[col].rank(ascending=False)
return df
result = (
pd.read_csv('raw.csv')
.pipe(remove_outliers, col='revenue')
.pipe(add_rank, col='revenue')
)
# pipe passes the DataFrame as the first argument to the functiondf.pipe(func, *args, **kwargs) calls func(df, *args, **kwargs), inserting the DataFrame at the front of the argument list. This lets you write standalone functions and use them inline in a method chain without breaking the fluent style.
EDA is the first thing you do with a new dataset before any modelling. The goal is to understand the data's structure, quality, and relationships, and to spot problems (wrong dtypes, missing values, outliers, data leakage) before they propagate into a model.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Load and inspect
df = pd.read_csv('housing.csv')
print(df.shape) # (rows, cols)
print(df.dtypes) # types per column
print(df.head()) # first 5 rows
print(df.info()) # dtypes + non-null counts
print(df.describe()) # summary stats for numeric cols
# 2. Missing value audit
missing = df.isnull().sum().sort_values(ascending=False)
print(missing[missing > 0])
# 3. Duplicate rows
print(df.duplicated().sum())
df = df.drop_duplicates()
# 4. Distribution of each numeric column
df.select_dtypes('number').hist(bins=30, figsize=(16, 10))
plt.tight_layout(); plt.show()
# 5. Correlation heatmap
corr = df.select_dtypes('number').corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix'); plt.show()
# 6. Target variable distribution
target = 'price'
sns.histplot(df[target], kde=True)
print(f'Skewness: {df[target].skew():.2f}')
# 7. Categorical breakdown
for col in df.select_dtypes('object').columns:
print(df[col].value_counts())
# 8. Outlier detection
for col in df.select_dtypes('number').columns:
Q1, Q3 = df[col].quantile([0.25, 0.75])
IQR = Q3 - Q1
n_out = ((df[col] < Q1-1.5*IQR)|(df[col] > Q3+1.5*IQR)).sum()
if n_out > 0: print(f'{col}: {n_out} outliers')EDA is iterative — findings in step 4 send you back to step 2, insights in the correlation matrix raise questions answered by group analysis. Keep a notebook with your observations alongside the code so you and your team can understand what was found and why certain preprocessing decisions were made.
Combining and splitting arrays is a frequent operation in data preprocessing — assembling feature matrices from multiple sources, or splitting a dataset into folds for cross-validation.
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
# --- Concatenating along existing axes ---
np.concatenate([a, b], axis=0) # stack rows (vertical)
# [[1 2]
# [3 4]
# [5 6]
# [7 8]]
np.concatenate([a, b], axis=1) # stack columns (horizontal)
# [[1 2 5 6]
# [3 4 7 8]]
# --- Convenience stacking functions ---
np.vstack([a, b]) # vertical stack — same as axis=0
np.hstack([a, b]) # horizontal stack — same as axis=1 for 2-D
np.dstack([a, b]) # depth stack (creates a 3rd axis)
# stack — creates a NEW axis (different from concatenate!)
np.stack([a, b], axis=0) # shape (2, 2, 2)
np.stack([a, b], axis=2) # shape (2, 2, 2) — depth
# --- Splitting ---
big = np.arange(12).reshape(6, 2)
parts = np.vsplit(big, 3) # split into 3 equal arrays along axis 0
# [array([[0,1]]), array([[2,3]]), ... ]
# Split at specific indices
parts = np.split(big, [2, 4], axis=0) # [0:2], [2:4], [4:]
# Tile — repeat an array
np.tile(a, (2, 3)) # repeat a 2 times along rows, 3 times along cols
Duplicate rows silently inflate counts, distort means, and can cause data leakage between training and test sets. Pandas provides duplicated() and drop_duplicates() for systematic duplicate management.
import pandas as pd
df = pd.DataFrame({
'order_id': [1, 2, 2, 3, 4, 4],
'product': ['A', 'B', 'B', 'C', 'D', 'D'],
'amount': [100, 200, 200, 150, 80, 90], # last pair differs!
})
# --- Detecting duplicates ---
df.duplicated() # True for all duplicates (keeps first)
df.duplicated(keep='last') # True for all duplicates (keeps last)
df.duplicated(keep=False) # True for ALL occurrences
print(df.duplicated().sum()) # count of duplicate rows
# Duplicate check on a subset of columns only
df.duplicated(subset=['order_id', 'product'])
# True where order_id AND product are repeated (ignores amount diff)
# --- Removing duplicates ---
df.drop_duplicates() # removes all but first occurrence
df.drop_duplicates(keep='last') # keeps last occurrence
df.drop_duplicates(keep=False) # removes all occurrences of any duplicate
# Subset-based deduplication — keep first by order_id
df.drop_duplicates(subset=['order_id'], keep='first')
# Sort before deduplicating to control which row is 'first'
# (e.g., keep the highest amount per order)
df.sort_values('amount', ascending=False).drop_duplicates(subset=['order_id'])When deduplicating on a subset of columns, think carefully about which row to keep. Sorting the DataFrame first (by timestamp, version, or a quality metric) ensures drop_duplicates(keep='first') retains the most appropriate record, not just whatever happened to be first in the file.
Colour is one of the most impactful design decisions in a chart. Used correctly it encodes information; used poorly it confuses or misleads. Both Matplotlib and Seaborn give you fine-grained control.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# --- Matplotlib colour specifications ---
# Named CSS colours
plt.plot(x, y, color='steelblue')
# Hex string
plt.plot(x, y, color='#2E86AB')
# RGB tuple (values 0-1)
plt.plot(x, y, color=(0.18, 0.52, 0.67))
# Grayscale string
plt.plot(x, y, color='0.5') # 50% grey
# --- Colormaps for continuous data ---
im = plt.imshow(matrix, cmap='viridis') # perceptually uniform
plt.colorbar(im)
# Other good cmaps: 'plasma', 'inferno', 'magma' (sequential)
# 'RdBu', 'coolwarm', 'bwr' (diverging — centred on 0)
# 'tab10', 'Set1', 'Set2' (categorical)
# --- Seaborn palettes ---
# Categorical (qualitative)
sns.barplot(data=df, x='day', y='tip', palette='Set2')
# Sequential (one colour family)
sns.barplot(data=df, x='day', y='tip', palette='Blues_d')
# Diverging (two colour families around a midpoint)
sns.heatmap(corr, cmap='coolwarm', vmin=-1, vmax=1, center=0)
# Custom palette
custom = ['#E63946', '#457B9D', '#1D3557', '#A8DADC']
sns.barplot(data=df, x='day', y='tip', palette=custom)
# Preview a palette
sns.palplot(sns.color_palette('husl', 8))Always use perceptually uniform colormaps (viridis, plasma) for continuous data — rainbow/jet maps are misleading because they are not perceptually linear (the eye perceives the yellow band as brighter than the blue or red bands, creating false visual contrast). For diverging data (correlation matrices, residuals) use a diverging colormap centred on zero.
Window functions compute statistics over a sliding or expanding subset of rows, essential for time-series smoothing, trend detection, and feature engineering. Unlike groupby aggregations, window functions return a result for every row, preserving the original index.
import pandas as pd
import numpy as np
ts = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=10, freq='D'),
'sales': [100, 120, 90, 150, 200, 130, 110, 180, 160, 140],
})
ts = ts.set_index('date')
# --- Rolling window (fixed-size, slides one step at a time) ---
ts['ma3'] = ts['sales'].rolling(window=3).mean() # 3-day moving avg
ts['std3'] = ts['sales'].rolling(window=3).std()
ts['min3'] = ts['sales'].rolling(window=3).min()
# First window-1 values are NaN (not enough history)
# min_periods: require fewer observations before computing
ts['ma3_mp'] = ts['sales'].rolling(window=3, min_periods=1).mean()
# --- Expanding window (grows to include all rows so far) ---
ts['cum_max'] = ts['sales'].expanding().max()
ts['cum_mean'] = ts['sales'].expanding().mean()
# --- Exponentially weighted moving average (more weight on recent data) ---
ts['ewma'] = ts['sales'].ewm(span=3).mean()
# --- Lag / shift features (common in time-series forecasting) ---
ts['lag1'] = ts['sales'].shift(1) # yesterday's sales
ts['lag7'] = ts['sales'].shift(7) # last week's sales
ts['pct_change'] = ts['sales'].pct_change() # % change from previous rowMoving averages (rolling mean) smooth out noise to reveal trends. Exponentially weighted moving averages give more influence to recent observations, making them responsive to recent changes while still smoothing. Lag features turn a time-series prediction problem into a supervised learning problem where past values predict future ones.
When you have more than one numeric variable, the next step after individual histograms is to understand relationships between pairs. Seaborn's jointplot and pairplot automate this exploration with minimal code.
import seaborn as sns
import matplotlib.pyplot as plt
penguins = sns.load_dataset('penguins').dropna()
# --- jointplot: one pair of variables ---
# Scatter + marginal histograms
sns.jointplot(data=penguins, x='bill_length_mm', y='bill_depth_mm',
hue='species', height=6)
# Regression + 95% confidence interval
sns.jointplot(data=penguins, x='flipper_length_mm', y='body_mass_g',
kind='reg', height=6)
# Hex bins — better than scatter for large datasets with overplotting
sns.jointplot(data=penguins, x='flipper_length_mm', y='body_mass_g',
kind='hex', height=6)
# KDE — smooth 2-D density
sns.jointplot(data=penguins, x='bill_length_mm', y='bill_depth_mm',
kind='kde', fill=True, height=6)
# --- pairplot: all pairs + diagonal histograms ---
# Standard scatter matrix
sns.pairplot(penguins, hue='species',
diag_kind='kde', # diagonal: KDE instead of histogram
plot_kws={'alpha': 0.5}, # semi-transparent points
height=2.5)
plt.suptitle('Penguin Feature Pairs', y=1.02)
plt.show()
# Subset of columns only
cols = ['bill_length_mm', 'flipper_length_mm', 'body_mass_g']
sns.pairplot(penguins[cols + ['species']], hue='species')Use jointplot when you want to focus deeply on one specific pair of variables with marginal distributions visible. Use pairplot for a broad overview of all pairwise relationships in a dataset with up to ~10 variables — beyond that the grid becomes too small to read meaningfully.
NumPy is fast by default, but a few common mistakes can undermine that speed. Knowing these patterns makes the difference between code that runs in seconds and code that runs in minutes.
import numpy as np
n = 10_000_000
arr = rng.random(n)
# 1. AVOID Python loops — always prefer ufuncs
# Slow:
result = [x**2 for x in arr] # Python loop, ~3s
# Fast:
result = arr ** 2 # NumPy ufunc, ~0.03s
# 2. Pre-allocate output arrays instead of growing them
# Slow:
out = []
for chunk in chunks:
out.append(chunk.sum()) # repeated list growth
# Fast:
out = np.empty(len(chunks))
for i, chunk in enumerate(chunks):
out[i] = chunk.sum()
# 3. Use views instead of copies when slicing
sub = arr[1000:2000] # view — no memory allocation
sub2 = arr[1000:2000].copy() # explicit copy — only when mutation safety needed
# 4. Choose the right dtype — float32 vs float64
a64 = np.ones(n, dtype=np.float64) # 80 MB
a32 = np.ones(n, dtype=np.float32) # 40 MB — also faster on many ops
# 5. Use out= argument to avoid temporary arrays
np.add(a32, a32, out=a32) # in-place: no temporary intermediate created
# 6. np.einsum for complex multi-dimensional contractions
A = rng.random((100, 200))
B = rng.random((200, 300))
C = np.einsum('ij,jk->ik', A, B) # equivalent to A @ B but explicitThe most impactful optimisation in almost every case is the first: eliminating Python loops. After that, reducing the number of temporary arrays (using out= or in-place operators like +=) and choosing smaller dtypes are the next biggest wins.
After fitting any regression model, visualising the residuals (actual - predicted values) is mandatory. Patterns in residuals reveal model assumptions violations: non-linearity, heteroscedasticity, or non-normality of errors.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Simulate some data with a non-linear relationship
rng = np.random.default_rng(42)
x = rng.uniform(0, 10, 200)
y = 2 * x + 0.5 * x**2 + rng.normal(0, 3, 200)
df = pd.DataFrame({'x': x, 'y': y})
# 1. Scatter + regression line (with confidence interval)
sns.regplot(data=df, x='x', y='y', ci=95, scatter_kws={'alpha': 0.4})
plt.title('Scatter with OLS Regression Line')
plt.show()
# 2. Residual plot — built in seaborn
sns.residplot(data=df, x='x', y='y', lowess=True,
scatter_kws={'alpha': 0.4})
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs x (lowess smoothed trend)')
plt.show()
# A horizontal band around 0 = good; a curve = model is missing non-linearity
# 3. Manual residuals (after sklearn model)
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(df[['x']], df['y'])
df['predicted'] = model.predict(df[['x']])
df['residual'] = df['y'] - df['predicted']
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(df['predicted'], df['residual'], alpha=0.4)
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set(xlabel='Fitted Values', ylabel='Residuals',
title='Residuals vs Fitted')
sns.histplot(df['residual'], kde=True, ax=axes[1])
axes[1].set_title('Residual Distribution')
plt.tight_layout(); plt.show()The two most diagnostic residual plots are: (1) Residuals vs Fitted — should be a random horizontal band; any curve indicates missing predictors or a need for feature transformation. (2) Residual histogram — should be approximately normal; heavy tails suggest outliers or a non-Gaussian error structure.
When a CSV is larger than available RAM, loading it with a plain pd.read_csv causes a MemoryError. Pandas provides three strategies: chunking, selective loading, and dtype optimisation.
import pandas as pd
import numpy as np
# --- Strategy 1: Read only necessary columns and rows ---
df = pd.read_csv(
'big_log.csv',
usecols=['timestamp', 'user_id', 'event', 'amount'], # skip unneeded cols
dtype={'user_id': 'int32', 'amount': 'float32'}, # smaller dtypes
parse_dates=['timestamp'],
nrows=500_000, # read a sample first for exploration
)
# --- Strategy 2: Process in chunks ---
chunk_size = 100_000
results = []
for chunk in pd.read_csv('big_log.csv', chunksize=chunk_size,
usecols=['user_id', 'amount']):
# Process each chunk independently
summary = chunk.groupby('user_id')['amount'].sum()
results.append(summary)
# Combine partial results
final = pd.concat(results).groupby(level=0).sum()
# --- Strategy 3: Filter while reading with chunksize ---
high_value_chunks = []
for chunk in pd.read_csv('big_log.csv', chunksize=chunk_size):
filtered = chunk[chunk['amount'] > 1000]
high_value_chunks.append(filtered)
high_value_df = pd.concat(high_value_chunks, ignore_index=True)
# --- Alternative: Parquet format (much faster than CSV) ---
# Convert once:
df.to_parquet('big_log.parquet', index=False)
# Then read efficiently — Parquet supports column projection and row filters
import pyarrow.parquet as pq
table = pq.read_table('big_log.parquet',
columns=['user_id', 'amount'],
filters=[('amount', '>', 1000)])For truly large-scale work (tens of GB), consider switching from CSV to Parquet (columnar, compressed, fast column projection) and using Dask or Polars instead of Pandas — both operate on lazy computation graphs that stream data without loading everything into memory at once.
Annotations turn a chart into a story — highlighting a key data point, marking a threshold, or labelling significant events on a timeline. Matplotlib provides ax.annotate() for arrow-and-text annotations and ax.text() for free-form text placement.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 200)
y = np.sin(x) * np.exp(-x / 5)
fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(x, y, color='steelblue', linewidth=2)
# Find and annotate the maximum
peak_idx = np.argmax(y)
px, py = x[peak_idx], y[peak_idx]
ax.annotate(
f'Peak: ({px:.2f}, {py:.2f})',
xy=(px, py), # point to annotate
xytext=(px + 1.5, py), # where the text goes
arrowprops=dict(
arrowstyle='->',
color='darkred',
lw=1.5,
),
fontsize=11,
color='darkred',
)
# Free-form text label
ax.text(0.5, 0.9, 'Damped oscillation',
transform=ax.transAxes, # axes-relative coords (0–1)
fontsize=12, ha='center',
bbox=dict(boxstyle='round,pad=0.3', fc='lightyellow', ec='grey'))
# Threshold line with label
ax.axhline(y=0.5, color='orange', linestyle='--', linewidth=1)
ax.text(9.5, 0.52, 'threshold=0.5', color='orange', ha='right', fontsize=9)
ax.set(title='Annotated Damped Sine', xlabel='x', ylabel='y')
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout(); plt.show()The two coordinate systems matter: xy in annotate uses data coordinates by default (values from your actual data range). Passing transform=ax.transAxes to ax.text() switches to axes-fraction coordinates (0,0 = bottom-left, 1,1 = top-right) — useful for fixed-position labels that stay put when the data range changes.
During EDA you often need to inspect extremes (the highest-revenue customers, the worst-performing products) or draw a random sample for quick analysis. Pandas provides concise methods for each of these.
import pandas as pd
import numpy as np
rng = np.random.default_rng(42)
df = pd.DataFrame({
'product': [f'P{i}' for i in range(100)],
'revenue': rng.integers(1_000, 100_000, 100),
'returns': rng.integers(0, 500, 100),
})
# --- Top and bottom N rows ---
df.nlargest(5, 'revenue') # 5 highest revenue products
df.nsmallest(5, 'revenue') # 5 lowest revenue products
# Multiple columns — break ties by second column
df.nlargest(5, ['revenue', 'returns'])
# --- Random sampling ---
df.sample(n=10, random_state=42) # 10 random rows
df.sample(frac=0.1, random_state=42) # 10% of rows
df.sample(n=10, replace=True) # with replacement (bootstrapping)
# Stratified sample — same proportion from each category
df['tier'] = pd.cut(df['revenue'], bins=3, labels=['low','mid','high'])
stratified = df.groupby('tier', group_keys=False).apply(
lambda g: g.sample(frac=0.1, random_state=42)
)
# --- Head, tail, every Nth row ---
df.head(10) # first 10 rows
df.tail(10) # last 10 rows
df.iloc[::5] # every 5th row — useful for large datasetsnlargest and nsmallest are significantly faster than sort_values(...).head(n) for large DataFrames because they use a partial sort (heap) under the hood — O(N log k) instead of O(N log N) for the full sort. Use them whenever you only need the extremes, not a fully sorted result.
Linear algebra underpins almost all of machine learning — from computing gradients to PCA to solving systems of equations. NumPy's linalg submodule provides production-grade implementations of the core operations.
import numpy as np
# --- Solving a system of linear equations: Ax = b ---
# 2x + y = 8
# x + 3y = 11
A = np.array([[2, 1], [1, 3]])
b = np.array([8, 11])
x = np.linalg.solve(A, b)
print(x) # [2.6 2.8] — verify: A @ x ≈ b
# --- Matrix decompositions ---
M = np.array([[3, 1], [1, 3]], dtype=float)
# Eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eig(M)
# eigenvalues = [4. 2.], eigenvectors (columns) = principal directions
# Singular Value Decomposition — used in PCA, recommendation systems
X = np.random.default_rng(42).random((100, 5)) # 100 samples, 5 features
X -= X.mean(axis=0) # centre
U, S, Vt = np.linalg.svd(X, full_matrices=False)
# S = singular values (square roots of eigenvalues of X^T X)
# Vt rows = principal components
# Project onto first 2 components:
X_pca = X @ Vt[:2].T # shape (100, 2)
# --- Norms ---
v = np.array([3.0, 4.0])
np.linalg.norm(v) # 5.0 — L2 norm
np.linalg.norm(v, ord=1) # 7.0 — L1 norm
# --- Matrix rank, determinant, inverse ---
np.linalg.matrix_rank(A)
np.linalg.det(A)
np.linalg.inv(A) # only for square non-singular matrices
np.linalg.pinv(A) # Moore-Penrose pseudoinverse for non-squareSVD is the engine behind PCA: the right singular vectors (rows of Vt) are the principal components, and the singular values tell you how much variance each component explains. Using full_matrices=False (economy SVD) is essential for tall matrices — it skips computing the large, unused portions of U.
Comparing how a numeric variable's distribution differs across groups is one of the most common analytical tasks. Seaborn's categorical plot family gives you progressively more information from left to right: bar (mean only) → box (five-number summary) → violin (full distribution shape) → strip/swarm (individual points).
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Bar plot — mean + 95% CI error bars
sns.barplot(data=tips, x='day', y='tip', hue='sex',
palette='Set2', ax=axes[0, 0])
axes[0, 0].set_title('Mean Tip by Day and Sex')
# Box plot — median, IQR, whiskers, outlier dots
sns.boxplot(data=tips, x='day', y='total_bill', hue='smoker',
palette='pastel', ax=axes[0, 1])
axes[0, 1].set_title('Total Bill Distribution by Day and Smoker')
# Violin plot — box + KDE combined
sns.violinplot(data=tips, x='day', y='tip',
inner='quartile', # show quartile lines inside
palette='muted', ax=axes[1, 0])
axes[1, 0].set_title('Tip Violin by Day')
# Strip + box overlay — all points + summary
sns.boxplot(data=tips, x='time', y='tip', color='lightblue',
ax=axes[1, 1], width=0.4)
sns.stripplot(data=tips, x='time', y='tip', color='navy',
alpha=0.4, jitter=True, ax=axes[1, 1])
axes[1, 1].set_title('Tip by Time — Box + All Points')
plt.tight_layout(); plt.show()
# Figure-level catplot for easy faceting
sns.catplot(data=tips, x='day', y='tip', hue='sex',
col='time', kind='violin', height=5, aspect=0.8)When to use each: bar plots are fine for comparing means but hide distributional information. Box plots add spread and outliers. Violin plots reveal multi-modality (two bumps indicating two groups within a category). Strip/swarm overlays add individual points, essential for small datasets where a box plot can be misleading with n < 30.
Combining all three libraries in a coherent pipeline is what data science interviews and take-home assignments test. Below is a realistic miniature pipeline that demonstrates the key integration points.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style='whitegrid', context='notebook')
# --- 1. Load ---
df = pd.read_csv('customer_orders.csv', parse_dates=['order_date'])
# --- 2. Audit ---
print(df.info())
print(df.isnull().sum())
print(df.describe())
# --- 3. Clean ---
df = (
df
.drop_duplicates(subset=['order_id'])
.dropna(subset=['customer_id', 'amount'])
.assign(
amount=lambda d: pd.to_numeric(d['amount'], errors='coerce'),
category=lambda d: d['category'].str.strip().str.title().astype('category'),
year=lambda d: d['order_date'].dt.year,
month=lambda d: d['order_date'].dt.month,
)
.dropna(subset=['amount'])
.query('amount > 0')
)
# --- 4. Feature engineering (NumPy) ---
amounts = df['amount'].to_numpy()
df['log_amount'] = np.log1p(amounts) # log1p avoids log(0)
df['amount_zscore'] = (amounts - amounts.mean()) / amounts.std()
# --- 5. Aggregate ---
monthly = (
df.groupby(['year', 'month', 'category'])
.agg(total=('amount', 'sum'), orders=('order_id', 'count'))
.reset_index()
)
# --- 6. Visualise ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Revenue distribution by category
sns.boxplot(data=df, x='category', y='log_amount', ax=axes[0])
axes[0].set(title='Log Revenue by Category', xlabel='', ylabel='log(1+amount)')
# Monthly trend
df['period'] = df['order_date'].dt.to_period('M').astype(str)
trend = df.groupby('period')['amount'].sum().reset_index()
axes[1].plot(trend['period'], trend['amount'], marker='o', linewidth=2)
axes[1].tick_params(axis='x', rotation=45)
axes[1].set(title='Monthly Revenue Trend', xlabel='Month', ylabel='Revenue')
plt.tight_layout()
plt.savefig('dashboard.png', dpi=150, bbox_inches='tight')
plt.show()The key integration patterns here: Pandas for all tabular operations (load, clean, aggregate), NumPy for numerical transformations on raw arrays (.to_numpy() → vectorised ops), and Seaborn/Matplotlib for visualisation. The method-chain style in the cleaning step makes the transformations readable as a pipeline.
