Prev Next

Python / Data Science Essentials Interview Questions

1. What is NumPy and why is it significantly faster than plain Python lists for numerical work? 2. What are the main ways to create NumPy arrays? 3. How do NumPy array shape, reshape, and axis work? 4. What is NumPy broadcasting and how does it work? 5. How does NumPy boolean masking and fancy indexing work? 6. What are the most commonly used NumPy mathematical functions in data science? 7. What is a Pandas DataFrame and how does it differ from a NumPy array? 8. How do you read CSV, Excel, and JSON files into a Pandas DataFrame? 9. What is the difference between df.loc[] and df.iloc[] in Pandas? 10. How do you detect, handle, and fill missing values in a Pandas DataFrame? 11. What are the different ways to filter rows in a Pandas DataFrame? 12. How does Pandas groupby work and what aggregation patterns are most useful? 13. How do you merge and join DataFrames in Pandas, and what do the different join types mean? 14. When should you use df.apply() versus vectorised Pandas operations? 15. How do you use pd.pivot_table to summarise data? 16. How do you perform string operations on Pandas DataFrame columns? 17. How do you work with dates and times in Pandas? 18. What is Matplotlib and what are the key components of a figure? 19. What are the most common chart types in Matplotlib and when do you use each? 20. How do you create multi-panel figures with Matplotlib subplots? 21. What is Seaborn and how does it differ from Matplotlib? 22. What are the most important Seaborn plot types for exploratory data analysis? 23. How do you create and interpret a correlation heatmap with Seaborn? 24. What is Seaborn's FacetGrid and how does it enable multi-panel statistical plots? 25. How do you compute descriptive statistics on a Pandas DataFrame? 26. How do you reduce a Pandas DataFrame's memory usage through dtype optimisation? 27. How do you generate reproducible random data with NumPy? 28. How do you use value_counts() and pd.crosstab() to understand categorical data? 29. How do you style Matplotlib figures and save them for reports? 30. What is np.where and how is it used for conditional array creation? 31. What is Pandas method chaining and how does df.pipe() support it? 32. What does a typical exploratory data analysis (EDA) workflow look like in Python? 33. How do you stack, concatenate, and split NumPy arrays? 34. How do you detect and remove duplicate rows in a Pandas DataFrame? 35. How do you control colours and colour palettes in Matplotlib and Seaborn? 36. How do rolling and expanding window functions work in Pandas? 37. How do Seaborn jointplot and pairplot help explore multivariate relationships? 38. What are the key performance tips when using NumPy for large-scale data processing? 39. How do you visualise regression results and residuals using Seaborn and Matplotlib? 40. How do you process large CSV files that don't fit in memory using Pandas? 41. How do you add annotations and text to Matplotlib charts? 42. How do you quickly extract top/bottom rows and random samples from a Pandas DataFrame? 43. How is NumPy linear algebra used in data science applications? 44. How do you compare distributions across categories using Seaborn categorical plots? 45. How do you build an end-to-end data cleaning and visualisation pipeline with NumPy, Pandas, and Seaborn?
Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is NumPy and why is it significantly faster than plain Python lists for numerical work?

NumPy (Numerical Python) is the foundational library for scientific computing in Python. At its core it provides the ndarray — an N-dimensional array of a single, fixed data type stored in a contiguous block of memory. That single design decision is the source of almost all of NumPy's performance advantage over Python lists.

Python lists store references to Python objects scattered around the heap. Each arithmetic operation on a list requires Python to look up each object, check its type, extract the value, compute, and then box the result back into a new Python object. A million-element loop pays that overhead a million times.

NumPy sidesteps the overhead in two ways. First, all elements in an ndarray share the same dtype (e.g., float64, int32), so there is no per-element type check and no boxing. Second, NumPy operations are implemented as compiled C (and sometimes Fortran) routines that operate on the raw memory buffer in tight loops — this is called vectorisation. The Python interpreter is invoked once for the whole array, not once per element.

import numpy as np
import time

n = 10_000_000
py_list = list(range(n))
np_arr  = np.arange(n, dtype=np.float64)

t0 = time.perf_counter()
py_result = [x * 2.5 for x in py_list]
print(f'List loop : {time.perf_counter()-t0:.3f}s')

t0 = time.perf_counter()
np_result = np_arr * 2.5   # vectorised — no Python loop
print(f'NumPy     : {time.perf_counter()-t0:.3f}s')
# Typical ratio: 50x–200x faster for NumPy

Memory is also more compact. A Python integer object takes ~28 bytes; a NumPy int64 element takes exactly 8 bytes. For a million-element array that is the difference between 28 MB and 8 MB.

What is the primary reason NumPy operations are faster than equivalent Python list loops?
How many bytes does a single float64 element occupy in a NumPy array?
2. What are the main ways to create NumPy arrays?

Knowing the idiomatic array-creation functions is a baseline NumPy skill. Each function is designed for a specific situation and picking the right one keeps code readable and avoids unnecessary copies.

import numpy as np

# From Python sequences
a = np.array([1, 2, 3, 4])              # 1-D, dtype inferred (int64)
b = np.array([[1, 2], [3, 4]], dtype=np.float32)  # 2-D, explicit dtype

# Pre-filled arrays
np.zeros((3, 4))          # 3×4 array of 0.0
np.ones((2, 2))           # 2×2 array of 1.0
np.full((3, 3), 7)        # 3×3 array filled with 7
np.eye(4)                 # 4×4 identity matrix

# Ranges
np.arange(0, 10, 2)       # [0 2 4 6 8]  — like range() but returns ndarray
np.linspace(0, 1, 5)      # [0.  0.25 0.5 0.75 1.]  — N evenly spaced points

# Random arrays (use default_rng for reproducibility)
rng = np.random.default_rng(seed=42)
rng.random((3, 3))        # uniform [0, 1)
rng.standard_normal(1000) # standard normal distribution
rng.integers(0, 100, size=10)  # random ints in [0, 100)

# From existing data without copying
np.asarray([1.0, 2.0, 3.0])   # no copy if already array-like and matching dtype
np.frombuffer(b'\x01\x02\x03', dtype=np.uint8)  # from raw bytes

np.linspace is preferred over np.arange for floating-point ranges because arange with a float step can produce unexpected element counts due to floating-point rounding. linspace guarantees exactly N points.

Which NumPy function creates exactly N evenly spaced values between start and stop inclusive?
What does np.eye(4) create?
3. How do NumPy array shape, reshape, and axis work?

Every NumPy array has a shape attribute — a tuple giving the size along each dimension. Shape is fundamental because most NumPy operations depend on it, and shape mismatches are the most common source of errors in numerical code.

import numpy as np

a = np.arange(24)
print(a.shape)   # (24,)

# reshape — change shape without copying data
b = a.reshape(4, 6)    # 4 rows, 6 columns
c = a.reshape(2, 3, 4) # 3-D: 2 blocks of 3×4
# -1 means 'infer this dimension'
d = a.reshape(6, -1)   # (6, 4)  — NumPy works out the 4

print(b.shape)  # (4, 6)
print(b.ndim)   # 2
print(b.size)   # 24  — total number of elements

# Axes: axis=0 is rows (down), axis=1 is columns (across)
m = np.array([[1, 2, 3],
              [4, 5, 6]])
print(m.sum(axis=0))  # [5 7 9]   — sum down each column
print(m.sum(axis=1))  # [6 15]    — sum across each row
print(m.sum())        # 21        — grand total

# Flatten and ravel
m.flatten()  # always returns a copy
m.ravel()    # returns view if possible (faster)

A view shares memory with the original array — modifying the view modifies the original. reshape usually returns a view; flatten always returns a copy. Use np.shares_memory(a, b) to check.

What does m.sum(axis=0) compute for a 2-D array m?
Which reshape argument tells NumPy to infer the size of a dimension automatically?
4. What is NumPy broadcasting and how does it work?

Broadcasting is the set of rules NumPy uses to perform element-wise operations on arrays of different but compatible shapes, without physically copying data to make them the same size. It is one of the most powerful and often misunderstood NumPy features.

The rules, applied dimension by dimension starting from the trailing (rightmost) axis:

  1. If the arrays have different numbers of dimensions, prepend 1s to the shape of the smaller-dimensional array.
  2. Dimensions of size 1 are stretched to match the other array's size in that dimension.
  3. If any dimension neither matches nor is 1, a ValueError is raised.
import numpy as np

# Scalar broadcast over array
a = np.array([1, 2, 3])
print(a * 10)         # [10 20 30]

# (3,) and (3, 1) — column vector subtraction from each column
matrix = np.array([[10, 20, 30],
                   [40, 50, 60]])
row_min = matrix.min(axis=1, keepdims=True)  # shape (2, 1)
normalised = matrix - row_min  # broadcasts: (2,3) - (2,1) -> (2,3)
print(normalised)
# [[ 0 10 20]
#  [ 0 10 20]]

# Outer product via broadcasting
col = np.array([[1], [2], [3]])  # shape (3, 1)
row = np.array([10, 20, 30])    # shape (3,) -> treated as (1, 3)
print(col * row)
# [[10 20 30]
#  [20 40 60]
#  [30 60 90]]

Broadcasting avoids the memory cost of explicit np.tile or np.repeat calls. The stretched values are never physically written — NumPy just iterates as if they were. For large arrays this can mean the difference between fitting in RAM and running out of memory.

What shape does NumPy broadcast (3, 1) with (1, 4) to produce?
What does keepdims=True do when passed to an aggregation like np.sum?
5. How does NumPy boolean masking and fancy indexing work?

Beyond basic integer indexing, NumPy supports two advanced selection mechanisms that are essential for data-cleaning and filtering tasks.

Boolean masking: A comparison on an array produces a boolean array of the same shape. Passing that boolean array back as an index selects only the True positions.

import numpy as np

scores = np.array([88, 45, 72, 91, 60, 33, 95])

# Boolean mask
mask = scores >= 70
print(mask)         # [True False True True False False True]
passing = scores[mask]
print(passing)      # [88 72 91 95]

# Compound conditions
mid_range = scores[(scores >= 60) & (scores < 90)]
print(mid_range)    # [88 72 60]  — use & | ~ not and/or

# Assign through a mask
scores[scores < 50] = 50   # clamp low scores to 50
print(scores)       # [88 50 72 91 60 50 95]

# np.where — vectorised if/else
grades = np.where(scores >= 70, 'Pass', 'Fail')
print(grades)       # ['Pass' 'Fail' 'Pass' 'Pass' 'Fail' 'Fail' 'Pass']

Fancy indexing: Pass an integer array (or list) as an index to select arbitrary elements in any order. Unlike slicing, fancy indexing always returns a copy, not a view.

data = np.array([10, 20, 30, 40, 50])
idx  = np.array([4, 1, 4, 0])          # can repeat indices
print(data[idx])   # [50 20 50 10]

# 2-D fancy indexing
m = np.arange(16).reshape(4, 4)
rows = [0, 2]; cols = [1, 3]
print(m[rows, cols])  # m[0,1] and m[2,3]: [1 11]
Why must you use & instead of 'and' when combining NumPy boolean masks?
Does fancy indexing with an integer array return a view or a copy?
6. What are the most commonly used NumPy mathematical functions in data science?

NumPy ships a comprehensive set of universal functions (ufuncs) — compiled, vectorised operations that apply element-wise across the full array without Python loops. Knowing these avoids writing slow manual loops for standard computations.

import numpy as np

a = np.array([1.0, 4.0, 9.0, 16.0, 25.0])

# Element-wise math
np.sqrt(a)          # [1.  2.  3.  4.  5.]
np.log(a)           # natural log
np.log2(a)          # base-2 log
np.log10(a)         # base-10 log
np.exp(a)           # e^x
np.abs(np.array([-3, 4, -1]))  # [3 4 1]

# Aggregation
a.sum()             # 55.0
a.mean()            # 11.0
a.std()             # standard deviation
a.var()             # variance
a.min(); a.max()    # extremes
a.argmin(); a.argmax()  # INDEX of min/max
np.median(a)        # 9.0
np.percentile(a, 75)   # 75th percentile

# Linear algebra
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
np.dot(A, B)        # matrix multiplication (also A @ B in Python 3.5+)
np.linalg.inv(A)    # matrix inverse
np.linalg.det(A)    # determinant
vals, vecs = np.linalg.eig(A)  # eigenvalues and eigenvectors

# Sorting
unsorted = np.array([3, 1, 4, 1, 5])
np.sort(unsorted)   # returns sorted copy: [1 1 3 4 5]
np.argsort(unsorted)  # indices that would sort: [1 3 0 2 4]
What does np.argmax(a) return?
Which operator performs matrix multiplication between two NumPy arrays in Python 3.5+?
7. What is a Pandas DataFrame and how does it differ from a NumPy array?

A Pandas DataFrame is a two-dimensional, labelled data structure — think of it as a spreadsheet or a SQL table in memory. Rows and columns both have labels (the index and the column names), and each column can hold a different data type. A Series is the single-column equivalent.

DataFrame vs NumPy Array
FeatureNumPy ndarrayPandas DataFrame
DimensionsN-dimensionalAlways 2-D (rows × columns)
Data typeSingle dtype per arrayEach column has its own dtype
LabelsInteger positions onlyNamed row index + column headers
Missing valuesNo native support (use np.nan)First-class NaN / NaT / pd.NA
Primary useNumerical computationTabular data: ETL, analysis, SQL-like ops
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name':   ['Alice', 'Bob', 'Carol'],
    'age':    [30, 25, 35],
    'salary': [95000.0, 72000.0, np.nan],
})
print(df.dtypes)
# name       object
# age         int64
# salary    float64

print(df.shape)   # (3, 3)
print(df.index)   # RangeIndex(start=0, stop=3, step=1)
print(df.columns) # Index(['name', 'age', 'salary'], dtype='object')

DataFrames are built on top of NumPy arrays — each column is essentially a NumPy array wrapped with extra metadata. When computation speed is paramount you often drop down to df.values or df.to_numpy() to get the raw array and run NumPy operations on it.

What is the key structural difference between a Pandas DataFrame and a NumPy 2-D array?
Which method converts a DataFrame to a raw NumPy array?

8. How do you read CSV, Excel, and JSON files into a Pandas DataFrame?

Pandas has a family of pd.read_* functions that handle virtually every common data format. Getting data in is usually the first step of any data science workflow, so these functions deserve close attention.

import pandas as pd

# --- CSV ---
df = pd.read_csv('sales.csv')
# Common options:
df = pd.read_csv(
    'sales.csv',
    sep=';',              # custom delimiter (semicolon, tab, etc.)
    header=0,             # row to use as column names (0 = first row)
    index_col='order_id', # use this column as the row index
    usecols=['date', 'amount', 'region'],  # read only these columns
    dtype={'amount': 'float32'},            # explicit dtype
    parse_dates=['date'],                   # auto-parse date strings
    na_values=['N/A', '--', ''],            # treat as NaN
    nrows=1000,           # read only first 1000 rows (useful for large files)
    encoding='utf-8',
)

# --- Excel ---
df_xl = pd.read_excel('report.xlsx', sheet_name='Q1', skiprows=2)

# --- JSON ---
df_j = pd.read_json('data.json', orient='records')
# orient='records' expects [{...}, {...}] — the common API response shape

# --- SQL ---
import sqlite3
conn = sqlite3.connect('mydb.sqlite')
df_sql = pd.read_sql('SELECT * FROM orders WHERE amount > 100', conn)

# Always inspect after reading
print(df.shape)
print(df.head())
print(df.dtypes)
print(df.info())   # shows non-null counts per column

For very large CSVs that do not fit in memory, pass chunksize=100_000 to read_csv — it returns an iterator of DataFrames, each containing that many rows. Process and aggregate chunk by chunk without loading the full file.

Which parameter tells pd.read_csv to read only the first 500 rows of a file?
What does orient='records' mean in pd.read_json?
9. What is the difference between df.loc[] and df.iloc[] in Pandas?

This distinction is tested in almost every Pandas interview. The short version: loc selects by label; iloc selects by integer position. They look similar but behave very differently, especially when the DataFrame index is not a default RangeIndex.

import pandas as pd

df = pd.DataFrame({
    'name':   ['Alice', 'Bob', 'Carol', 'Dave'],
    'score':  [88, 72, 95, 61],
    'city':   ['NYC', 'LA', 'NYC', 'Chicago'],
}, index=[10, 20, 30, 40])   # non-default index!

# --- loc: label-based ---
df.loc[20]               # row with index label 20 (Bob)
df.loc[10:30]            # rows 10, 20, 30 — INCLUSIVE stop
df.loc[10, 'name']       # single value: 'Alice'
df.loc[[10, 40], ['name', 'score']]   # multiple rows and columns
df.loc[df['score'] >= 80]             # boolean mask selection

# --- iloc: position-based ---
df.iloc[0]               # first row (Alice) — positional 0
df.iloc[0:2]             # rows 0 and 1 — EXCLUSIVE stop (like Python slicing)
df.iloc[0, 1]            # row 0, column 1: 88
df.iloc[-1]              # last row (Dave)
df.iloc[:, 0]            # entire first column

# --- [] shorthand ---
df['name']               # single column as Series
df[['name', 'city']]     # multiple columns as DataFrame
df[df['score'] > 80]     # boolean filtering — OK for rows only

The classic trap: loc stop is inclusive; iloc stop is exclusive. This asymmetry trips up even experienced developers. When in doubt, prefer explicit loc or iloc over the [] shorthand to avoid ambiguity.

For df with index [10, 20, 30, 40], what does df.iloc[0:2] return?
Which accessor should you use to select rows by a boolean condition in Pandas?
10. How do you detect, handle, and fill missing values in a Pandas DataFrame?

Missing values are represented in Pandas as NaN (float Not-a-Number from NumPy), NaT (Not-a-Time for datetime columns), or pd.NA (the newer nullable integer/string missing marker). Handling them correctly is the most time-consuming step of real-world data cleaning.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name':   ['Alice', 'Bob', None, 'Dave'],
    'age':    [30, np.nan, 35, 28],
    'score':  [88, 72, np.nan, np.nan],
})

# --- Detection ---
df.isnull()               # boolean DataFrame — True where NaN
df.isnull().sum()         # count NaNs per column
df.isnull().sum() / len(df) * 100  # % missing per column
df.notnull()              # inverse of isnull

# --- Dropping ---
df.dropna()               # drop rows with ANY NaN
df.dropna(how='all')      # drop rows where ALL values are NaN
df.dropna(subset=['age']) # drop rows with NaN only in 'age'
df.dropna(axis=1, thresh=3)  # drop columns with fewer than 3 non-NaN values

# --- Filling ---
df['score'].fillna(df['score'].mean())   # fill with column mean
df['score'].fillna(method='ffill')       # forward fill (propagate last valid)
df['score'].fillna(method='bfill')       # backward fill
df.fillna({'age': 0, 'name': 'Unknown'})  # column-specific fills

# --- Interpolation ---
df['score'].interpolate(method='linear')  # linear interpolation between values

# --- Replace specific sentinel values ---
df.replace(-999, np.nan)  # treat -999 as missing

Choosing between dropping and filling requires domain knowledge. Dropping rows is acceptable when missing data is rare (below 5%) and appears randomly. Filling with mean/median is common for numerical features; filling with mode or a sentinel ('Unknown') for categoricals. For time-series, forward fill preserves temporal order.

Which Pandas method counts the number of missing values per column?
What does df['col'].fillna(method='ffill') do?
11. What are the different ways to filter rows in a Pandas DataFrame?

Row filtering is one of the most frequent DataFrame operations. Pandas provides several syntaxes, each with different readability and performance trade-offs.

import pandas as pd

df = pd.DataFrame({
    'city':     ['NYC', 'LA', 'NYC', 'Chicago', 'LA'],
    'revenue':  [120, 85, 200, 55, 140],
    'category': ['A', 'B', 'A', 'C', 'B'],
})

# Boolean indexing — most common
df[df['revenue'] > 100]

# Compound conditions — use & | ~ (not and/or)
df[(df['city'] == 'NYC') & (df['revenue'] > 100)]

# isin — membership test
df[df['city'].isin(['NYC', 'LA'])]

# between — inclusive range
df[df['revenue'].between(80, 150)]

# str methods for text filtering
df[df['city'].str.startswith('N')]
df[df['city'].str.contains('C', case=False)]

# query() — string-based, readable for complex conditions
df.query('city == "NYC" and revenue > 100')
df.query('revenue > @threshold', local_dict={'threshold': 100})
# @ prefix references a Python variable inside query string

# filter() — filter columns or index labels (NOT rows by content)
df.filter(like='rev')    # columns whose name contains 'rev'
df.filter(regex='^c')   # columns starting with 'c'

query() is readable for ad-hoc analysis and slightly faster for very large DataFrames because it avoids creating the intermediate boolean array. However, it does not support all Python expressions and can be harder to debug. For production pipelines, explicit boolean indexing is more explicit and testable.

Why must you use & instead of 'and' when writing compound Pandas boolean filters?
Which Pandas method checks if each row's value belongs to a given list?
12. How does Pandas groupby work and what aggregation patterns are most useful?

GroupBy is the Pandas implementation of the split-apply-combine pattern: split the DataFrame into groups by one or more column values, apply an aggregation or transformation to each group, and combine the results into a new DataFrame. It is the primary tool for summary statistics on tabular data.

import pandas as pd

sales = pd.DataFrame({
    'region':  ['East','East','West','West','East','West'],
    'product': ['A','B','A','B','A','A'],
    'revenue': [100, 200, 150, 80, 120, 90],
    'units':   [10, 20, 15, 8, 12, 9],
})

# Single-column groupby with single aggregation
sales.groupby('region')['revenue'].sum()
# East    420   West    320

# Multiple aggregations on one column
sales.groupby('region')['revenue'].agg(['sum', 'mean', 'count', 'std'])

# Different aggregations per column
sales.groupby('region').agg(
    total_revenue=('revenue', 'sum'),
    avg_units    =('units',   'mean'),
    num_orders   =('revenue', 'count'),
)

# Multi-column groupby
sales.groupby(['region', 'product'])['revenue'].sum()

# transform — returns same-length Series aligned with original index
# (useful for adding group statistics back as a new column)
sales['region_total'] = sales.groupby('region')['revenue'].transform('sum')

# filter — keep only groups satisfying a condition
big_regions = sales.groupby('region').filter(lambda g: g['revenue'].sum() > 400)

transform vs agg: agg reduces each group to a scalar, returning a smaller DataFrame; transform keeps the original shape, broadcasting the group result back to each row. Use transform when you want to add a group statistic as a feature column without losing row-level detail.

What does groupby.transform('sum') return compared to groupby.agg('sum')?
How do you apply different aggregations to different columns in a single groupby call?
13. How do you merge and join DataFrames in Pandas, and what do the different join types mean?

Real-world data lives in multiple tables. Pandas merge() implements SQL-style joins, and concat() stacks DataFrames. Choosing the right join type prevents silently losing or duplicating rows.

Pandas Join Types
how=Keeps rows fromMissing matches become
'inner'Both DataFrames (intersection)NaN (none dropped because only matched)
'left'All of left, matched from rightNaN in right columns
'right'All of right, matched from leftNaN in left columns
'outer'Both (union)NaN on whichever side has no match
import pandas as pd

customers = pd.DataFrame({
    'cust_id': [1, 2, 3, 4],
    'name':    ['Alice', 'Bob', 'Carol', 'Dave'],
})
orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'cust_id':  [1, 2, 9],     # cust_id 9 has no match; Dave has no order
    'amount':   [200, 150, 80],
})

# Inner join — only rows that match in both
pd.merge(customers, orders, on='cust_id', how='inner')   # 2 rows

# Left join — all customers, NaN where no order
pd.merge(customers, orders, on='cust_id', how='left')    # 4 rows

# Different key names in each table
pd.merge(customers, orders,
         left_on='cust_id', right_on='cust_id')  # same here, but shows syntax

# Merge on index
pd.merge(customers.set_index('cust_id'), orders,
         left_index=True, right_on='cust_id')

# concat — stack vertically (rows) or horizontally (columns)
pd.concat([df1, df2], axis=0, ignore_index=True)   # stack rows
pd.concat([df1, df2], axis=1)                       # add columns side by side
A left join with customers (left) and orders (right) on cust_id will…
What does pd.concat([df1, df2], axis=0) do?
14. When should you use df.apply() versus vectorised Pandas operations?

apply() runs a Python function on every row or column of a DataFrame. It is the most flexible transformation tool in Pandas but also the slowest because it falls back to a Python-level loop under the hood.

import pandas as pd
import numpy as np

df = pd.DataFrame({'price': [10.5, 20.0, 8.75, 35.0],
                   'qty':   [3,    5,    2,    1   ]})

# --- Slow: apply with a Python lambda ---
df['revenue'] = df.apply(lambda row: row['price'] * row['qty'], axis=1)

# --- Fast: vectorised arithmetic (always prefer this) ---
df['revenue'] = df['price'] * df['qty']

# --- apply on a single column (Series.apply) ---
df['price_cat'] = df['price'].apply(lambda x: 'high' if x > 20 else 'low')

# --- Faster alternative: np.where ---
df['price_cat'] = np.where(df['price'] > 20, 'high', 'low')

# --- Multi-condition: np.select ---
conditions  = [df['price'] > 25, df['price'] > 15, df['price'] > 0]
choices     = ['premium', 'mid', 'budget']
df['tier']  = np.select(conditions, choices, default='unknown')

# When apply is genuinely needed:
# — calling a function that returns a list/dict/Series per row
# — complex multi-column logic that cannot be expressed as vectorised ops
df.apply(lambda row: pd.Series({'x': row['price']+1, 'y': row['qty']*2}), axis=1)

Performance hierarchy for transformations (fastest to slowest): vectorised arithmetic > NumPy ufuncs > df.eval() string expressions > map() on a Series > apply() > explicit Python for-loop. Use apply only when no vectorised alternative exists; for simple conditions always use np.where or boolean indexing instead.

Why is df['price'] * df['qty'] faster than df.apply(lambda row: row['price']*row['qty'], axis=1)?
Which NumPy function is the recommended vectorised replacement for a simple if/else transformation on a column?
15. How do you use pd.pivot_table to summarise data?

pd.pivot_table reshapes and aggregates a DataFrame simultaneously, producing a cross-tabulation — exactly like a spreadsheet pivot table. It is the go-to function for producing summary reports broken down by two categorical dimensions.

import pandas as pd

sales = pd.DataFrame({
    'region':  ['East','East','West','West','East','West','West'],
    'quarter': ['Q1',  'Q2',  'Q1',  'Q2',  'Q1',  'Q1',  'Q2'],
    'product': ['A',   'A',   'A',   'A',   'B',   'B',   'B'],
    'revenue': [100,   120,   90,    110,   80,    70,    95],
})

# Basic pivot: average revenue by region (rows) and quarter (columns)
pt = pd.pivot_table(
    sales,
    values='revenue',
    index='region',
    columns='quarter',
    aggfunc='sum',       # sum, mean, count, np.median, list, ...
    fill_value=0,        # replace NaN with 0
    margins=True,        # add row/column totals (labelled 'All')
    margins_name='Total',
)
print(pt)
# quarter  Q1   Q2  Total
# region
# East    180  120    300
# West    160  205    365
# Total   340  325    665

# Multiple values and multiple aggregations
pd.pivot_table(sales, values='revenue', index='region',
               columns='product', aggfunc=['sum', 'count'])

The inverse operation — converting a wide pivot back to long form — is pd.melt(). df.stack() and df.unstack() do similar reshape operations on the index levels directly.

What does margins=True add to a pd.pivot_table result?
Which Pandas function is the inverse of pivot_table, reshaping wide data back to long form?
16. How do you perform string operations on Pandas DataFrame columns?

Pandas exposes string methods through the .str accessor on object-dtype Series. These operations are vectorised over the whole column — no explicit loop needed — and handle NaN values gracefully (they propagate as NaN rather than raising an error).

import pandas as pd

df = pd.DataFrame({'name': ['  Alice Smith  ', 'bob jones', 'CAROL LEE', None],
                   'email': ['alice@corp.com', 'BOB@CORP.COM', 'carol@other.org', None]})

# Case normalisation
df['name'].str.strip().str.title()     # 'Alice Smith', 'Bob Jones', 'Carol Lee', NaN

# Split into multiple columns
df[['first', 'last']] = df['name'].str.strip().str.split(' ', expand=True)

# Contains / startswith / endswith
df[df['email'].str.endswith('@corp.com', na=False)]

# Extract patterns with regex
df['domain'] = df['email'].str.extract(r'@(.+)$')  # captures text after @

# Replace with regex
df['email'].str.lower().str.replace(r'[^a-z0-9@._]', '', regex=True)

# Count occurrences
df['name'].str.count('l')   # 1, 0, 1, NaN

# Length
df['name'].str.len()

# Padding / justification
df['id'].str.zfill(6)       # zero-pad to width 6
df['name'].str.ljust(20, '-')  # left-justify, pad with dashes

The na=False argument in methods like str.contains and str.startswith is important — without it, NaN values produce NaN in the boolean mask, which causes issues in filtering. Passing na=False returns False for NaN rows, keeping them out of the filtered result cleanly.

What does df['col'].str.extract(r'(\d+)') do?
Why pass na=False to df['email'].str.contains('@corp.com')?
17. How do you work with dates and times in Pandas?

Time-series data is everywhere in data science — sales by day, sensor readings by second, user activity by hour. Pandas has first-class datetime support built on NumPy's datetime64 type and Python's datetime module.

import pandas as pd

df = pd.DataFrame({
    'date_str': ['2024-01-15', '2024-02-20', '2024-03-05'],
    'value':    [100, 200, 150],
})

# Parse string dates — always specify format for speed and correctness
df['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d')

# Extract components via .dt accessor
df['year']    = df['date'].dt.year
df['month']   = df['date'].dt.month
df['day']     = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()  # 'Monday', 'Tuesday', ...
df['quarter'] = df['date'].dt.quarter

# Date arithmetic
df['days_since'] = (pd.Timestamp.today() - df['date']).dt.days
df['next_month'] = df['date'] + pd.DateOffset(months=1)

# Set as index for time-series resampling
ts = df.set_index('date')
ts.resample('M').sum()   # sum by month
ts.resample('W').mean()  # mean by week
ts.resample('Q').agg({'value': ['sum', 'count']})  # quarterly stats

# Filtering date ranges
df[df['date'] >= '2024-02-01']
df[df['date'].between('2024-01-01', '2024-03-01')]

Always parse dates explicitly with format= rather than relying on infer_datetime_format=True — the inferred path is slow and occasionally wrong for ambiguous formats like 01/02/03. For production pipelines, parse at read time using parse_dates=['date_col'] in pd.read_csv.

Which Pandas accessor exposes datetime components like .year, .month, and .day_name() on a datetime Series?
What does ts.resample('M').sum() do on a time-indexed DataFrame?
18. What is Matplotlib and what are the key components of a figure?

Matplotlib is Python's foundational plotting library, originally modelled after MATLAB's plotting API. Almost every other Python visualisation library (Seaborn, Pandas .plot(), Plotly static exports) either wraps Matplotlib or uses it as a rendering backend.

Understanding the object hierarchy is essential for customising plots beyond the defaults:

Matplotlib Object Hierarchy
ObjectWhat it isCreated by
FigureThe entire canvas / windowplt.figure() or plt.subplots()
AxesOne coordinate system (plot area) inside a Figurefig.add_subplot() or plt.subplots()
AxisThe X or Y axis of an Axes (note: Axes ≠ Axis)Exists on every Axes
ArtistEvery visible element — lines, patches, text, legendsplot(), bar(), text(), etc.
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 2 * np.pi, 300)

# Object-oriented interface (recommended for complex plots)
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(x, np.sin(x), label='sin(x)', color='steelblue', linewidth=2)
ax.plot(x, np.cos(x), label='cos(x)', color='tomato', linestyle='--')
ax.set_title('Sine and Cosine', fontsize=14)
ax.set_xlabel('x (radians)')
ax.set_ylabel('Amplitude')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 2 * np.pi)
fig.tight_layout()        # prevent label clipping
plt.savefig('trig.png', dpi=150, bbox_inches='tight')
plt.show()

The pyplot (plt.*) interface is a state-machine shorthand that implicitly manages the current Figure and Axes. It is convenient for quick interactive plots but problematic in scripts and notebooks that create multiple figures — use the object-oriented fig, ax = plt.subplots() style for anything beyond a single simple chart.

In Matplotlib, what is the difference between 'Axes' and 'Axis'?
Which Matplotlib function returns both a Figure and an Axes object in one call?
19. What are the most common chart types in Matplotlib and when do you use each?

Choosing the right chart type communicates data clearly; choosing the wrong one obscures it. Here are the workhorses of exploratory data analysis:

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(2, 3, figsize=(14, 8))

# 1. Line chart — trends over time or ordered x-axis
ax = axes[0, 0]
ax.plot([1, 2, 3, 4], [10, 15, 13, 18])
ax.set_title('Line: trends')

# 2. Bar chart — comparing discrete categories
ax = axes[0, 1]
ax.bar(['A', 'B', 'C'], [30, 45, 20])
ax.set_title('Bar: categories')

# 3. Scatter plot — relationship between two continuous variables
ax = axes[0, 2]
x = np.random.randn(100); y = x * 0.8 + np.random.randn(100) * 0.5
ax.scatter(x, y, alpha=0.5, c='steelblue')
ax.set_title('Scatter: correlation')

# 4. Histogram — distribution of one continuous variable
ax = axes[1, 0]
ax.hist(np.random.randn(1000), bins=30, color='salmon', edgecolor='white')
ax.set_title('Histogram: distribution')

# 5. Box plot — distribution summary with outliers
ax = axes[1, 1]
ax.boxplot([np.random.randn(100) for _ in range(3)], labels=['G1','G2','G3'])
ax.set_title('Box: spread & outliers')

# 6. Heatmap via imshow — 2-D matrix data (e.g., correlation matrix)
ax = axes[1, 2]
data = np.random.rand(4, 4)
im = ax.imshow(data, cmap='viridis')
plt.colorbar(im, ax=ax)
ax.set_title('Heatmap: 2-D matrix')

fig.tight_layout()
plt.show()

Rule of thumb: line for temporal/ordered data, bar for nominal comparisons, scatter for two-variable relationships, histogram for single-variable distributions, box for group comparisons with outlier context, heatmap for correlation matrices and confusion matrices.

Which chart type is best for showing the correlation between two continuous numerical variables?
What information does a box plot display that a histogram does not?
20. How do you create multi-panel figures with Matplotlib subplots?

Multi-panel figures are standard in data science reports — comparing multiple variables or time periods side by side. Matplotlib provides several ways to arrange subplots.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 200)

# --- Regular grid ---
fig, axes = plt.subplots(2, 2, figsize=(10, 8), sharex=True)
# sharex=True links x-axis zoom/pan across all subplots
axes[0, 0].plot(x, np.sin(x))
axes[0, 0].set_title('sin')
axes[0, 1].plot(x, np.cos(x), color='tomato')
axes[0, 1].set_title('cos')
axes[1, 0].plot(x, np.tan(x))
axes[1, 0].set_title('tan')
axes[1, 1].set_visible(False)   # hide unused subplot
fig.suptitle('Trig functions', fontsize=16)
fig.tight_layout(rect=[0, 0, 1, 0.95])   # leave room for suptitle

# --- Flatten for iteration ---
fig, axes = plt.subplots(2, 3, figsize=(14, 6))
for ax, col in zip(axes.flatten(), df.select_dtypes('number').columns):
    ax.hist(df[col].dropna(), bins=20)
    ax.set_title(col)

# --- GridSpec for irregular layouts ---
from matplotlib.gridspec import GridSpec
fig = plt.figure(figsize=(12, 6))
gs  = GridSpec(2, 3, figure=fig)
ax1 = fig.add_subplot(gs[0, :2])   # spans first two columns of row 0
ax2 = fig.add_subplot(gs[0, 2])    # third column of row 0
ax3 = fig.add_subplot(gs[1, :])    # entire row 1

axes.flatten() is the standard idiom when you want to loop over a 2-D grid of Axes objects as if they were a 1-D list. fig.tight_layout() automatically adjusts spacing to prevent labels overlapping between subplots — call it before plt.show() or fig.savefig().

What does sharex=True do when passed to plt.subplots()?
Which method converts a 2-D array of Axes (from plt.subplots(2,3)) into a 1-D array for easy iteration?
21. What is Seaborn and how does it differ from Matplotlib?

Seaborn is a high-level statistical visualisation library built on top of Matplotlib. Where Matplotlib gives you full control over every pixel, Seaborn provides opinionated, attractive defaults and plot types designed specifically for statistical exploration — with far less boilerplate code.

Matplotlib vs Seaborn
AspectMatplotlibSeaborn
LevelLow-level — explicit controlHigh-level — declarative
DefaultsFunctional but plainPublication-quality themes out of the box
DataFrame integrationManual (extract arrays)Direct — pass df= and column names
Statistical plotsManual calculation requiredBuilt-in (regression, KDE, violin, pair)
CustomisationUnlimitedMatplotlib calls needed for fine-tuning
import seaborn as sns
import matplotlib.pyplot as plt

# Load a built-in example dataset
tips = sns.load_dataset('tips')

# Seaborn: one line to create a scatter with regression line and hue
sns.regplot(data=tips, x='total_bill', y='tip')

# Matplotlib equivalent would require:
# 1. Compute regression manually
# 2. Plot scatter
# 3. Plot fitted line
# 4. Shade confidence interval — ~15 lines total

# Themes and contexts
sns.set_theme(style='whitegrid', context='notebook', palette='muted')
# styles: darkgrid, whitegrid, dark, white, ticks
# contexts: paper, notebook, talk, poster (scale font/line sizes)

Seaborn plots return Matplotlib Axes objects, so all standard Matplotlib customisation still applies after the Seaborn call: ax = sns.scatterplot(...); ax.set_title('My Title'). Seaborn does not replace Matplotlib — it is a complement that handles the tedious parts of statistical plotting.

What object does most Seaborn plotting functions return?
Which Seaborn function sets the global theme (background grid, font scale, colour palette) for all subsequent plots?
22. What are the most important Seaborn plot types for exploratory data analysis?

Seaborn divides its plots into relational (relationship between variables), distributional (distribution of a single variable), and categorical (comparison across categories). Knowing when to use each makes EDA far more efficient.

import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')

# --- Relational ---
# Scatter with colour encoding
sns.scatterplot(data=tips, x='total_bill', y='tip',
                hue='smoker', size='size', palette='Set1')

# Regression line + scatter
sns.regplot(data=tips, x='total_bill', y='tip', ci=95)

# --- Distributional ---
# Histogram + KDE
sns.histplot(data=tips, x='total_bill', hue='sex', kde=True, bins=20)

# KDE only
sns.kdeplot(data=tips, x='total_bill', hue='sex', fill=True)

# ECDF — empirical cumulative distribution
sns.ecdfplot(data=tips, x='total_bill', hue='day')

# --- Categorical ---
# Box plot
sns.boxplot(data=tips, x='day', y='total_bill', hue='smoker', palette='pastel')

# Violin — box + KDE combined
sns.violinplot(data=tips, x='day', y='tip', inner='quartile')

# Bar chart with error bars (95% CI by default)
sns.barplot(data=tips, x='day', y='tip', estimator='mean', errorbar='ci')

# Strip plot — all individual points
sns.stripplot(data=tips, x='day', y='tip', jitter=True, alpha=0.4)

# --- Multi-variable overview ---
# Pair plot — scatter matrix of all numeric column pairs
sns.pairplot(tips, hue='sex', diag_kind='kde')

# Heatmap — great for correlation matrices
corr = tips.select_dtypes('number').corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1)
What does a violin plot combine that a regular box plot does not show?
Which Seaborn function creates a scatter matrix of all numeric column pairs in a DataFrame?
23. How do you create and interpret a correlation heatmap with Seaborn?

A correlation heatmap is one of the first plots every data scientist makes on a new dataset. It shows the Pearson (or other) correlation coefficient between every pair of numeric features as a colour-coded grid, immediately revealing which variables move together and which do not.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load example dataset
df = sns.load_dataset('penguins').select_dtypes('number')

# Compute correlation matrix
corr = df.corr()   # Pearson by default; method='spearman' for ranked
print(corr)

# --- Basic heatmap ---
fig, ax = plt.subplots(figsize=(7, 5))
sns.heatmap(
    corr,
    annot=True,           # show values inside each cell
    fmt='.2f',            # 2 decimal places
    cmap='coolwarm',      # blue = negative, red = positive
    vmin=-1, vmax=1,      # fix colour scale to [-1, 1]
    linewidths=0.5,       # add grid lines between cells
    ax=ax,
)
ax.set_title('Feature Correlation Matrix')
fig.tight_layout()

# --- Mask upper triangle (remove redundancy) ---
import numpy as np
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f',
            cmap='coolwarm', vmin=-1, vmax=1)

Interpreting the output: values close to +1 mean strong positive linear correlation (both variables increase together), values close to -1 mean strong negative correlation (one increases as the other decreases), and values near 0 indicate little to no linear relationship. The diagonal is always 1.0 (a variable is perfectly correlated with itself). Masking the upper triangle removes the mirror image and makes the chart less cluttered.

What does a correlation value of -0.87 between two features indicate?
What does masking the upper triangle of a correlation heatmap achieve?
24. What is Seaborn's FacetGrid and how does it enable multi-panel statistical plots?

FacetGrid is Seaborn's mechanism for trellis/small-multiples plots — the same chart repeated across different subsets of the data, defined by one or more categorical columns. It is one of Seaborn's most powerful features for exploring interaction effects between variables.

import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')

# --- FacetGrid manually ---
g = sns.FacetGrid(tips, col='time', row='sex', height=3, aspect=1.2)
g.map_dataframe(sns.histplot, x='total_bill', bins=15, kde=True)
g.add_legend()
g.set_titles(col_template='{col_name} service', row_template='Sex: {row_name}')
g.set_axis_labels('Total Bill ($)', 'Count')

# --- Figure-level functions (wrap FacetGrid automatically) ---
# relplot — relational
sns.relplot(data=tips, x='total_bill', y='tip',
            col='smoker', hue='sex', kind='scatter', height=4)

# displot — distributional
sns.displot(data=tips, x='total_bill',
            col='sex', row='time', kind='kde', fill=True)

# catplot — categorical
sns.catplot(data=tips, x='day', y='tip',
            col='sex', kind='violin', height=5, aspect=0.8)

The figure-level functions (relplot, displot, catplot) return a FacetGrid object, not an Axes. To customise them after creation you call FacetGrid methods like g.set_titles(), g.set_axis_labels(), or iterate over g.axes.flatten() to access individual Axes objects and apply standard Matplotlib customisation.

What is the purpose of Seaborn's FacetGrid?
Which figure-level Seaborn function is used to create distributional plots across facets?
25. How do you compute descriptive statistics on a Pandas DataFrame?

Descriptive statistics summarise the central tendency, spread, and shape of a dataset. Pandas df.describe() is the starting point for any exploratory analysis, but knowing the individual methods gives you more precise control.

import pandas as pd
import numpy as np

df = pd.read_csv('housing.csv')

# --- df.describe() ---
# Numeric columns: count, mean, std, min, 25%, 50%, 75%, max
df.describe()
# Include object columns too
df.describe(include='all')

# --- Individual statistics ---
df['price'].mean()      # arithmetic mean
df['price'].median()    # 50th percentile — robust to outliers
df['price'].mode()[0]   # most frequent value (returns Series)
df['price'].std()       # standard deviation (ddof=1 by default)
df['price'].var()       # variance
df['price'].skew()      # skewness: >0 right-skewed, <0 left-skewed
df['price'].kurt()      # excess kurtosis (0 = normal dist)
df['price'].quantile(0.90)  # 90th percentile
df['price'].quantile([0.25, 0.5, 0.75])  # multiple quantiles

# IQR — interquartile range (robust measure of spread)
Q1, Q3 = df['price'].quantile(0.25), df['price'].quantile(0.75)
IQR = Q3 - Q1

# Outlier detection via IQR fence
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['price'] < lower) | (df['price'] > upper)]
print(f'{len(outliers)} outliers detected ({len(outliers)/len(df)*100:.1f}%)')
Why is the median preferred over the mean as a measure of central tendency for highly skewed data?
What does a positive skewness value indicate about a distribution?
26. How do you reduce a Pandas DataFrame's memory usage through dtype optimisation?

DataFrames loaded from CSV often use unnecessarily large dtypes — 64-bit integers for values that fit in 8 bits, generic object dtype for repeated string categories. Downcasting dtypes can reduce memory by 4–8× without any data loss, enabling analysis of larger datasets within available RAM.

import pandas as pd
import numpy as np

df = pd.read_csv('large.csv')
print(f'Memory before: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB')

# --- Integer downcasting ---
for col in df.select_dtypes('int64').columns:
    df[col] = pd.to_numeric(df[col], downcast='integer')
    # downcast tries int8 -> int16 -> int32 depending on value range

# --- Float downcasting ---
for col in df.select_dtypes('float64').columns:
    df[col] = pd.to_numeric(df[col], downcast='float')  # float32

# --- Categorical: object columns with low cardinality ---
# If a column has < 5% unique values, Categorical saves memory
for col in df.select_dtypes('object').columns:
    n_unique = df[col].nunique()
    if n_unique / len(df) < 0.05:    # less than 5% cardinality
        df[col] = df[col].astype('category')

print(f'Memory after : {df.memory_usage(deep=True).sum() / 1e6:.1f} MB')

# Categorical also speeds up groupby on low-cardinality columns
# because grouping enumerates integers rather than comparing strings

The Categorical dtype stores repeated strings as integer codes internally — a column with 5 unique city names in a million-row dataset stores one integer per row rather than one full string per row. This speeds up groupby, sort_values, and value_counts in addition to saving memory.

Which Pandas dtype should you use for a column with only 10 distinct string values repeated across a million rows?
What does pd.to_numeric(col, downcast='integer') do?
27. How do you generate reproducible random data with NumPy?

Reproducibility is a core requirement of data science — experiments, train/test splits, and simulations must produce the same result every run so that results can be verified and shared. NumPy's random number generation is the building block for all of this.

import numpy as np

# --- Legacy API (still common in older code) ---
np.random.seed(42)
np.random.rand(3)       # [0.374, 0.951, 0.732] — same every time

# --- Modern API: Generator (preferred since NumPy 1.17) ---
rng = np.random.default_rng(seed=42)
# Using a Generator is thread-safe and has better statistical properties

rng.random(5)                # uniform [0, 1)
rng.standard_normal(5)       # N(0, 1)
rng.normal(loc=170, scale=10, size=1000)  # N(mean, std)
rng.integers(0, 100, size=10)  # random ints in [0, 100)
rng.choice(['a','b','c'], size=5, replace=True)  # random sampling
rng.shuffle(arr)              # in-place shuffle
rng.permutation(arr)          # shuffled copy

# --- Distributions used in ML simulations ---
rng.binomial(n=10, p=0.3, size=100)      # number of successes in n trials
rng.poisson(lam=5, size=100)             # events per interval
rng.exponential(scale=2, size=100)       # time between Poisson events
rng.uniform(low=0, high=10, size=100)    # uniform distribution

The modern Generator API (np.random.default_rng) is preferred over np.random.seed because: the generator is a first-class object you can pass around (not a global state), it is thread-safe, and it uses the PCG64 algorithm which passes more statistical tests than the Mersenne Twister used by the legacy API.

Why is np.random.default_rng(seed) preferred over np.random.seed() for production code?
What distribution does rng.standard_normal() draw from?
28. How do you use value_counts() and pd.crosstab() to understand categorical data?

Categorical columns are understood by counting their frequencies and cross-tabulating them against other variables. These two tools answer the questions 'what values exist and how often?' and 'how are two categorical variables related?'

import pandas as pd
tips = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')

# --- value_counts ---
tips['day'].value_counts()
# Sat     87  Fri    19  Sun     76  Thur    62

tips['day'].value_counts(normalize=True).round(3)
# proportions: Sat 0.357, Sun 0.312, Thur 0.255, Fri 0.078

tips['day'].value_counts(dropna=False)  # includes NaN count if any

# Count unique values
tips['day'].nunique()   # 4

# Histogram of numeric with bins
pd.cut(tips['total_bill'], bins=5).value_counts().sort_index()

# --- pd.crosstab ---
# Frequency cross-table: how many smokers vs non-smokers per day
ct = pd.crosstab(tips['day'], tips['smoker'])
# smoker   No  Yes
# day
# Fri       4   15
# Sat      45   42
# Sun      57   19
# Thur     45   17

# Proportions within rows (what % of each day are smokers)
pd.crosstab(tips['day'], tips['smoker'], normalize='index').round(3)

# With aggregation (mean tip by day and smoker)
pd.crosstab(tips['day'], tips['smoker'],
            values=tips['tip'], aggfunc='mean').round(2)
What does value_counts(normalize=True) return?
What does pd.crosstab(df['A'], df['B'], normalize='index') produce?
29. How do you style Matplotlib figures and save them for reports?

The default Matplotlib style is functional but plain. For presentations and reports you need publication-quality output — chosen colour palettes, correct font sizes, no chart junk, and lossless or high-resolution raster output.

import matplotlib.pyplot as plt
import numpy as np

# --- Using a style sheet ---
plt.style.use('seaborn-v0_8-whitegrid')  # clean grid background
# Other useful styles: 'ggplot', 'fivethirtyeight', 'bmh', 'dark_background'
print(plt.style.available)   # list all available styles

# --- Common appearance tweaks via rcParams ---
plt.rcParams.update({
    'font.size':        12,
    'axes.labelsize':   13,
    'axes.titlesize':   14,
    'legend.fontsize':  11,
    'figure.dpi':       100,
    'lines.linewidth':  2,
})

# --- Figure construction ---
fig, ax = plt.subplots(figsize=(8, 5))
x = np.linspace(0, 10, 200)
ax.plot(x, np.sin(x), color='#2E86AB', label='sin(x)')
ax.fill_between(x, np.sin(x), 0, alpha=0.15, color='#2E86AB')
ax.axhline(0, color='black', linewidth=0.8, linestyle='--')
ax.set_title('Sine Wave with Fill', pad=12)
ax.set_xlabel('x')
ax.set_ylabel('sin(x)')
ax.legend(loc='upper right')
ax.spines[['top', 'right']].set_visible(False)  # remove chart junk
fig.tight_layout()

# --- Saving ---
fig.savefig('output.png', dpi=300, bbox_inches='tight')  # raster
fig.savefig('output.pdf', bbox_inches='tight')            # vector
fig.savefig('output.svg', bbox_inches='tight')            # web/edit

Use bbox_inches='tight' whenever saving — it prevents axis labels being clipped at the edges. For publications use PDF or SVG (vector formats that scale without pixelation). For web and slides, PNG at 150–300 DPI is standard.

What does bbox_inches='tight' do when saving a Matplotlib figure?
Which file format should you use when you need a Matplotlib plot that scales without pixelation in a report?
30. What is np.where and how is it used for conditional array creation?

np.where is NumPy's vectorised if/else for arrays. In its three-argument form it returns a new array built element-by-element: where the condition is True, use values from x; where False, use values from y. It is the correct alternative to writing a Python loop with an if-statement inside.

import numpy as np

scores = np.array([88, 45, 72, 91, 60, 33, 95])

# Classify into Pass / Fail without a loop
labels = np.where(scores >= 70, 'Pass', 'Fail')
# ['Pass' 'Fail' 'Pass' 'Pass' 'Fail' 'Fail' 'Pass']

# Apply a discount: over 80 gets 20% off, rest gets 5% off
prices = np.array([100.0, 200.0, 50.0, 150.0])
discounted = np.where(prices > 80, prices * 0.80, prices * 0.95)
# [95.  160.   47.5  120.]

# Chain multiple conditions using np.select
conditions = [
    scores >= 90,
    (scores >= 70) & (scores < 90),
    scores < 70,
]
choices = ['A', 'B', 'C']
grades = np.select(conditions, choices, default='F')
# ['B' 'C' 'B' 'A' 'C' 'C' 'A']

# One-argument form: returns indices where condition is True
failing_indices = np.where(scores < 70)
# (array([1, 4, 5]),)   — tuple of index arrays
failing_scores = scores[failing_indices]
# [45 60 33]

np.select generalises np.where to multiple conditions — the first matching condition wins. Use it whenever you have more than two output categories; chaining nested np.where calls quickly becomes unreadable.

What does np.where(scores < 70) return when called with only one argument?
Which NumPy function is the idiomatic replacement for chaining multiple np.where conditions?
31. What is Pandas method chaining and how does df.pipe() support it?

Method chaining is the style of writing data transformations as a single expression where each step's result is the input to the next. It avoids creating intermediate variables, reads like a pipeline, and makes the data flow explicit from top to bottom.

import pandas as pd

# --- Without chaining (intermediate variables) ---
df1 = pd.read_csv('raw.csv')
df2 = df1.dropna(subset=['revenue'])
df3 = df2.rename(columns={'rev': 'revenue'})
df4 = df3[df3['revenue'] > 0]
df5 = df4.assign(log_revenue=lambda d: d['revenue'].apply(np.log1p))
result = df5.groupby('region')['log_revenue'].mean()

# --- With method chaining ---
import numpy as np

result = (
    pd.read_csv('raw.csv')
    .dropna(subset=['revenue'])
    .rename(columns={'rev': 'revenue'})
    .query('revenue > 0')
    .assign(log_revenue=lambda d: np.log1p(d['revenue']))
    .groupby('region')['log_revenue']
    .mean()
)

# --- df.pipe() for custom functions ---
def remove_outliers(df, col, n_std=3):
    mean, std = df[col].mean(), df[col].std()
    return df[(df[col] - mean).abs() < n_std * std]

def add_rank(df, col):
    df = df.copy()
    df['rank'] = df[col].rank(ascending=False)
    return df

result = (
    pd.read_csv('raw.csv')
    .pipe(remove_outliers, col='revenue')
    .pipe(add_rank, col='revenue')
)
# pipe passes the DataFrame as the first argument to the function

df.pipe(func, *args, **kwargs) calls func(df, *args, **kwargs), inserting the DataFrame at the front of the argument list. This lets you write standalone functions and use them inline in a method chain without breaking the fluent style.

What does df.pipe(my_func, extra_arg=5) do?
What is the main readability advantage of method chaining over using intermediate variables?
32. What does a typical exploratory data analysis (EDA) workflow look like in Python?

EDA is the first thing you do with a new dataset before any modelling. The goal is to understand the data's structure, quality, and relationships, and to spot problems (wrong dtypes, missing values, outliers, data leakage) before they propagate into a model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load and inspect
df = pd.read_csv('housing.csv')
print(df.shape)          # (rows, cols)
print(df.dtypes)         # types per column
print(df.head())         # first 5 rows
print(df.info())         # dtypes + non-null counts
print(df.describe())     # summary stats for numeric cols

# 2. Missing value audit
missing = df.isnull().sum().sort_values(ascending=False)
print(missing[missing > 0])

# 3. Duplicate rows
print(df.duplicated().sum())
df = df.drop_duplicates()

# 4. Distribution of each numeric column
df.select_dtypes('number').hist(bins=30, figsize=(16, 10))
plt.tight_layout(); plt.show()

# 5. Correlation heatmap
corr = df.select_dtypes('number').corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix'); plt.show()

# 6. Target variable distribution
target = 'price'
sns.histplot(df[target], kde=True)
print(f'Skewness: {df[target].skew():.2f}')

# 7. Categorical breakdown
for col in df.select_dtypes('object').columns:
    print(df[col].value_counts())

# 8. Outlier detection
for col in df.select_dtypes('number').columns:
    Q1, Q3 = df[col].quantile([0.25, 0.75])
    IQR = Q3 - Q1
    n_out = ((df[col] < Q1-1.5*IQR)|(df[col] > Q3+1.5*IQR)).sum()
    if n_out > 0: print(f'{col}: {n_out} outliers')

EDA is iterative — findings in step 4 send you back to step 2, insights in the correlation matrix raise questions answered by group analysis. Keep a notebook with your observations alongside the code so you and your team can understand what was found and why certain preprocessing decisions were made.

What is the first thing you should check after loading a dataset into a DataFrame?
Why should you investigate a highly skewed target variable before training a regression model?
33. How do you stack, concatenate, and split NumPy arrays?

Combining and splitting arrays is a frequent operation in data preprocessing — assembling feature matrices from multiple sources, or splitting a dataset into folds for cross-validation.

import numpy as np

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

# --- Concatenating along existing axes ---
np.concatenate([a, b], axis=0)  # stack rows (vertical)
# [[1 2]
#  [3 4]
#  [5 6]
#  [7 8]]

np.concatenate([a, b], axis=1)  # stack columns (horizontal)
# [[1 2 5 6]
#  [3 4 7 8]]

# --- Convenience stacking functions ---
np.vstack([a, b])   # vertical stack — same as axis=0
np.hstack([a, b])   # horizontal stack — same as axis=1 for 2-D
np.dstack([a, b])   # depth stack (creates a 3rd axis)

# stack — creates a NEW axis (different from concatenate!)
np.stack([a, b], axis=0)   # shape (2, 2, 2)
np.stack([a, b], axis=2)   # shape (2, 2, 2) — depth

# --- Splitting ---
big = np.arange(12).reshape(6, 2)
parts = np.vsplit(big, 3)    # split into 3 equal arrays along axis 0
# [array([[0,1]]), array([[2,3]]), ... ]

# Split at specific indices
parts = np.split(big, [2, 4], axis=0)  # [0:2], [2:4], [4:]

# Tile — repeat an array
np.tile(a, (2, 3))   # repeat a 2 times along rows, 3 times along cols
What is the key difference between np.stack([a,b]) and np.concatenate([a,b])?
Which function splits a 2-D array into N equal parts along axis 0?
34. How do you detect and remove duplicate rows in a Pandas DataFrame?

Duplicate rows silently inflate counts, distort means, and can cause data leakage between training and test sets. Pandas provides duplicated() and drop_duplicates() for systematic duplicate management.

import pandas as pd

df = pd.DataFrame({
    'order_id': [1, 2, 2, 3, 4, 4],
    'product':  ['A', 'B', 'B', 'C', 'D', 'D'],
    'amount':   [100, 200, 200, 150, 80, 90],   # last pair differs!
})

# --- Detecting duplicates ---
df.duplicated()               # True for all duplicates (keeps first)
df.duplicated(keep='last')    # True for all duplicates (keeps last)
df.duplicated(keep=False)     # True for ALL occurrences

print(df.duplicated().sum())  # count of duplicate rows

# Duplicate check on a subset of columns only
df.duplicated(subset=['order_id', 'product'])
# True where order_id AND product are repeated (ignores amount diff)

# --- Removing duplicates ---
df.drop_duplicates()          # removes all but first occurrence
df.drop_duplicates(keep='last')  # keeps last occurrence
df.drop_duplicates(keep=False)   # removes all occurrences of any duplicate

# Subset-based deduplication — keep first by order_id
df.drop_duplicates(subset=['order_id'], keep='first')

# Sort before deduplicating to control which row is 'first'
# (e.g., keep the highest amount per order)
df.sort_values('amount', ascending=False).drop_duplicates(subset=['order_id'])

When deduplicating on a subset of columns, think carefully about which row to keep. Sorting the DataFrame first (by timestamp, version, or a quality metric) ensures drop_duplicates(keep='first') retains the most appropriate record, not just whatever happened to be first in the file.

What does df.duplicated(keep=False) return?
How do you keep only the row with the highest amount for each order_id when deduplicating?
35. How do you control colours and colour palettes in Matplotlib and Seaborn?

Colour is one of the most impactful design decisions in a chart. Used correctly it encodes information; used poorly it confuses or misleads. Both Matplotlib and Seaborn give you fine-grained control.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# --- Matplotlib colour specifications ---
# Named CSS colours
plt.plot(x, y, color='steelblue')
# Hex string
plt.plot(x, y, color='#2E86AB')
# RGB tuple (values 0-1)
plt.plot(x, y, color=(0.18, 0.52, 0.67))
# Grayscale string
plt.plot(x, y, color='0.5')   # 50% grey

# --- Colormaps for continuous data ---
im = plt.imshow(matrix, cmap='viridis')   # perceptually uniform
plt.colorbar(im)
# Other good cmaps: 'plasma', 'inferno', 'magma' (sequential)
# 'RdBu', 'coolwarm', 'bwr' (diverging — centred on 0)
# 'tab10', 'Set1', 'Set2' (categorical)

# --- Seaborn palettes ---
# Categorical (qualitative)
sns.barplot(data=df, x='day', y='tip', palette='Set2')

# Sequential (one colour family)
sns.barplot(data=df, x='day', y='tip', palette='Blues_d')

# Diverging (two colour families around a midpoint)
sns.heatmap(corr, cmap='coolwarm', vmin=-1, vmax=1, center=0)

# Custom palette
custom = ['#E63946', '#457B9D', '#1D3557', '#A8DADC']
sns.barplot(data=df, x='day', y='tip', palette=custom)

# Preview a palette
sns.palplot(sns.color_palette('husl', 8))

Always use perceptually uniform colormaps (viridis, plasma) for continuous data — rainbow/jet maps are misleading because they are not perceptually linear (the eye perceives the yellow band as brighter than the blue or red bands, creating false visual contrast). For diverging data (correlation matrices, residuals) use a diverging colormap centred on zero.

Why should you avoid the 'jet' (rainbow) colormap for continuous data?
Which type of colormap should you use for a correlation matrix where values range from -1 to +1?
36. How do rolling and expanding window functions work in Pandas?

Window functions compute statistics over a sliding or expanding subset of rows, essential for time-series smoothing, trend detection, and feature engineering. Unlike groupby aggregations, window functions return a result for every row, preserving the original index.

import pandas as pd
import numpy as np

ts = pd.DataFrame({
    'date':  pd.date_range('2024-01-01', periods=10, freq='D'),
    'sales': [100, 120, 90, 150, 200, 130, 110, 180, 160, 140],
})
ts = ts.set_index('date')

# --- Rolling window (fixed-size, slides one step at a time) ---
ts['ma3']   = ts['sales'].rolling(window=3).mean()  # 3-day moving avg
ts['std3']  = ts['sales'].rolling(window=3).std()
ts['min3']  = ts['sales'].rolling(window=3).min()

# First window-1 values are NaN (not enough history)
# min_periods: require fewer observations before computing
ts['ma3_mp'] = ts['sales'].rolling(window=3, min_periods=1).mean()

# --- Expanding window (grows to include all rows so far) ---
ts['cum_max']  = ts['sales'].expanding().max()
ts['cum_mean'] = ts['sales'].expanding().mean()

# --- Exponentially weighted moving average (more weight on recent data) ---
ts['ewma'] = ts['sales'].ewm(span=3).mean()

# --- Lag / shift features (common in time-series forecasting) ---
ts['lag1'] = ts['sales'].shift(1)   # yesterday's sales
ts['lag7'] = ts['sales'].shift(7)   # last week's sales
ts['pct_change'] = ts['sales'].pct_change()  # % change from previous row

Moving averages (rolling mean) smooth out noise to reveal trends. Exponentially weighted moving averages give more influence to recent observations, making them responsive to recent changes while still smoothing. Lag features turn a time-series prediction problem into a supervised learning problem where past values predict future ones.

Why do the first (window - 1) rows of a rolling() calculation contain NaN?
What does ts['sales'].shift(1) produce?
37. How do Seaborn jointplot and pairplot help explore multivariate relationships?

When you have more than one numeric variable, the next step after individual histograms is to understand relationships between pairs. Seaborn's jointplot and pairplot automate this exploration with minimal code.

import seaborn as sns
import matplotlib.pyplot as plt

penguins = sns.load_dataset('penguins').dropna()

# --- jointplot: one pair of variables ---
# Scatter + marginal histograms
sns.jointplot(data=penguins, x='bill_length_mm', y='bill_depth_mm',
              hue='species', height=6)

# Regression + 95% confidence interval
sns.jointplot(data=penguins, x='flipper_length_mm', y='body_mass_g',
              kind='reg', height=6)

# Hex bins — better than scatter for large datasets with overplotting
sns.jointplot(data=penguins, x='flipper_length_mm', y='body_mass_g',
              kind='hex', height=6)

# KDE — smooth 2-D density
sns.jointplot(data=penguins, x='bill_length_mm', y='bill_depth_mm',
              kind='kde', fill=True, height=6)

# --- pairplot: all pairs + diagonal histograms ---
# Standard scatter matrix
sns.pairplot(penguins, hue='species',
             diag_kind='kde',         # diagonal: KDE instead of histogram
             plot_kws={'alpha': 0.5},  # semi-transparent points
             height=2.5)
plt.suptitle('Penguin Feature Pairs', y=1.02)
plt.show()

# Subset of columns only
cols = ['bill_length_mm', 'flipper_length_mm', 'body_mass_g']
sns.pairplot(penguins[cols + ['species']], hue='species')

Use jointplot when you want to focus deeply on one specific pair of variables with marginal distributions visible. Use pairplot for a broad overview of all pairwise relationships in a dataset with up to ~10 variables — beyond that the grid becomes too small to read meaningfully.

What does the diag_kind='kde' argument to sns.pairplot() control?
When is kind='hex' in sns.jointplot() preferred over kind='scatter'?
38. What are the key performance tips when using NumPy for large-scale data processing?

NumPy is fast by default, but a few common mistakes can undermine that speed. Knowing these patterns makes the difference between code that runs in seconds and code that runs in minutes.

import numpy as np

n = 10_000_000
arr = rng.random(n)

# 1. AVOID Python loops — always prefer ufuncs
# Slow:
result = [x**2 for x in arr]        # Python loop, ~3s
# Fast:
result = arr ** 2                    # NumPy ufunc, ~0.03s

# 2. Pre-allocate output arrays instead of growing them
# Slow:
out = []
for chunk in chunks:
    out.append(chunk.sum())          # repeated list growth
# Fast:
out = np.empty(len(chunks))
for i, chunk in enumerate(chunks):
    out[i] = chunk.sum()

# 3. Use views instead of copies when slicing
sub = arr[1000:2000]   # view — no memory allocation
sub2 = arr[1000:2000].copy()  # explicit copy — only when mutation safety needed

# 4. Choose the right dtype — float32 vs float64
a64 = np.ones(n, dtype=np.float64)  # 80 MB
a32 = np.ones(n, dtype=np.float32)  # 40 MB — also faster on many ops

# 5. Use out= argument to avoid temporary arrays
np.add(a32, a32, out=a32)   # in-place: no temporary intermediate created

# 6. np.einsum for complex multi-dimensional contractions
A = rng.random((100, 200))
B = rng.random((200, 300))
C = np.einsum('ij,jk->ik', A, B)  # equivalent to A @ B but explicit

The most impactful optimisation in almost every case is the first: eliminating Python loops. After that, reducing the number of temporary arrays (using out= or in-place operators like +=) and choosing smaller dtypes are the next biggest wins.

What is the most impactful NumPy performance optimisation in most cases?
What does the out= parameter do in np.add(a, b, out=a)?
39. How do you visualise regression results and residuals using Seaborn and Matplotlib?

After fitting any regression model, visualising the residuals (actual - predicted values) is mandatory. Patterns in residuals reveal model assumptions violations: non-linearity, heteroscedasticity, or non-normality of errors.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Simulate some data with a non-linear relationship
rng = np.random.default_rng(42)
x = rng.uniform(0, 10, 200)
y = 2 * x + 0.5 * x**2 + rng.normal(0, 3, 200)
df = pd.DataFrame({'x': x, 'y': y})

# 1. Scatter + regression line (with confidence interval)
sns.regplot(data=df, x='x', y='y', ci=95, scatter_kws={'alpha': 0.4})
plt.title('Scatter with OLS Regression Line')
plt.show()

# 2. Residual plot — built in seaborn
sns.residplot(data=df, x='x', y='y', lowess=True,
              scatter_kws={'alpha': 0.4})
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs x (lowess smoothed trend)')
plt.show()
# A horizontal band around 0 = good; a curve = model is missing non-linearity

# 3. Manual residuals (after sklearn model)
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(df[['x']], df['y'])
df['predicted'] = model.predict(df[['x']])
df['residual']  = df['y'] - df['predicted']

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(df['predicted'], df['residual'], alpha=0.4)
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set(xlabel='Fitted Values', ylabel='Residuals',
            title='Residuals vs Fitted')
sns.histplot(df['residual'], kde=True, ax=axes[1])
axes[1].set_title('Residual Distribution')
plt.tight_layout(); plt.show()

The two most diagnostic residual plots are: (1) Residuals vs Fitted — should be a random horizontal band; any curve indicates missing predictors or a need for feature transformation. (2) Residual histogram — should be approximately normal; heavy tails suggest outliers or a non-Gaussian error structure.

What pattern in a residuals-vs-fitted plot indicates the model is missing non-linear structure?
What does the lowess=True argument add to sns.residplot()?
40. How do you process large CSV files that don't fit in memory using Pandas?

When a CSV is larger than available RAM, loading it with a plain pd.read_csv causes a MemoryError. Pandas provides three strategies: chunking, selective loading, and dtype optimisation.

import pandas as pd
import numpy as np

# --- Strategy 1: Read only necessary columns and rows ---
df = pd.read_csv(
    'big_log.csv',
    usecols=['timestamp', 'user_id', 'event', 'amount'],  # skip unneeded cols
    dtype={'user_id': 'int32', 'amount': 'float32'},       # smaller dtypes
    parse_dates=['timestamp'],
    nrows=500_000,   # read a sample first for exploration
)

# --- Strategy 2: Process in chunks ---
chunk_size = 100_000
results = []

for chunk in pd.read_csv('big_log.csv', chunksize=chunk_size,
                          usecols=['user_id', 'amount']):
    # Process each chunk independently
    summary = chunk.groupby('user_id')['amount'].sum()
    results.append(summary)

# Combine partial results
final = pd.concat(results).groupby(level=0).sum()

# --- Strategy 3: Filter while reading with chunksize ---
high_value_chunks = []
for chunk in pd.read_csv('big_log.csv', chunksize=chunk_size):
    filtered = chunk[chunk['amount'] > 1000]
    high_value_chunks.append(filtered)
high_value_df = pd.concat(high_value_chunks, ignore_index=True)

# --- Alternative: Parquet format (much faster than CSV) ---
# Convert once:
df.to_parquet('big_log.parquet', index=False)
# Then read efficiently — Parquet supports column projection and row filters
import pyarrow.parquet as pq
table = pq.read_table('big_log.parquet',
                       columns=['user_id', 'amount'],
                       filters=[('amount', '>', 1000)])

For truly large-scale work (tens of GB), consider switching from CSV to Parquet (columnar, compressed, fast column projection) and using Dask or Polars instead of Pandas — both operate on lazy computation graphs that stream data without loading everything into memory at once.

What does chunksize=100_000 do when passed to pd.read_csv?
Why is the Parquet format preferable to CSV for large analytical datasets?
41. How do you add annotations and text to Matplotlib charts?

Annotations turn a chart into a story — highlighting a key data point, marking a threshold, or labelling significant events on a timeline. Matplotlib provides ax.annotate() for arrow-and-text annotations and ax.text() for free-form text placement.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 200)
y = np.sin(x) * np.exp(-x / 5)

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(x, y, color='steelblue', linewidth=2)

# Find and annotate the maximum
peak_idx = np.argmax(y)
px, py   = x[peak_idx], y[peak_idx]

ax.annotate(
    f'Peak: ({px:.2f}, {py:.2f})',
    xy=(px, py),              # point to annotate
    xytext=(px + 1.5, py),    # where the text goes
    arrowprops=dict(
        arrowstyle='->',
        color='darkred',
        lw=1.5,
    ),
    fontsize=11,
    color='darkred',
)

# Free-form text label
ax.text(0.5, 0.9, 'Damped oscillation',
        transform=ax.transAxes,   # axes-relative coords (0–1)
        fontsize=12, ha='center',
        bbox=dict(boxstyle='round,pad=0.3', fc='lightyellow', ec='grey'))

# Threshold line with label
ax.axhline(y=0.5, color='orange', linestyle='--', linewidth=1)
ax.text(9.5, 0.52, 'threshold=0.5', color='orange', ha='right', fontsize=9)

ax.set(title='Annotated Damped Sine', xlabel='x', ylabel='y')
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout(); plt.show()

The two coordinate systems matter: xy in annotate uses data coordinates by default (values from your actual data range). Passing transform=ax.transAxes to ax.text() switches to axes-fraction coordinates (0,0 = bottom-left, 1,1 = top-right) — useful for fixed-position labels that stay put when the data range changes.

What does transform=ax.transAxes do when passed to ax.text()?
In ax.annotate(), what do the xy and xytext arguments control?
42. How do you quickly extract top/bottom rows and random samples from a Pandas DataFrame?

During EDA you often need to inspect extremes (the highest-revenue customers, the worst-performing products) or draw a random sample for quick analysis. Pandas provides concise methods for each of these.

import pandas as pd
import numpy as np

rng = np.random.default_rng(42)
df = pd.DataFrame({
    'product': [f'P{i}' for i in range(100)],
    'revenue': rng.integers(1_000, 100_000, 100),
    'returns': rng.integers(0, 500, 100),
})

# --- Top and bottom N rows ---
df.nlargest(5, 'revenue')   # 5 highest revenue products
df.nsmallest(5, 'revenue')  # 5 lowest revenue products

# Multiple columns — break ties by second column
df.nlargest(5, ['revenue', 'returns'])

# --- Random sampling ---
df.sample(n=10, random_state=42)        # 10 random rows
df.sample(frac=0.1, random_state=42)    # 10% of rows
df.sample(n=10, replace=True)           # with replacement (bootstrapping)

# Stratified sample — same proportion from each category
df['tier'] = pd.cut(df['revenue'], bins=3, labels=['low','mid','high'])
stratified = df.groupby('tier', group_keys=False).apply(
    lambda g: g.sample(frac=0.1, random_state=42)
)

# --- Head, tail, every Nth row ---
df.head(10)       # first 10 rows
df.tail(10)       # last 10 rows
df.iloc[::5]      # every 5th row — useful for large datasets

nlargest and nsmallest are significantly faster than sort_values(...).head(n) for large DataFrames because they use a partial sort (heap) under the hood — O(N log k) instead of O(N log N) for the full sort. Use them whenever you only need the extremes, not a fully sorted result.

Why is df.nlargest(10, 'revenue') faster than df.sort_values('revenue', ascending=False).head(10) on a large DataFrame?
Which argument in df.sample() sets the proportion of rows to return?
43. How is NumPy linear algebra used in data science applications?

Linear algebra underpins almost all of machine learning — from computing gradients to PCA to solving systems of equations. NumPy's linalg submodule provides production-grade implementations of the core operations.

import numpy as np

# --- Solving a system of linear equations: Ax = b ---
# 2x + y = 8
# x + 3y = 11
A = np.array([[2, 1], [1, 3]])
b = np.array([8, 11])
x = np.linalg.solve(A, b)
print(x)   # [2.6  2.8]  — verify: A @ x ≈ b

# --- Matrix decompositions ---
M = np.array([[3, 1], [1, 3]], dtype=float)

# Eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eig(M)
# eigenvalues = [4. 2.], eigenvectors (columns) = principal directions

# Singular Value Decomposition — used in PCA, recommendation systems
X = np.random.default_rng(42).random((100, 5))   # 100 samples, 5 features
X -= X.mean(axis=0)                               # centre
U, S, Vt = np.linalg.svd(X, full_matrices=False)
# S = singular values (square roots of eigenvalues of X^T X)
# Vt rows = principal components
# Project onto first 2 components:
X_pca = X @ Vt[:2].T    # shape (100, 2)

# --- Norms ---
v = np.array([3.0, 4.0])
np.linalg.norm(v)        # 5.0 — L2 norm
np.linalg.norm(v, ord=1) # 7.0 — L1 norm

# --- Matrix rank, determinant, inverse ---
np.linalg.matrix_rank(A)
np.linalg.det(A)
np.linalg.inv(A)   # only for square non-singular matrices
np.linalg.pinv(A)  # Moore-Penrose pseudoinverse for non-square

SVD is the engine behind PCA: the right singular vectors (rows of Vt) are the principal components, and the singular values tell you how much variance each component explains. Using full_matrices=False (economy SVD) is essential for tall matrices — it skips computing the large, unused portions of U.

In PCA implemented via SVD, what do the rows of the Vt matrix represent?
Which NumPy function solves the linear system Ax = b without computing the inverse of A?
44. How do you compare distributions across categories using Seaborn categorical plots?

Comparing how a numeric variable's distribution differs across groups is one of the most common analytical tasks. Seaborn's categorical plot family gives you progressively more information from left to right: bar (mean only) → box (five-number summary) → violin (full distribution shape) → strip/swarm (individual points).

import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Bar plot — mean + 95% CI error bars
sns.barplot(data=tips, x='day', y='tip', hue='sex',
            palette='Set2', ax=axes[0, 0])
axes[0, 0].set_title('Mean Tip by Day and Sex')

# Box plot — median, IQR, whiskers, outlier dots
sns.boxplot(data=tips, x='day', y='total_bill', hue='smoker',
            palette='pastel', ax=axes[0, 1])
axes[0, 1].set_title('Total Bill Distribution by Day and Smoker')

# Violin plot — box + KDE combined
sns.violinplot(data=tips, x='day', y='tip',
               inner='quartile',   # show quartile lines inside
               palette='muted', ax=axes[1, 0])
axes[1, 0].set_title('Tip Violin by Day')

# Strip + box overlay — all points + summary
sns.boxplot(data=tips, x='time', y='tip', color='lightblue',
            ax=axes[1, 1], width=0.4)
sns.stripplot(data=tips, x='time', y='tip', color='navy',
              alpha=0.4, jitter=True, ax=axes[1, 1])
axes[1, 1].set_title('Tip by Time — Box + All Points')

plt.tight_layout(); plt.show()

# Figure-level catplot for easy faceting
sns.catplot(data=tips, x='day', y='tip', hue='sex',
            col='time', kind='violin', height=5, aspect=0.8)

When to use each: bar plots are fine for comparing means but hide distributional information. Box plots add spread and outliers. Violin plots reveal multi-modality (two bumps indicating two groups within a category). Strip/swarm overlays add individual points, essential for small datasets where a box plot can be misleading with n < 30.

What does inner='quartile' display inside a Seaborn violin plot?
Why is overlaying a strip plot on a box plot particularly useful for small datasets?
45. How do you build an end-to-end data cleaning and visualisation pipeline with NumPy, Pandas, and Seaborn?

Combining all three libraries in a coherent pipeline is what data science interviews and take-home assignments test. Below is a realistic miniature pipeline that demonstrates the key integration points.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style='whitegrid', context='notebook')

# --- 1. Load ---
df = pd.read_csv('customer_orders.csv', parse_dates=['order_date'])

# --- 2. Audit ---
print(df.info())
print(df.isnull().sum())
print(df.describe())

# --- 3. Clean ---
df = (
    df
    .drop_duplicates(subset=['order_id'])
    .dropna(subset=['customer_id', 'amount'])
    .assign(
        amount=lambda d: pd.to_numeric(d['amount'], errors='coerce'),
        category=lambda d: d['category'].str.strip().str.title().astype('category'),
        year=lambda d: d['order_date'].dt.year,
        month=lambda d: d['order_date'].dt.month,
    )
    .dropna(subset=['amount'])
    .query('amount > 0')
)

# --- 4. Feature engineering (NumPy) ---
amounts = df['amount'].to_numpy()
df['log_amount']  = np.log1p(amounts)       # log1p avoids log(0)
df['amount_zscore'] = (amounts - amounts.mean()) / amounts.std()

# --- 5. Aggregate ---
monthly = (
    df.groupby(['year', 'month', 'category'])
    .agg(total=('amount', 'sum'), orders=('order_id', 'count'))
    .reset_index()
)

# --- 6. Visualise ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Revenue distribution by category
sns.boxplot(data=df, x='category', y='log_amount', ax=axes[0])
axes[0].set(title='Log Revenue by Category', xlabel='', ylabel='log(1+amount)')

# Monthly trend
df['period'] = df['order_date'].dt.to_period('M').astype(str)
trend = df.groupby('period')['amount'].sum().reset_index()
axes[1].plot(trend['period'], trend['amount'], marker='o', linewidth=2)
axes[1].tick_params(axis='x', rotation=45)
axes[1].set(title='Monthly Revenue Trend', xlabel='Month', ylabel='Revenue')

plt.tight_layout()
plt.savefig('dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

The key integration patterns here: Pandas for all tabular operations (load, clean, aggregate), NumPy for numerical transformations on raw arrays (.to_numpy() → vectorised ops), and Seaborn/Matplotlib for visualisation. The method-chain style in the cleaning step makes the transformations readable as a pipeline.

Why is np.log1p(x) preferred over np.log(x) for monetary or count data?
In the pipeline above, what does .reset_index() do after groupby().agg()?
«
»
#

Comments & Discussions