Python / Data Science Essentials Interview Questions

1. What is NumPy and why is it significantly faster than plain Python lists for numerical work? 2. What are the main ways to create NumPy arrays? 3. How do NumPy array shape, reshape, and axis work? 4. What is NumPy broadcasting and how does it work? 5. How does NumPy boolean masking and fancy indexing work? 6. What are the most commonly used NumPy mathematical functions in data science? 7. What is a Pandas DataFrame and how does it differ from a NumPy array? 8. How do you read CSV, Excel, and JSON files into a Pandas DataFrame? 9. What is the difference between df.loc[] and df.iloc[] in Pandas? 10. How do you detect, handle, and fill missing values in a Pandas DataFrame? 11. What are the different ways to filter rows in a Pandas DataFrame? 12. How does Pandas groupby work and what aggregation patterns are most useful? 13. How do you merge and join DataFrames in Pandas, and what do the different join types mean? 14. When should you use df.apply() versus vectorised Pandas operations? 15. How do you use pd.pivot_table to summarise data? 16. How do you perform string operations on Pandas DataFrame columns? 17. How do you work with dates and times in Pandas? 18. What is Matplotlib and what are the key components of a figure? 19. What are the most common chart types in Matplotlib and when do you use each? 20. How do you create multi-panel figures with Matplotlib subplots? 21. What is Seaborn and how does it differ from Matplotlib? 22. What are the most important Seaborn plot types for exploratory data analysis? 23. How do you create and interpret a correlation heatmap with Seaborn? 24. What is Seaborn's FacetGrid and how does it enable multi-panel statistical plots? 25. How do you compute descriptive statistics on a Pandas DataFrame? 26. How do you reduce a Pandas DataFrame's memory usage through dtype optimisation? 27. How do you generate reproducible random data with NumPy? 28. How do you use value_counts() and pd.crosstab() to understand categorical data? 29. How do you style Matplotlib figures and save them for reports? 30. What is np.where and how is it used for conditional array creation? 31. What is Pandas method chaining and how does df.pipe() support it? 32. What does a typical exploratory data analysis (EDA) workflow look like in Python? 33. How do you stack, concatenate, and split NumPy arrays? 34. How do you detect and remove duplicate rows in a Pandas DataFrame? 35. How do you control colours and colour palettes in Matplotlib and Seaborn? 36. How do rolling and expanding window functions work in Pandas? 37. How do Seaborn jointplot and pairplot help explore multivariate relationships? 38. What are the key performance tips when using NumPy for large-scale data processing? 39. How do you visualise regression results and residuals using Seaborn and Matplotlib? 40. How do you process large CSV files that don't fit in memory using Pandas? 41. How do you add annotations and text to Matplotlib charts? 42. How do you quickly extract top/bottom rows and random samples from a Pandas DataFrame? 43. How is NumPy linear algebra used in data science applications? 44. How do you compare distributions across categories using Seaborn categorical plots? 45. How do you build an end-to-end data cleaning and visualisation pipeline with NumPy, Pandas, and Seaborn?

Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is NumPy and why is it significantly faster than plain Python lists for numerical work?

NumPy (Numerical Python) is the foundational library for scientific computing in Python. At its core it provides the ndarray — an N-dimensional array of a single, fixed data type stored in a contiguous block of memory. That single design decision is the source of almost all of NumPy's performance advantage over Python lists.

Python lists store references to Python objects scattered around the heap. Each arithmetic operation on a list requires Python to look up each object, check its type, extract the value, compute, and then box the result back into a new Python object. A million-element loop pays that overhead a million times.

NumPy sidesteps the overhead in two ways. First, all elements in an ndarray share the same dtype (e.g., float64, int32), so there is no per-element type check and no boxing. Second, NumPy operations are implemented as compiled C (and sometimes Fortran) routines that operate on the raw memory buffer in tight loops — this is called vectorisation. The Python interpreter is invoked once for the whole array, not once per element.

import numpy as np
import time

n = 10_000_000
py_list = list(range(n))
np_arr  = np.arange(n, dtype=np.float64)

t0 = time.perf_counter()
py_result = [x * 2.5 for x in py_list]
print(f'List loop : {time.perf_counter()-t0:.3f}s')

t0 = time.perf_counter()
np_result = np_arr * 2.5   # vectorised — no Python loop
print(f'NumPy     : {time.perf_counter()-t0:.3f}s')
# Typical ratio: 50x–200x faster for NumPy

Memory is also more compact. A Python integer object takes ~28 bytes; a NumPy int64 element takes exactly 8 bytes. For a million-element array that is the difference between 28 MB and 8 MB.

What is the primary reason NumPy operations are faster than equivalent Python list loops?NumPy runs on the GPU by default

✗ Try again.

NumPy stores homogeneous data in contiguous memory and executes operations in compiled C, avoiding Python's per-element overhead

✓ Correct! Well done.

NumPy uses multiple CPU threads automatically

✗ Try again.

NumPy skips bounds checking entirely

✗ Try again.

How many bytes does a single float64 element occupy in a NumPy array?4

✗ Try again.

✓ Correct! Well done.

✗ Try again.

2. What are the main ways to create NumPy arrays?

Knowing the idiomatic array-creation functions is a baseline NumPy skill. Each function is designed for a specific situation and picking the right one keeps code readable and avoids unnecessary copies.

import numpy as np

# From Python sequences
a = np.array([1, 2, 3, 4])              # 1-D, dtype inferred (int64)
b = np.array([[1, 2], [3, 4]], dtype=np.float32)  # 2-D, explicit dtype

# Pre-filled arrays
np.zeros((3, 4))          # 3×4 array of 0.0
np.ones((2, 2))           # 2×2 array of 1.0
np.full((3, 3), 7)        # 3×3 array filled with 7
np.eye(4)                 # 4×4 identity matrix

# Ranges
np.arange(0, 10, 2)       # [0 2 4 6 8]  — like range() but returns ndarray
np.linspace(0, 1, 5)      # [0.  0.25 0.5 0.75 1.]  — N evenly spaced points

# Random arrays (use default_rng for reproducibility)
rng = np.random.default_rng(seed=42)
rng.random((3, 3))        # uniform [0, 1)
rng.standard_normal(1000) # standard normal distribution
rng.integers(0, 100, size=10)  # random ints in [0, 100)

# From existing data without copying
np.asarray([1.0, 2.0, 3.0])   # no copy if already array-like and matching dtype
np.frombuffer(b'\x01\x02\x03', dtype=np.uint8)  # from raw bytes

np.linspace is preferred over np.arange for floating-point ranges because arange with a float step can produce unexpected element counts due to floating-point rounding. linspace guarantees exactly N points.

Which NumPy function creates exactly N evenly spaced values between start and stop inclusive?np.arange(start, stop, N)

✗ Try again.

np.linspace(start, stop, N)

✓ Correct! Well done.

np.range(start, stop, N)

✗ Try again.

np.spacing(start, stop, N)

✗ Try again.

What does np.eye(4) create?A 4-element array of 1s

✗ Try again.

A 4×4 array filled with 4

✗ Try again.

A 4×4 identity matrix

✓ Correct! Well done.

A 4×4 array of zeros

✗ Try again.

3. How do NumPy array shape, reshape, and axis work?

Every NumPy array has a shape attribute — a tuple giving the size along each dimension. Shape is fundamental because most NumPy operations depend on it, and shape mismatches are the most common source of errors in numerical code.

import numpy as np

a = np.arange(24)
print(a.shape)   # (24,)

# reshape — change shape without copying data
b = a.reshape(4, 6)    # 4 rows, 6 columns
c = a.reshape(2, 3, 4) # 3-D: 2 blocks of 3×4
# -1 means 'infer this dimension'
d = a.reshape(6, -1)   # (6, 4)  — NumPy works out the 4

print(b.shape)  # (4, 6)
print(b.ndim)   # 2
print(b.size)   # 24  — total number of elements

# Axes: axis=0 is rows (down), axis=1 is columns (across)
m = np.array([[1, 2, 3],
              [4, 5, 6]])
print(m.sum(axis=0))  # [5 7 9]   — sum down each column
print(m.sum(axis=1))  # [6 15]    — sum across each row
print(m.sum())        # 21        — grand total

# Flatten and ravel
m.flatten()  # always returns a copy
m.ravel()    # returns view if possible (faster)

A view shares memory with the original array — modifying the view modifies the original. reshape usually returns a view; flatten always returns a copy. Use np.shares_memory(a, b) to check.

What does m.sum(axis=0) compute for a 2-D array m?The sum of all elements

✗ Try again.

The sum of each row

✗ Try again.

The sum down each column

✓ Correct! Well done.

The cumulative sum

✗ Try again.

Which reshape argument tells NumPy to infer the size of a dimension automatically?0

✗ Try again.

-1

✓ Correct! Well done.

None

✗ Try again.

auto

✗ Try again.

4. What is NumPy broadcasting and how does it work?

Broadcasting is the set of rules NumPy uses to perform element-wise operations on arrays of different but compatible shapes, without physically copying data to make them the same size. It is one of the most powerful and often misunderstood NumPy features.

The rules, applied dimension by dimension starting from the trailing (rightmost) axis:

If the arrays have different numbers of dimensions, prepend 1s to the shape of the smaller-dimensional array.
Dimensions of size 1 are stretched to match the other array's size in that dimension.
If any dimension neither matches nor is 1, a ValueError is raised.

import numpy as np

# Scalar broadcast over array
a = np.array([1, 2, 3])
print(a * 10)         # [10 20 30]

# (3,) and (3, 1) — column vector subtraction from each column
matrix = np.array([[10, 20, 30],
                   [40, 50, 60]])
row_min = matrix.min(axis=1, keepdims=True)  # shape (2, 1)
normalised = matrix - row_min  # broadcasts: (2,3) - (2,1) -> (2,3)
print(normalised)
# [[ 0 10 20]
#  [ 0 10 20]]

# Outer product via broadcasting
col = np.array([[1], [2], [3]])  # shape (3, 1)
row = np.array([10, 20, 30])    # shape (3,) -> treated as (1, 3)
print(col * row)
# [[10 20 30]
#  [20 40 60]
#  [30 60 90]]

Broadcasting avoids the memory cost of explicit np.tile or np.repeat calls. The stretched values are never physically written — NumPy just iterates as if they were. For large arrays this can mean the difference between fitting in RAM and running out of memory.

What shape does NumPy broadcast (3, 1) with (1, 4) to produce?(3, 1)

✗ Try again.

(1, 4)

✗ Try again.

(3, 4)

✓ Correct! Well done.

ValueError — incompatible

✗ Try again.

What does keepdims=True do when passed to an aggregation like np.sum?It keeps a copy of the original array

✗ Try again.

It preserves the reduced dimension as size 1 so broadcasting still works

✓ Correct! Well done.

It prevents the result from being flattened

✗ Try again.

It doubles the computation time

✗ Try again.

5. How does NumPy boolean masking and fancy indexing work?

Beyond basic integer indexing, NumPy supports two advanced selection mechanisms that are essential for data-cleaning and filtering tasks.

Boolean masking: A comparison on an array produces a boolean array of the same shape. Passing that boolean array back as an index selects only the True positions.

import numpy as np

scores = np.array([88, 45, 72, 91, 60, 33, 95])

# Boolean mask
mask = scores >= 70
print(mask)         # [True False True True False False True]
passing = scores[mask]
print(passing)      # [88 72 91 95]

# Compound conditions
mid_range = scores[(scores >= 60) & (scores < 90)]
print(mid_range)    # [88 72 60]  — use & | ~ not and/or

# Assign through a mask
scores[scores < 50] = 50   # clamp low scores to 50
print(scores)       # [88 50 72 91 60 50 95]

# np.where — vectorised if/else
grades = np.where(scores >= 70, 'Pass', 'Fail')
print(grades)       # ['Pass' 'Fail' 'Pass' 'Pass' 'Fail' 'Fail' 'Pass']

Fancy indexing: Pass an integer array (or list) as an index to select arbitrary elements in any order. Unlike slicing, fancy indexing always returns a copy, not a view.

data = np.array([10, 20, 30, 40, 50])
idx  = np.array([4, 1, 4, 0])          # can repeat indices
print(data[idx])   # [50 20 50 10]

# 2-D fancy indexing
m = np.arange(16).reshape(4, 4)
rows = [0, 2]; cols = [1, 3]
print(m[rows, cols])  # m[0,1] and m[2,3]: [1 11]

Why must you use & instead of 'and' when combining NumPy boolean masks?'and' is slower

✗ Try again.

'and' operates on the truth value of the whole array and raises an error; & applies element-wise

✓ Correct! Well done.

& is shorthand for np.logical_and which is required

✗ Try again.

There is no difference

✗ Try again.

Does fancy indexing with an integer array return a view or a copy?Always a view

✗ Try again.

A view if the indices are sorted

✗ Try again.

Always a copy

✓ Correct! Well done.

A copy only if there are repeated indices

✗ Try again.

6. What are the most commonly used NumPy mathematical functions in data science?

NumPy ships a comprehensive set of universal functions (ufuncs) — compiled, vectorised operations that apply element-wise across the full array without Python loops. Knowing these avoids writing slow manual loops for standard computations.

import numpy as np

a = np.array([1.0, 4.0, 9.0, 16.0, 25.0])

# Element-wise math
np.sqrt(a)          # [1.  2.  3.  4.  5.]
np.log(a)           # natural log
np.log2(a)          # base-2 log
np.log10(a)         # base-10 log
np.exp(a)           # e^x
np.abs(np.array([-3, 4, -1]))  # [3 4 1]

# Aggregation
a.sum()             # 55.0
a.mean()            # 11.0
a.std()             # standard deviation
a.var()             # variance
a.min(); a.max()    # extremes
a.argmin(); a.argmax()  # INDEX of min/max
np.median(a)        # 9.0
np.percentile(a, 75)   # 75th percentile

# Linear algebra
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
np.dot(A, B)        # matrix multiplication (also A @ B in Python 3.5+)
np.linalg.inv(A)    # matrix inverse
np.linalg.det(A)    # determinant
vals, vecs = np.linalg.eig(A)  # eigenvalues and eigenvectors

# Sorting
unsorted = np.array([3, 1, 4, 1, 5])
np.sort(unsorted)   # returns sorted copy: [1 1 3 4 5]
np.argsort(unsorted)  # indices that would sort: [1 3 0 2 4]

What does np.argmax(a) return?The maximum value in a

✗ Try again.

The index of the maximum value in a

✓ Correct! Well done.

A boolean array marking the maximum

✗ Try again.

The cumulative maximum

✗ Try again.

Which operator performs matrix multiplication between two NumPy arrays in Python 3.5+?*

✗ Try again.

✓ Correct! Well done.

✗ Try again.

7. What is a Pandas DataFrame and how does it differ from a NumPy array?

A Pandas DataFrame is a two-dimensional, labelled data structure — think of it as a spreadsheet or a SQL table in memory. Rows and columns both have labels (the index and the column names), and each column can hold a different data type. A Series is the single-column equivalent.

DataFrame vs NumPy Array
Feature	NumPy ndarray	Pandas DataFrame
Dimensions	N-dimensional	Always 2-D (rows × columns)
Data type	Single dtype per array	Each column has its own dtype
Labels	Integer positions only	Named row index + column headers
Missing values	No native support (use np.nan)	First-class NaN / NaT / pd.NA
Primary use	Numerical computation	Tabular data: ETL, analysis, SQL-like ops

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name':   ['Alice', 'Bob', 'Carol'],
    'age':    [30, 25, 35],
    'salary': [95000.0, 72000.0, np.nan],
})
print(df.dtypes)
# name       object
# age         int64
# salary    float64

print(df.shape)   # (3, 3)
print(df.index)   # RangeIndex(start=0, stop=3, step=1)
print(df.columns) # Index(['name', 'age', 'salary'], dtype='object')

DataFrames are built on top of NumPy arrays — each column is essentially a NumPy array wrapped with extra metadata. When computation speed is paramount you often drop down to df.values or df.to_numpy() to get the raw array and run NumPy operations on it.

What is the key structural difference between a Pandas DataFrame and a NumPy 2-D array?DataFrames are always sorted; arrays are not

✗ Try again.

DataFrames support labelled axes and mixed column dtypes; arrays require a single dtype

✓ Correct! Well done.

DataFrames store data column-major; arrays are row-major

✗ Try again.

DataFrames only support string data

✗ Try again.

Which method converts a DataFrame to a raw NumPy array?df.to_array()

✗ Try again.

df.to_numpy()

✓ Correct! Well done.

df.values_array()

✗ Try again.

np.from_dataframe(df)

✗ Try again.

8. How do you read CSV, Excel, and JSON files into a Pandas DataFrame?

Pandas has a family of pd.read_* functions that handle virtually every common data format. Getting data in is usually the first step of any data science workflow, so these functions deserve close attention.

import pandas as pd

# --- CSV ---
df = pd.read_csv('sales.csv')
# Common options:
df = pd.read_csv(
    'sales.csv',
    sep=';',              # custom delimiter (semicolon, tab, etc.)
    header=0,             # row to use as column names (0 = first row)
    index_col='order_id', # use this column as the row index
    usecols=['date', 'amount', 'region'],  # read only these columns
    dtype={'amount': 'float32'},            # explicit dtype
    parse_dates=['date'],                   # auto-parse date strings
    na_values=['N/A', '--', ''],            # treat as NaN
    nrows=1000,           # read only first 1000 rows (useful for large files)
    encoding='utf-8',
)

# --- Excel ---
df_xl = pd.read_excel('report.xlsx', sheet_name='Q1', skiprows=2)

# --- JSON ---
df_j = pd.read_json('data.json', orient='records')
# orient='records' expects [{...}, {...}] — the common API response shape

# --- SQL ---
import sqlite3
conn = sqlite3.connect('mydb.sqlite')
df_sql = pd.read_sql('SELECT * FROM orders WHERE amount > 100', conn)

# Always inspect after reading
print(df.shape)
print(df.head())
print(df.dtypes)
print(df.info())   # shows non-null counts per column

For very large CSVs that do not fit in memory, pass chunksize=100_000 to read_csv — it returns an iterator of DataFrames, each containing that many rows. Process and aggregate chunk by chunk without loading the full file.

Which parameter tells pd.read_csv to read only the first 500 rows of a file?head=500

✗ Try again.

max_rows=500

✗ Try again.

nrows=500

✓ Correct! Well done.

limit=500

✗ Try again.

What does orient='records' mean in pd.read_json?Each row is a list

✗ Try again.

JSON is a list of row-dictionaries [{col: val}, ...]

✓ Correct! Well done.

JSON keys map to column names only

✗ Try again.

Records are sorted by index

✗ Try again.

9. What is the difference between df.loc[] and df.iloc[] in Pandas?

This distinction is tested in almost every Pandas interview. The short version: loc selects by label; iloc selects by integer position. They look similar but behave very differently, especially when the DataFrame index is not a default RangeIndex.

import pandas as pd

df = pd.DataFrame({
    'name':   ['Alice', 'Bob', 'Carol', 'Dave'],
    'score':  [88, 72, 95, 61],
    'city':   ['NYC', 'LA', 'NYC', 'Chicago'],
}, index=[10, 20, 30, 40])   # non-default index!

# --- loc: label-based ---
df.loc[20]               # row with index label 20 (Bob)
df.loc[10:30]            # rows 10, 20, 30 — INCLUSIVE stop
df.loc[10, 'name']       # single value: 'Alice'
df.loc[[10, 40], ['name', 'score']]   # multiple rows and columns
df.loc[df['score'] >= 80]             # boolean mask selection

# --- iloc: position-based ---
df.iloc[0]               # first row (Alice) — positional 0
df.iloc[0:2]             # rows 0 and 1 — EXCLUSIVE stop (like Python slicing)
df.iloc[0, 1]            # row 0, column 1: 88
df.iloc[-1]              # last row (Dave)
df.iloc[:, 0]            # entire first column

# --- [] shorthand ---
df['name']               # single column as Series
df[['name', 'city']]     # multiple columns as DataFrame
df[df['score'] > 80]     # boolean filtering — OK for rows only

The classic trap: loc stop is inclusive; iloc stop is exclusive. This asymmetry trips up even experienced developers. When in doubt, prefer explicit loc or iloc over the [] shorthand to avoid ambiguity.

For df with index [10, 20, 30, 40], what does df.iloc[0:2] return?Rows with labels 0 and 1 (KeyError)

✗ Try again.

Rows with labels 10 and 20 (positions 0 and 1, exclusive stop)

✓ Correct! Well done.

Rows with labels 10, 20, and 30 (inclusive stop)

✗ Try again.

An empty DataFrame

✗ Try again.

Which accessor should you use to select rows by a boolean condition in Pandas?df.iloc[condition]

✗ Try again.

df.loc[condition]

✓ Correct! Well done.

df.sel[condition]

✗ Try again.

df.query_bool[condition]

✗ Try again.

10. How do you detect, handle, and fill missing values in a Pandas DataFrame?

Missing values are represented in Pandas as NaN (float Not-a-Number from NumPy), NaT (Not-a-Time for datetime columns), or pd.NA (the newer nullable integer/string missing marker). Handling them correctly is the most time-consuming step of real-world data cleaning.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name':   ['Alice', 'Bob', None, 'Dave'],
    'age':    [30, np.nan, 35, 28],
    'score':  [88, 72, np.nan, np.nan],
})

# --- Detection ---
df.isnull()               # boolean DataFrame — True where NaN
df.isnull().sum()         # count NaNs per column
df.isnull().sum() / len(df) * 100  # % missing per column
df.notnull()              # inverse of isnull

# --- Dropping ---
df.dropna()               # drop rows with ANY NaN
df.dropna(how='all')      # drop rows where ALL values are NaN
df.dropna(subset=['age']) # drop rows with NaN only in 'age'
df.dropna(axis=1, thresh=3)  # drop columns with fewer than 3 non-NaN values

# --- Filling ---
df['score'].fillna(df['score'].mean())   # fill with column mean
df['score'].fillna(method='ffill')       # forward fill (propagate last valid)
df['score'].fillna(method='bfill')       # backward fill
df.fillna({'age': 0, 'name': 'Unknown'})  # column-specific fills

# --- Interpolation ---
df['score'].interpolate(method='linear')  # linear interpolation between values

# --- Replace specific sentinel values ---
df.replace(-999, np.nan)  # treat -999 as missing

Choosing between dropping and filling requires domain knowledge. Dropping rows is acceptable when missing data is rare (below 5%) and appears randomly. Filling with mean/median is common for numerical features; filling with mode or a sentinel ('Unknown') for categoricals. For time-series, forward fill preserves temporal order.

Which Pandas method counts the number of missing values per column?df.count_nan()

✗ Try again.

df.isnull().sum()

✓ Correct! Well done.

df.missing()

✗ Try again.

df.nan_count()

✗ Try again.

What does df['col'].fillna(method='ffill') do?Fills NaNs with the column mean

✗ Try again.

Fills NaNs with 0

✗ Try again.

Propagates the last valid value forward to fill NaNs

✓ Correct! Well done.

Fills NaNs with the next valid value

✗ Try again.

11. What are the different ways to filter rows in a Pandas DataFrame?

Row filtering is one of the most frequent DataFrame operations. Pandas provides several syntaxes, each with different readability and performance trade-offs.

import pandas as pd

df = pd.DataFrame({
    'city':     ['NYC', 'LA', 'NYC', 'Chicago', 'LA'],
    'revenue':  [120, 85, 200, 55, 140],
    'category': ['A', 'B', 'A', 'C', 'B'],
})

# Boolean indexing — most common
df[df['revenue'] > 100]

# Compound conditions — use & | ~ (not and/or)
df[(df['city'] == 'NYC') & (df['revenue'] > 100)]

# isin — membership test
df[df['city'].isin(['NYC', 'LA'])]

# between — inclusive range
df[df['revenue'].between(80, 150)]

# str methods for text filtering
df[df['city'].str.startswith('N')]
df[df['city'].str.contains('C', case=False)]

# query() — string-based, readable for complex conditions
df.query('city == "NYC" and revenue > 100')
df.query('revenue > @threshold', local_dict={'threshold': 100})
# @ prefix references a Python variable inside query string

# filter() — filter columns or index labels (NOT rows by content)
df.filter(like='rev')    # columns whose name contains 'rev'
df.filter(regex='^c')   # columns starting with 'c'

query() is readable for ad-hoc analysis and slightly faster for very large DataFrames because it avoids creating the intermediate boolean array. However, it does not support all Python expressions and can be harder to debug. For production pipelines, explicit boolean indexing is more explicit and testable.

Why must you use & instead of 'and' when writing compound Pandas boolean filters?'and' is not defined for DataFrames

✗ Try again.

'and' evaluates the truth value of the whole Series which is ambiguous; & applies element-wise

✓ Correct! Well done.

& runs faster due to C-level optimisation

✗ Try again.

There is no practical difference

✗ Try again.

Which Pandas method checks if each row's value belongs to a given list?df['col'].contains()

✗ Try again.

df['col'].isin()

✓ Correct! Well done.

df['col'].includes()

✗ Try again.

df['col'].matches()

✗ Try again.

12. How does Pandas groupby work and what aggregation patterns are most useful?

GroupBy is the Pandas implementation of the split-apply-combine pattern: split the DataFrame into groups by one or more column values, apply an aggregation or transformation to each group, and combine the results into a new DataFrame. It is the primary tool for summary statistics on tabular data.

import pandas as pd

sales = pd.DataFrame({
    'region':  ['East','East','West','West','East','West'],
    'product': ['A','B','A','B','A','A'],
    'revenue': [100, 200, 150, 80, 120, 90],
    'units':   [10, 20, 15, 8, 12, 9],
})

# Single-column groupby with single aggregation
sales.groupby('region')['revenue'].sum()
# East    420   West    320

# Multiple aggregations on one column
sales.groupby('region')['revenue'].agg(['sum', 'mean', 'count', 'std'])

# Different aggregations per column
sales.groupby('region').agg(
    total_revenue=('revenue', 'sum'),
    avg_units    =('units',   'mean'),
    num_orders   =('revenue', 'count'),
)

# Multi-column groupby
sales.groupby(['region', 'product'])['revenue'].sum()

# transform — returns same-length Series aligned with original index
# (useful for adding group statistics back as a new column)
sales['region_total'] = sales.groupby('region')['revenue'].transform('sum')

# filter — keep only groups satisfying a condition
big_regions = sales.groupby('region').filter(lambda g: g['revenue'].sum() > 400)

transform vs agg: agg reduces each group to a scalar, returning a smaller DataFrame; transform keeps the original shape, broadcasting the group result back to each row. Use transform when you want to add a group statistic as a feature column without losing row-level detail.

What does groupby.transform('sum') return compared to groupby.agg('sum')?Both return the same result

✗ Try again.

transform returns a same-length Series aligned with the original index; agg returns one row per group

✓ Correct! Well done.

transform is deprecated in favour of agg

✗ Try again.

agg keeps the original index; transform produces a smaller one

✗ Try again.

How do you apply different aggregations to different columns in a single groupby call?Pass a dict to .agg() mapping column names to aggregation functions

✓ Correct! Well done.

Chain multiple .groupby() calls

✗ Try again.

Use .pivot() after groupby

✗ Try again.

Loop over columns and call .sum() individually

✗ Try again.

13. How do you merge and join DataFrames in Pandas, and what do the different join types mean?

Real-world data lives in multiple tables. Pandas merge() implements SQL-style joins, and concat() stacks DataFrames. Choosing the right join type prevents silently losing or duplicating rows.

Pandas Join Types
how=	Keeps rows from	Missing matches become
'inner'	Both DataFrames (intersection)	NaN (none dropped because only matched)
'left'	All of left, matched from right	NaN in right columns
'right'	All of right, matched from left	NaN in left columns
'outer'	Both (union)	NaN on whichever side has no match

import pandas as pd

customers = pd.DataFrame({
    'cust_id': [1, 2, 3, 4],
    'name':    ['Alice', 'Bob', 'Carol', 'Dave'],
})
orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'cust_id':  [1, 2, 9],     # cust_id 9 has no match; Dave has no order
    'amount':   [200, 150, 80],
})

# Inner join — only rows that match in both
pd.merge(customers, orders, on='cust_id', how='inner')   # 2 rows

# Left join — all customers, NaN where no order
pd.merge(customers, orders, on='cust_id', how='left')    # 4 rows

# Different key names in each table
pd.merge(customers, orders,
         left_on='cust_id', right_on='cust_id')  # same here, but shows syntax

# Merge on index
pd.merge(customers.set_index('cust_id'), orders,
         left_index=True, right_on='cust_id')

# concat — stack vertically (rows) or horizontally (columns)
pd.concat([df1, df2], axis=0, ignore_index=True)   # stack rows
pd.concat([df1, df2], axis=1)                       # add columns side by side

A left join with customers (left) and orders (right) on cust_id will…Drop customers with no orders

✗ Try again.

Keep all customers and fill order columns with NaN for customers with no order

✓ Correct! Well done.

Keep only customers who have at least two orders

✗ Try again.

Return an error if a customer ID appears in orders but not customers

✗ Try again.

What does pd.concat([df1, df2], axis=0) do?Adds df2's columns to df1

✗ Try again.

Stacks df2's rows below df1's rows

✓ Correct! Well done.

Merges on the common index

✗ Try again.

Performs an outer join on all columns

✗ Try again.

14. When should you use df.apply() versus vectorised Pandas operations?

apply() runs a Python function on every row or column of a DataFrame. It is the most flexible transformation tool in Pandas but also the slowest because it falls back to a Python-level loop under the hood.

import pandas as pd
import numpy as np

df = pd.DataFrame({'price': [10.5, 20.0, 8.75, 35.0],
                   'qty':   [3,    5,    2,    1   ]})

# --- Slow: apply with a Python lambda ---
df['revenue'] = df.apply(lambda row: row['price'] * row['qty'], axis=1)

# --- Fast: vectorised arithmetic (always prefer this) ---
df['revenue'] = df['price'] * df['qty']

# --- apply on a single column (Series.apply) ---
df['price_cat'] = df['price'].apply(lambda x: 'high' if x > 20 else 'low')

# --- Faster alternative: np.where ---
df['price_cat'] = np.where(df['price'] > 20, 'high', 'low')

# --- Multi-condition: np.select ---
conditions  = [df['price'] > 25, df['price'] > 15, df['price'] > 0]
choices     = ['premium', 'mid', 'budget']
df['tier']  = np.select(conditions, choices, default='unknown')

# When apply is genuinely needed:
# — calling a function that returns a list/dict/Series per row
# — complex multi-column logic that cannot be expressed as vectorised ops
df.apply(lambda row: pd.Series({'x': row['price']+1, 'y': row['qty']*2}), axis=1)

Performance hierarchy for transformations (fastest to slowest): vectorised arithmetic > NumPy ufuncs > df.eval() string expressions > map() on a Series > apply() > explicit Python for-loop. Use apply only when no vectorised alternative exists; for simple conditions always use np.where or boolean indexing instead.

Why is df['price'] * df['qty'] faster than df.apply(lambda row: row['price']*row['qty'], axis=1)?apply() copies the DataFrame first

✗ Try again.

apply() iterates row-by-row in Python; vectorised arithmetic calls compiled C operations on the whole array at once

✓ Correct! Well done.

The * operator uses multiple threads

✗ Try again.

apply() sorts the DataFrame before processing

✗ Try again.

Which NumPy function is the recommended vectorised replacement for a simple if/else transformation on a column?np.apply()

✗ Try again.

np.choose()

✗ Try again.

np.where()

✓ Correct! Well done.

np.map()

✗ Try again.

15. How do you use pd.pivot_table to summarise data?

pd.pivot_table reshapes and aggregates a DataFrame simultaneously, producing a cross-tabulation — exactly like a spreadsheet pivot table. It is the go-to function for producing summary reports broken down by two categorical dimensions.

import pandas as pd

sales = pd.DataFrame({
    'region':  ['East','East','West','West','East','West','West'],
    'quarter': ['Q1',  'Q2',  'Q1',  'Q2',  'Q1',  'Q1',  'Q2'],
    'product': ['A',   'A',   'A',   'A',   'B',   'B',   'B'],
    'revenue': [100,   120,   90,    110,   80,    70,    95],
})

# Basic pivot: average revenue by region (rows) and quarter (columns)
pt = pd.pivot_table(
    sales,
    values='revenue',
    index='region',
    columns='quarter',
    aggfunc='sum',       # sum, mean, count, np.median, list, ...
    fill_value=0,        # replace NaN with 0
    margins=True,        # add row/column totals (labelled 'All')
    margins_name='Total',
)
print(pt)
# quarter  Q1   Q2  Total
# region
# East    180  120    300
# West    160  205    365
# Total   340  325    665

# Multiple values and multiple aggregations
pd.pivot_table(sales, values='revenue', index='region',
               columns='product', aggfunc=['sum', 'count'])

The inverse operation — converting a wide pivot back to long form — is pd.melt(). df.stack() and df.unstack() do similar reshape operations on the index levels directly.

What does margins=True add to a pd.pivot_table result?Borders around cells in the HTML output

✗ Try again.

Row and column subtotals/totals

✓ Correct! Well done.

Percentage breakdowns alongside absolute values

✗ Try again.

An extra 'margins' column for outlier detection

✗ Try again.

Which Pandas function is the inverse of pivot_table, reshaping wide data back to long form?pd.wide_to_long()

✗ Try again.

pd.unstack()

✗ Try again.

pd.melt()

✓ Correct! Well done.

pd.reshape()

✗ Try again.

16. How do you perform string operations on Pandas DataFrame columns?

Pandas exposes string methods through the .str accessor on object-dtype Series. These operations are vectorised over the whole column — no explicit loop needed — and handle NaN values gracefully (they propagate as NaN rather than raising an error).

import pandas as pd

df = pd.DataFrame({'name': ['  Alice Smith  ', 'bob jones', 'CAROL LEE', None],
                   'email': ['alice@corp.com', 'BOB@CORP.COM', 'carol@other.org', None]})

# Case normalisation
df['name'].str.strip().str.title()     # 'Alice Smith', 'Bob Jones', 'Carol Lee', NaN

# Split into multiple columns
df[['first', 'last']] = df['name'].str.strip().str.split(' ', expand=True)

# Contains / startswith / endswith
df[df['email'].str.endswith('@corp.com', na=False)]

# Extract patterns with regex
df['domain'] = df['email'].str.extract(r'@(.+)$')  # captures text after @

# Replace with regex
df['email'].str.lower().str.replace(r'[^a-z0-9@._]', '', regex=True)

# Count occurrences
df['name'].str.count('l')   # 1, 0, 1, NaN

# Length
df['name'].str.len()

# Padding / justification
df['id'].str.zfill(6)       # zero-pad to width 6
df['name'].str.ljust(20, '-')  # left-justify, pad with dashes

The na=False argument in methods like str.contains and str.startswith is important — without it, NaN values produce NaN in the boolean mask, which causes issues in filtering. Passing na=False returns False for NaN rows, keeping them out of the filtered result cleanly.

What does df['col'].str.extract(r'(\d+)') do?Removes all digit characters from each string

✗ Try again.

Extracts the first sequence of digits from each string into a new column

✓ Correct! Well done.

Tests whether each string contains a digit

✗ Try again.

Splits the string on digit boundaries

✗ Try again.

Why pass na=False to df['email'].str.contains('@corp.com')?It makes the search case-insensitive

✗ Try again.

It makes NaN values return False instead of NaN, preventing issues in boolean filtering

✓ Correct! Well done.

It enables regex mode

✗ Try again.

It searches the entire DataFrame instead of one column

✗ Try again.

17. How do you work with dates and times in Pandas?

Time-series data is everywhere in data science — sales by day, sensor readings by second, user activity by hour. Pandas has first-class datetime support built on NumPy's datetime64 type and Python's datetime module.

import pandas as pd

df = pd.DataFrame({
    'date_str': ['2024-01-15', '2024-02-20', '2024-03-05'],
    'value':    [100, 200, 150],
})

# Parse string dates — always specify format for speed and correctness
df['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d')

# Extract components via .dt accessor
df['year']    = df['date'].dt.year
df['month']   = df['date'].dt.month
df['day']     = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()  # 'Monday', 'Tuesday', ...
df['quarter'] = df['date'].dt.quarter

# Date arithmetic
df['days_since'] = (pd.Timestamp.today() - df['date']).dt.days
df['next_month'] = df['date'] + pd.DateOffset(months=1)

# Set as index for time-series resampling
ts = df.set_index('date')
ts.resample('M').sum()   # sum by month
ts.resample('W').mean()  # mean by week
ts.resample('Q').agg({'value': ['sum', 'count']})  # quarterly stats

# Filtering date ranges
df[df['date'] >= '2024-02-01']
df[df['date'].between('2024-01-01', '2024-03-01')]

Always parse dates explicitly with format= rather than relying on infer_datetime_format=True — the inferred path is slow and occasionally wrong for ambiguous formats like 01/02/03. For production pipelines, parse at read time using parse_dates=['date_col'] in pd.read_csv.

Which Pandas accessor exposes datetime components like .year, .month, and .day_name() on a datetime Series?.date

✗ Try again.

.dt

✓ Correct! Well done.

.time

✗ Try again.

.calendar

✗ Try again.

What does ts.resample('M').sum() do on a time-indexed DataFrame?Resamples to milliseconds

✗ Try again.

Groups rows by calendar month and sums each group

✓ Correct! Well done.

Randomly samples rows monthly

✗ Try again.

Computes monthly moving averages

✗ Try again.

18. What is Matplotlib and what are the key components of a figure?

Matplotlib is Python's foundational plotting library, originally modelled after MATLAB's plotting API. Almost every other Python visualisation library (Seaborn, Pandas .plot(), Plotly static exports) either wraps Matplotlib or uses it as a rendering backend.

Understanding the object hierarchy is essential for customising plots beyond the defaults:

Matplotlib Object Hierarchy
Object	What it is	Created by
Figure	The entire canvas / window	plt.figure() or plt.subplots()
Axes	One coordinate system (plot area) inside a Figure	fig.add_subplot() or plt.subplots()
Axis	The X or Y axis of an Axes (note: Axes ≠ Axis)	Exists on every Axes
Artist	Every visible element — lines, patches, text, legends	plot(), bar(), text(), etc.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 2 * np.pi, 300)

# Object-oriented interface (recommended for complex plots)
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(x, np.sin(x), label='sin(x)', color='steelblue', linewidth=2)
ax.plot(x, np.cos(x), label='cos(x)', color='tomato', linestyle='--')
ax.set_title('Sine and Cosine', fontsize=14)
ax.set_xlabel('x (radians)')
ax.set_ylabel('Amplitude')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 2 * np.pi)
fig.tight_layout()        # prevent label clipping
plt.savefig('trig.png', dpi=150, bbox_inches='tight')
plt.show()

The pyplot (plt.*) interface is a state-machine shorthand that implicitly manages the current Figure and Axes. It is convenient for quick interactive plots but problematic in scripts and notebooks that create multiple figures — use the object-oriented fig, ax = plt.subplots() style for anything beyond a single simple chart.

In Matplotlib, what is the difference between 'Axes' and 'Axis'?They are synonyms

✗ Try again.

Axes is the entire plot area (coordinate system); Axis is the individual X or Y axis line

✓ Correct! Well done.

Axis is the entire canvas; Axes is one subplot

✗ Try again.

Axes handles 2-D plots; Axis handles 3-D

✗ Try again.

Which Matplotlib function returns both a Figure and an Axes object in one call?plt.figure()

✗ Try again.

plt.plot()

✗ Try again.

plt.subplots()

✓ Correct! Well done.

plt.draw()

✗ Try again.

19. What are the most common chart types in Matplotlib and when do you use each?

Choosing the right chart type communicates data clearly; choosing the wrong one obscures it. Here are the workhorses of exploratory data analysis:

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(2, 3, figsize=(14, 8))

# 1. Line chart — trends over time or ordered x-axis
ax = axes[0, 0]
ax.plot([1, 2, 3, 4], [10, 15, 13, 18])
ax.set_title('Line: trends')

# 2. Bar chart — comparing discrete categories
ax = axes[0, 1]
ax.bar(['A', 'B', 'C'], [30, 45, 20])
ax.set_title('Bar: categories')

# 3. Scatter plot — relationship between two continuous variables
ax = axes[0, 2]
x = np.random.randn(100); y = x * 0.8 + np.random.randn(100) * 0.5
ax.scatter(x, y, alpha=0.5, c='steelblue')
ax.set_title('Scatter: correlation')

# 4. Histogram — distribution of one continuous variable
ax = axes[1, 0]
ax.hist(np.random.randn(1000), bins=30, color='salmon', edgecolor='white')
ax.set_title('Histogram: distribution')

# 5. Box plot — distribution summary with outliers
ax = axes[1, 1]
ax.boxplot([np.random.randn(100) for _ in range(3)], labels=['G1','G2','G3'])
ax.set_title('Box: spread & outliers')

# 6. Heatmap via imshow — 2-D matrix data (e.g., correlation matrix)
ax = axes[1, 2]
data = np.random.rand(4, 4)
im = ax.imshow(data, cmap='viridis')
plt.colorbar(im, ax=ax)
ax.set_title('Heatmap: 2-D matrix')

fig.tight_layout()
plt.show()

Rule of thumb: line for temporal/ordered data, bar for nominal comparisons, scatter for two-variable relationships, histogram for single-variable distributions, box for group comparisons with outlier context, heatmap for correlation matrices and confusion matrices.

Which chart type is best for showing the correlation between two continuous numerical variables?Bar chart

✗ Try again.

Pie chart

✗ Try again.

Scatter plot

✓ Correct! Well done.

Line chart

✗ Try again.

What information does a box plot display that a histogram does not?The distribution shape

✗ Try again.

Outliers, quartiles, and the median all in one compact view

✓ Correct! Well done.

The mean value

✗ Try again.

The count of values in each bin

✗ Try again.

20. How do you create multi-panel figures with Matplotlib subplots?

Multi-panel figures are standard in data science reports — comparing multiple variables or time periods side by side. Matplotlib provides several ways to arrange subplots.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 200)

# --- Regular grid ---
fig, axes = plt.subplots(2, 2, figsize=(10, 8), sharex=True)
# sharex=True links x-axis zoom/pan across all subplots
axes[0, 0].plot(x, np.sin(x))
axes[0, 0].set_title('sin')
axes[0, 1].plot(x, np.cos(x), color='tomato')
axes[0, 1].set_title('cos')
axes[1, 0].plot(x, np.tan(x))
axes[1, 0].set_title('tan')
axes[1, 1].set_visible(False)   # hide unused subplot
fig.suptitle('Trig functions', fontsize=16)
fig.tight_layout(rect=[0, 0, 1, 0.95])   # leave room for suptitle

# --- Flatten for iteration ---
fig, axes = plt.subplots(2, 3, figsize=(14, 6))
for ax, col in zip(axes.flatten(), df.select_dtypes('number').columns):
    ax.hist(df[col].dropna(), bins=20)
    ax.set_title(col)

# --- GridSpec for irregular layouts ---
from matplotlib.gridspec import GridSpec
fig = plt.figure(figsize=(12, 6))
gs  = GridSpec(2, 3, figure=fig)
ax1 = fig.add_subplot(gs[0, :2])   # spans first two columns of row 0
ax2 = fig.add_subplot(gs[0, 2])    # third column of row 0
ax3 = fig.add_subplot(gs[1, :])    # entire row 1

axes.flatten() is the standard idiom when you want to loop over a 2-D grid of Axes objects as if they were a 1-D list. fig.tight_layout() automatically adjusts spacing to prevent labels overlapping between subplots — call it before plt.show() or fig.savefig().

What does sharex=True do when passed to plt.subplots()?Copies the x-axis values from the first subplot to all others

✗ Try again.

Links the x-axis range so zooming or panning one subplot affects all

✓ Correct! Well done.

Forces all subplots to display the same data

✗ Try again.

Shares the x-axis label across subplots

✗ Try again.

Which method converts a 2-D array of Axes (from plt.subplots(2,3)) into a 1-D array for easy iteration?axes.ravel() or axes.flatten()

✓ Correct! Well done.

axes.reshape(1, -1)

✗ Try again.

np.concat(axes)

✗ Try again.

axes.tolist()

✗ Try again.

21. What is Seaborn and how does it differ from Matplotlib?

Seaborn is a high-level statistical visualisation library built on top of Matplotlib. Where Matplotlib gives you full control over every pixel, Seaborn provides opinionated, attractive defaults and plot types designed specifically for statistical exploration — with far less boilerplate code.

Matplotlib vs Seaborn
Aspect	Matplotlib	Seaborn
Level	Low-level — explicit control	High-level — declarative
Defaults	Functional but plain	Publication-quality themes out of the box
DataFrame integration	Manual (extract arrays)	Direct — pass df= and column names
Statistical plots	Manual calculation required	Built-in (regression, KDE, violin, pair)
Customisation	Unlimited	Matplotlib calls needed for fine-tuning

import seaborn as sns
import matplotlib.pyplot as plt

# Load a built-in example dataset
tips = sns.load_dataset('tips')

# Seaborn: one line to create a scatter with regression line and hue
sns.regplot(data=tips, x='total_bill', y='tip')

# Matplotlib equivalent would require:
# 1. Compute regression manually
# 2. Plot scatter
# 3. Plot fitted line
# 4. Shade confidence interval — ~15 lines total

# Themes and contexts
sns.set_theme(style='whitegrid', context='notebook', palette='muted')
# styles: darkgrid, whitegrid, dark, white, ticks
# contexts: paper, notebook, talk, poster (scale font/line sizes)

Seaborn plots return Matplotlib Axes objects, so all standard Matplotlib customisation still applies after the Seaborn call: ax = sns.scatterplot(...); ax.set_title('My Title'). Seaborn does not replace Matplotlib — it is a complement that handles the tedious parts of statistical plotting.

What object does most Seaborn plotting functions return?A Seaborn Figure object

✗ Try again.

A Matplotlib Axes object

✓ Correct! Well done.

A Pandas DataFrame

✗ Try again.

A PNG image

✗ Try again.

Which Seaborn function sets the global theme (background grid, font scale, colour palette) for all subsequent plots?sns.theme()

✗ Try again.

sns.style()

✗ Try again.

sns.set_theme()

✓ Correct! Well done.

sns.configure()

✗ Try again.

22. What are the most important Seaborn plot types for exploratory data analysis?

Seaborn divides its plots into relational (relationship between variables), distributional (distribution of a single variable), and categorical (comparison across categories). Knowing when to use each makes EDA far more efficient.

import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')

# --- Relational ---
# Scatter with colour encoding
sns.scatterplot(data=tips, x='total_bill', y='tip',
                hue='smoker', size='size', palette='Set1')

# Regression line + scatter
sns.regplot(data=tips, x='total_bill', y='tip', ci=95)

# --- Distributional ---
# Histogram + KDE
sns.histplot(data=tips, x='total_bill', hue='sex', kde=True, bins=20)

# KDE only
sns.kdeplot(data=tips, x='total_bill', hue='sex', fill=True)

# ECDF — empirical cumulative distribution
sns.ecdfplot(data=tips, x='total_bill', hue='day')

# --- Categorical ---
# Box plot
sns.boxplot(data=tips, x='day', y='total_bill', hue='smoker', palette='pastel')

# Violin — box + KDE combined
sns.violinplot(data=tips, x='day', y='tip', inner='quartile')

# Bar chart with error bars (95% CI by default)
sns.barplot(data=tips, x='day', y='tip', estimator='mean', errorbar='ci')

# Strip plot — all individual points
sns.stripplot(data=tips, x='day', y='tip', jitter=True, alpha=0.4)

# --- Multi-variable overview ---
# Pair plot — scatter matrix of all numeric column pairs
sns.pairplot(tips, hue='sex', diag_kind='kde')

# Heatmap — great for correlation matrices
corr = tips.select_dtypes('number').corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1)

What does a violin plot combine that a regular box plot does not show?The mean and 95% confidence interval

✗ Try again.

The full KDE distribution shape alongside the quartile summary

✓ Correct! Well done.

Individual data points plotted as dots

✗ Try again.

A second y-axis for normalised values

✗ Try again.

Which Seaborn function creates a scatter matrix of all numeric column pairs in a DataFrame?sns.scatterplot()

✗ Try again.

sns.pairplot()

✓ Correct! Well done.

sns.jointplot()

✗ Try again.

sns.relplot()

✗ Try again.

23. How do you create and interpret a correlation heatmap with Seaborn?

A correlation heatmap is one of the first plots every data scientist makes on a new dataset. It shows the Pearson (or other) correlation coefficient between every pair of numeric features as a colour-coded grid, immediately revealing which variables move together and which do not.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load example dataset
df = sns.load_dataset('penguins').select_dtypes('number')

# Compute correlation matrix
corr = df.corr()   # Pearson by default; method='spearman' for ranked
print(corr)

# --- Basic heatmap ---
fig, ax = plt.subplots(figsize=(7, 5))
sns.heatmap(
    corr,
    annot=True,           # show values inside each cell
    fmt='.2f',            # 2 decimal places
    cmap='coolwarm',      # blue = negative, red = positive
    vmin=-1, vmax=1,      # fix colour scale to [-1, 1]
    linewidths=0.5,       # add grid lines between cells
    ax=ax,
)
ax.set_title('Feature Correlation Matrix')
fig.tight_layout()

# --- Mask upper triangle (remove redundancy) ---
import numpy as np
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f',
            cmap='coolwarm', vmin=-1, vmax=1)

Interpreting the output: values close to +1 mean strong positive linear correlation (both variables increase together), values close to -1 mean strong negative correlation (one increases as the other decreases), and values near 0 indicate little to no linear relationship. The diagonal is always 1.0 (a variable is perfectly correlated with itself). Masking the upper triangle removes the mirror image and makes the chart less cluttered.

What does a correlation value of -0.87 between two features indicate?The features are nearly uncorrelated

✗ Try again.

As one feature increases, the other strongly tends to decrease

✓ Correct! Well done.

The features have a strong positive relationship

✗ Try again.

The data contains 87% missing values

✗ Try again.

What does masking the upper triangle of a correlation heatmap achieve?Hides correlations above 0

✗ Try again.

Removes the redundant mirror image since the matrix is symmetric

✓ Correct! Well done.

Shows only statistically significant correlations

✗ Try again.

Highlights the diagonal for clarity

✗ Try again.

24. What is Seaborn's FacetGrid and how does it enable multi-panel statistical plots?

FacetGrid is Seaborn's mechanism for trellis/small-multiples plots — the same chart repeated across different subsets of the data, defined by one or more categorical columns. It is one of Seaborn's most powerful features for exploring interaction effects between variables.

import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')

# --- FacetGrid manually ---
g = sns.FacetGrid(tips, col='time', row='sex', height=3, aspect=1.2)
g.map_dataframe(sns.histplot, x='total_bill', bins=15, kde=True)
g.add_legend()
g.set_titles(col_template='{col_name} service', row_template='Sex: {row_name}')
g.set_axis_labels('Total Bill ($)', 'Count')

# --- Figure-level functions (wrap FacetGrid automatically) ---
# relplot — relational
sns.relplot(data=tips, x='total_bill', y='tip',
            col='smoker', hue='sex', kind='scatter', height=4)

# displot — distributional
sns.displot(data=tips, x='total_bill',
            col='sex', row='time', kind='kde', fill=True)

# catplot — categorical
sns.catplot(data=tips, x='day', y='tip',
            col='sex', kind='violin', height=5, aspect=0.8)

The figure-level functions (relplot, displot, catplot) return a FacetGrid object, not an Axes. To customise them after creation you call FacetGrid methods like g.set_titles(), g.set_axis_labels(), or iterate over g.axes.flatten() to access individual Axes objects and apply standard Matplotlib customisation.

What is the purpose of Seaborn's FacetGrid?To render plots in a grid of pixels for image export

✗ Try again.

To create the same chart for different subsets of data defined by categorical variables

✓ Correct! Well done.

To apply a faceted colour palette across a single chart

✗ Try again.

To merge multiple DataFrames before plotting

✗ Try again.

Which figure-level Seaborn function is used to create distributional plots across facets?sns.distplot()

✗ Try again.

sns.displot()

✓ Correct! Well done.

sns.histgrid()

✗ Try again.

sns.facethist()

✗ Try again.

25. How do you compute descriptive statistics on a Pandas DataFrame?

Descriptive statistics summarise the central tendency, spread, and shape of a dataset. Pandas df.describe() is the starting point for any exploratory analysis, but knowing the individual methods gives you more precise control.

import pandas as pd
import numpy as np

df = pd.read_csv('housing.csv')

# --- df.describe() ---
# Numeric columns: count, mean, std, min, 25%, 50%, 75%, max
df.describe()
# Include object columns too
df.describe(include='all')

# --- Individual statistics ---
df['price'].mean()      # arithmetic mean
df['price'].median()    # 50th percentile — robust to outliers
df['price'].mode()[0]   # most frequent value (returns Series)
df['price'].std()       # standard deviation (ddof=1 by default)
df['price'].var()       # variance
df['price'].skew()      # skewness: >0 right-skewed, <0 left-skewed
df['price'].kurt()      # excess kurtosis (0 = normal dist)
df['price'].quantile(0.90)  # 90th percentile
df['price'].quantile([0.25, 0.5, 0.75])  # multiple quantiles

# IQR — interquartile range (robust measure of spread)
Q1, Q3 = df['price'].quantile(0.25), df['price'].quantile(0.75)
IQR = Q3 - Q1

# Outlier detection via IQR fence
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['price'] < lower) | (df['price'] > upper)]
print(f'{len(outliers)} outliers detected ({len(outliers)/len(df)*100:.1f}%)')

Why is the median preferred over the mean as a measure of central tendency for highly skewed data?The median is always larger than the mean

✗ Try again.

The median is not affected by extreme outlier values; the mean is pulled toward them

✓ Correct! Well done.

The median is faster to compute

✗ Try again.

The mean is undefined for skewed distributions

✗ Try again.

What does a positive skewness value indicate about a distribution?The distribution is symmetric

✗ Try again.

The distribution has a long right tail (most values clustered on the left)

✓ Correct! Well done.

The distribution has a long left tail

✗ Try again.

The distribution is bimodal

✗ Try again.

26. How do you reduce a Pandas DataFrame's memory usage through dtype optimisation?

DataFrames loaded from CSV often use unnecessarily large dtypes — 64-bit integers for values that fit in 8 bits, generic object dtype for repeated string categories. Downcasting dtypes can reduce memory by 4–8× without any data loss, enabling analysis of larger datasets within available RAM.

import pandas as pd
import numpy as np

df = pd.read_csv('large.csv')
print(f'Memory before: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB')

# --- Integer downcasting ---
for col in df.select_dtypes('int64').columns:
    df[col] = pd.to_numeric(df[col], downcast='integer')
    # downcast tries int8 -> int16 -> int32 depending on value range

# --- Float downcasting ---
for col in df.select_dtypes('float64').columns:
    df[col] = pd.to_numeric(df[col], downcast='float')  # float32

# --- Categorical: object columns with low cardinality ---
# If a column has < 5% unique values, Categorical saves memory
for col in df.select_dtypes('object').columns:
    n_unique = df[col].nunique()
    if n_unique / len(df) < 0.05:    # less than 5% cardinality
        df[col] = df[col].astype('category')

print(f'Memory after : {df.memory_usage(deep=True).sum() / 1e6:.1f} MB')

# Categorical also speeds up groupby on low-cardinality columns
# because grouping enumerates integers rather than comparing strings

The Categorical dtype stores repeated strings as integer codes internally — a column with 5 unique city names in a million-row dataset stores one integer per row rather than one full string per row. This speeds up groupby, sort_values, and value_counts in addition to saving memory.

Which Pandas dtype should you use for a column with only 10 distinct string values repeated across a million rows?object

✗ Try again.

string

✗ Try again.

category

✓ Correct! Well done.

str8

✗ Try again.

What does pd.to_numeric(col, downcast='integer') do?Converts the column to float

✗ Try again.

Finds the smallest integer dtype (int8, int16, etc.) that fits all values and casts to it

✓ Correct! Well done.

Removes all non-numeric values

✗ Try again.

Rounds values to the nearest integer

✗ Try again.

27. How do you generate reproducible random data with NumPy?

Reproducibility is a core requirement of data science — experiments, train/test splits, and simulations must produce the same result every run so that results can be verified and shared. NumPy's random number generation is the building block for all of this.

import numpy as np

# --- Legacy API (still common in older code) ---
np.random.seed(42)
np.random.rand(3)       # [0.374, 0.951, 0.732] — same every time

# --- Modern API: Generator (preferred since NumPy 1.17) ---
rng = np.random.default_rng(seed=42)
# Using a Generator is thread-safe and has better statistical properties

rng.random(5)                # uniform [0, 1)
rng.standard_normal(5)       # N(0, 1)
rng.normal(loc=170, scale=10, size=1000)  # N(mean, std)
rng.integers(0, 100, size=10)  # random ints in [0, 100)
rng.choice(['a','b','c'], size=5, replace=True)  # random sampling
rng.shuffle(arr)              # in-place shuffle
rng.permutation(arr)          # shuffled copy

# --- Distributions used in ML simulations ---
rng.binomial(n=10, p=0.3, size=100)      # number of successes in n trials
rng.poisson(lam=5, size=100)             # events per interval
rng.exponential(scale=2, size=100)       # time between Poisson events
rng.uniform(low=0, high=10, size=100)    # uniform distribution

The modern Generator API (np.random.default_rng) is preferred over np.random.seed because: the generator is a first-class object you can pass around (not a global state), it is thread-safe, and it uses the PCG64 algorithm which passes more statistical tests than the Mersenne Twister used by the legacy API.

Why is np.random.default_rng(seed) preferred over np.random.seed() for production code?default_rng is faster for large arrays

✗ Try again.

default_rng returns a Generator object that is thread-safe and not global state; seed() sets a global state that can cause subtle bugs in parallel code

✓ Correct! Well done.

seed() is deprecated and removed in NumPy 2.0

✗ Try again.

default_rng supports more distributions

✗ Try again.

What distribution does rng.standard_normal() draw from?Uniform [0, 1)

✗ Try again.

Standard normal N(0, 1)

✓ Correct! Well done.

Exponential with mean 1

✗ Try again.

Binomial with p=0.5

✗ Try again.

28. How do you use value_counts() and pd.crosstab() to understand categorical data?

Categorical columns are understood by counting their frequencies and cross-tabulating them against other variables. These two tools answer the questions 'what values exist and how often?' and 'how are two categorical variables related?'

import pandas as pd
tips = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')

# --- value_counts ---
tips['day'].value_counts()
# Sat     87  Fri    19  Sun     76  Thur    62

tips['day'].value_counts(normalize=True).round(3)
# proportions: Sat 0.357, Sun 0.312, Thur 0.255, Fri 0.078

tips['day'].value_counts(dropna=False)  # includes NaN count if any

# Count unique values
tips['day'].nunique()   # 4

# Histogram of numeric with bins
pd.cut(tips['total_bill'], bins=5).value_counts().sort_index()

# --- pd.crosstab ---
# Frequency cross-table: how many smokers vs non-smokers per day
ct = pd.crosstab(tips['day'], tips['smoker'])
# smoker   No  Yes
# day
# Fri       4   15
# Sat      45   42
# Sun      57   19
# Thur     45   17

# Proportions within rows (what % of each day are smokers)
pd.crosstab(tips['day'], tips['smoker'], normalize='index').round(3)

# With aggregation (mean tip by day and smoker)
pd.crosstab(tips['day'], tips['smoker'],
            values=tips['tip'], aggfunc='mean').round(2)

What does value_counts(normalize=True) return?Counts sorted by value

✗ Try again.

Proportions — each count divided by the total number of non-null values

✓ Correct! Well done.

Normalised z-scores of the counts

✗ Try again.

Counts after removing outliers

✗ Try again.

What does pd.crosstab(df['A'], df['B'], normalize='index') produce?Raw counts per cell

✗ Try again.

Row proportions — each row sums to 1.0

✓ Correct! Well done.

Column proportions — each column sums to 1.0

✗ Try again.

The chi-squared statistic for independence

✗ Try again.

29. How do you style Matplotlib figures and save them for reports?

The default Matplotlib style is functional but plain. For presentations and reports you need publication-quality output — chosen colour palettes, correct font sizes, no chart junk, and lossless or high-resolution raster output.

import matplotlib.pyplot as plt
import numpy as np

# --- Using a style sheet ---
plt.style.use('seaborn-v0_8-whitegrid')  # clean grid background
# Other useful styles: 'ggplot', 'fivethirtyeight', 'bmh', 'dark_background'
print(plt.style.available)   # list all available styles

# --- Common appearance tweaks via rcParams ---
plt.rcParams.update({
    'font.size':        12,
    'axes.labelsize':   13,
    'axes.titlesize':   14,
    'legend.fontsize':  11,
    'figure.dpi':       100,
    'lines.linewidth':  2,
})

# --- Figure construction ---
fig, ax = plt.subplots(figsize=(8, 5))
x = np.linspace(0, 10, 200)
ax.plot(x, np.sin(x), color='#2E86AB', label='sin(x)')
ax.fill_between(x, np.sin(x), 0, alpha=0.15, color='#2E86AB')
ax.axhline(0, color='black', linewidth=0.8, linestyle='--')
ax.set_title('Sine Wave with Fill', pad=12)
ax.set_xlabel('x')
ax.set_ylabel('sin(x)')
ax.legend(loc='upper right')
ax.spines[['top', 'right']].set_visible(False)  # remove chart junk
fig.tight_layout()

# --- Saving ---
fig.savefig('output.png', dpi=300, bbox_inches='tight')  # raster
fig.savefig('output.pdf', bbox_inches='tight')            # vector
fig.savefig('output.svg', bbox_inches='tight')            # web/edit

Use bbox_inches='tight' whenever saving — it prevents axis labels being clipped at the edges. For publications use PDF or SVG (vector formats that scale without pixelation). For web and slides, PNG at 150–300 DPI is standard.

What does bbox_inches='tight' do when saving a Matplotlib figure?Adds a tight border around the figure

✗ Try again.

Ensures axis labels and titles are not clipped when saving

✓ Correct! Well done.

Reduces file size by cropping whitespace only

✗ Try again.

Applies JPEG compression to the output

✗ Try again.

Which file format should you use when you need a Matplotlib plot that scales without pixelation in a report?PNG

✗ Try again.

BMP

✗ Try again.

PDF or SVG

✓ Correct! Well done.

GIF

✗ Try again.

30. What is np.where and how is it used for conditional array creation?

np.where is NumPy's vectorised if/else for arrays. In its three-argument form it returns a new array built element-by-element: where the condition is True, use values from x; where False, use values from y. It is the correct alternative to writing a Python loop with an if-statement inside.

import numpy as np

scores = np.array([88, 45, 72, 91, 60, 33, 95])

# Classify into Pass / Fail without a loop
labels = np.where(scores >= 70, 'Pass', 'Fail')
# ['Pass' 'Fail' 'Pass' 'Pass' 'Fail' 'Fail' 'Pass']

# Apply a discount: over 80 gets 20% off, rest gets 5% off
prices = np.array([100.0, 200.0, 50.0, 150.0])
discounted = np.where(prices > 80, prices * 0.80, prices * 0.95)
# [95.  160.   47.5  120.]

# Chain multiple conditions using np.select
conditions = [
    scores >= 90,
    (scores >= 70) & (scores < 90),
    scores < 70,
]
choices = ['A', 'B', 'C']
grades = np.select(conditions, choices, default='F')
# ['B' 'C' 'B' 'A' 'C' 'C' 'A']

# One-argument form: returns indices where condition is True
failing_indices = np.where(scores < 70)
# (array([1, 4, 5]),)   — tuple of index arrays
failing_scores = scores[failing_indices]
# [45 60 33]

np.select generalises np.where to multiple conditions — the first matching condition wins. Use it whenever you have more than two output categories; chaining nested np.where calls quickly becomes unreadable.

What does np.where(scores < 70) return when called with only one argument?A boolean array

✗ Try again.

An array of zeros and ones

✗ Try again.

A tuple of index arrays where the condition is True

✓ Correct! Well done.

The count of True values

✗ Try again.

Which NumPy function is the idiomatic replacement for chaining multiple np.where conditions?np.choose()

✗ Try again.

np.piecewise()

✗ Try again.

np.select()

✓ Correct! Well done.

np.ifelse()

✗ Try again.

31. What is Pandas method chaining and how does df.pipe() support it?

Method chaining is the style of writing data transformations as a single expression where each step's result is the input to the next. It avoids creating intermediate variables, reads like a pipeline, and makes the data flow explicit from top to bottom.

import pandas as pd

# --- Without chaining (intermediate variables) ---
df1 = pd.read_csv('raw.csv')
df2 = df1.dropna(subset=['revenue'])
df3 = df2.rename(columns={'rev': 'revenue'})
df4 = df3[df3['revenue'] > 0]
df5 = df4.assign(log_revenue=lambda d: d['revenue'].apply(np.log1p))
result = df5.groupby('region')['log_revenue'].mean()

# --- With method chaining ---
import numpy as np

result = (
    pd.read_csv('raw.csv')
    .dropna(subset=['revenue'])
    .rename(columns={'rev': 'revenue'})
    .query('revenue > 0')
    .assign(log_revenue=lambda d: np.log1p(d['revenue']))
    .groupby('region')['log_revenue']
    .mean()
)

# --- df.pipe() for custom functions ---
def remove_outliers(df, col, n_std=3):
    mean, std = df[col].mean(), df[col].std()
    return df[(df[col] - mean).abs() < n_std * std]

def add_rank(df, col):
    df = df.copy()
    df['rank'] = df[col].rank(ascending=False)
    return df

result = (
    pd.read_csv('raw.csv')
    .pipe(remove_outliers, col='revenue')
    .pipe(add_rank, col='revenue')
)
# pipe passes the DataFrame as the first argument to the function

df.pipe(func, *args, **kwargs) calls func(df, *args, **kwargs), inserting the DataFrame at the front of the argument list. This lets you write standalone functions and use them inline in a method chain without breaking the fluent style.

What does df.pipe(my_func, extra_arg=5) do?Filters df using my_func as a condition

✗ Try again.

Calls my_func(df, extra_arg=5) and returns the result

✓ Correct! Well done.

Applies my_func element-wise like apply()

✗ Try again.

Passes extra_arg as the first argument and df as the second

✗ Try again.

What is the main readability advantage of method chaining over using intermediate variables?It is always faster

✗ Try again.

The data flow reads top-to-bottom as a single pipeline without naming throw-away intermediates

✓ Correct! Well done.

It allows parallel execution of steps

✗ Try again.

Pandas optimises chained calls into a single pass

✗ Try again.

32. What does a typical exploratory data analysis (EDA) workflow look like in Python?

EDA is the first thing you do with a new dataset before any modelling. The goal is to understand the data's structure, quality, and relationships, and to spot problems (wrong dtypes, missing values, outliers, data leakage) before they propagate into a model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load and inspect
df = pd.read_csv('housing.csv')
print(df.shape)          # (rows, cols)
print(df.dtypes)         # types per column
print(df.head())         # first 5 rows
print(df.info())         # dtypes + non-null counts
print(df.describe())     # summary stats for numeric cols

# 2. Missing value audit
missing = df.isnull().sum().sort_values(ascending=False)
print(missing[missing > 0])

# 3. Duplicate rows
print(df.duplicated().sum())
df = df.drop_duplicates()

# 4. Distribution of each numeric column
df.select_dtypes('number').hist(bins=30, figsize=(16, 10))
plt.tight_layout(); plt.show()

# 5. Correlation heatmap
corr = df.select_dtypes('number').corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix'); plt.show()

# 6. Target variable distribution
target = 'price'
sns.histplot(df[target], kde=True)
print(f'Skewness: {df[target].skew():.2f}')

# 7. Categorical breakdown
for col in df.select_dtypes('object').columns:
    print(df[col].value_counts())

# 8. Outlier detection
for col in df.select_dtypes('number').columns:
    Q1, Q3 = df[col].quantile([0.25, 0.75])
    IQR = Q3 - Q1
    n_out = ((df[col] < Q1-1.5*IQR)|(df[col] > Q3+1.5*IQR)).sum()
    if n_out > 0: print(f'{col}: {n_out} outliers')

EDA is iterative — findings in step 4 send you back to step 2, insights in the correlation matrix raise questions answered by group analysis. Keep a notebook with your observations alongside the code so you and your team can understand what was found and why certain preprocessing decisions were made.

What is the first thing you should check after loading a dataset into a DataFrame?Build the model immediately

✗ Try again.

Check shape, dtypes, missing values, and duplicate rows to understand data quality

✓ Correct! Well done.

Normalise all numeric columns

✗ Try again.

Remove all rows with any missing value

✗ Try again.

Why should you investigate a highly skewed target variable before training a regression model?Skewed targets cause DataFrames to be slower

✗ Try again.

Most regression models assume errors are normally distributed; a skewed target often requires a log transform to meet this assumption

✓ Correct! Well done.

Sklearn cannot handle skewed targets

✗ Try again.

Skewness only matters for classification

✗ Try again.

33. How do you stack, concatenate, and split NumPy arrays?

Combining and splitting arrays is a frequent operation in data preprocessing — assembling feature matrices from multiple sources, or splitting a dataset into folds for cross-validation.

import numpy as np

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

# --- Concatenating along existing axes ---
np.concatenate([a, b], axis=0)  # stack rows (vertical)
# [[1 2]
#  [3 4]
#  [5 6]
#  [7 8]]

np.concatenate([a, b], axis=1)  # stack columns (horizontal)
# [[1 2 5 6]
#  [3 4 7 8]]

# --- Convenience stacking functions ---
np.vstack([a, b])   # vertical stack — same as axis=0
np.hstack([a, b])   # horizontal stack — same as axis=1 for 2-D
np.dstack([a, b])   # depth stack (creates a 3rd axis)

# stack — creates a NEW axis (different from concatenate!)
np.stack([a, b], axis=0)   # shape (2, 2, 2)
np.stack([a, b], axis=2)   # shape (2, 2, 2) — depth

# --- Splitting ---
big = np.arange(12).reshape(6, 2)
parts = np.vsplit(big, 3)    # split into 3 equal arrays along axis 0
# [array([[0,1]]), array([[2,3]]), ... ]

# Split at specific indices
parts = np.split(big, [2, 4], axis=0)  # [0:2], [2:4], [4:]

# Tile — repeat an array
np.tile(a, (2, 3))   # repeat a 2 times along rows, 3 times along cols

What is the key difference between np.stack([a,b]) and np.concatenate([a,b])?stack requires 3-D arrays; concatenate works on 2-D

✗ Try again.

stack creates a NEW axis; concatenate joins along an existing axis

✓ Correct! Well done.

concatenate is always slower

✗ Try again.

They are identical; stack is just shorthand

✗ Try again.

Which function splits a 2-D array into N equal parts along axis 0?np.split()

✗ Try again.

np.hsplit()

✗ Try again.

np.vsplit()

✓ Correct! Well done.

np.chunk()

✗ Try again.

34. How do you detect and remove duplicate rows in a Pandas DataFrame?

Duplicate rows silently inflate counts, distort means, and can cause data leakage between training and test sets. Pandas provides duplicated() and drop_duplicates() for systematic duplicate management.

import pandas as pd

df = pd.DataFrame({
    'order_id': [1, 2, 2, 3, 4, 4],
    'product':  ['A', 'B', 'B', 'C', 'D', 'D'],
    'amount':   [100, 200, 200, 150, 80, 90],   # last pair differs!
})

# --- Detecting duplicates ---
df.duplicated()               # True for all duplicates (keeps first)
df.duplicated(keep='last')    # True for all duplicates (keeps last)
df.duplicated(keep=False)     # True for ALL occurrences

print(df.duplicated().sum())  # count of duplicate rows

# Duplicate check on a subset of columns only
df.duplicated(subset=['order_id', 'product'])
# True where order_id AND product are repeated (ignores amount diff)

# --- Removing duplicates ---
df.drop_duplicates()          # removes all but first occurrence
df.drop_duplicates(keep='last')  # keeps last occurrence
df.drop_duplicates(keep=False)   # removes all occurrences of any duplicate

# Subset-based deduplication — keep first by order_id
df.drop_duplicates(subset=['order_id'], keep='first')

# Sort before deduplicating to control which row is 'first'
# (e.g., keep the highest amount per order)
df.sort_values('amount', ascending=False).drop_duplicates(subset=['order_id'])

When deduplicating on a subset of columns, think carefully about which row to keep. Sorting the DataFrame first (by timestamp, version, or a quality metric) ensures drop_duplicates(keep='first') retains the most appropriate record, not just whatever happened to be first in the file.

What does df.duplicated(keep=False) return?True only for the second occurrence of each duplicate

✗ Try again.

True for ALL occurrences of any duplicated row, not just the extras

✓ Correct! Well done.

True only for completely unique rows

✗ Try again.

A count of total duplicates

✗ Try again.

How do you keep only the row with the highest amount for each order_id when deduplicating?df.drop_duplicates(subset=['order_id'], keep='max')

✗ Try again.

Sort descending by amount first, then drop_duplicates(subset=['order_id'], keep='first')

✓ Correct! Well done.

df.groupby('order_id').max()

✗ Try again.

df.unique(subset=['order_id'], criterion='amount')

✗ Try again.

35. How do you control colours and colour palettes in Matplotlib and Seaborn?

Colour is one of the most impactful design decisions in a chart. Used correctly it encodes information; used poorly it confuses or misleads. Both Matplotlib and Seaborn give you fine-grained control.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# --- Matplotlib colour specifications ---
# Named CSS colours
plt.plot(x, y, color='steelblue')
# Hex string
plt.plot(x, y, color='#2E86AB')
# RGB tuple (values 0-1)
plt.plot(x, y, color=(0.18, 0.52, 0.67))
# Grayscale string
plt.plot(x, y, color='0.5')   # 50% grey

# --- Colormaps for continuous data ---
im = plt.imshow(matrix, cmap='viridis')   # perceptually uniform
plt.colorbar(im)
# Other good cmaps: 'plasma', 'inferno', 'magma' (sequential)
# 'RdBu', 'coolwarm', 'bwr' (diverging — centred on 0)
# 'tab10', 'Set1', 'Set2' (categorical)

# --- Seaborn palettes ---
# Categorical (qualitative)
sns.barplot(data=df, x='day', y='tip', palette='Set2')

# Sequential (one colour family)
sns.barplot(data=df, x='day', y='tip', palette='Blues_d')

# Diverging (two colour families around a midpoint)
sns.heatmap(corr, cmap='coolwarm', vmin=-1, vmax=1, center=0)

# Custom palette
custom = ['#E63946', '#457B9D', '#1D3557', '#A8DADC']
sns.barplot(data=df, x='day', y='tip', palette=custom)

# Preview a palette
sns.palplot(sns.color_palette('husl', 8))

Always use perceptually uniform colormaps (viridis, plasma) for continuous data — rainbow/jet maps are misleading because they are not perceptually linear (the eye perceives the yellow band as brighter than the blue or red bands, creating false visual contrast). For diverging data (correlation matrices, residuals) use a diverging colormap centred on zero.

Why should you avoid the 'jet' (rainbow) colormap for continuous data?It is not available in older Matplotlib versions

✗ Try again.

It is perceptually non-uniform — the eye perceives some colours as brighter, creating false visual contrast

✓ Correct! Well done.

It only works with seaborn themes

✗ Try again.

It cannot represent negative values

✗ Try again.

Which type of colormap should you use for a correlation matrix where values range from -1 to +1?Sequential (single hue)

✗ Try again.

Qualitative (distinct colours)

✗ Try again.

Diverging (two hue families around a midpoint)

✓ Correct! Well done.

Cyclic (wrapping hue)

✗ Try again.

36. How do rolling and expanding window functions work in Pandas?

Window functions compute statistics over a sliding or expanding subset of rows, essential for time-series smoothing, trend detection, and feature engineering. Unlike groupby aggregations, window functions return a result for every row, preserving the original index.

import pandas as pd
import numpy as np

ts = pd.DataFrame({
    'date':  pd.date_range('2024-01-01', periods=10, freq='D'),
    'sales': [100, 120, 90, 150, 200, 130, 110, 180, 160, 140],
})
ts = ts.set_index('date')

# --- Rolling window (fixed-size, slides one step at a time) ---
ts['ma3']   = ts['sales'].rolling(window=3).mean()  # 3-day moving avg
ts['std3']  = ts['sales'].rolling(window=3).std()
ts['min3']  = ts['sales'].rolling(window=3).min()

# First window-1 values are NaN (not enough history)
# min_periods: require fewer observations before computing
ts['ma3_mp'] = ts['sales'].rolling(window=3, min_periods=1).mean()

# --- Expanding window (grows to include all rows so far) ---
ts['cum_max']  = ts['sales'].expanding().max()
ts['cum_mean'] = ts['sales'].expanding().mean()

# --- Exponentially weighted moving average (more weight on recent data) ---
ts['ewma'] = ts['sales'].ewm(span=3).mean()

# --- Lag / shift features (common in time-series forecasting) ---
ts['lag1'] = ts['sales'].shift(1)   # yesterday's sales
ts['lag7'] = ts['sales'].shift(7)   # last week's sales
ts['pct_change'] = ts['sales'].pct_change()  # % change from previous row

Moving averages (rolling mean) smooth out noise to reveal trends. Exponentially weighted moving averages give more influence to recent observations, making them responsive to recent changes while still smoothing. Lag features turn a time-series prediction problem into a supervised learning problem where past values predict future ones.

Why do the first (window - 1) rows of a rolling() calculation contain NaN?NaN is used as a placeholder for future values

✗ Try again.

There are not enough prior observations to fill the window

✓ Correct! Well done.

rolling() requires at least 10 observations to start

✗ Try again.

NaN indicates the minimum was used instead of the mean

✗ Try again.

What does ts['sales'].shift(1) produce?Each day's sales shifted forward by 1 day

✗ Try again.

Each row contains the previous row's sales value (a 1-step lag)

✓ Correct! Well done.

Sales values rounded to 1 decimal

✗ Try again.

A percentage change from the previous value

✗ Try again.

37. How do Seaborn jointplot and pairplot help explore multivariate relationships?

When you have more than one numeric variable, the next step after individual histograms is to understand relationships between pairs. Seaborn's jointplot and pairplot automate this exploration with minimal code.

import seaborn as sns
import matplotlib.pyplot as plt

penguins = sns.load_dataset('penguins').dropna()

# --- jointplot: one pair of variables ---
# Scatter + marginal histograms
sns.jointplot(data=penguins, x='bill_length_mm', y='bill_depth_mm',
              hue='species', height=6)

# Regression + 95% confidence interval
sns.jointplot(data=penguins, x='flipper_length_mm', y='body_mass_g',
              kind='reg', height=6)

# Hex bins — better than scatter for large datasets with overplotting
sns.jointplot(data=penguins, x='flipper_length_mm', y='body_mass_g',
              kind='hex', height=6)

# KDE — smooth 2-D density
sns.jointplot(data=penguins, x='bill_length_mm', y='bill_depth_mm',
              kind='kde', fill=True, height=6)

# --- pairplot: all pairs + diagonal histograms ---
# Standard scatter matrix
sns.pairplot(penguins, hue='species',
             diag_kind='kde',         # diagonal: KDE instead of histogram
             plot_kws={'alpha': 0.5},  # semi-transparent points
             height=2.5)
plt.suptitle('Penguin Feature Pairs', y=1.02)
plt.show()

# Subset of columns only
cols = ['bill_length_mm', 'flipper_length_mm', 'body_mass_g']
sns.pairplot(penguins[cols + ['species']], hue='species')

Use jointplot when you want to focus deeply on one specific pair of variables with marginal distributions visible. Use pairplot for a broad overview of all pairwise relationships in a dataset with up to ~10 variables — beyond that the grid becomes too small to read meaningfully.

What does the diag_kind='kde' argument to sns.pairplot() control?The kind of scatter plot in off-diagonal cells

✗ Try again.

The type of plot drawn on the diagonal cells of the pair grid

✓ Correct! Well done.

The kernel used for density estimation in all cells

✗ Try again.

Whether to draw a KDE contour on all scatter plots

✗ Try again.

When is kind='hex' in sns.jointplot() preferred over kind='scatter'?When the dataset has fewer than 100 rows

✗ Try again.

When there is heavy overplotting — many points share similar coordinates and a scatter plot becomes a solid blob

✓ Correct! Well done.

When the x and y variables have different units

✗ Try again.

When the relationship is expected to be non-linear

✗ Try again.

38. What are the key performance tips when using NumPy for large-scale data processing?

NumPy is fast by default, but a few common mistakes can undermine that speed. Knowing these patterns makes the difference between code that runs in seconds and code that runs in minutes.

import numpy as np

n = 10_000_000
arr = rng.random(n)

# 1. AVOID Python loops — always prefer ufuncs
# Slow:
result = [x**2 for x in arr]        # Python loop, ~3s
# Fast:
result = arr ** 2                    # NumPy ufunc, ~0.03s

# 2. Pre-allocate output arrays instead of growing them
# Slow:
out = []
for chunk in chunks:
    out.append(chunk.sum())          # repeated list growth
# Fast:
out = np.empty(len(chunks))
for i, chunk in enumerate(chunks):
    out[i] = chunk.sum()

# 3. Use views instead of copies when slicing
sub = arr[1000:2000]   # view — no memory allocation
sub2 = arr[1000:2000].copy()  # explicit copy — only when mutation safety needed

# 4. Choose the right dtype — float32 vs float64
a64 = np.ones(n, dtype=np.float64)  # 80 MB
a32 = np.ones(n, dtype=np.float32)  # 40 MB — also faster on many ops

# 5. Use out= argument to avoid temporary arrays
np.add(a32, a32, out=a32)   # in-place: no temporary intermediate created

# 6. np.einsum for complex multi-dimensional contractions
A = rng.random((100, 200))
B = rng.random((200, 300))
C = np.einsum('ij,jk->ik', A, B)  # equivalent to A @ B but explicit

The most impactful optimisation in almost every case is the first: eliminating Python loops. After that, reducing the number of temporary arrays (using out= or in-place operators like +=) and choosing smaller dtypes are the next biggest wins.

What is the most impactful NumPy performance optimisation in most cases?Using np.einsum instead of @

✗ Try again.

Replacing Python loops with vectorised NumPy operations

✓ Correct! Well done.

Always copying arrays before modifying them

✗ Try again.

Using float16 instead of float32

✗ Try again.

What does the out= parameter do in np.add(a, b, out=a)?Adds a safety check before writing

✗ Try again.

Writes the result directly into array a without allocating a temporary output array

✓ Correct! Well done.

Sends output to a file

✗ Try again.

out= is ignored in NumPy — it is a no-op

✗ Try again.

39. How do you visualise regression results and residuals using Seaborn and Matplotlib?

After fitting any regression model, visualising the residuals (actual - predicted values) is mandatory. Patterns in residuals reveal model assumptions violations: non-linearity, heteroscedasticity, or non-normality of errors.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Simulate some data with a non-linear relationship
rng = np.random.default_rng(42)
x = rng.uniform(0, 10, 200)
y = 2 * x + 0.5 * x**2 + rng.normal(0, 3, 200)
df = pd.DataFrame({'x': x, 'y': y})

# 1. Scatter + regression line (with confidence interval)
sns.regplot(data=df, x='x', y='y', ci=95, scatter_kws={'alpha': 0.4})
plt.title('Scatter with OLS Regression Line')
plt.show()

# 2. Residual plot — built in seaborn
sns.residplot(data=df, x='x', y='y', lowess=True,
              scatter_kws={'alpha': 0.4})
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs x (lowess smoothed trend)')
plt.show()
# A horizontal band around 0 = good; a curve = model is missing non-linearity

# 3. Manual residuals (after sklearn model)
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(df[['x']], df['y'])
df['predicted'] = model.predict(df[['x']])
df['residual']  = df['y'] - df['predicted']

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(df['predicted'], df['residual'], alpha=0.4)
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set(xlabel='Fitted Values', ylabel='Residuals',
            title='Residuals vs Fitted')
sns.histplot(df['residual'], kde=True, ax=axes[1])
axes[1].set_title('Residual Distribution')
plt.tight_layout(); plt.show()

The two most diagnostic residual plots are: (1) Residuals vs Fitted — should be a random horizontal band; any curve indicates missing predictors or a need for feature transformation. (2) Residual histogram — should be approximately normal; heavy tails suggest outliers or a non-Gaussian error structure.

What pattern in a residuals-vs-fitted plot indicates the model is missing non-linear structure?A random scatter of points around the zero line

✗ Try again.

A curved or systematic trend in the residuals

✓ Correct! Well done.

Residuals that are all positive

✗ Try again.

Residuals with equal variance across all fitted values

✗ Try again.

What does the lowess=True argument add to sns.residplot()?A linear regression line through the residuals

✗ Try again.

A locally weighted smoothed trend line to reveal systematic patterns

✓ Correct! Well done.

A 95% confidence band around zero

✗ Try again.

Logarithmic scaling of the y-axis

✗ Try again.

40. How do you process large CSV files that don't fit in memory using Pandas?

When a CSV is larger than available RAM, loading it with a plain pd.read_csv causes a MemoryError. Pandas provides three strategies: chunking, selective loading, and dtype optimisation.

import pandas as pd
import numpy as np

# --- Strategy 1: Read only necessary columns and rows ---
df = pd.read_csv(
    'big_log.csv',
    usecols=['timestamp', 'user_id', 'event', 'amount'],  # skip unneeded cols
    dtype={'user_id': 'int32', 'amount': 'float32'},       # smaller dtypes
    parse_dates=['timestamp'],
    nrows=500_000,   # read a sample first for exploration
)

# --- Strategy 2: Process in chunks ---
chunk_size = 100_000
results = []

for chunk in pd.read_csv('big_log.csv', chunksize=chunk_size,
                          usecols=['user_id', 'amount']):
    # Process each chunk independently
    summary = chunk.groupby('user_id')['amount'].sum()
    results.append(summary)

# Combine partial results
final = pd.concat(results).groupby(level=0).sum()

# --- Strategy 3: Filter while reading with chunksize ---
high_value_chunks = []
for chunk in pd.read_csv('big_log.csv', chunksize=chunk_size):
    filtered = chunk[chunk['amount'] > 1000]
    high_value_chunks.append(filtered)
high_value_df = pd.concat(high_value_chunks, ignore_index=True)

# --- Alternative: Parquet format (much faster than CSV) ---
# Convert once:
df.to_parquet('big_log.parquet', index=False)
# Then read efficiently — Parquet supports column projection and row filters
import pyarrow.parquet as pq
table = pq.read_table('big_log.parquet',
                       columns=['user_id', 'amount'],
                       filters=[('amount', '>', 1000)])

For truly large-scale work (tens of GB), consider switching from CSV to Parquet (columnar, compressed, fast column projection) and using Dask or Polars instead of Pandas — both operate on lazy computation graphs that stream data without loading everything into memory at once.

What does chunksize=100_000 do when passed to pd.read_csv?Limits the CSV to 100,000 rows

✗ Try again.

Returns an iterator that yields DataFrames of 100,000 rows at a time

✓ Correct! Well done.

Reads 100,000 bytes per batch

✗ Try again.

Compresses the CSV into 100,000-byte blocks

✗ Try again.

Why is the Parquet format preferable to CSV for large analytical datasets?Parquet is human-readable like CSV

✗ Try again.

Parquet is columnar and compressed, enabling fast column-level reads without loading the full file

✓ Correct! Well done.

Parquet is the only format Pandas supports

✗ Try again.

CSV files cannot store numerical data

✗ Try again.

41. How do you add annotations and text to Matplotlib charts?

Annotations turn a chart into a story — highlighting a key data point, marking a threshold, or labelling significant events on a timeline. Matplotlib provides ax.annotate() for arrow-and-text annotations and ax.text() for free-form text placement.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 200)
y = np.sin(x) * np.exp(-x / 5)

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(x, y, color='steelblue', linewidth=2)

# Find and annotate the maximum
peak_idx = np.argmax(y)
px, py   = x[peak_idx], y[peak_idx]

ax.annotate(
    f'Peak: ({px:.2f}, {py:.2f})',
    xy=(px, py),              # point to annotate
    xytext=(px + 1.5, py),    # where the text goes
    arrowprops=dict(
        arrowstyle='->',
        color='darkred',
        lw=1.5,
    ),
    fontsize=11,
    color='darkred',
)

# Free-form text label
ax.text(0.5, 0.9, 'Damped oscillation',
        transform=ax.transAxes,   # axes-relative coords (0–1)
        fontsize=12, ha='center',
        bbox=dict(boxstyle='round,pad=0.3', fc='lightyellow', ec='grey'))

# Threshold line with label
ax.axhline(y=0.5, color='orange', linestyle='--', linewidth=1)
ax.text(9.5, 0.52, 'threshold=0.5', color='orange', ha='right', fontsize=9)

ax.set(title='Annotated Damped Sine', xlabel='x', ylabel='y')
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout(); plt.show()

The two coordinate systems matter: xy in annotate uses data coordinates by default (values from your actual data range). Passing transform=ax.transAxes to ax.text() switches to axes-fraction coordinates (0,0 = bottom-left, 1,1 = top-right) — useful for fixed-position labels that stay put when the data range changes.

What does transform=ax.transAxes do when passed to ax.text()?Rotates the text by the axis angle

✗ Try again.

Positions the text using axes-fraction coordinates (0–1) rather than data coordinates

✓ Correct! Well done.

Applies a logarithmic transform to the text position

✗ Try again.

Mirrors the text along the y-axis

✗ Try again.

In ax.annotate(), what do the xy and xytext arguments control?xy is the arrowhead position (point annotated); xytext is where the text label is placed

✓ Correct! Well done.

xy is the text position; xytext is the arrowhead

✗ Try again.

Both specify the same point — xytext is ignored

✗ Try again.

xy sets the font size; xytext sets the offset

✗ Try again.

42. How do you quickly extract top/bottom rows and random samples from a Pandas DataFrame?

During EDA you often need to inspect extremes (the highest-revenue customers, the worst-performing products) or draw a random sample for quick analysis. Pandas provides concise methods for each of these.

import pandas as pd
import numpy as np

rng = np.random.default_rng(42)
df = pd.DataFrame({
    'product': [f'P{i}' for i in range(100)],
    'revenue': rng.integers(1_000, 100_000, 100),
    'returns': rng.integers(0, 500, 100),
})

# --- Top and bottom N rows ---
df.nlargest(5, 'revenue')   # 5 highest revenue products
df.nsmallest(5, 'revenue')  # 5 lowest revenue products

# Multiple columns — break ties by second column
df.nlargest(5, ['revenue', 'returns'])

# --- Random sampling ---
df.sample(n=10, random_state=42)        # 10 random rows
df.sample(frac=0.1, random_state=42)    # 10% of rows
df.sample(n=10, replace=True)           # with replacement (bootstrapping)

# Stratified sample — same proportion from each category
df['tier'] = pd.cut(df['revenue'], bins=3, labels=['low','mid','high'])
stratified = df.groupby('tier', group_keys=False).apply(
    lambda g: g.sample(frac=0.1, random_state=42)
)

# --- Head, tail, every Nth row ---
df.head(10)       # first 10 rows
df.tail(10)       # last 10 rows
df.iloc[::5]      # every 5th row — useful for large datasets

nlargest and nsmallest are significantly faster than sort_values(...).head(n) for large DataFrames because they use a partial sort (heap) under the hood — O(N log k) instead of O(N log N) for the full sort. Use them whenever you only need the extremes, not a fully sorted result.

Why is df.nlargest(10, 'revenue') faster than df.sort_values('revenue', ascending=False).head(10) on a large DataFrame?nlargest skips NaN values

✗ Try again.

nlargest uses a partial heap sort — O(N log k) — instead of a full sort — O(N log N)

✓ Correct! Well done.

head() requires the full sort to complete first anyway

✗ Try again.

They have identical performance

✗ Try again.

Which argument in df.sample() sets the proportion of rows to return?n=

✗ Try again.

size=

✗ Try again.

frac=

✓ Correct! Well done.

pct=

✗ Try again.

43. How is NumPy linear algebra used in data science applications?

Linear algebra underpins almost all of machine learning — from computing gradients to PCA to solving systems of equations. NumPy's linalg submodule provides production-grade implementations of the core operations.

import numpy as np

# --- Solving a system of linear equations: Ax = b ---
# 2x + y = 8
# x + 3y = 11
A = np.array([[2, 1], [1, 3]])
b = np.array([8, 11])
x = np.linalg.solve(A, b)
print(x)   # [2.6  2.8]  — verify: A @ x ≈ b

# --- Matrix decompositions ---
M = np.array([[3, 1], [1, 3]], dtype=float)

# Eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eig(M)
# eigenvalues = [4. 2.], eigenvectors (columns) = principal directions

# Singular Value Decomposition — used in PCA, recommendation systems
X = np.random.default_rng(42).random((100, 5))   # 100 samples, 5 features
X -= X.mean(axis=0)                               # centre
U, S, Vt = np.linalg.svd(X, full_matrices=False)
# S = singular values (square roots of eigenvalues of X^T X)
# Vt rows = principal components
# Project onto first 2 components:
X_pca = X @ Vt[:2].T    # shape (100, 2)

# --- Norms ---
v = np.array([3.0, 4.0])
np.linalg.norm(v)        # 5.0 — L2 norm
np.linalg.norm(v, ord=1) # 7.0 — L1 norm

# --- Matrix rank, determinant, inverse ---
np.linalg.matrix_rank(A)
np.linalg.det(A)
np.linalg.inv(A)   # only for square non-singular matrices
np.linalg.pinv(A)  # Moore-Penrose pseudoinverse for non-square

SVD is the engine behind PCA: the right singular vectors (rows of Vt) are the principal components, and the singular values tell you how much variance each component explains. Using full_matrices=False (economy SVD) is essential for tall matrices — it skips computing the large, unused portions of U.

In PCA implemented via SVD, what do the rows of the Vt matrix represent?The mean of each feature

✗ Try again.

The principal components (directions of maximum variance)

✓ Correct! Well done.

The projected data in the reduced space

✗ Try again.

The explained variance ratios

✗ Try again.

Which NumPy function solves the linear system Ax = b without computing the inverse of A?np.linalg.inv(A) @ b

✗ Try again.

np.linalg.solve(A, b)

✓ Correct! Well done.

np.dot(A, b)

✗ Try again.

np.linalg.lstsq(A, b)[0]

✗ Try again.

44. How do you compare distributions across categories using Seaborn categorical plots?

Comparing how a numeric variable's distribution differs across groups is one of the most common analytical tasks. Seaborn's categorical plot family gives you progressively more information from left to right: bar (mean only) → box (five-number summary) → violin (full distribution shape) → strip/swarm (individual points).

import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Bar plot — mean + 95% CI error bars
sns.barplot(data=tips, x='day', y='tip', hue='sex',
            palette='Set2', ax=axes[0, 0])
axes[0, 0].set_title('Mean Tip by Day and Sex')

# Box plot — median, IQR, whiskers, outlier dots
sns.boxplot(data=tips, x='day', y='total_bill', hue='smoker',
            palette='pastel', ax=axes[0, 1])
axes[0, 1].set_title('Total Bill Distribution by Day and Smoker')

# Violin plot — box + KDE combined
sns.violinplot(data=tips, x='day', y='tip',
               inner='quartile',   # show quartile lines inside
               palette='muted', ax=axes[1, 0])
axes[1, 0].set_title('Tip Violin by Day')

# Strip + box overlay — all points + summary
sns.boxplot(data=tips, x='time', y='tip', color='lightblue',
            ax=axes[1, 1], width=0.4)
sns.stripplot(data=tips, x='time', y='tip', color='navy',
              alpha=0.4, jitter=True, ax=axes[1, 1])
axes[1, 1].set_title('Tip by Time — Box + All Points')

plt.tight_layout(); plt.show()

# Figure-level catplot for easy faceting
sns.catplot(data=tips, x='day', y='tip', hue='sex',
            col='time', kind='violin', height=5, aspect=0.8)

When to use each: bar plots are fine for comparing means but hide distributional information. Box plots add spread and outliers. Violin plots reveal multi-modality (two bumps indicating two groups within a category). Strip/swarm overlays add individual points, essential for small datasets where a box plot can be misleading with n < 30.

What does inner='quartile' display inside a Seaborn violin plot?Scatter points for each observation

✗ Try again.

Horizontal lines marking the quartiles within the violin

✓ Correct! Well done.

A box plot drawn inside the violin

✗ Try again.

The mean and standard deviation

✗ Try again.

Why is overlaying a strip plot on a box plot particularly useful for small datasets?It makes the box plot render faster

✗ Try again.

Individual points reveal the actual n and distribution shape that summary statistics alone can misrepresent

✓ Correct! Well done.

Strip plots are required for correct IQR calculation

✗ Try again.

It automatically removes outliers

✗ Try again.

45. How do you build an end-to-end data cleaning and visualisation pipeline with NumPy, Pandas, and Seaborn?

Combining all three libraries in a coherent pipeline is what data science interviews and take-home assignments test. Below is a realistic miniature pipeline that demonstrates the key integration points.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style='whitegrid', context='notebook')

# --- 1. Load ---
df = pd.read_csv('customer_orders.csv', parse_dates=['order_date'])

# --- 2. Audit ---
print(df.info())
print(df.isnull().sum())
print(df.describe())

# --- 3. Clean ---
df = (
    df
    .drop_duplicates(subset=['order_id'])
    .dropna(subset=['customer_id', 'amount'])
    .assign(
        amount=lambda d: pd.to_numeric(d['amount'], errors='coerce'),
        category=lambda d: d['category'].str.strip().str.title().astype('category'),
        year=lambda d: d['order_date'].dt.year,
        month=lambda d: d['order_date'].dt.month,
    )
    .dropna(subset=['amount'])
    .query('amount > 0')
)

# --- 4. Feature engineering (NumPy) ---
amounts = df['amount'].to_numpy()
df['log_amount']  = np.log1p(amounts)       # log1p avoids log(0)
df['amount_zscore'] = (amounts - amounts.mean()) / amounts.std()

# --- 5. Aggregate ---
monthly = (
    df.groupby(['year', 'month', 'category'])
    .agg(total=('amount', 'sum'), orders=('order_id', 'count'))
    .reset_index()
)

# --- 6. Visualise ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Revenue distribution by category
sns.boxplot(data=df, x='category', y='log_amount', ax=axes[0])
axes[0].set(title='Log Revenue by Category', xlabel='', ylabel='log(1+amount)')

# Monthly trend
df['period'] = df['order_date'].dt.to_period('M').astype(str)
trend = df.groupby('period')['amount'].sum().reset_index()
axes[1].plot(trend['period'], trend['amount'], marker='o', linewidth=2)
axes[1].tick_params(axis='x', rotation=45)
axes[1].set(title='Monthly Revenue Trend', xlabel='Month', ylabel='Revenue')

plt.tight_layout()
plt.savefig('dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

The key integration patterns here: Pandas for all tabular operations (load, clean, aggregate), NumPy for numerical transformations on raw arrays (.to_numpy() → vectorised ops), and Seaborn/Matplotlib for visualisation. The method-chain style in the cleaning step makes the transformations readable as a pipeline.

Why is np.log1p(x) preferred over np.log(x) for monetary or count data?log1p is faster

✗ Try again.

log1p(x) = log(1+x), which handles x=0 without returning -inf or raising an error

✓ Correct! Well done.

log1p returns integers; log returns floats

✗ Try again.

log1p applies to arrays; log only works on scalars

✗ Try again.

In the pipeline above, what does .reset_index() do after groupby().agg()?Resets row numbers to 0, 1, 2... and moves the groupby keys back to regular columns

✓ Correct! Well done.

Removes all index labels

✗ Try again.

Sorts the DataFrame by the first groupby column

✗ Try again.

Creates a backup copy of the original index

✗ Try again.

	Interviews Questions Java Spring Hibernate Maven Testing API BigData Web DataStructures AI Database Integration Cloud Scala Tools	About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company.
	contact: javatutorials2016[at]gmail[dot]com
Kindly consider donating for maintaining this website. Thanks.
	Copyright © 2026, javapedia.net, all rights reserved. privacy policy.

Python / Data Science Essentials Interview Questions

Comments & Discussions

Recently added...