7 Loading and handling data

Download notebook.

In this chapter we learn how to load data from various sources and how to handle tabular data using pandas. Before we start, we describe what a dataframe is, since it is the building block of many data science routines.

7.1 Dataframes (`pandas`)

pandas is the standard library for handling tabular data in Python. Its central data structure is the DataFrame, which is a two-dimensional, tabular data structure — essentially a table with rows and columns. Each column has a name and a data type (e.g., integers, floats, strings), and each row represents one observation or record. You can think of a dataframe as an in-memory spreadsheet that comes with powerful methods for selecting, filtering, transforming, and aggregating data. A single column of a DataFrame is a Series — a one-dimensional array with an index. A Series behaves like a list but supports vectorized operations (e.g., series + 1 adds 1 to every element) and has many built-in methods such as .mean(), .sum(), or .value_counts().

# Execute once for installing pandas
# !pip3 install pandas
import pandas as pd
import numpy as np

Here are the most important routines, where df is a DataFrame:

pd.DataFrame(dict): create a DataFrame from a dictionary (keys become column names, values must be lists of the same length).
df["col"]: select a single column (returns a Series, which is similar to a list).
df[["col1", "col2"]]: select multiple columns.
df.iloc[i] or df.iloc[i:j]: select row(s) i: int (up to j-1).
df.loc[i, "col"] or df.loc[i:j, ["col1", "col2"]]: select row(s) i (up to j-1) for columns "col" (resp "col1", "col").
df[df["col"] > value]: filter rows by a condition (here df["col"] > value).
df.head(n) / df.tail(n): first / last n rows.
df.describe(): summary statistics (count, mean, std, min, quartiles, max) for all numeric columns.
df["col"].count(), df["col"].mean(), df["col"].std(), df["col"].min(), df["col"].max(): individual summary statistics for a column.
df["col"].median(): median of a column.
df["col"].quantile(q): the q-th quantile (e.g., q=0.25 for the first quartile).
df["col"].sum(): sum of all values in a column.
df.sort_values("col"): sort by a column.
df.groupby("col").mean(): group by a column and compute means.
df.isna(): boolean mask of missing values.
df.dropna(): drop rows with missing values.
df.fillna(value): fill missing values.
df.drop(columns=["col"]): remove a column.
df.to_numpy(): convert to a NumPy array (but column names get lost).

df = pd.DataFrame({
    "age": [22, 27, 31, 45, 38, 29],
    "income": [42000, 51000, 64000, 83000, 72000, 56000],
    "hours_studied": [10, 12, 7, 4, 6, 9],
    "passed": [1, 1, 0, 0, 1, 1]
})
df

	age	income	hours_studied	passed
0	22	42000	10	1
1	27	51000	12	1
2	31	64000	7	0
3	45	83000	4	0
4	38	72000	6	1
5	29	56000	9	1

# selecting columns
print(df["age"])

0    22
1    27
2    31
3    45
4    38
5    29
Name: age, dtype: int64

print(df[["age", "income"]])

   age  income
0   22   42000
1   27   51000
2   31   64000
3   45   83000
4   38   72000
5   29   56000

# selecting rows
print(df.iloc[0])

age                 22
income           42000
hours_studied       10
passed               1
Name: 0, dtype: int64

print(df.loc[0:2, ["age", "passed"]])

   age  passed
0   22       1
1   27       1
2   31       0

# filtering
filtered = df[df["income"] > 55000]
filtered

	age	income	hours_studied	passed
2	31	64000	7	0
3	45	83000	4	0
4	38	72000	6	1
5	29	56000	9	1

# summary statistics
print(df.describe())

             age        income  hours_studied    passed
count   6.000000      6.000000       6.000000  6.000000
mean   32.000000  61333.333333       8.000000  0.666667
std     8.246211  14827.901627       2.898275  0.516398
min    22.000000  42000.000000       4.000000  0.000000
25%    27.500000  52250.000000       6.250000  0.250000
50%    30.000000  60000.000000       8.000000  1.000000
75%    36.250000  70000.000000       9.750000  1.000000
max    45.000000  83000.000000      12.000000  1.000000

# individual statistics for a column
print("mean income:", df["income"].mean())
print("std income:", df["income"].std())
print("median income:", df["income"].median())
print("25th percentile:", df["income"].quantile(0.25))
print("total hours studied:", df["hours_studied"].sum())

mean income: 61333.333333333336
std income: 14827.901627225163
median income: 60000.0
25th percentile: 52250.0
total hours studied: 48

# sorting
df_sorted = df.sort_values("income", ascending=False)
print(df_sorted)

   age  income  hours_studied  passed
3   45   83000              4       0
4   38   72000              6       1
2   31   64000              7       0
5   29   56000              9       1
1   27   51000             12       1
0   22   42000             10       1

# groupby — numeric_only=True restricts the computation to numeric columns,
# ignoring string columns (which cannot be averaged)
grouped = df.groupby("passed").mean(numeric_only=True)
print(grouped)

         age   income  hours_studied
passed                              
0       38.0  73500.0           5.50
1       29.0  55250.0           9.25

# missing values
df_missing = df.copy()
df_missing.loc[2, "income"] = np.nan
df_missing.loc[4, "hours_studied"] = np.nan

print(df_missing)
print()
print(df_missing.isna().sum())
print()
print(df_missing.dropna())

   age   income  hours_studied  passed
0   22  42000.0           10.0       1
1   27  51000.0           12.0       1
2   31      NaN            7.0       0
3   45  83000.0            4.0       0
4   38  72000.0            NaN       1
5   29  56000.0            9.0       1

age              0
income           1
hours_studied    1
passed           0
dtype: int64

   age   income  hours_studied  passed
0   22  42000.0           10.0       1
1   27  51000.0           12.0       1
3   45  83000.0            4.0       0
5   29  56000.0            9.0       1

# fill missing values with column means
df_filled = df_missing.fillna(df_missing.mean(numeric_only=True))
print(df_filled)

   age   income  hours_studied  passed
0   22  42000.0           10.0       1
1   27  51000.0           12.0       1
2   31  60800.0            7.0       0
3   45  83000.0            4.0       0
4   38  72000.0            8.4       1
5   29  56000.0            9.0       1

# from pandas to numpy
clean = df_missing.dropna()
X = clean[["age", "income", "hours_studied"]].to_numpy(dtype=float)
y = clean["passed"].to_numpy(dtype=float)
print("X.shape =", X.shape)
print("y =", y)

X.shape = (4, 3)
y = [1. 1. 0. 1.]

7.2 Adding and renaming columns (`assign`, `rename`)

Adding new columns and renaming existing ones are among the most frequent pandas operations.

df["new_col"] = values: add a new column (or overwrite an existing one) by direct assignment.
df.assign(col=expr): return a new DataFrame with an additional column (does not modify the original).
df.rename(columns={"old": "new"}): rename one or more columns.
df.insert(loc, column, value): insert a column at a specific position.

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "weight_kg": [62, 85, 74],
    "height_m": [1.65, 1.80, 1.75]
})
df

	name	weight_kg	height_m
0	Alice	62	1.65
1	Bob	85	1.80
2	Charlie	74	1.75

# adding a computed column
df["bmi"] = df["weight_kg"] / df["height_m"] ** 2
df

	name	weight_kg	height_m	bmi
0	Alice	62	1.65	22.773186
1	Bob	85	1.80	26.234568
2	Charlie	74	1.75	24.163265

# assign returns a new DataFrame, leaving df unchanged
df2 = df.assign(height_cm=df["height_m"] * 100)
df2

	name	weight_kg	height_m	bmi	height_cm
0	Alice	62	1.65	22.773186	165.0
1	Bob	85	1.80	26.234568	180.0
2	Charlie	74	1.75	24.163265	175.0

# renaming columns
df_renamed = df.rename(columns={"weight_kg": "weight", "height_m": "height"})
df_renamed

	name	weight	height	bmi
0	Alice	62	1.65	22.773186
1	Bob	85	1.80	26.234568
2	Charlie	74	1.75	24.163265

Note that direct assignment (df["col"] = ...) modifies the DataFrame in place, while assign and rename return new DataFrames by default.

7.3 Merging and joining DataFrames (`merge`, `concat`)

Assume you have two DataFrames, which share one column (best used with unique entries, called key below). Then, pd.merge joins the two DataFrames into one by using the joint column as anchor. However, note that we have the keys from the left and right DataFrame, and there are keys appearing in the left, in the right, in both, or in any of the two DataFrames. This leads to the four cases, how pd.merge(left, right, on="key", how="left/right/inner/outer") operates:

how	meaning
`"left"`	keep all rows from `left`
`"right"`	keep all rows from `right`
`"inner"`	keep only joint keys
`"outer"`	keep all keys from both DataFrames

In contrast, in pd.concat, you have two DataFrames which do not need to share a column, and you simply stack them vertically or horizontally.

pd.merge(left, right, on="key"): inner join on a shared column key.
pd.merge(left, right, on="key", how="left"): left join (keep all rows from the left DataFrame). Other options: "right", "inner", "outer".
pd.concat([df1, df2]): stack DataFrames vertically (row-wise).
pd.concat([df1, df2], axis=1): stack DataFrames horizontally (column-wise).

students = pd.DataFrame({
    "student_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Charlie", "Diana"]
})

grades = pd.DataFrame({
    "student_id": [1, 2, 3, 5],
    "grade": [1.3, 2.0, 1.7, 2.7]
})

print(students)
print()
print(grades)

   student_id     name
0           1    Alice
1           2      Bob
2           3  Charlie
3           4    Diana

   student_id  grade
0           1    1.3
1           2    2.0
2           3    1.7
3           5    2.7

# inner join: only students present in both tables
merged = pd.merge(students, grades, on="student_id")
# same as merged = pd.merge(students, grades, on="student_id", how="inner")
print(merged)

   student_id     name  grade
0           1    Alice    1.3
1           2      Bob    2.0
2           3  Charlie    1.7

# left join: keep all students, NaN where no grade exists
merged_left = pd.merge(students, grades, on="student_id", how="left")
print(merged_left)

   student_id     name  grade
0           1    Alice    1.3
1           2      Bob    2.0
2           3  Charlie    1.7
3           4    Diana    NaN

# concatenating vertically
df_a = pd.DataFrame({"name": ["Alice", "Bob"], "score": [85, 92]})
df_b = pd.DataFrame({"name": ["Charlie", "Diana"], "score": [78, 91]})
combined = pd.concat([df_a, df_b], ignore_index=True)
print(combined)

      name  score
0    Alice     85
1      Bob     92
2  Charlie     78
3    Diana     91

When concatenating vertically, use ignore_index=True to reset the index; otherwise both DataFrames keep their original indices, which may lead to duplicate index values.

7.4 Applying custom functions (`apply`, `map`)

Built-in methods like .mean() or .sum() cover many cases, but sometimes you need to apply an arbitrary Python function to each row, column, or element.

df.apply(func): apply func to each column (default) or each row (axis=1).
series.map(func): apply func element-wise to a Series. (This is usually applied to a column of a DataFrame.)
df.map(func): apply func element-wise to every cell of a DataFrame.

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "score": [85, 92, 78],
    "bonus": [5, 3, 8]
})
df

	name	score	bonus
0	Alice	85	5
1	Bob	92	3
2	Charlie	78	8

# apply a function to each column (default axis=0)
print(df[["score", "bonus"]].apply(np.mean))

score    85.000000
bonus     5.333333
dtype: float64

# apply a function row-wise (axis=1)
df["total"] = df.apply(lambda row: row["score"] + row["bonus"], axis=1)
df

	name	score	bonus	total
0	Alice	85	5	90
1	Bob	92	3	95
2	Charlie	78	8	86

# map: element-wise transformation of a Series
df["grade"] = df["total"].map(lambda x: "pass" if x >= 85 else "fail")
df

	name	score	bonus	total	grade
0	Alice	85	5	90	pass
1	Bob	92	3	95	pass
2	Charlie	78	8	86	pass

Use apply with axis=1 when the computation depends on multiple columns in the same row. For simple element-wise transformations on a single column, map is more readable. For best performance on large DataFrames, prefer vectorized operations (e.g., df["score"] + df["bonus"]) over apply whenever possible.

7.5 Iterating through a DataFrame (`iterrows`, `itertuples`)

Sometimes you need to loop over the rows of a DataFrame, for example to perform row-wise logic that is hard to express as a vectorized operation.

df.iterrows(): iterate over rows as (index, Series) pairs.
df.itertuples(): iterate over rows as named tuples (faster than iterrows).

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "score": [85, 92, 78],
    "bonus": [5, 3, 8]
})

# iterrows: each row is a Series
for idx, row in df.iterrows():
    print(f"Row {idx}: {row['name']} scored {row['score'] + row['bonus']} in total")

Row 0: Alice scored 90 in total
Row 1: Bob scored 95 in total
Row 2: Charlie scored 86 in total

# itertuples: each row is a named tuple (faster)
for row in df.itertuples():
    print(f"{row.name}: score={row.score}, bonus={row.bonus}")

Alice: score=85, bonus=5
Bob: score=92, bonus=3
Charlie: score=78, bonus=8

Prefer itertuples over iterrows when performance matters, since it avoids creating a Series for each row. However, for most tasks, vectorized operations (e.g., df["score"] + df["bonus"]) or apply are faster than explicit loops and should be preferred whenever possible.

7.6 Value counts and unique values (`value_counts`, `unique`)

When exploring a dataset, you often want to know which values occur in a column and how frequently. This is especially useful for categorical data.

df["col"].value_counts(): count how often each value appears, sorted by frequency. (df["col"].value_counts(normalize=True) returns relative frequencies instead of absolute counts.)
df["col"].unique(): return an array of all distinct values.
df["col"].nunique(): return the number of distinct values.

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank"],
    "major": ["Math", "CS", "Math", "CS", "Physics", "Math"],
    "level": ["BSc", "MSc", "MSc", "BSc", "BSc", "MSc"]
})
df

	name	major	level
0	Alice	Math	BSc
1	Bob	CS	MSc
2	Charlie	Math	MSc
3	Diana	CS	BSc
4	Eve	Physics	BSc
5	Frank	Math	MSc

# how often does each major appear?
print(df["major"].value_counts())

major
Math       3
CS         2
Physics    1
Name: count, dtype: int64

# relative frequencies
print(df["major"].value_counts(normalize=True))

major
Math       0.500000
CS         0.333333
Physics    0.166667
Name: proportion, dtype: float64

# unique values and their count
print("Unique majors:", df["major"].unique())
print("Number of majors:", df["major"].nunique())

Unique majors: <StringArray>
['Math', 'CS', 'Physics']
Length: 3, dtype: str
Number of majors: 3

value_counts() is one of the first things to try when getting to know a new dataset. For numeric columns, consider using df["col"].describe() instead.

7.7 Handling duplicates (`duplicated`, `drop_duplicates`)

Real-world data often contains duplicate rows — for example from repeated data imports or logging errors. pandas makes it easy to detect and remove them.

df.duplicated(): return a boolean Series that is True for duplicate rows.
df.duplicated(subset=["col1", "col2"]): check for duplicates based on specific columns only.
df.drop_duplicates(): remove duplicate rows (keeps the first occurrence by default).
df.drop_duplicates(subset=["col"], keep="last"): keep the last occurrence instead.

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Alice", "Charlie", "Bob"],
    "score": [85, 92, 85, 78, 95]
})
df

	name	score
0	Alice	85
1	Bob	92
2	Alice	85
3	Charlie	78
4	Bob	95

# which rows are duplicates?
print(df.duplicated())

0    False
1    False
2     True
3    False
4    False
dtype: bool

# remove exact duplicate rows
print(df.drop_duplicates())

      name  score
0    Alice     85
1      Bob     92
3  Charlie     78
4      Bob     95

# duplicates based on a subset of columns
print(df.duplicated(subset=["name"]))

0    False
1    False
2     True
3    False
4     True
dtype: bool

# keep only the last entry per name
print(df.drop_duplicates(subset=["name"], keep="last"))

      name  score
2    Alice     85
3  Charlie     78
4      Bob     95

Note that drop_duplicates() returns a new DataFrame by default. To modify in place, pass inplace=True. When checking duplicates on a subset of columns, the first (or last) complete row is kept, not just the key columns.

7.8 Loading data from files (`csv`, `xlsx`, `json`)

Data often comes as files. The most common formats are CSV (comma-separated values), Excel (.xlsx), and JSON (JavaScript Object Notation). A CSV file is like an Excel table, where contents of cells are separated by a comma (or some other delimiter). A JSON file has the same structure as a Python dictionary. We use the following example files throughout this section: person.csv, person.json, and person.db. These contain data on 50 fictional persons with names, addresses, dates of birth, and hobbies.

Download notebook.

pd.read_csv(path): reads a CSV file into a DataFrame.
df.to_csv(path, index=False): writes a DataFrame to a CSV file.
pd.read_json(path): reads a JSON file into a DataFrame. (So the JSON file is interpreted as a dict in pd.DataFrame(dict).)
df.to_json(orient="records"): converts a DataFrame to a JSON string. The orient parameter controls the format; "records" produces a list of dictionaries.
pd.read_excel(path): reads an Excel file into a DataFrame (requires the openpyxl package).

# Reading a CSV file
df_csv = pd.read_csv("misc/person.csv")
df_csv.head()

	first_name	last_name	street	town	zip	country	date_of_birth	hobbies
0	André	Pieters	Paseo Chucho Roca 1	Rey	86379	Kingdom of the Netherlands	1993-01-12	reading
1	Ilaria	Jäggi	Veerlepad 1	Naters	5078	France	1950-03-12	chess;writing;photography
2	Melissa	Pollak	Urbanización María Teresa Mancebo 4	Lopesdan	3553	Kingdom of the Netherlands	1983-08-09	skiing
3	Vittorio	Cárdenas	Rua de Sousa, 40	Lobenstein	4395PF	Switzerland	1954-03-25	baking
4	Gioffre	van Dam	Eichenbergerstrasse 1	Techer	07045	France	1969-04-09	cycling;reading;hiking;swimming

# The hobbies column is a semicolon-separated string — we can inspect it
print(df_csv["hobbies"].head())

0                            reading
1          chess;writing;photography
2                             skiing
3                             baking
4    cycling;reading;hiking;swimming
Name: hobbies, dtype: str

# Reading a JSON file
df_json = pd.read_json("misc/person.json")
df_json.head()

	first_name	last_name	street	town	zip	country	date_of_birth	hobbies
0	André	Pieters	Paseo Chucho Roca 1	Rey	86379	Kingdom of the Netherlands	1993-01-12	[reading]
1	Ilaria	Jäggi	Veerlepad 1	Naters	5078	France	1950-03-12	[chess, writing, photography]
2	Melissa	Pollak	Urbanización María Teresa Mancebo 4	Lopesdan	3553	Kingdom of the Netherlands	1983-08-09	[skiing]
3	Vittorio	Cárdenas	Rua de Sousa, 40	Lobenstein	4395PF	Switzerland	1954-03-25	[baking]
4	Gioffre	van Dam	Eichenbergerstrasse 1	Techer	07045	France	1969-04-09	[cycling, reading, hiking, swimming]

# In the JSON version, hobbies is a list
print(df_json["hobbies"].head())

0                               [reading]
1           [chess, writing, photography]
2                                [skiing]
3                                [baking]
4    [cycling, reading, hiking, swimming]
Name: hobbies, dtype: object

# Converting a DataFrame back to JSON
print(df_csv.head(3).to_json(orient="records", indent=2))

[
  {
    "first_name":"Andr\u00e9",
    "last_name":"Pieters",
    "street":"Paseo Chucho Roca 1",
    "town":"Rey",
    "zip":"86379",
    "country":"Kingdom of the Netherlands",
    "date_of_birth":"1993-01-12",
    "hobbies":"reading"
  },
  {
    "first_name":"Ilaria",
    "last_name":"J\u00e4ggi",
    "street":"Veerlepad 1",
    "town":"Naters",
    "zip":"5078",
    "country":"France",
    "date_of_birth":"1950-03-12",
    "hobbies":"chess;writing;photography"
  },
  {
    "first_name":"Melissa",
    "last_name":"Pollak",
    "street":"Urbanizaci\u00f3n Mar\u00eda Teresa Mancebo 4",
    "town":"Lopesdan",
    "zip":"3553",
    "country":"Kingdom of the Netherlands",
    "date_of_birth":"1983-08-09",
    "hobbies":"skiing"
  }
]

Note how CSV and JSON handle the hobbies column differently: CSV stores it as a flat string (here semicolon-separated), while JSON preserves it as a list.

7.9 Loading data from online resources (`requests`)

The requests library allows downloading data from the internet.

requests.get(url): sends a GET request (i.e. tries to download a website) and returns a response object.
response.status_code: HTTP status code (200 means success).
response.text: the response body as a string.
response.json(): parse the response body as JSON.

import requests

url = "https://jsonplaceholder.typicode.com/todos/1"
response = requests.get(url)
print("status code:", response.status_code)
print("content:", response.json())

status code: 200
content: {'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}

As a concrete example, the World Bank Indicators API provides demographic and economic data. The API offers thousands of indicators; you can search them programmatically:

# search for population-related indicators
url = "https://api.worldbank.org/v2/indicator?format=json&per_page=20000"
indicators = requests.get(url).json()[1]
df_ind = pd.DataFrame(indicators)
df_ind[df_ind["name"].str.contains("population", case=False)][["id", "name"]].head()

	id	name
24	1.1_ACCESS.ELECTRICITY.TOT	Access to electricity (% of total population)
39	1.2_ACCESS.ELECTRICITY.RURAL	Access to electricity (% of rural population)
40	1.3_ACCESS.ELECTRICITY.URBAN	Access to electricity (% of urban population)
161	2.1_ACCESS.CFT.TOT	Access to Clean Fuels and Technologies for coo...
1552	allsa.cov_pop	Coverage of social safety net programs (% of p...

Here we fetch total population data (SP.POP.TOTL) for all countries and display the ten most populous ones. The API returns a list with two elements: metadata (page info) at index 0 and the actual data records at index 1. Since the response also includes regional aggregates (like “Arab World” or “Euro area”), we first fetch country metadata to identify and exclude them.

# step 1: find out which codes are regional aggregates (not actual countries)
meta = requests.get(
    "https://api.worldbank.org/v2/country",
    params={"format": "json", "per_page": 300}
).json()[1]
aggregates = {e["iso2Code"] for e in meta if e["region"]["id"] == "NA"}

# step 2: fetch population data and keep only actual countries
response = requests.get(
    "https://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL",
    params={"date": "2023", "format": "json", "per_page": 300}
)
data = response.json()[1]  # index 0 is metadata

df_pop = pd.DataFrame([{
    "country": entry["country"]["value"],
    "population": entry["value"]
} for entry in data
    if entry["country"]["id"] not in aggregates and entry["value"] is not None])

df_pop.sort_values("population", ascending=False).head(10)

	country	population
89	India	1438069596
41	China	1410710000
206	United States	336755052
90	Indonesia	281190067
149	Pakistan	247504495
144	Nigeria	227882945
26	Brazil	211140729
15	Bangladesh	171466990
161	Russian Federation	143826130
127	Mexico	129739759

7.10 Working with dates and times (`to_datetime`, `Timestamp`)

pandas can parse and work with dates. We use the person dataset from Section 7.8 to illustrate this.

pd.to_datetime(...): convert strings to datetime. The format codes are the same as for the datetime module, see Section 5.3.
series.dt.year / .dt.month / .dt.day_of_week: extract date components.
pd.Timestamp(...): create a single datetime value.
dt.strftime(format): format a datetime as a string. The format codes (e.g., %Y, %m, %d, %H, %M) are the same as for datetime.datetime.strftime from the datetime module covered in Section 5.3.

df = pd.read_csv("misc/person.csv")
df["date_of_birth"] = pd.to_datetime(df["date_of_birth"])
df["date_of_birth"].head()

0   1993-01-12
1   1950-03-12
2   1983-08-09
3   1954-03-25
4   1969-04-09
Name: date_of_birth, dtype: datetime64[us]

# extract year and month of birth
df["birth_year"] = df["date_of_birth"].dt.year
df["birth_month"] = df["date_of_birth"].dt.month
df[["first_name", "last_name", "birth_year", "birth_month"]].head()

	first_name	last_name	birth_year	birth_month
0	André	Pieters	1993	1
1	Ilaria	Jäggi	1950	3
2	Melissa	Pollak	1983	8
3	Vittorio	Cárdenas	1954	3
4	Gioffre	van Dam	1969	4

# compute age in completed years
today = pd.Timestamp.now()
print(today)
# Subtracting two datetimes gives a timedelta; .dt.days extracts the number of days from it
df["age"] = (today - df["date_of_birth"]).dt.days // 365
df[["first_name", "last_name", "date_of_birth", "age"]].head()

2026-07-07 09:19:46.750601

	first_name	last_name	date_of_birth	age
0	André	Pieters	1993-01-12	33
1	Ilaria	Jäggi	1950-03-12	76
2	Melissa	Pollak	1983-08-09	42
3	Vittorio	Cárdenas	1954-03-25	72
4	Gioffre	van Dam	1969-04-09	57

# format dates as strings
df["dob_formatted"] = df["date_of_birth"].dt.strftime("%-d %B %Y")
df[["first_name", "last_name", "dob_formatted"]].head()

	first_name	last_name	dob_formatted
0	André	Pieters	12 January 1993
1	Ilaria	Jäggi	12 March 1950
2	Melissa	Pollak	9 August 1983
3	Vittorio	Cárdenas	25 March 1954
4	Gioffre	van Dam	9 April 1969

7.11 SQL databases (`sqlite3`)

Python has built-in support for SQLite databases via the sqlite3 module. This is useful for working with structured data that lives in a database.

sqlite3.connect(path): connect to a database (use ":memory:" for an in-memory database).
cursor.execute(sql): execute an SQL statement.
cursor.fetchall(): fetch all results.
pd.read_sql_query(sql, conn): run an SQL query and return the result as a DataFrame.
df.to_sql("table_name", conn, index=False): write a DataFrame into an SQL table.

import sqlite3

# connect to the person SQLite database
conn = sqlite3.connect("misc/person.db")

# querying with raw SQL
cursor = conn.cursor()
cursor.execute("SELECT first_name, last_name, country FROM persons WHERE country = 'Germany'")
for row in cursor.fetchall():
    print(row)

('Natalie', 'Burtscher', 'Germany')
('Katherine', 'Gehringer', 'Germany')
('Raimondo', 'van der Heijden', 'Germany')
('Idriz', 'Kitzmann', 'Germany')
('Denis', 'Egli', 'Germany')
('Ekaterina', 'Ledoux', 'Germany')
('Isaac', 'Käfer', 'Germany')

# reading SQL results directly into a pandas DataFrame
df_french = pd.read_sql_query(
    "SELECT first_name, last_name, town FROM persons WHERE country = 'France'",
    conn
)
df_french

	first_name	last_name	town
0	Ilaria	Jäggi	Naters
1	Gioffre	van Dam	Techer
2	Filipa	Backer	Nördlingen
3	Leyre	de Vroege	Rudolstadt
4	Tygo	Seiwald	Benoit-la-Forêt
5	Roswitha	Göschl	Alcara Li Fusi
6	Tito	Mader	Groß-Enzersdorf

# aggregation in SQL
df_counts = pd.read_sql_query(
    "SELECT country, COUNT(*) AS n_persons FROM persons GROUP BY country ORDER BY n_persons DESC",
    conn
)
df_counts

	country	n_persons
0	Switzerland	12
1	Kingdom of the Netherlands	10
2	Germany	7
3	France	7
4	Austria	5
5	Portugal	4
6	Italy	3
7	Spain	2

conn.close()

7.12 Exercises

All exercises use the person dataset from Section 7.8. Start by loading it:

df = pd.read_csv("misc/person.csv")

Exercise 1 Add a column age to df that contains each person’s age in completed years (as an integer). Hint: use pd.to_datetime to convert date_of_birth, then compute the difference to today’s date. Then use map to add a column age_group that maps each age to "young" (under 30), "middle" (30–59), or "senior" (60+).

# Exercise 1

Exercise 2 Add a column birth_year extracted from date_of_birth. Then use df.describe() and df["birth_year"].value_counts() to explore the distribution of birth years.

# Exercise 2

Exercise 3 Using the generations dictionary from Exercise 6, write a function get_generation(year) that returns the generation name (or "Other"). Add a column generation to df using this function. Then write a function filter_generation(df, gen_name) that returns only the rows belonging to a given generation. Use it to display all millennials. How many are there?

# Exercise 3

Exercise 4 Consider the following two DataFrames where the city column contains duplicate values:

shops = pd.DataFrame({
    "city": ["Berlin", "Berlin", "Paris", "Paris"],
    "shop": ["BookWorld", "TechStore", "Café Lune", "ModeParis"]
})

deliveries = pd.DataFrame({
    "city": ["Berlin", "Berlin", "Paris"],
    "item": ["books", "laptops", "croissants"]
})

What does pd.merge(shops, deliveries, on="city") return? Run it and explain the result. How many rows does the output have, and why?

# Exercise 4

Exercise 5 Create a DataFrame with columns name, exam1, exam2, exam3 for 10 students (you can invent the data). Add a column average that contains the mean of the three exams. Filter to show only students with an average above 70.

# Exercise 5

Exercise 6 The following dictionary defines generational cohorts by birth year ranges:

generations = {
    "Silent Generation": (1928, 1945),
    "Baby Boomer": (1946, 1964),
    "Generation X": (1965, 1980),
    "Millennial": (1981, 1996),
    "Generation Z": (1997, 2012),
}

Write a function get_generation(year) that takes a birth year and returns the generation name (or "Other" if the year does not fall into any range). Then use map or apply to add a column generation to df. How many persons belong to each generation?

# Exercise 6

Exercise 7 The hobbies column in the CSV is a semicolon-separated string. The .str accessor on a pandas Series gives access to string methods, so df["hobbies"].str.split(";") splits each entry into a Python list. Use this to split the hobbies. Then find out: what is the most common hobby across all persons? Hint: series.explode() turns a Series of lists into individual rows (one row per list element); then use value_counts.

# Exercise 7

Exercise 8 Download a CSV file from the internet using pd.read_csv(url) (e.g. from this collection). Use describe(), groupby(), and a histogram to explore the data. Write a short summary of what you find.

# Exercise 8

Exercise 9 Create a DataFrame with some missing values (np.nan). Compare three strategies for handling them: (a) drop rows with any missing value, (b) fill with the column mean, (c) fill with the column median. How do the summary statistics (describe()) change in each case?

# Exercise 9

Exercise 10 Write a function csv_to_sqlite(csv_path, db_path, table_name) that reads a CSV file into a pandas DataFrame and writes it into an SQLite database. Then write a function query_db(db_path, sql) that runs an SQL query and returns the result as a DataFrame. Test with a CSV file of your choice.

# Exercise 10

Exercise 11 Fetch Germany’s total population from the World Bank API for the years 1960–2023 (see Section 7.9 for how to use the API; the country code is DEU, the indicator is SP.POP.TOTL). Convert the year column to a proper datetime using pd.to_datetime(df["year"], format="%Y"). Then: (a) Add a column decade that contains the decade of each year (e.g. 1960, 1970, …). Hint: use integer division // 10 * 10. (b) Compute the mean population per decade using groupby. (c) Add a column growth that contains the year-over-year population change. Hint: df["population"].diff(). (d) In which year was the population growth the highest? In which year was it negative for the first time? (e) Resample the data to 5-year intervals using df.set_index("date").resample("5YE").mean() and plot the result.

# Exercise 11

7.1 Dataframes (pandas)

7.2 Adding and renaming columns (assign, rename)

7.3 Merging and joining DataFrames (merge, concat)

7.4 Applying custom functions (apply, map)

7.5 Iterating through a DataFrame (iterrows, itertuples)

7.6 Value counts and unique values (value_counts, unique)

7.7 Handling duplicates (duplicated, drop_duplicates)

7.8 Loading data from files (csv, xlsx, json)

7.9 Loading data from online resources (requests)

7.10 Working with dates and times (to_datetime, Timestamp)

7.11 SQL databases (sqlite3)