7  Loading and handling data

Download notebook.

In this chapter we learn how to load data from various sources and how to handle tabular data using pandas. Before we start, we describe what a dataframe is, since it is the building block of many data science routines.

7.1 Dataframes (pandas)

pandas is the standard library for handling tabular data in Python. Its central data structure is the DataFrame, which is a two-dimensional, tabular data structure — essentially a table with rows and columns. Each column has a name and a data type (e.g., integers, floats, strings), and each row represents one observation or record. You can think of a dataframe as an in-memory spreadsheet that comes with powerful methods for selecting, filtering, transforming, and aggregating data. A single column of a DataFrame is a Series — a one-dimensional array with an index. A Series behaves like a list but supports vectorized operations (e.g., series + 1 adds 1 to every element) and has many built-in methods such as .mean(), .sum(), or .value_counts().

# Execute once for installing pandas
# !pip3 install pandas
import pandas as pd
import numpy as np

Here are the most important routines, where df is a DataFrame:

  • pd.DataFrame(dict): create a DataFrame from a dictionary (keys become column names, values must be lists of the same length).
  • df["col"]: select a single column (returns a Series, which is similar to a list).
  • df[["col1", "col2"]]: select multiple columns.
  • df.iloc[i] or df.iloc[i:j]: select row(s) i: int (up to j-1).
  • df.loc[i, "col"] or df.loc[i:j, ["col1", "col2"]]: select row(s) i (up to j-1) for columns "col" (resp "col1", "col").
  • df[df["col"] > value]: filter rows by a condition (here df["col"] > value).
  • df.head(n) / df.tail(n): first / last n rows.
  • df.describe(): summary statistics (count, mean, std, min, quartiles, max) for all numeric columns.
  • df["col"].count(), df["col"].mean(), df["col"].std(), df["col"].min(), df["col"].max(): individual summary statistics for a column.
  • df["col"].median(): median of a column.
  • df["col"].quantile(q): the q-th quantile (e.g., q=0.25 for the first quartile).
  • df["col"].sum(): sum of all values in a column.
  • df.sort_values("col"): sort by a column.
  • df.groupby("col").mean(): group by a column and compute means.
  • df.isna(): boolean mask of missing values.
  • df.dropna(): drop rows with missing values.
  • df.fillna(value): fill missing values.
  • df.drop(columns=["col"]): remove a column.
  • df.to_numpy(): convert to a NumPy array (but column names get lost).
df = pd.DataFrame({
    "age": [22, 27, 31, 45, 38, 29],
    "income": [42000, 51000, 64000, 83000, 72000, 56000],
    "hours_studied": [10, 12, 7, 4, 6, 9],
    "passed": [1, 1, 0, 0, 1, 1]
})
df
age income hours_studied passed
0 22 42000 10 1
1 27 51000 12 1
2 31 64000 7 0
3 45 83000 4 0
4 38 72000 6 1
5 29 56000 9 1
# selecting columns
print(df["age"])
0    22
1    27
2    31
3    45
4    38
5    29
Name: age, dtype: int64
print(df[["age", "income"]])
   age  income
0   22   42000
1   27   51000
2   31   64000
3   45   83000
4   38   72000
5   29   56000
# selecting rows
print(df.iloc[0])
age                 22
income           42000
hours_studied       10
passed               1
Name: 0, dtype: int64
print(df.loc[0:2, ["age", "passed"]])
   age  passed
0   22       1
1   27       1
2   31       0
# filtering
filtered = df[df["income"] > 55000]
filtered
age income hours_studied passed
2 31 64000 7 0
3 45 83000 4 0
4 38 72000 6 1
5 29 56000 9 1
# summary statistics
print(df.describe())
             age        income  hours_studied    passed
count   6.000000      6.000000       6.000000  6.000000
mean   32.000000  61333.333333       8.000000  0.666667
std     8.246211  14827.901627       2.898275  0.516398
min    22.000000  42000.000000       4.000000  0.000000
25%    27.500000  52250.000000       6.250000  0.250000
50%    30.000000  60000.000000       8.000000  1.000000
75%    36.250000  70000.000000       9.750000  1.000000
max    45.000000  83000.000000      12.000000  1.000000
# individual statistics for a column
print("mean income:", df["income"].mean())
print("std income:", df["income"].std())
print("median income:", df["income"].median())
print("25th percentile:", df["income"].quantile(0.25))
print("total hours studied:", df["hours_studied"].sum())
mean income: 61333.333333333336
std income: 14827.901627225163
median income: 60000.0
25th percentile: 52250.0
total hours studied: 48
# sorting
df_sorted = df.sort_values("income", ascending=False)
print(df_sorted)
   age  income  hours_studied  passed
3   45   83000              4       0
4   38   72000              6       1
2   31   64000              7       0
5   29   56000              9       1
1   27   51000             12       1
0   22   42000             10       1
# groupby — numeric_only=True restricts the computation to numeric columns,
# ignoring string columns (which cannot be averaged)
grouped = df.groupby("passed").mean(numeric_only=True)
print(grouped)
         age   income  hours_studied
passed                              
0       38.0  73500.0           5.50
1       29.0  55250.0           9.25
# missing values
df_missing = df.copy()
df_missing.loc[2, "income"] = np.nan
df_missing.loc[4, "hours_studied"] = np.nan

print(df_missing)
print()
print(df_missing.isna().sum())
print()
print(df_missing.dropna())
   age   income  hours_studied  passed
0   22  42000.0           10.0       1
1   27  51000.0           12.0       1
2   31      NaN            7.0       0
3   45  83000.0            4.0       0
4   38  72000.0            NaN       1
5   29  56000.0            9.0       1

age              0
income           1
hours_studied    1
passed           0
dtype: int64

   age   income  hours_studied  passed
0   22  42000.0           10.0       1
1   27  51000.0           12.0       1
3   45  83000.0            4.0       0
5   29  56000.0            9.0       1
# fill missing values with column means
df_filled = df_missing.fillna(df_missing.mean(numeric_only=True))
print(df_filled)
   age   income  hours_studied  passed
0   22  42000.0           10.0       1
1   27  51000.0           12.0       1
2   31  60800.0            7.0       0
3   45  83000.0            4.0       0
4   38  72000.0            8.4       1
5   29  56000.0            9.0       1
# from pandas to numpy
clean = df_missing.dropna()
X = clean[["age", "income", "hours_studied"]].to_numpy(dtype=float)
y = clean["passed"].to_numpy(dtype=float)
print("X.shape =", X.shape)
print("y =", y)
X.shape = (4, 3)
y = [1. 1. 0. 1.]

7.2 Adding and renaming columns (assign, rename)

Adding new columns and renaming existing ones are among the most frequent pandas operations.

  • df["new_col"] = values: add a new column (or overwrite an existing one) by direct assignment.
  • df.assign(col=expr): return a new DataFrame with an additional column (does not modify the original).
  • df.rename(columns={"old": "new"}): rename one or more columns.
  • df.insert(loc, column, value): insert a column at a specific position.
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "weight_kg": [62, 85, 74],
    "height_m": [1.65, 1.80, 1.75]
})
df
name weight_kg height_m
0 Alice 62 1.65
1 Bob 85 1.80
2 Charlie 74 1.75
# adding a computed column
df["bmi"] = df["weight_kg"] / df["height_m"] ** 2
df
name weight_kg height_m bmi
0 Alice 62 1.65 22.773186
1 Bob 85 1.80 26.234568
2 Charlie 74 1.75 24.163265
# assign returns a new DataFrame, leaving df unchanged
df2 = df.assign(height_cm=df["height_m"] * 100)
df2
name weight_kg height_m bmi height_cm
0 Alice 62 1.65 22.773186 165.0
1 Bob 85 1.80 26.234568 180.0
2 Charlie 74 1.75 24.163265 175.0
# renaming columns
df_renamed = df.rename(columns={"weight_kg": "weight", "height_m": "height"})
df_renamed
name weight height bmi
0 Alice 62 1.65 22.773186
1 Bob 85 1.80 26.234568
2 Charlie 74 1.75 24.163265

Note that direct assignment (df["col"] = ...) modifies the DataFrame in place, while assign and rename return new DataFrames by default.

7.3 Merging and joining DataFrames (merge, concat)

Assume you have two DataFrames, which share one column (best used with unique entries, called key below). Then, pd.merge joins the two DataFrames into one by using the joint column as anchor. However, note that we have the keys from the left and right DataFrame, and there are keys appearing in the left, in the right, in both, or in any of the two DataFrames. This leads to the four cases, how pd.merge(left, right, on="key", how="left/right/inner/outer") operates:

how meaning
"left" keep all rows from left
"right" keep all rows from right
"inner" keep only joint keys
"outer" keep all keys from both DataFrames

In contrast, in pd.concat, you have two DataFrames which do not need to share a column, and you simply stack them vertically or horizontally.

  • pd.merge(left, right, on="key"): inner join on a shared column key.
  • pd.merge(left, right, on="key", how="left"): left join (keep all rows from the left DataFrame). Other options: "right", "inner", "outer".
  • pd.concat([df1, df2]): stack DataFrames vertically (row-wise).
  • pd.concat([df1, df2], axis=1): stack DataFrames horizontally (column-wise).
students = pd.DataFrame({
    "student_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Charlie", "Diana"]
})

grades = pd.DataFrame({
    "student_id": [1, 2, 3, 5],
    "grade": [1.3, 2.0, 1.7, 2.7]
})

print(students)
print()
print(grades)
   student_id     name
0           1    Alice
1           2      Bob
2           3  Charlie
3           4    Diana

   student_id  grade
0           1    1.3
1           2    2.0
2           3    1.7
3           5    2.7
# inner join: only students present in both tables
merged = pd.merge(students, grades, on="student_id")
# same as merged = pd.merge(students, grades, on="student_id", how="inner")
print(merged)
   student_id     name  grade
0           1    Alice    1.3
1           2      Bob    2.0
2           3  Charlie    1.7
# left join: keep all students, NaN where no grade exists
merged_left = pd.merge(students, grades, on="student_id", how="left")
print(merged_left)
   student_id     name  grade
0           1    Alice    1.3
1           2      Bob    2.0
2           3  Charlie    1.7
3           4    Diana    NaN
# concatenating vertically
df_a = pd.DataFrame({"name": ["Alice", "Bob"], "score": [85, 92]})
df_b = pd.DataFrame({"name": ["Charlie", "Diana"], "score": [78, 91]})
combined = pd.concat([df_a, df_b], ignore_index=True)
print(combined)
      name  score
0    Alice     85
1      Bob     92
2  Charlie     78
3    Diana     91

When concatenating vertically, use ignore_index=True to reset the index; otherwise both DataFrames keep their original indices, which may lead to duplicate index values.

7.4 Applying custom functions (apply, map)

Built-in methods like .mean() or .sum() cover many cases, but sometimes you need to apply an arbitrary Python function to each row, column, or element.

  • df.apply(func): apply func to each column (default) or each row (axis=1).
  • series.map(func): apply func element-wise to a Series. (This is usually applied to a column of a DataFrame.)
  • df.map(func): apply func element-wise to every cell of a DataFrame.
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "score": [85, 92, 78],
    "bonus": [5, 3, 8]
})
df
name score bonus
0 Alice 85 5
1 Bob 92 3
2 Charlie 78 8
# apply a function to each column (default axis=0)
print(df[["score", "bonus"]].apply(np.mean))
score    85.000000
bonus     5.333333
dtype: float64
# apply a function row-wise (axis=1)
df["total"] = df.apply(lambda row: row["score"] + row["bonus"], axis=1)
df
name score bonus total
0 Alice 85 5 90
1 Bob 92 3 95
2 Charlie 78 8 86
# map: element-wise transformation of a Series
df["grade"] = df["total"].map(lambda x: "pass" if x >= 85 else "fail")
df
name score bonus total grade
0 Alice 85 5 90 pass
1 Bob 92 3 95 pass
2 Charlie 78 8 86 pass

Use apply with axis=1 when the computation depends on multiple columns in the same row. For simple element-wise transformations on a single column, map is more readable. For best performance on large DataFrames, prefer vectorized operations (e.g., df["score"] + df["bonus"]) over apply whenever possible.

7.5 Iterating through a DataFrame (iterrows, itertuples)

Sometimes you need to loop over the rows of a DataFrame, for example to perform row-wise logic that is hard to express as a vectorized operation.

  • df.iterrows(): iterate over rows as (index, Series) pairs.
  • df.itertuples(): iterate over rows as named tuples (faster than iterrows).
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "score": [85, 92, 78],
    "bonus": [5, 3, 8]
})

# iterrows: each row is a Series
for idx, row in df.iterrows():
    print(f"Row {idx}: {row['name']} scored {row['score'] + row['bonus']} in total")
Row 0: Alice scored 90 in total
Row 1: Bob scored 95 in total
Row 2: Charlie scored 86 in total
# itertuples: each row is a named tuple (faster)
for row in df.itertuples():
    print(f"{row.name}: score={row.score}, bonus={row.bonus}")
Alice: score=85, bonus=5
Bob: score=92, bonus=3
Charlie: score=78, bonus=8

Prefer itertuples over iterrows when performance matters, since it avoids creating a Series for each row. However, for most tasks, vectorized operations (e.g., df["score"] + df["bonus"]) or apply are faster than explicit loops and should be preferred whenever possible.

7.6 Value counts and unique values (value_counts, unique)

When exploring a dataset, you often want to know which values occur in a column and how frequently. This is especially useful for categorical data.

  • df["col"].value_counts(): count how often each value appears, sorted by frequency. (df["col"].value_counts(normalize=True) returns relative frequencies instead of absolute counts instead.)
  • df["col"].unique(): return an array of all distinct values.
  • df["col"].nunique(): return the number of distinct values.
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank"],
    "major": ["Math", "CS", "Math", "CS", "Physics", "Math"],
    "level": ["BSc", "MSc", "MSc", "BSc", "BSc", "MSc"]
})
df
name major level
0 Alice Math BSc
1 Bob CS MSc
2 Charlie Math MSc
3 Diana CS BSc
4 Eve Physics BSc
5 Frank Math MSc
# how often does each major appear?
print(df["major"].value_counts())
major
Math       3
CS         2
Physics    1
Name: count, dtype: int64
# relative frequencies
print(df["major"].value_counts(normalize=True))
major
Math       0.500000
CS         0.333333
Physics    0.166667
Name: proportion, dtype: float64
# unique values and their count
print("Unique majors:", df["major"].unique())
print("Number of majors:", df["major"].nunique())
Unique majors: <StringArray>
['Math', 'CS', 'Physics']
Length: 3, dtype: str
Number of majors: 3

value_counts() is one of the first things to try when getting to know a new dataset. For numeric columns, consider using df["col"].describe() instead.

7.7 Handling duplicates (duplicated, drop_duplicates)

Real-world data often contains duplicate rows — for example from repeated data imports or logging errors. pandas makes it easy to detect and remove them.

  • df.duplicated(): return a boolean Series that is True for duplicate rows.
  • df.duplicated(subset=["col1", "col2"]): check for duplicates based on specific columns only.
  • df.drop_duplicates(): remove duplicate rows (keeps the first occurrence by default).
  • df.drop_duplicates(subset=["col"], keep="last"): keep the last occurrence instead.
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Alice", "Charlie", "Bob"],
    "score": [85, 92, 85, 78, 95]
})
df
name score
0 Alice 85
1 Bob 92
2 Alice 85
3 Charlie 78
4 Bob 95
# which rows are duplicates?
print(df.duplicated())
0    False
1    False
2     True
3    False
4    False
dtype: bool
# remove exact duplicate rows
print(df.drop_duplicates())
      name  score
0    Alice     85
1      Bob     92
3  Charlie     78
4      Bob     95
# duplicates based on a subset of columns
print(df.duplicated(subset=["name"]))
0    False
1    False
2     True
3    False
4     True
dtype: bool
# keep only the last entry per name
print(df.drop_duplicates(subset=["name"], keep="last"))
      name  score
2    Alice     85
3  Charlie     78
4      Bob     95

Note that drop_duplicates() returns a new DataFrame by default. To modify in place, pass inplace=True. When checking duplicates on a subset of columns, the first (or last) complete row is kept, not just the key columns.

7.8 Loading data from files (csv, xlsx, json)

Data often comes as files. The most common formats are CSV (comma-separated values), Excel (.xlsx), and JSON (JavaScript Object Notation). A CSV file is like an Excel table, where contents of cells are separated by a comma (or some other delimiter). A JSON file has the same structure as a Python dictionary. We use the following example files throughout this section: person.csv, person.json, and person.db. These contain data on 50 fictional persons with names, addresses, dates of birth, and hobbies.

Download notebook.

  • pd.read_csv(path): reads a CSV file into a DataFrame.
  • df.to_csv(path, index=False): writes a DataFrame to a CSV file.
  • pd.read_json(path): reads a JSON file into a DataFrame. (So the JSON file is interpreted as a dict in pd.DataFrame(dict).)
  • df.to_json(orient="records"): converts a DataFrame to a JSON string. The orient parameter controls the format; "records" produces a list of dictionaries.
  • pd.read_excel(path): reads an Excel file into a DataFrame (requires the openpyxl package).
# Reading a CSV file
df_csv = pd.read_csv("misc/person.csv")
df_csv.head()
first_name last_name street town zip country date_of_birth hobbies
0 André Pieters Paseo Chucho Roca 1 Rey 86379 Kingdom of the Netherlands 1993-01-12 reading
1 Ilaria Jäggi Veerlepad 1 Naters 5078 France 1950-03-12 chess;writing;photography
2 Melissa Pollak Urbanización María Teresa Mancebo 4 Lopesdan 3553 Kingdom of the Netherlands 1983-08-09 skiing
3 Vittorio Cárdenas Rua de Sousa, 40 Lobenstein 4395PF Switzerland 1954-03-25 baking
4 Gioffre van Dam Eichenbergerstrasse 1 Techer 07045 France 1969-04-09 cycling;reading;hiking;swimming
# The hobbies column is a semicolon-separated string — we can inspect it
print(df_csv["hobbies"].head())
0                            reading
1          chess;writing;photography
2                             skiing
3                             baking
4    cycling;reading;hiking;swimming
Name: hobbies, dtype: str
# Reading a JSON file
df_json = pd.read_json("misc/person.json")
df_json.head()
first_name last_name street town zip country date_of_birth hobbies
0 André Pieters Paseo Chucho Roca 1 Rey 86379 Kingdom of the Netherlands 1993-01-12 [reading]
1 Ilaria Jäggi Veerlepad 1 Naters 5078 France 1950-03-12 [chess, writing, photography]
2 Melissa Pollak Urbanización María Teresa Mancebo 4 Lopesdan 3553 Kingdom of the Netherlands 1983-08-09 [skiing]
3 Vittorio Cárdenas Rua de Sousa, 40 Lobenstein 4395PF Switzerland 1954-03-25 [baking]
4 Gioffre van Dam Eichenbergerstrasse 1 Techer 07045 France 1969-04-09 [cycling, reading, hiking, swimming]
# In the JSON version, hobbies is a list
print(df_json["hobbies"].head())
0                               [reading]
1           [chess, writing, photography]
2                                [skiing]
3                                [baking]
4    [cycling, reading, hiking, swimming]
Name: hobbies, dtype: object
# Converting a DataFrame back to JSON
print(df_csv.head(3).to_json(orient="records", indent=2))
[
  {
    "first_name":"Andr\u00e9",
    "last_name":"Pieters",
    "street":"Paseo Chucho Roca 1",
    "town":"Rey",
    "zip":"86379",
    "country":"Kingdom of the Netherlands",
    "date_of_birth":"1993-01-12",
    "hobbies":"reading"
  },
  {
    "first_name":"Ilaria",
    "last_name":"J\u00e4ggi",
    "street":"Veerlepad 1",
    "town":"Naters",
    "zip":"5078",
    "country":"France",
    "date_of_birth":"1950-03-12",
    "hobbies":"chess;writing;photography"
  },
  {
    "first_name":"Melissa",
    "last_name":"Pollak",
    "street":"Urbanizaci\u00f3n Mar\u00eda Teresa Mancebo 4",
    "town":"Lopesdan",
    "zip":"3553",
    "country":"Kingdom of the Netherlands",
    "date_of_birth":"1983-08-09",
    "hobbies":"skiing"
  }
]

Note how CSV and JSON handle the hobbies column differently: CSV stores it as a flat string (here semicolon-separated), while JSON preserves it as a list.

7.9 Loading data from online resources (requests)

The requests library allows downloading data from the internet.

  • requests.get(url): sends a GET request (i.e. tries to download a website) and returns a response object.
  • response.status_code: HTTP status code (200 means success).
  • response.text: the response body as a string.
  • response.json(): parse the response body as JSON.
import requests

url = "https://jsonplaceholder.typicode.com/todos/1"
response = requests.get(url)
print("status code:", response.status_code)
print("content:", response.json())
status code: 200
content: {'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}

As a concrete example, the World Bank Indicators API provides demographic and economic data. The API offers thousands of indicators; you can search them programmatically:

# search for population-related indicators
url = "https://api.worldbank.org/v2/indicator?format=json&per_page=20000"
indicators = requests.get(url).json()[1]
df_ind = pd.DataFrame(indicators)
df_ind[df_ind["name"].str.contains("population", case=False)][["id", "name"]].head()
id name
24 1.1_ACCESS.ELECTRICITY.TOT Access to electricity (% of total population)
39 1.2_ACCESS.ELECTRICITY.RURAL Access to electricity (% of rural population)
40 1.3_ACCESS.ELECTRICITY.URBAN Access to electricity (% of urban population)
161 2.1_ACCESS.CFT.TOT Access to Clean Fuels and Technologies for coo...
1551 allsa.cov_pop Coverage of social safety net programs (% of p...

Here we fetch total population data (SP.POP.TOTL) for all countries and display the ten most populous ones. The API returns a list with two elements: metadata (page info) at index 0 and the actual data records at index 1. Since the response also includes regional aggregates (like “Arab World” or “Euro area”), we first fetch country metadata to identify and exclude them.

# step 1: find out which codes are regional aggregates (not actual countries)
meta = requests.get(
    "https://api.worldbank.org/v2/country",
    params={"format": "json", "per_page": 300}
).json()[1]
aggregates = {e["iso2Code"] for e in meta if e["region"]["id"] == "NA"}

# step 2: fetch population data and keep only actual countries
response = requests.get(
    "https://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL",
    params={"date": "2023", "format": "json", "per_page": 300}
)
data = response.json()[1]  # index 0 is metadata

df_pop = pd.DataFrame([{
    "country": entry["country"]["value"],
    "population": entry["value"]
} for entry in data
    if entry["country"]["id"] not in aggregates and entry["value"] is not None])

df_pop.sort_values("population", ascending=False).head(10)
country population
89 India 1438069596
41 China 1410710000
206 United States 336806231
90 Indonesia 281190067
149 Pakistan 247504495
144 Nigeria 227882945
26 Brazil 211140729
15 Bangladesh 171466990
161 Russian Federation 143826130
127 Mexico 129739759

7.10 Working with dates and times (to_datetime, Timestamp)

pandas can parse and work with dates. We use the person dataset from Section 7.8 to illustrate this.

  • pd.to_datetime(...): convert strings to datetime. The format codes are the same as for the datetime module, see Section 5.3.
  • series.dt.year / .dt.month / .dt.day_of_week: extract date components.
  • pd.Timestamp(...): create a single datetime value.
  • dt.strftime(format): format a datetime as a string. The format codes (e.g., %Y, %m, %d, %H, %M) are the same as for datetime.datetime.strftime from the datetime module covered in Section 5.3.
df = pd.read_csv("misc/person.csv")
df["date_of_birth"] = pd.to_datetime(df["date_of_birth"])
df["date_of_birth"].head()
0   1993-01-12
1   1950-03-12
2   1983-08-09
3   1954-03-25
4   1969-04-09
Name: date_of_birth, dtype: datetime64[us]
# extract year and month of birth
df["birth_year"] = df["date_of_birth"].dt.year
df["birth_month"] = df["date_of_birth"].dt.month
df[["first_name", "last_name", "birth_year", "birth_month"]].head()
first_name last_name birth_year birth_month
0 André Pieters 1993 1
1 Ilaria Jäggi 1950 3
2 Melissa Pollak 1983 8
3 Vittorio Cárdenas 1954 3
4 Gioffre van Dam 1969 4
# compute age in completed years
today = pd.Timestamp.now()
print(today)
# Subtracting two datetimes gives a timedelta; .dt.days extracts the number of days from it
df["age"] = (today - df["date_of_birth"]).dt.days // 365
df[["first_name", "last_name", "date_of_birth", "age"]].head()
2026-04-06 13:42:31.398160
first_name last_name date_of_birth age
0 André Pieters 1993-01-12 33
1 Ilaria Jäggi 1950-03-12 76
2 Melissa Pollak 1983-08-09 42
3 Vittorio Cárdenas 1954-03-25 72
4 Gioffre van Dam 1969-04-09 57
# format dates as strings
df["dob_formatted"] = df["date_of_birth"].dt.strftime("%-d %B %Y")
df[["first_name", "last_name", "dob_formatted"]].head()
first_name last_name dob_formatted
0 André Pieters 12 January 1993
1 Ilaria Jäggi 12 March 1950
2 Melissa Pollak 9 August 1983
3 Vittorio Cárdenas 25 March 1954
4 Gioffre van Dam 9 April 1969

7.11 SQL databases (sqlite3)

Python has built-in support for SQLite databases via the sqlite3 module. This is useful for working with structured data that lives in a database.

  • sqlite3.connect(path): connect to a database (use ":memory:" for an in-memory database).
  • cursor.execute(sql): execute an SQL statement.
  • cursor.fetchall(): fetch all results.
  • pd.read_sql_query(sql, conn): run an SQL query and return the result as a DataFrame.
  • df.to_sql("table_name", conn, index=False): write a DataFrame into an SQL table.
import sqlite3

# connect to the person SQLite database
conn = sqlite3.connect("misc/person.db")
# querying with raw SQL
cursor = conn.cursor()
cursor.execute("SELECT first_name, last_name, country FROM persons WHERE country = 'Germany'")
for row in cursor.fetchall():
    print(row)
('Natalie', 'Burtscher', 'Germany')
('Katherine', 'Gehringer', 'Germany')
('Raimondo', 'van der Heijden', 'Germany')
('Idriz', 'Kitzmann', 'Germany')
('Denis', 'Egli', 'Germany')
('Ekaterina', 'Ledoux', 'Germany')
('Isaac', 'Käfer', 'Germany')
# reading SQL results directly into a pandas DataFrame
df_french = pd.read_sql_query(
    "SELECT first_name, last_name, town FROM persons WHERE country = 'France'",
    conn
)
df_french
first_name last_name town
0 Ilaria Jäggi Naters
1 Gioffre van Dam Techer
2 Filipa Backer Nördlingen
3 Leyre de Vroege Rudolstadt
4 Tygo Seiwald Benoit-la-Forêt
5 Roswitha Göschl Alcara Li Fusi
6 Tito Mader Groß-Enzersdorf
# aggregation in SQL
df_counts = pd.read_sql_query(
    "SELECT country, COUNT(*) AS n_persons FROM persons GROUP BY country ORDER BY n_persons DESC",
    conn
)
df_counts
country n_persons
0 Switzerland 12
1 Kingdom of the Netherlands 10
2 Germany 7
3 France 7
4 Austria 5
5 Portugal 4
6 Italy 3
7 Spain 2
conn.close()

7.12 Exercises

All exercises use the person dataset from Section 7.8. Start by loading it:

df = pd.read_csv("misc/person.csv")

Exercise 1 Add a column age to df that contains each person’s age in completed years (as an integer). Hint: use pd.to_datetime to convert date_of_birth, then compute the difference to today’s date. Then use map to add a column age_group that maps each age to "young" (under 30), "middle" (30–59), or "senior" (60+).

# Exercise 1

Exercise 2 Add a column birth_year extracted from date_of_birth. Then use df.describe() and df["birth_year"].value_counts() to explore the distribution of birth years.

# Exercise 2

Exercise 3 Using the generations dictionary from Exercise 6, write a function get_generation(year) that returns the generation name (or "Other"). Add a column generation to df using this function. Then write a function filter_generation(df, gen_name) that returns only the rows belonging to a given generation. Use it to display all millennials. How many are there?

# Exercise 3

Exercise 4 Consider the following two DataFrames where the city column contains duplicate values:

shops = pd.DataFrame({
    "city": ["Berlin", "Berlin", "Paris", "Paris"],
    "shop": ["BookWorld", "TechStore", "Café Lune", "ModeParis"]
})

deliveries = pd.DataFrame({
    "city": ["Berlin", "Berlin", "Paris"],
    "item": ["books", "laptops", "croissants"]
})

What does pd.merge(shops, deliveries, on="city") return? Run it and explain the result. How many rows does the output have, and why?

# Exercise 4

Exercise 5 Create a DataFrame with columns name, exam1, exam2, exam3 for 10 students (you can invent the data). Add a column average that contains the mean of the three exams. Filter to show only students with an average above 70.

# Exercise 5

Exercise 6 The following dictionary defines generational cohorts by birth year ranges:

generations = {
    "Silent Generation": (1928, 1945),
    "Baby Boomer": (1946, 1964),
    "Generation X": (1965, 1980),
    "Millennial": (1981, 1996),
    "Generation Z": (1997, 2012),
}

Write a function get_generation(year) that takes a birth year and returns the generation name (or "Other" if the year does not fall into any range). Then use map or apply to add a column generation to df. How many persons belong to each generation?

# Exercise 6

Exercise 7 The hobbies column in the CSV is a semicolon-separated string. The .str accessor on a pandas Series gives access to string methods, so df["hobbies"].str.split(";") splits each entry into a Python list. Use this to split the hobbies. Then find out: what is the most common hobby across all persons? Hint: series.explode() turns a Series of lists into individual rows (one row per list element); then use value_counts.

# Exercise 7

Exercise 8 Download a CSV file from the internet using pd.read_csv(url) (e.g. from this collection). Use describe(), groupby(), and a histogram to explore the data. Write a short summary of what you find.

# Exercise 8

Exercise 9 Create a DataFrame with some missing values (np.nan). Compare three strategies for handling them: (a) drop rows with any missing value, (b) fill with the column mean, (c) fill with the column median. How do the summary statistics (describe()) change in each case?

# Exercise 9

Exercise 10 Write a function csv_to_sqlite(csv_path, db_path, table_name) that reads a CSV file into a pandas DataFrame and writes it into an SQLite database. Then write a function query_db(db_path, sql) that runs an SQL query and returns the result as a DataFrame. Test with a CSV file of your choice.

# Exercise 10

Exercise 11 Fetch Germany’s total population from the World Bank API for the years 1960–2023 (see Section 7.9 for how to use the API; the country code is DEU, the indicator is SP.POP.TOTL). Convert the year column to a proper datetime using pd.to_datetime(df["year"], format="%Y"). Then: (a) Add a column decade that contains the decade of each year (e.g. 1960, 1970, …). Hint: use integer division // 10 * 10. (b) Compute the mean population per decade using groupby. (c) Add a column growth that contains the year-over-year population change. Hint: df["population"].diff(). (d) In which year was the population growth the highest? In which year was it negative for the first time? (e) Resample the data to 5-year intervals using df.set_index("date").resample("5YE").mean() and plot the result.

# Exercise 11