# Execute once for installing pandas
# !pip3 install pandas
import pandas as pd
import numpy as np7 Loading and handling data
In this chapter we learn how to load data from various sources and how to handle tabular data using pandas. Before we start, we describe what a dataframe is, since it is the building block of many data science routines.
7.1 Dataframes (pandas)
pandas is the standard library for handling tabular data in Python. Its central data structure is the DataFrame, which is a two-dimensional, tabular data structure — essentially a table with rows and columns. Each column has a name and a data type (e.g., integers, floats, strings), and each row represents one observation or record. You can think of a dataframe as an in-memory spreadsheet that comes with powerful methods for selecting, filtering, transforming, and aggregating data. A single column of a DataFrame is a Series — a one-dimensional array with an index. A Series behaves like a list but supports vectorized operations (e.g., series + 1 adds 1 to every element) and has many built-in methods such as .mean(), .sum(), or .value_counts().
Here are the most important routines, where df is a DataFrame:
pd.DataFrame(dict): create a DataFrame from a dictionary (keys become column names, values must be lists of the same length).df["col"]: select a single column (returns aSeries, which is similar to alist).df[["col1", "col2"]]: select multiple columns.df.iloc[i]ordf.iloc[i:j]: select row(s)i: int(up toj-1).df.loc[i, "col"]ordf.loc[i:j, ["col1", "col2"]]: select row(s)i(up toj-1) for columns"col"(resp"col1", "col").df[df["col"] > value]: filter rows by a condition (heredf["col"] > value).df.head(n)/df.tail(n): first / lastnrows.df.describe(): summary statistics (count, mean, std, min, quartiles, max) for all numeric columns.df["col"].count(),df["col"].mean(),df["col"].std(),df["col"].min(),df["col"].max(): individual summary statistics for a column.df["col"].median(): median of a column.df["col"].quantile(q): theq-th quantile (e.g.,q=0.25for the first quartile).df["col"].sum(): sum of all values in a column.df.sort_values("col"): sort by a column.df.groupby("col").mean(): group by a column and compute means.df.isna(): boolean mask of missing values.df.dropna(): drop rows with missing values.df.fillna(value): fill missing values.df.drop(columns=["col"]): remove a column.df.to_numpy(): convert to a NumPy array (but column names get lost).
df = pd.DataFrame({
"age": [22, 27, 31, 45, 38, 29],
"income": [42000, 51000, 64000, 83000, 72000, 56000],
"hours_studied": [10, 12, 7, 4, 6, 9],
"passed": [1, 1, 0, 0, 1, 1]
})
df| age | income | hours_studied | passed | |
|---|---|---|---|---|
| 0 | 22 | 42000 | 10 | 1 |
| 1 | 27 | 51000 | 12 | 1 |
| 2 | 31 | 64000 | 7 | 0 |
| 3 | 45 | 83000 | 4 | 0 |
| 4 | 38 | 72000 | 6 | 1 |
| 5 | 29 | 56000 | 9 | 1 |
# selecting columns
print(df["age"])0 22
1 27
2 31
3 45
4 38
5 29
Name: age, dtype: int64
print(df[["age", "income"]]) age income
0 22 42000
1 27 51000
2 31 64000
3 45 83000
4 38 72000
5 29 56000
# selecting rows
print(df.iloc[0])age 22
income 42000
hours_studied 10
passed 1
Name: 0, dtype: int64
print(df.loc[0:2, ["age", "passed"]]) age passed
0 22 1
1 27 1
2 31 0
# filtering
filtered = df[df["income"] > 55000]
filtered| age | income | hours_studied | passed | |
|---|---|---|---|---|
| 2 | 31 | 64000 | 7 | 0 |
| 3 | 45 | 83000 | 4 | 0 |
| 4 | 38 | 72000 | 6 | 1 |
| 5 | 29 | 56000 | 9 | 1 |
# summary statistics
print(df.describe()) age income hours_studied passed
count 6.000000 6.000000 6.000000 6.000000
mean 32.000000 61333.333333 8.000000 0.666667
std 8.246211 14827.901627 2.898275 0.516398
min 22.000000 42000.000000 4.000000 0.000000
25% 27.500000 52250.000000 6.250000 0.250000
50% 30.000000 60000.000000 8.000000 1.000000
75% 36.250000 70000.000000 9.750000 1.000000
max 45.000000 83000.000000 12.000000 1.000000
# individual statistics for a column
print("mean income:", df["income"].mean())
print("std income:", df["income"].std())
print("median income:", df["income"].median())
print("25th percentile:", df["income"].quantile(0.25))
print("total hours studied:", df["hours_studied"].sum())mean income: 61333.333333333336
std income: 14827.901627225163
median income: 60000.0
25th percentile: 52250.0
total hours studied: 48
# sorting
df_sorted = df.sort_values("income", ascending=False)
print(df_sorted) age income hours_studied passed
3 45 83000 4 0
4 38 72000 6 1
2 31 64000 7 0
5 29 56000 9 1
1 27 51000 12 1
0 22 42000 10 1
# groupby — numeric_only=True restricts the computation to numeric columns,
# ignoring string columns (which cannot be averaged)
grouped = df.groupby("passed").mean(numeric_only=True)
print(grouped) age income hours_studied
passed
0 38.0 73500.0 5.50
1 29.0 55250.0 9.25
# missing values
df_missing = df.copy()
df_missing.loc[2, "income"] = np.nan
df_missing.loc[4, "hours_studied"] = np.nan
print(df_missing)
print()
print(df_missing.isna().sum())
print()
print(df_missing.dropna()) age income hours_studied passed
0 22 42000.0 10.0 1
1 27 51000.0 12.0 1
2 31 NaN 7.0 0
3 45 83000.0 4.0 0
4 38 72000.0 NaN 1
5 29 56000.0 9.0 1
age 0
income 1
hours_studied 1
passed 0
dtype: int64
age income hours_studied passed
0 22 42000.0 10.0 1
1 27 51000.0 12.0 1
3 45 83000.0 4.0 0
5 29 56000.0 9.0 1
# fill missing values with column means
df_filled = df_missing.fillna(df_missing.mean(numeric_only=True))
print(df_filled) age income hours_studied passed
0 22 42000.0 10.0 1
1 27 51000.0 12.0 1
2 31 60800.0 7.0 0
3 45 83000.0 4.0 0
4 38 72000.0 8.4 1
5 29 56000.0 9.0 1
# from pandas to numpy
clean = df_missing.dropna()
X = clean[["age", "income", "hours_studied"]].to_numpy(dtype=float)
y = clean["passed"].to_numpy(dtype=float)
print("X.shape =", X.shape)
print("y =", y)X.shape = (4, 3)
y = [1. 1. 0. 1.]
7.2 Adding and renaming columns (assign, rename)
Adding new columns and renaming existing ones are among the most frequent pandas operations.
df["new_col"] = values: add a new column (or overwrite an existing one) by direct assignment.df.assign(col=expr): return a new DataFrame with an additional column (does not modify the original).df.rename(columns={"old": "new"}): rename one or more columns.df.insert(loc, column, value): insert a column at a specific position.
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"weight_kg": [62, 85, 74],
"height_m": [1.65, 1.80, 1.75]
})
df| name | weight_kg | height_m | |
|---|---|---|---|
| 0 | Alice | 62 | 1.65 |
| 1 | Bob | 85 | 1.80 |
| 2 | Charlie | 74 | 1.75 |
# adding a computed column
df["bmi"] = df["weight_kg"] / df["height_m"] ** 2
df| name | weight_kg | height_m | bmi | |
|---|---|---|---|---|
| 0 | Alice | 62 | 1.65 | 22.773186 |
| 1 | Bob | 85 | 1.80 | 26.234568 |
| 2 | Charlie | 74 | 1.75 | 24.163265 |
# assign returns a new DataFrame, leaving df unchanged
df2 = df.assign(height_cm=df["height_m"] * 100)
df2| name | weight_kg | height_m | bmi | height_cm | |
|---|---|---|---|---|---|
| 0 | Alice | 62 | 1.65 | 22.773186 | 165.0 |
| 1 | Bob | 85 | 1.80 | 26.234568 | 180.0 |
| 2 | Charlie | 74 | 1.75 | 24.163265 | 175.0 |
# renaming columns
df_renamed = df.rename(columns={"weight_kg": "weight", "height_m": "height"})
df_renamed| name | weight | height | bmi | |
|---|---|---|---|---|
| 0 | Alice | 62 | 1.65 | 22.773186 |
| 1 | Bob | 85 | 1.80 | 26.234568 |
| 2 | Charlie | 74 | 1.75 | 24.163265 |
Note that direct assignment (df["col"] = ...) modifies the DataFrame in place, while assign and rename return new DataFrames by default.
7.3 Merging and joining DataFrames (merge, concat)
Assume you have two DataFrames, which share one column (best used with unique entries, called key below). Then, pd.merge joins the two DataFrames into one by using the joint column as anchor. However, note that we have the keys from the left and right DataFrame, and there are keys appearing in the left, in the right, in both, or in any of the two DataFrames. This leads to the four cases, how pd.merge(left, right, on="key", how="left/right/inner/outer") operates:
| how | meaning |
|---|---|
"left" |
keep all rows from left |
"right" |
keep all rows from right |
"inner" |
keep only joint keys |
"outer" |
keep all keys from both DataFrames |
In contrast, in pd.concat, you have two DataFrames which do not need to share a column, and you simply stack them vertically or horizontally.
pd.merge(left, right, on="key"): inner join on a shared columnkey.pd.merge(left, right, on="key", how="left"): left join (keep all rows from the left DataFrame). Other options:"right","inner","outer".pd.concat([df1, df2]): stack DataFrames vertically (row-wise).pd.concat([df1, df2], axis=1): stack DataFrames horizontally (column-wise).
students = pd.DataFrame({
"student_id": [1, 2, 3, 4],
"name": ["Alice", "Bob", "Charlie", "Diana"]
})
grades = pd.DataFrame({
"student_id": [1, 2, 3, 5],
"grade": [1.3, 2.0, 1.7, 2.7]
})
print(students)
print()
print(grades) student_id name
0 1 Alice
1 2 Bob
2 3 Charlie
3 4 Diana
student_id grade
0 1 1.3
1 2 2.0
2 3 1.7
3 5 2.7
# inner join: only students present in both tables
merged = pd.merge(students, grades, on="student_id")
# same as merged = pd.merge(students, grades, on="student_id", how="inner")
print(merged) student_id name grade
0 1 Alice 1.3
1 2 Bob 2.0
2 3 Charlie 1.7
# left join: keep all students, NaN where no grade exists
merged_left = pd.merge(students, grades, on="student_id", how="left")
print(merged_left) student_id name grade
0 1 Alice 1.3
1 2 Bob 2.0
2 3 Charlie 1.7
3 4 Diana NaN
# concatenating vertically
df_a = pd.DataFrame({"name": ["Alice", "Bob"], "score": [85, 92]})
df_b = pd.DataFrame({"name": ["Charlie", "Diana"], "score": [78, 91]})
combined = pd.concat([df_a, df_b], ignore_index=True)
print(combined) name score
0 Alice 85
1 Bob 92
2 Charlie 78
3 Diana 91
When concatenating vertically, use ignore_index=True to reset the index; otherwise both DataFrames keep their original indices, which may lead to duplicate index values.
7.4 Applying custom functions (apply, map)
Built-in methods like .mean() or .sum() cover many cases, but sometimes you need to apply an arbitrary Python function to each row, column, or element.
df.apply(func): applyfuncto each column (default) or each row (axis=1).series.map(func): applyfuncelement-wise to a Series. (This is usually applied to a column of a DataFrame.)df.map(func): applyfuncelement-wise to every cell of a DataFrame.
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"score": [85, 92, 78],
"bonus": [5, 3, 8]
})
df| name | score | bonus | |
|---|---|---|---|
| 0 | Alice | 85 | 5 |
| 1 | Bob | 92 | 3 |
| 2 | Charlie | 78 | 8 |
# apply a function to each column (default axis=0)
print(df[["score", "bonus"]].apply(np.mean))score 85.000000
bonus 5.333333
dtype: float64
# apply a function row-wise (axis=1)
df["total"] = df.apply(lambda row: row["score"] + row["bonus"], axis=1)
df| name | score | bonus | total | |
|---|---|---|---|---|
| 0 | Alice | 85 | 5 | 90 |
| 1 | Bob | 92 | 3 | 95 |
| 2 | Charlie | 78 | 8 | 86 |
# map: element-wise transformation of a Series
df["grade"] = df["total"].map(lambda x: "pass" if x >= 85 else "fail")
df| name | score | bonus | total | grade | |
|---|---|---|---|---|---|
| 0 | Alice | 85 | 5 | 90 | pass |
| 1 | Bob | 92 | 3 | 95 | pass |
| 2 | Charlie | 78 | 8 | 86 | pass |
Use apply with axis=1 when the computation depends on multiple columns in the same row. For simple element-wise transformations on a single column, map is more readable. For best performance on large DataFrames, prefer vectorized operations (e.g., df["score"] + df["bonus"]) over apply whenever possible.
7.5 Iterating through a DataFrame (iterrows, itertuples)
Sometimes you need to loop over the rows of a DataFrame, for example to perform row-wise logic that is hard to express as a vectorized operation.
df.iterrows(): iterate over rows as(index, Series)pairs.df.itertuples(): iterate over rows as named tuples (faster thaniterrows).
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"score": [85, 92, 78],
"bonus": [5, 3, 8]
})
# iterrows: each row is a Series
for idx, row in df.iterrows():
print(f"Row {idx}: {row['name']} scored {row['score'] + row['bonus']} in total")Row 0: Alice scored 90 in total
Row 1: Bob scored 95 in total
Row 2: Charlie scored 86 in total
# itertuples: each row is a named tuple (faster)
for row in df.itertuples():
print(f"{row.name}: score={row.score}, bonus={row.bonus}")Alice: score=85, bonus=5
Bob: score=92, bonus=3
Charlie: score=78, bonus=8
Prefer itertuples over iterrows when performance matters, since it avoids creating a Series for each row. However, for most tasks, vectorized operations (e.g., df["score"] + df["bonus"]) or apply are faster than explicit loops and should be preferred whenever possible.
7.6 Value counts and unique values (value_counts, unique)
When exploring a dataset, you often want to know which values occur in a column and how frequently. This is especially useful for categorical data.
df["col"].value_counts(): count how often each value appears, sorted by frequency. (df["col"].value_counts(normalize=True)returns relative frequencies instead of absolute counts instead.)df["col"].unique(): return an array of all distinct values.df["col"].nunique(): return the number of distinct values.
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank"],
"major": ["Math", "CS", "Math", "CS", "Physics", "Math"],
"level": ["BSc", "MSc", "MSc", "BSc", "BSc", "MSc"]
})
df| name | major | level | |
|---|---|---|---|
| 0 | Alice | Math | BSc |
| 1 | Bob | CS | MSc |
| 2 | Charlie | Math | MSc |
| 3 | Diana | CS | BSc |
| 4 | Eve | Physics | BSc |
| 5 | Frank | Math | MSc |
# how often does each major appear?
print(df["major"].value_counts())major
Math 3
CS 2
Physics 1
Name: count, dtype: int64
# relative frequencies
print(df["major"].value_counts(normalize=True))major
Math 0.500000
CS 0.333333
Physics 0.166667
Name: proportion, dtype: float64
# unique values and their count
print("Unique majors:", df["major"].unique())
print("Number of majors:", df["major"].nunique())Unique majors: <StringArray>
['Math', 'CS', 'Physics']
Length: 3, dtype: str
Number of majors: 3
value_counts() is one of the first things to try when getting to know a new dataset. For numeric columns, consider using df["col"].describe() instead.
7.7 Handling duplicates (duplicated, drop_duplicates)
Real-world data often contains duplicate rows — for example from repeated data imports or logging errors. pandas makes it easy to detect and remove them.
df.duplicated(): return a boolean Series that isTruefor duplicate rows.df.duplicated(subset=["col1", "col2"]): check for duplicates based on specific columns only.df.drop_duplicates(): remove duplicate rows (keeps the first occurrence by default).df.drop_duplicates(subset=["col"], keep="last"): keep the last occurrence instead.
df = pd.DataFrame({
"name": ["Alice", "Bob", "Alice", "Charlie", "Bob"],
"score": [85, 92, 85, 78, 95]
})
df| name | score | |
|---|---|---|
| 0 | Alice | 85 |
| 1 | Bob | 92 |
| 2 | Alice | 85 |
| 3 | Charlie | 78 |
| 4 | Bob | 95 |
# which rows are duplicates?
print(df.duplicated())0 False
1 False
2 True
3 False
4 False
dtype: bool
# remove exact duplicate rows
print(df.drop_duplicates()) name score
0 Alice 85
1 Bob 92
3 Charlie 78
4 Bob 95
# duplicates based on a subset of columns
print(df.duplicated(subset=["name"]))0 False
1 False
2 True
3 False
4 True
dtype: bool
# keep only the last entry per name
print(df.drop_duplicates(subset=["name"], keep="last")) name score
2 Alice 85
3 Charlie 78
4 Bob 95
Note that drop_duplicates() returns a new DataFrame by default. To modify in place, pass inplace=True. When checking duplicates on a subset of columns, the first (or last) complete row is kept, not just the key columns.
7.8 Loading data from files (csv, xlsx, json)
Data often comes as files. The most common formats are CSV (comma-separated values), Excel (.xlsx), and JSON (JavaScript Object Notation). A CSV file is like an Excel table, where contents of cells are separated by a comma (or some other delimiter). A JSON file has the same structure as a Python dictionary. We use the following example files throughout this section: person.csv, person.json, and person.db. These contain data on 50 fictional persons with names, addresses, dates of birth, and hobbies.
pd.read_csv(path): reads a CSV file into a DataFrame.df.to_csv(path, index=False): writes a DataFrame to a CSV file.pd.read_json(path): reads a JSON file into a DataFrame. (So the JSON file is interpreted as a dict inpd.DataFrame(dict).)df.to_json(orient="records"): converts a DataFrame to a JSON string. Theorientparameter controls the format;"records"produces a list of dictionaries.pd.read_excel(path): reads an Excel file into a DataFrame (requires theopenpyxlpackage).
# Reading a CSV file
df_csv = pd.read_csv("misc/person.csv")
df_csv.head()| first_name | last_name | street | town | zip | country | date_of_birth | hobbies | |
|---|---|---|---|---|---|---|---|---|
| 0 | André | Pieters | Paseo Chucho Roca 1 | Rey | 86379 | Kingdom of the Netherlands | 1993-01-12 | reading |
| 1 | Ilaria | Jäggi | Veerlepad 1 | Naters | 5078 | France | 1950-03-12 | chess;writing;photography |
| 2 | Melissa | Pollak | Urbanización María Teresa Mancebo 4 | Lopesdan | 3553 | Kingdom of the Netherlands | 1983-08-09 | skiing |
| 3 | Vittorio | Cárdenas | Rua de Sousa, 40 | Lobenstein | 4395PF | Switzerland | 1954-03-25 | baking |
| 4 | Gioffre | van Dam | Eichenbergerstrasse 1 | Techer | 07045 | France | 1969-04-09 | cycling;reading;hiking;swimming |
# The hobbies column is a semicolon-separated string — we can inspect it
print(df_csv["hobbies"].head())0 reading
1 chess;writing;photography
2 skiing
3 baking
4 cycling;reading;hiking;swimming
Name: hobbies, dtype: str
# Reading a JSON file
df_json = pd.read_json("misc/person.json")
df_json.head()| first_name | last_name | street | town | zip | country | date_of_birth | hobbies | |
|---|---|---|---|---|---|---|---|---|
| 0 | André | Pieters | Paseo Chucho Roca 1 | Rey | 86379 | Kingdom of the Netherlands | 1993-01-12 | [reading] |
| 1 | Ilaria | Jäggi | Veerlepad 1 | Naters | 5078 | France | 1950-03-12 | [chess, writing, photography] |
| 2 | Melissa | Pollak | Urbanización María Teresa Mancebo 4 | Lopesdan | 3553 | Kingdom of the Netherlands | 1983-08-09 | [skiing] |
| 3 | Vittorio | Cárdenas | Rua de Sousa, 40 | Lobenstein | 4395PF | Switzerland | 1954-03-25 | [baking] |
| 4 | Gioffre | van Dam | Eichenbergerstrasse 1 | Techer | 07045 | France | 1969-04-09 | [cycling, reading, hiking, swimming] |
# In the JSON version, hobbies is a list
print(df_json["hobbies"].head())0 [reading]
1 [chess, writing, photography]
2 [skiing]
3 [baking]
4 [cycling, reading, hiking, swimming]
Name: hobbies, dtype: object
# Converting a DataFrame back to JSON
print(df_csv.head(3).to_json(orient="records", indent=2))[
{
"first_name":"Andr\u00e9",
"last_name":"Pieters",
"street":"Paseo Chucho Roca 1",
"town":"Rey",
"zip":"86379",
"country":"Kingdom of the Netherlands",
"date_of_birth":"1993-01-12",
"hobbies":"reading"
},
{
"first_name":"Ilaria",
"last_name":"J\u00e4ggi",
"street":"Veerlepad 1",
"town":"Naters",
"zip":"5078",
"country":"France",
"date_of_birth":"1950-03-12",
"hobbies":"chess;writing;photography"
},
{
"first_name":"Melissa",
"last_name":"Pollak",
"street":"Urbanizaci\u00f3n Mar\u00eda Teresa Mancebo 4",
"town":"Lopesdan",
"zip":"3553",
"country":"Kingdom of the Netherlands",
"date_of_birth":"1983-08-09",
"hobbies":"skiing"
}
]
Note how CSV and JSON handle the hobbies column differently: CSV stores it as a flat string (here semicolon-separated), while JSON preserves it as a list.
7.9 Loading data from online resources (requests)
The requests library allows downloading data from the internet.
requests.get(url): sends a GET request (i.e. tries to download a website) and returns a response object.response.status_code: HTTP status code (200 means success).response.text: the response body as a string.response.json(): parse the response body as JSON.
import requests
url = "https://jsonplaceholder.typicode.com/todos/1"
response = requests.get(url)
print("status code:", response.status_code)
print("content:", response.json())status code: 200
content: {'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}
As a concrete example, the World Bank Indicators API provides demographic and economic data. The API offers thousands of indicators; you can search them programmatically:
# search for population-related indicators
url = "https://api.worldbank.org/v2/indicator?format=json&per_page=20000"
indicators = requests.get(url).json()[1]
df_ind = pd.DataFrame(indicators)
df_ind[df_ind["name"].str.contains("population", case=False)][["id", "name"]].head()| id | name | |
|---|---|---|
| 24 | 1.1_ACCESS.ELECTRICITY.TOT | Access to electricity (% of total population) |
| 39 | 1.2_ACCESS.ELECTRICITY.RURAL | Access to electricity (% of rural population) |
| 40 | 1.3_ACCESS.ELECTRICITY.URBAN | Access to electricity (% of urban population) |
| 161 | 2.1_ACCESS.CFT.TOT | Access to Clean Fuels and Technologies for coo... |
| 1551 | allsa.cov_pop | Coverage of social safety net programs (% of p... |
Here we fetch total population data (SP.POP.TOTL) for all countries and display the ten most populous ones. The API returns a list with two elements: metadata (page info) at index 0 and the actual data records at index 1. Since the response also includes regional aggregates (like “Arab World” or “Euro area”), we first fetch country metadata to identify and exclude them.
# step 1: find out which codes are regional aggregates (not actual countries)
meta = requests.get(
"https://api.worldbank.org/v2/country",
params={"format": "json", "per_page": 300}
).json()[1]
aggregates = {e["iso2Code"] for e in meta if e["region"]["id"] == "NA"}
# step 2: fetch population data and keep only actual countries
response = requests.get(
"https://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL",
params={"date": "2023", "format": "json", "per_page": 300}
)
data = response.json()[1] # index 0 is metadata
df_pop = pd.DataFrame([{
"country": entry["country"]["value"],
"population": entry["value"]
} for entry in data
if entry["country"]["id"] not in aggregates and entry["value"] is not None])
df_pop.sort_values("population", ascending=False).head(10)| country | population | |
|---|---|---|
| 89 | India | 1438069596 |
| 41 | China | 1410710000 |
| 206 | United States | 336806231 |
| 90 | Indonesia | 281190067 |
| 149 | Pakistan | 247504495 |
| 144 | Nigeria | 227882945 |
| 26 | Brazil | 211140729 |
| 15 | Bangladesh | 171466990 |
| 161 | Russian Federation | 143826130 |
| 127 | Mexico | 129739759 |
7.10 Working with dates and times (to_datetime, Timestamp)
pandas can parse and work with dates. We use the person dataset from Section 7.8 to illustrate this.
pd.to_datetime(...): convert strings to datetime. The format codes are the same as for thedatetimemodule, see Section 5.3.series.dt.year/.dt.month/.dt.day_of_week: extract date components.pd.Timestamp(...): create a single datetime value.dt.strftime(format): format a datetime as a string. The format codes (e.g.,%Y,%m,%d,%H,%M) are the same as fordatetime.datetime.strftimefrom thedatetimemodule covered in Section 5.3.
df = pd.read_csv("misc/person.csv")
df["date_of_birth"] = pd.to_datetime(df["date_of_birth"])
df["date_of_birth"].head()0 1993-01-12
1 1950-03-12
2 1983-08-09
3 1954-03-25
4 1969-04-09
Name: date_of_birth, dtype: datetime64[us]
# extract year and month of birth
df["birth_year"] = df["date_of_birth"].dt.year
df["birth_month"] = df["date_of_birth"].dt.month
df[["first_name", "last_name", "birth_year", "birth_month"]].head()| first_name | last_name | birth_year | birth_month | |
|---|---|---|---|---|
| 0 | André | Pieters | 1993 | 1 |
| 1 | Ilaria | Jäggi | 1950 | 3 |
| 2 | Melissa | Pollak | 1983 | 8 |
| 3 | Vittorio | Cárdenas | 1954 | 3 |
| 4 | Gioffre | van Dam | 1969 | 4 |
# compute age in completed years
today = pd.Timestamp.now()
print(today)
# Subtracting two datetimes gives a timedelta; .dt.days extracts the number of days from it
df["age"] = (today - df["date_of_birth"]).dt.days // 365
df[["first_name", "last_name", "date_of_birth", "age"]].head()2026-04-06 13:42:31.398160
| first_name | last_name | date_of_birth | age | |
|---|---|---|---|---|
| 0 | André | Pieters | 1993-01-12 | 33 |
| 1 | Ilaria | Jäggi | 1950-03-12 | 76 |
| 2 | Melissa | Pollak | 1983-08-09 | 42 |
| 3 | Vittorio | Cárdenas | 1954-03-25 | 72 |
| 4 | Gioffre | van Dam | 1969-04-09 | 57 |
# format dates as strings
df["dob_formatted"] = df["date_of_birth"].dt.strftime("%-d %B %Y")
df[["first_name", "last_name", "dob_formatted"]].head()| first_name | last_name | dob_formatted | |
|---|---|---|---|
| 0 | André | Pieters | 12 January 1993 |
| 1 | Ilaria | Jäggi | 12 March 1950 |
| 2 | Melissa | Pollak | 9 August 1983 |
| 3 | Vittorio | Cárdenas | 25 March 1954 |
| 4 | Gioffre | van Dam | 9 April 1969 |
7.11 SQL databases (sqlite3)
Python has built-in support for SQLite databases via the sqlite3 module. This is useful for working with structured data that lives in a database.
sqlite3.connect(path): connect to a database (use":memory:"for an in-memory database).cursor.execute(sql): execute an SQL statement.cursor.fetchall(): fetch all results.pd.read_sql_query(sql, conn): run an SQL query and return the result as a DataFrame.df.to_sql("table_name", conn, index=False): write a DataFrame into an SQL table.
import sqlite3
# connect to the person SQLite database
conn = sqlite3.connect("misc/person.db")# querying with raw SQL
cursor = conn.cursor()
cursor.execute("SELECT first_name, last_name, country FROM persons WHERE country = 'Germany'")
for row in cursor.fetchall():
print(row)('Natalie', 'Burtscher', 'Germany')
('Katherine', 'Gehringer', 'Germany')
('Raimondo', 'van der Heijden', 'Germany')
('Idriz', 'Kitzmann', 'Germany')
('Denis', 'Egli', 'Germany')
('Ekaterina', 'Ledoux', 'Germany')
('Isaac', 'Käfer', 'Germany')
# reading SQL results directly into a pandas DataFrame
df_french = pd.read_sql_query(
"SELECT first_name, last_name, town FROM persons WHERE country = 'France'",
conn
)
df_french| first_name | last_name | town | |
|---|---|---|---|
| 0 | Ilaria | Jäggi | Naters |
| 1 | Gioffre | van Dam | Techer |
| 2 | Filipa | Backer | Nördlingen |
| 3 | Leyre | de Vroege | Rudolstadt |
| 4 | Tygo | Seiwald | Benoit-la-Forêt |
| 5 | Roswitha | Göschl | Alcara Li Fusi |
| 6 | Tito | Mader | Groß-Enzersdorf |
# aggregation in SQL
df_counts = pd.read_sql_query(
"SELECT country, COUNT(*) AS n_persons FROM persons GROUP BY country ORDER BY n_persons DESC",
conn
)
df_counts| country | n_persons | |
|---|---|---|
| 0 | Switzerland | 12 |
| 1 | Kingdom of the Netherlands | 10 |
| 2 | Germany | 7 |
| 3 | France | 7 |
| 4 | Austria | 5 |
| 5 | Portugal | 4 |
| 6 | Italy | 3 |
| 7 | Spain | 2 |
conn.close()7.12 Exercises
All exercises use the person dataset from Section 7.8. Start by loading it:
df = pd.read_csv("misc/person.csv")Exercise 1 Add a column age to df that contains each person’s age in completed years (as an integer). Hint: use pd.to_datetime to convert date_of_birth, then compute the difference to today’s date. Then use map to add a column age_group that maps each age to "young" (under 30), "middle" (30–59), or "senior" (60+).
# Exercise 1Exercise 2 Add a column birth_year extracted from date_of_birth. Then use df.describe() and df["birth_year"].value_counts() to explore the distribution of birth years.
# Exercise 2Exercise 3 Using the generations dictionary from Exercise 6, write a function get_generation(year) that returns the generation name (or "Other"). Add a column generation to df using this function. Then write a function filter_generation(df, gen_name) that returns only the rows belonging to a given generation. Use it to display all millennials. How many are there?
# Exercise 3Exercise 4 Consider the following two DataFrames where the city column contains duplicate values:
shops = pd.DataFrame({
"city": ["Berlin", "Berlin", "Paris", "Paris"],
"shop": ["BookWorld", "TechStore", "Café Lune", "ModeParis"]
})
deliveries = pd.DataFrame({
"city": ["Berlin", "Berlin", "Paris"],
"item": ["books", "laptops", "croissants"]
})What does pd.merge(shops, deliveries, on="city") return? Run it and explain the result. How many rows does the output have, and why?
# Exercise 4Exercise 5 Create a DataFrame with columns name, exam1, exam2, exam3 for 10 students (you can invent the data). Add a column average that contains the mean of the three exams. Filter to show only students with an average above 70.
# Exercise 5Exercise 6 The following dictionary defines generational cohorts by birth year ranges:
generations = {
"Silent Generation": (1928, 1945),
"Baby Boomer": (1946, 1964),
"Generation X": (1965, 1980),
"Millennial": (1981, 1996),
"Generation Z": (1997, 2012),
}Write a function get_generation(year) that takes a birth year and returns the generation name (or "Other" if the year does not fall into any range). Then use map or apply to add a column generation to df. How many persons belong to each generation?
# Exercise 6Exercise 7 The hobbies column in the CSV is a semicolon-separated string. The .str accessor on a pandas Series gives access to string methods, so df["hobbies"].str.split(";") splits each entry into a Python list. Use this to split the hobbies. Then find out: what is the most common hobby across all persons? Hint: series.explode() turns a Series of lists into individual rows (one row per list element); then use value_counts.
# Exercise 7Exercise 8 Download a CSV file from the internet using pd.read_csv(url) (e.g. from this collection). Use describe(), groupby(), and a histogram to explore the data. Write a short summary of what you find.
# Exercise 8Exercise 9 Create a DataFrame with some missing values (np.nan). Compare three strategies for handling them: (a) drop rows with any missing value, (b) fill with the column mean, (c) fill with the column median. How do the summary statistics (describe()) change in each case?
# Exercise 9Exercise 10 Write a function csv_to_sqlite(csv_path, db_path, table_name) that reads a CSV file into a pandas DataFrame and writes it into an SQLite database. Then write a function query_db(db_path, sql) that runs an SQL query and returns the result as a DataFrame. Test with a CSV file of your choice.
# Exercise 10Exercise 11 Fetch Germany’s total population from the World Bank API for the years 1960–2023 (see Section 7.9 for how to use the API; the country code is DEU, the indicator is SP.POP.TOTL). Convert the year column to a proper datetime using pd.to_datetime(df["year"], format="%Y"). Then: (a) Add a column decade that contains the decade of each year (e.g. 1960, 1970, …). Hint: use integer division // 10 * 10. (b) Compute the mean population per decade using groupby. (c) Add a column growth that contains the year-over-year population change. Hint: df["population"].diff(). (d) In which year was the population growth the highest? In which year was it negative for the first time? (e) Resample the data to 5-year intervals using df.set_index("date").resample("5YE").mean() and plot the result.
# Exercise 11