# Uncomment the next line and run the cell for installation.
# This only needs to be done once for every package.
# If pip3 does not work, use pip instead.
# !pip3 install numpy5 Some useful libraries/packages/modules
In this chapter, we take a little break. Although there are lots of things to learn, the material covered here has the role of helping to learn about the Python ecosystem and what is possible, rather than learn the details of various labraries. The first section, however, is central and will be needed in any Python project you will start.
5.1 Loading modules (import)
As in all programming languages, not all Python programs come in a single file. If a Python project spans multiple files, an import becomes necessary. In Python, there is syntax for imports as follows:
| Syntax | Meaning |
|---|---|
import module |
full module |
import module as name |
alias |
from module import x |
specific objects |
from module import * |
everything (avoid) |
import module.submodule |
submodule |
Generally, there are packages (folders), modules (a .py-file within a package, or a single py-file), and submodules. Every submodule is a module. In any case, the import loads the (sub-)module, and executes all top-level code inside it (i.e. everything which is not inside a function).
There are three cases:
- You wrote some code in file
one.py(e.g. some functions) which you want to use in filetwo.py. (Note, btw, that the first digit inone.pymust be a letter.) Assumeone.pyis as follows:
# File: one.py
def myConst():
return 42
Then, the import in two.py is either import one (both files must be in the same folder), or from one import myConst or (not recommended) from one import *. In the first case, you must prefix myconst() by one:
# File: two.py
import one
print(f"My constant is {one.myConst()}.")
In the second case, there is no prefix:
from one import myConst
# or
from one import *
print(f"My constant is {myConst()}.")
Observe that the latter option comes with the disadvantage that you cannot see from the call of myconst(), if it is defined in two.py or inone.py, so it becomes harder to keep track of your code if things become more complicated.
If you have a folder myPackage, which holds one.py, you can must write import myPackage, import myPackage.one or import myPackage.one as myNameOne with differing ways to use myConst():
import myPackage
print(f"My constant is {myPackage.one.myConst()}.")
import myPackage.one
print(f"My constant is {one.myConst()}.")
import myPackage.one as myName
print(f"My constant is {myName.myConst()}.")
You want to use functions from a built-in library. Then, a simple
import moduleis often the best choice. We will see examples in Section 5.2.You want to use functions from a library, which needs to be installed. Here, before the
import moduleeven works, you need to install the module in your Python environment. In a terminal, within your Python environment, use
and then, import numpy will work. We will see examples for this e.g. in Section 7.8.
5.2 Mathematics (math)
The mathematical library in Python is used by import math. Then, you have e.g. the following functions, which are self-explanatory: math.sqrt(), math.exp(), math.log(), math.sin(), math.cos(), and constants like math.pi and math.e.
5.3 Dates and times of day (datetime)
In data science, we often have to handle dates. Their disadvantage is that there is no standard way to write these as strings. In Python, this and various other problems around dates and times is solved by the datetime-library.
Submodules (in fact, these are classes in this case) of this package are
date: handling datestime: handling timesdatetime.datetime: handlingdatetimes, i.e. dates and times togethertimedelta: time differencestzinfoandtimezone: handling timezones
We will here use
import datetime at the cost that datetime-objects are only accessible by saying datetime.datetime (the first datetime being the package, the second datetime being the class/submodule).
Let us create a datetime-object`:
dt = datetime.datetime(2026, 4, 21, 12, 15)
print(dt)
print(f"year: {dt.year}")
print(f"month: {dt.month}")
print(f"day: {dt.day}")
print(f"hour: {dt.hour}")
print(f"minutes: {dt.minute}")
print(f"second: {dt.second}")
print(f"microsecond: {dt.microsecond}")
print(f"date: {dt.date()}")
print(f"time: {dt.time()}")2026-04-21 12:15:00
year: 2026
month: 4
day: 21
hour: 12
minutes: 15
second: 0
microsecond: 0
date: 2026-04-21
time: 12:15:00
which is the starting time of the first session in this course. As we see here, we can extract year,…, microseconds from the datetime-object.
Often, we are given strings such as "21.4.26, 12:15", and want to convert this into a datetime object. For this, we have datetime.datetime.strptime:
s = "21.4.26, 12:15"
dt = datetime.datetime.strptime(s, "%y.%m.%d, %H:%M")
print(s + f" is converted to {dt}.")21.4.26, 12:15 is converted to 2021-04-26 12:15:00.
(For the correct way to place %, y, m, d etc., it is in fact best to consult your favourite AI.)
Reversely, we want to print a datetime-object in a certain way. For this, we have datetime.datetime.strftime:
print(dt.strftime("%-m/%-d/%y, %H:%M"))4/26/21, 12:15
which is the US-style output.
Next, assume we want to compute the same time 12 weeks from now. For this, we use datetime.timedelta.
print(dt + datetime.timedelta(days=84))2021-07-19 12:15:00
Here are some more remarks:
datetime.dateis not used very often. The reason is that you can recover adatefrom adatetime, but not the other way round.datetime.timehas the same limitations.- We will not cover time-zones here, but if you have global data, they will become important!
5.4 Decimal numbers (decimal)
We already saw that:
print(f"Rounding error: 0.2 + 0.1 = {0.2 + 0.1}.")Rounding error: 0.2 + 0.1 = 0.30000000000000004.
The reason is that computers in general use a base of 2 for all computations, and 0.3 has a bad representation in this system. While this small rounding error is acceptable often, it is not in some contexts (above all, finance). In this case, we can use the decimal library. Within decimal, a base of 10 is used, but calculations are slower than for floats.
from decimal import Decimal
print(f"Exact calculation: 0.2 + 0.1 = {Decimal('0.2') + Decimal('0.1')}.")Exact calculation: 0.2 + 0.1 = 0.3.
5.5 Regular expressions (re)
Sometimes, you need to find patterns in a string. In data science, this happens if some field contains unstructured text, and you want to extract information. As an example, assume you load data from a csv-file, and one columns contains all email-adresses of the corresponding item. If you read this, you have something like "muller@gmail.com; muller@yahoo.com". However, you want to separate the two addresses, and store them in a list. One way to approach this is using a regular expression (which in this case reads "\S+@\S+\.\S+"):
import re
text = "muller@gmail.com; muller@yahoo.com"
emails = re.findall(r"\S+@\S+\.\S+", text)
print(emails)['muller@gmail.com;', 'muller@yahoo.com']
Let us quickly break this down to some extent, looking at "\S+@\S+\.\S+": * \S: any non-whitespace character * +: one or more of the previous thing * @: exactly the @ symbol * \.: exactly the dot . character (separating e.g. gmail from com)
We can agree that this exactly what an email address looks like. The re.finall() finds all places in text which match this pattern and extracts them, putting them into a list. Note that there are more patterns to find than \S, + and exact matches. For this reason, it is advised to use AI when it comes to regular expressions.
5.6 Storing and restoring objects (pickle)
Sometimes, you want to store intermediate objects. One way is to write them out in files, but you can also use the pickle-library in order to pickle (i.e. serialize) Python objects to disc. Here is an example:
import pickle
data = {"name": "Alice", "age": 30}
# Save to file (pickle)
with open("data.pkl", "wb") as f:
pickle.dump(data, f)
# Load from file (unpickle)
with open("data.pkl", "rb") as f:
loaded_data = pickle.load(f)
print(loaded_data){'name': 'Alice', 'age': 30}
5.7 Testing code (pytest)
# Uncomment the next line and run the cell for installation.
# If pip3 does not work, use pip instead.
# !pip3 install pytestWhen starting to code, producing reliable programs is both important and difficult. One – not the best – way to approach this is to rely on the program until something breaks. It is actually better to test the code, i.e. to write tests. This means that you tell your program that you expect a certain behavior.
A simple and powerful library for this is pytest. When working with .py-files, you usually separate such tests in a separate file. Every test-case is within a function with name starting with test_. In a terminal, you can run pytest on that file, which performs all tests, and reports any error.
However, we are working with Jupyter notebooks here, so we cannot run pytest on separate files. Although there are workarounds (the package ipytest), we will only go through some basics here, without importing pytest. This means we are introducing assert.
x = 5
y = -1
assert 2 + 2 == 4
assert 2 + 2 == 5--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[12], line 4 2 y = -1 3 assert 2 + 2 == 4 ----> 4 assert 2 + 2 == 5 AssertionError:
5.8 Using the operating system (os)
For the kind of projects we envision, we often must access files, which happens via the operating system. We might also want to see all files in some directory, access them one by one, etc. Here, folder is the string of the name of a folder, and file is the string for the name of a file. * os.getcwd(): returns the current working directory. * os.chdir(folder): this is a Python command for cd. * os.path.join(folder", file): gives a complete path from the folder and file name. * os.listdir(folder): returns a list of files within the folder. * os.path.exists(file): returns true or false if the file exists or not. * os.walk(folder): gives the triple (root, dirs, files), where root is the absolute path of the folder, dirs are all direct subfolders in that folder, and files are the files in that folder. If folder has subfolders, it returns a list of triples for all subfolders.
The string method .endswith(suffix) checks whether a string ends with the given suffix and returns True or False.
import os
# Print all ipynb-files in the current folder
for file in os.listdir("."):
if file.endswith(".ipynb"):
print(file)The built-in function enumerate(iterable) returns pairs (index, element) for each element, which is useful when you need both the position and the value during iteration.
# Print folders and files (only first 5 entries of os.walk)
for i, (root, dirs, files) in enumerate(os.walk(".")):
if i >= 5:
print("... (truncated)")
break
print("ROOT:", root)
print("DIRS:", dirs[:5], "..." if len(dirs) > 5 else "")
print("FILES:", files[:5], "..." if len(files) > 5 else "")
print()ROOT: .
DIRS: ['latex', 'misc', 'site_libs', '.claude', 'index_files'] ...
FILES: ['data.pkl', 'index.qmd', '06_numerics.qmd', '.gitignore', 'render_notebooks.sh'] ...
ROOT: ./latex
DIRS: []
FILES: ['gcd.log', 'gcd.aux', 'gcd.pdf', 'gcd.tex']
ROOT: ./misc
DIRS: []
FILES: ['fake_data.py', 'person.csv', 'person.db', 'person.json']
ROOT: ./site_libs
DIRS: ['quarto-html', 'bootstrap', 'clipboard', 'quarto-nav', 'quarto-search']
FILES: []
ROOT: ./site_libs/quarto-html
DIRS: ['tabsets']
FILES: ['quarto.js', 'anchor.min.js', 'tippy.umd.min.js', 'popper.min.js', 'quarto-syntax-highlighting-067f8ab80d092ae3595f7402c2c9b667.css'] ...
... (truncated)
5.9 Starting other programs (subprocess)
Sometimes, you want to start some piece of software (a command line tool) from within your Python program, and collect the result. You could use a script for this, which starts the command line tool, and rus your Python script afterwards, but you can as well use Python for the whole setting. We only give a basic example:
import subprocess
filename = "file.tex"
try:
result = subprocess.run(
["pdflatex", "-interaction=nonstopmode", filename],
capture_output=True, # capture stdout and stderr output instead of printing
text=True, # returns output as string, not binary
check=True # raises an error if the command fails
)
print(result.stdout)
except FileNotFoundError:
print(f"File {filename} not found.")
5.10 Exercises
Exercise 1 What is your age in seconds/minutes/hours/days/months?
# Exercise 1Exercise 2 Write a function days_until_birthday(birthday_str) that takes a date string like "1990-03-15" and returns the number of days until the next occurrence of that birthday. (Hint: use datetime.date.)
# Exercise 2Exercise 3 Write a function extract_emails(text) that uses a regular expression to find all email addresses in a string. Test it on "Contact us at info@example.com or support@example.org, but not at wrong-email.". (Hint: a simple pattern like r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" is a good starting point.)
# Exercise 3Exercise 4 Create some txt-files in the current working directory. Then, write a Python program that iterates through all such files, creates all.txt, where the content of all your txt-files is appended. If all.txt already exists, the program should give a warning that this file is overridden. (Hint: you can generate fake text files using the faker library, e.g. from faker import Faker; fake = Faker(); fake.text().)
# Exercise 4Exercise 5 Below are some sales dates and quantities.
- Write a function, with
startandendasdatetimeas input, which gives you the total amount of sales during this time. (You will need to convertdateas given in the data todatetime. Note that<,>,<=,>=,==work fordatetimeas expected.) - Write a function, with a date
dateand a number daysdaysas input, which returns the average amount of sales per day during the lastdayssincedate.
data = [
("2026-01-01", 120),
("2026-01-02", 135),
("2026-01-03", 128),
("2026-01-08", 150),
("2026-02-01", 210),
("2026-02-03", 215)]# Exercise 5Exercise 6 Write a function find_large_files(directory, min_size_mb=10) that uses os.walk to find all files in directory (and subdirectories) that are larger than min_size_mb megabytes. Return a list of (path, size_mb) tuples, sorted by size descending.
# Exercise 6Exercise 7 A bank manager has an idea: Interests should be computed daily (as usual), but rounded to cents (round up if last digit is 5 or above) at the end of the day. (So, sub-cents are not kept for computing the account balance.) Assume someone has invested 1 Euro on January 1st, 0 (birth of christ) with an interest rate of 5% per year. * Compute the account balance as usual, i.e. keeping sub-cents at all days. * Compute the account balance using the above strategy, i.e. ignoring sub-cents at the end of each day.
# Exercise 7