5 Some useful libraries/packages/modules

Download notebook.

In this chapter, we take a little break. Although there are lots of things to learn, the material covered here has the role of helping to learn about the Python ecosystem and what is possible, rather than learn the details of various libraries. The first section, however, is central and will be needed in any Python project you will start.

5.1 Loading modules (`import`)

As in all programming languages, not all Python programs come in a single file. If a Python project spans multiple files, an import becomes necessary. In Python, there is syntax for imports as follows:

Syntax	Meaning
`import module`	full module
`import module as name`	alias
`from module import x`	specific objects
`from module import *`	everything (avoid)
`import module.submodule`	submodule

Generally, there are packages (folders), modules (a .py-file within a package, or a single py-file), and submodules. Every submodule is a module. In any case, the import loads the (sub-)module, and executes all top-level code inside it (i.e. everything which is not inside a function).

There are three cases:

You wrote some code in file one.py (e.g. some functions) which you want to use in file two.py. (Note, btw, that the first digit in one.py must be a letter.) Assume one.py is as follows:

# File: one.py

def myConst():
    return 42

Then, the import in two.py is either import one (both files must be in the same folder), or from one import myConst or (not recommended) from one import *. In the first case, you must prefix myconst() by one:

# File: two.py

import one
print(f"My constant is {one.myConst()}.")

In the second case, there is no prefix:

from one import myConst
# or 
from one import *
print(f"My constant is {myConst()}.")

Observe that the latter option comes with the disadvantage that you cannot see from the call of myconst(), if it is defined in two.py or inone.py, so it becomes harder to keep track of your code if things become more complicated.

If you have a folder myPackage, which holds one.py, you can must write import myPackage, import myPackage.one or import myPackage.one as myNameOne with differing ways to use myConst():

import myPackage 
print(f"My constant is {myPackage.one.myConst()}.")

import myPackage.one
print(f"My constant is {one.myConst()}.")

import myPackage.one as myName
print(f"My constant is {myName.myConst()}.")

You want to use functions from a built-in library. Then, a simple import module is often the best choice. We will see examples in Section 5.2.
You want to use functions from a library, which needs to be installed. Here, before the import module even works, you need to install the module in your Python environment. In a terminal, within your Python environment, use

# Uncomment the next line and run the cell for installation.
# This only needs to be done once for every package.
# If pip3 does not work, use pip instead. 
# !pip3 install numpy

and then, import numpy will work. We will see examples for this e.g. in Section 7.8.

5.2 Mathematics (`math`)

The mathematical library in Python is used by import math. Then, you have e.g. the following functions, which are self-explanatory: math.sqrt(), math.exp(), math.log(), math.sin(), math.cos(), and constants like math.pi and math.e.

5.3 Dates and times of day (`datetime`)

In data science, we often have to handle dates. Their disadvantage is that there is no standard way to write these as strings. In Python, this and various other problems around dates and times is solved by the datetime-library.

Submodules (in fact, these are classes in this case) of this package are

date: handling dates
time: handling times
datetime.datetime: handling datetimes, i.e. dates and times together
timedelta: time differences
tzinfo and timezone: handling timezones

We will here use

import datetime

at the cost that datetime-objects are only accessible by saying datetime.datetime (the first datetime being the package, the second datetime being the class/submodule).

Let us create a datetime-object`:

dt = datetime.datetime(2026, 4, 21, 12, 15)
print(dt)
print(f"year: {dt.year}")
print(f"month: {dt.month}")
print(f"day: {dt.day}")
print(f"hour: {dt.hour}")
print(f"minutes: {dt.minute}")
print(f"second: {dt.second}")
print(f"microsecond: {dt.microsecond}")
print(f"date: {dt.date()}")
print(f"time: {dt.time()}")

2026-04-21 12:15:00
year: 2026
month: 4
day: 21
hour: 12
minutes: 15
second: 0
microsecond: 0
date: 2026-04-21
time: 12:15:00

which is the starting time of the first session in this course. As we see here, we can extract year,…, microseconds from the datetime-object.

Often, we are given strings such as "21.4.26, 12:15", and want to convert this into a datetime object. For this, we have datetime.datetime.strptime:

s = "21.4.26, 12:15"
dt = datetime.datetime.strptime(s, "%y.%m.%d, %H:%M")
print(s + f" is converted to {dt}.")

21.4.26, 12:15 is converted to 2021-04-26 12:15:00.

(For the correct way to place %, y, m, d etc., it is in fact best to consult your favourite AI.)

Reversely, we want to print a datetime-object in a certain way. For this, we have datetime.datetime.strftime:

print(dt.strftime("%-m/%-d/%y, %H:%M"))

4/26/21, 12:15

which is the US-style output.

Next, assume we want to compute the same time 12 weeks from now. For this, we use datetime.timedelta.

print(dt + datetime.timedelta(days=84))

2021-07-19 12:15:00

Here are some more remarks:

datetime.date is not used very often. The reason is that you can recover a date from a datetime, but not the other way round.
datetime.time has the same limitations.
We will not cover time-zones here, but if you have global data, they will become important!

5.4 Decimal numbers (`decimal`)

We already saw that:

print(f"Rounding error: 0.2 + 0.1 = {0.2 + 0.1}.")

Rounding error: 0.2 + 0.1 = 0.30000000000000004.

The reason is that computers in general use a base of 2 for all computations, and 0.3 has a bad representation in this system. While this small rounding error is acceptable often, it is not in some contexts (above all, finance). In this case, we can use the decimal library. Within decimal, a base of 10 is used, but calculations are slower than for floats.

from decimal import Decimal
print(f"Exact calculation: 0.2 + 0.1 = {Decimal('0.2') + Decimal('0.1')}.")

Exact calculation: 0.2 + 0.1 = 0.3.

5.5 Regular expressions (`re`)

Sometimes, you need to find patterns in a string. In data science, this happens if some field contains unstructured text, and you want to extract information. As an example, assume you load data from a csv-file, and one columns contains all email-adresses of the corresponding item. If you read this, you have something like "muller@gmail.com; muller@yahoo.com". However, you want to separate the two addresses, and store them in a list. One way to approach this is using a regular expression (which in this case reads "\S+@\S+\.\S+"):

import re

text = "muller@gmail.com; muller@yahoo.com"

emails = re.findall(r"\S+@\S+\.\S+", text)
print(emails)

['muller@gmail.com;', 'muller@yahoo.com']

Let us quickly break this down to some extent, looking at "\S+@\S+\.\S+":

\S: any non-whitespace character
+: one or more of the previous thing
@: exactly the @ symbol
\.: exactly the dot . character (separating e.g. gmail from com)

We can agree that this exactly what an email address looks like. The re.finall() finds all places in text which match this pattern and extracts them, putting them into a list. Note that there are more patterns to find than \S, + and exact matches. For this reason, it is advised to use AI when it comes to regular expressions.

5.6 Storing and restoring objects (`pickle`)

Sometimes, you want to store intermediate objects. One way is to write them out in files, but you can also use the pickle-library in order to pickle (i.e. serialize) Python objects to disc. Here is an example:

import pickle

data = {"name": "Alice", "age": 30}

# Save to file (pickle)
with open("data.pkl", "wb") as f:
    pickle.dump(data, f)

# Load from file (unpickle)
with open("data.pkl", "rb") as f:
    loaded_data = pickle.load(f)

print(loaded_data)

{'name': 'Alice', 'age': 30}

5.7 Testing code (`pytest`)

# Uncomment the next line and run the cell for installation.
# If pip3 does not work, use pip instead. 
# !pip3 install pytest

When starting to code, producing reliable programs is both important and difficult. One – not the best – way to approach this is to rely on the program until something breaks. It is actually better to test the code, i.e. to write tests. This means that you tell your program that you expect a certain behavior.

A simple and powerful library for this is pytest. When working with .py-files, you usually separate such tests in a separate file. Every test-case is within a function with name starting with test_. In a terminal, you can run pytest on that file, which performs all tests, and reports any error.

However, we are working with Jupyter notebooks here, so we cannot run pytest on separate files. Although there are workarounds (the package ipytest), we will only go through some basics here, without importing pytest. This means we are introducing assert.

x = 5
y = -1
assert 2 + 2 == 4
assert 2 + 2 == 5

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[12], line 4
      1 x = 5
      2 y = -1
      3 assert 2 + 2 == 4
----> 4 assert 2 + 2 == 5

AssertionError:

5.8 Using the operating system (`os`)

For the kind of projects we envision, we often must access files, which happens via the operating system. We might also want to see all files in some directory, access them one by one, etc. Here, folder is the string of the name of a folder, and file is the string for the name of a file.

os.getcwd(): returns the current working directory.
os.chdir(folder): this is a Python command for cd.
os.path.join(folder", file): gives a complete path from the folder and file name.
os.listdir(folder): returns a list of files within the folder.
os.path.exists(file): returns true or false if the file exists or not.
os.walk(folder): gives the triple (root, dirs, files), where root is the absolute path of the folder, dirs are all direct subfolders in that folder, and files are the files in that folder. If folder has subfolders, it returns a list of triples for all subfolders.

The string method .endswith(suffix) checks whether a string ends with the given suffix and returns True or False.

import os
# Print all ipynb-files in the current folder
for file in os.listdir("."):
    if file.endswith(".ipynb"):
        print(file)

The built-in function enumerate(iterable) returns pairs (index, element) for each element, which is useful when you need both the position and the value during iteration.

# Print folders and files (only first 5 entries of os.walk)
for i, (root, dirs, files) in enumerate(os.walk(".")):
    if i >= 5:
        print("... (truncated)")
        break
    print("ROOT:", root)
    print("DIRS:", dirs[:5], "..." if len(dirs) > 5 else "")
    print("FILES:", files[:5], "..." if len(files) > 5 else "")
    print()

ROOT: .
DIRS: ['exam', '03_loops_and_conditions_files', '.quarto', 'docs', '02_simple_data_types_files'] ...
FILES: ['07_handling_data.qmd', '02_simple_data_types.html', '04_container_data_types.quarto_ipynb', '03_loops_and_conditions.html', '02_simple_data_types.qmd'] ...

ROOT: ./exam
DIRS: [] 
FILES: ['generate_exam.py', 'exam_questions.md'] 

ROOT: ./03_loops_and_conditions_files
DIRS: ['figure-pdf'] 
FILES: [] 

ROOT: ./03_loops_and_conditions_files/figure-pdf
DIRS: [] 
FILES: [] 

ROOT: ./.quarto
DIRS: ['_freeze', 'cites', 'idx', 'xref', 'quarto-session-tempc1bb626afa028a84'] ...
FILES: [] 

... (truncated)

5.9 Starting other programs (`subprocess`)

Sometimes, you want to start some piece of software (a command line tool) from within your Python program, and collect the result. You could use a script for this, which starts the command line tool, and runs your Python script afterwards, but you can as well use Python for the whole setting. We only give a basic example:

import subprocess
filename = "file.tex"
try:
    result = subprocess.run(
        ["pdflatex", "-interaction=nonstopmode", filename], 
        capture_output=True, # capture stdout and stderr output instead of printing
        text=True, # returns output as string, not binary
        check=True # raises an error if the command fails
    )
    print(result.stdout)
except FileNotFoundError:
    print(f"File {filename} not found.")

5.10 Exercises

Exercise 1 What is your age in seconds/minutes/hours/days/months?

# Exercise 1

Exercise 2 Write a function days_until_birthday(birthday_str) that takes a date string like "1990-03-15" and returns the number of days until the next occurrence of that birthday. (Hint: use datetime.date.)

# Exercise 2

Exercise 3 Write a function extract_emails(text) that uses a regular expression to find all email addresses in a string. Test it on "Contact us at info@example.com or support@example.org, but not at wrong-email.". (Hint: a simple pattern like r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" is a good starting point.)

# Exercise 3

Exercise 4 Create some txt-files in the current working directory. Then, write a Python program that iterates through all such files, creates all.txt, where the content of all your txt-files is appended. If all.txt already exists, the program should give a warning that this file is overridden. (Hint: you can generate fake text files using the faker library, e.g. from faker import Faker; fake = Faker(); fake.text().)

# Exercise 4

Exercise 5 Below are some sales dates and quantities.

Write a function, with start and end as datetime as input, which gives you the total amount of sales during this time. (You will need to convert date as given in the data to datetime. Note that <, >, <=, >=, == work for datetime as expected.)
Write a function, with a date date and a number days days as input, which returns the average amount of sales per day during the last days since date.

data = [
        ("2026-01-01", 120),
        ("2026-01-02", 135),
        ("2026-01-03", 128),
        ("2026-01-08", 150),
        ("2026-02-01", 210),
        ("2026-02-03", 215)]

# Exercise 5

Exercise 6 Write a function find_large_files(directory, min_size_mb=10) that uses os.walk to find all files in directory (and subdirectories) that are larger than min_size_mb megabytes. Return a list of (path, size_mb) tuples, sorted by size descending.

# Exercise 6

Exercise 7 A bank manager has an idea: Interests should be computed daily (as usual), but rounded to cents (round up if last digit is 5 or above) at the end of the day. (So, sub-cents are not kept for computing the account balance.) Assume someone has invested 1 Euro on January 1st, 0 (birth of christ) with an interest rate of 5% per year.

Compute the account balance as usual, i.e. keeping sub-cents at all days.
Compute the account balance using the above strategy, i.e. ignoring sub-cents at the end of each day.

# Exercise 7

5.1 Loading modules (import)

5.2 Mathematics (math)

5.3 Dates and times of day (datetime)

5.4 Decimal numbers (decimal)

5.5 Regular expressions (re)

5.6 Storing and restoring objects (pickle)

5.7 Testing code (pytest)

5.8 Using the operating system (os)

5.9 Starting other programs (subprocess)