Automation and Make#
As we work in a project, we often encounter certain commands and operations that we end up running multiple times. Many of these operations regards the behaviour of certain programs and corresponds to programs that we execute from the terminal.For example, so far in this course we were
File management: creation of files/folders.
Running code from Python scripts and Jupyter notebooks that perform certain analysis, reading data and generating outputs.
Creating virtual environment; activate them; install new packages; creating an iPython kernel.
Creating a JupyterBook
As our workflow of grows, these operations start becoming more complex and dependent of each other. Make allow us not just to automatize the execution of programs, but also keep a track of the network of commands between the different parts of out project.
0. Setup#
Let’s consider the following piece of code inside our Eratosthenes project. Let’s create a new Python script called calculate_prime.py
with the following piece of code
# calculate_primes.py
import sys
import math
import numpy as np
def sieve(nmax):
"""
Function to compute prime numbers.
Arguments:
- nmax: integer. Upper bound for prime search.
Ourputs:
- all_primes: list. List with all the prime numbers slower than nmax
"""
all_primes = []
if nmax == 2:
all_primes = [2]
else:
primes_head = [2]
first = 3
primes_tail = np.arange(first,nmax+1,2)
while first <= round(math.sqrt(primes_tail[-1])):
first = primes_tail[0]
primes_head.append(first)
non_primes = first * primes_tail
primes_tail = np.array([ n for n in primes_tail[1:]
if n not in non_primes ])
all_primes = primes_head + primes_tail.tolist()
return all_primes
if __name__ == '__main__':
n = int(sys.argv[1])
print(sieve(n))
The last part of calculate_prime.py
includes the __main__
header. This is what allow us to run and read arguments directly from the terminal. Now, from the terminal we can run sieve()
with
python calculate_sieve.py 10
which should print the list [2, 3, 5, 7]
.
Warning
Remember to check in which environment you are running this code! If you do this from the base
environment this won’t work, since numpy
is not installed there. As we always emphasize, always check in which environment you are running code. You can activate the notebook
environment or use the environment you created for the Eratosthenes project in Lab 04.
Now, let’s move thing a little bit around. Instead of passing the argument variables by the terminal and then printing the outputs, let’s create an input.txt
and output.txt
file that reads a list of arguments and save them in an output file. We can archive this by modifying the previous script to include
if __name__ == '__main__':
input_file = sys.argv[1]
output_file = sys.argv[2]
# Read each line of the file
with open(input_file) as file:
lines = file.read().splitlines()
results = []
for n in lines:
results.append(sieve(int(n)))
# Save values
with open(output_file, 'w') as output:
for i, res in enumerate(results):
output.write("{} {}\n".format(lines[i], res))
Create now an data/input.txt
file with one integer number per line, create a folder called results
, and now execute
python3 calculate_prime.py data/input.txt results/output.txt
This will create the file output.txt
file inside the folder results
with the printed outputs.
Running from iPython
You can also run the previous command directly from the iPython cell inside Jupyter Notebook instead of a terminal by using the %run
magic command:
%run calculate_prime.py input.txt output.txt
1. Automation with Bash#
Now, if we now want to perform one simple operation, we can run individually commands form the terminal. However,
This doesn’t look fully reproducible
It doesn’t escalate very well when our analysis requires execution of multiple program lines.
Do not generalize very well to cases with different input/output files.
Notice that the workflow introduced in the previous section required at least three steps: the activation of the correct conda environment, the creation of the output folder, and the execution of the Python script.
A first solution to some of this problems will be to create a Bash script that executes all these operations. Let’s make this
#!/bin/bash
conda activate notebook
mkdir results
python calculate_prime.py input.txt results/output.txt
The header of the file has the shebang #!
that indicates that this is an executable file. You will probably need to change the permission to the file in order to execute it. Explore the chmod
command in bash for doing this
Warning
This doens’t activate the environment since it does’t recognize conda from the bash script.
2. Our first Makefile#
Now, instead of having all these instructions in a bash script, let’s use Make instead. This is a build file. Altought similar to a bash script, they are not the same. Let’s begin with something simple and let’s create a file called Makefile
with the following content
# Compute prime numbers
results/output.txt : input.txt
python calculate_prime.py input.txt results/output.txt
and now, from the terminal let’s execute just the command make:
make
This executed the Python script and generates the respective outputs in results/output.txt
.
Warning
It is important that the indentation inside Makefile
are tabs instead of spaces. If you are working from JupyterLab, you can change this configuration in Settings > Text Editor Indentation
.
Make Syntax
The basic syntax inside the Makefile can be described as follows
# Comments
<TARGETS> : <DEPENDENCIES>
<PROGRAMS>
The #
is used for comments. The section for programs can include multiple lines of scrip and with increasing level of complexity, for example by including conditional statements. The important thing you need to know is that inside <PROGRAM>
, you are running bash code.
2.1. Re executing code#
One of the things that makes Make special is that it doesn’t execute operations that had been executed already with dependencies that haven’t change over the course of time. For example, in the previous example, the output.py
depends of both the input data input.txt
and the Python script calculate_prime.py
. If we don’t change these two files and execute one more time make
you will observe no change, plus male will print a message similar to this one:
!make
make: Nothing to be done for 'outputs'.
Now, if we make the minimum change to any of the dependencies files, then Make will execute the program again. For example, if you just update the timestamp of any of the files (touch input.txt
) and run make
again, you will see the Python code will be executed again and the timestamp of output.py
will be updated too.
%%bash
touch input.txt
make
bash: /srv/conda/envs/notebook/bin/../lib/./libtinfo.so.6: no version information available (required by /srv/conda/envs/notebook/bin/../lib/libreadline.so.8)
make: Nothing to be done for 'outputs'.
Some advantages of this build-in memory system of Makefiles are
We save repeating unnecessary operations as we run a dataflow.
2.2. Special characters#
As you can see from Makefile
, we are being redundant about the name of both dependency and target files. Instead of
results/output.txt : input.txt
python calculate_prime.py input.txt results/output.txt
we can instead write
results/output.txt : input.txt
python calculate_prime.py $^ $@
The symbols beginning with $
are special characters in Make that have special meaning and can be used as shortcuts. You can find a full list of them here. Some of the most useful ones include
$^
: The list of all dependencies$<
: The name of the first dependency$@
: The name of the target$%
: Make wildcard (See Section 3.3)
3. Adding more functions to out Makefile#
Let’s explore some other commands we can add to out Makefile that will be useful as we automatize and execute more code.
3.1. Cleaning#
We may be interested in removing all existing output data so we can recreate them.
.PHONY : clean
clean :
rm -f results/*
and then run just the cleaning command with
make clean
Phony target
A phony target is one that is not really the name of a file. It is just a name for some commands to be executed when you make an explicit request. There are two reasons to use a phony target: to avoid a conflict with a file of the same name, and to improve performance (see here for mor information).
3.2. Grouping operations#
Now, we can combine multiple operation under the same group. By creating the target output
, we can create all the output files using make outputs
:
.PHONY : outputs
outputs : results/output1.txt results/output2.txt
result/output1.txt : input1.txt
python calculate_prime.py $^ $@
result/output2.txt : input2.txt
python calculate_prime.py $^ $@
.PHONY : clean
clean :
rm -f results/*
Since outputs
doesn’t refer to the name of any target file, we add it as a .PHONY
target, just as we did with clean
.
Make will automatically run the first make command inside our Makefile
. That means that if we locate the following line
.PHONY : outputs
outputs : results/output1.txt results/output2.txt
in the top of our ``Makefile, then the
makecommand will execute by default
make outputs`.
3.3. Wildcard#
In our Makefile, there are two ways of using wildcards. For targets and dependencies (the <TARGETS> : <DEPENDENCIES>
part in our Makefile) we can use a generic wildcard %
. This is used to find patterns and automatize the processing of them. For example, we simplify the last two commands into one by using
result/output%.txt : input%.txt
python calculate_primes.py $^ $@
Now, we can use the placeholder $*
in to call any matching run we found with %
between the commands. For example, an equivalent way of running the previous command would be doing
results/output%.txt : input%.txt
python calculate_prime.py $< results/output$*.txt
3.3. Working directory setup#
You can also use Make to setup your working directory, running the same operations you will run from the terminal but in a more axiomatic way.
.PHONY : setup
setup:
mkdir results
3.4. Make for creating a new environment#
You can use commands in make to create and manipulate conda environments. You can do this in such a way that you can automatize the process of creating an environment form a .yml
file, install new dependencies (eg ipython
) and then create the corresponding kernel for the environment.
Warning
Unfortunately, it is not possible to activate environments using make. A workaround solution to this problem is to have separate make commands to create your environment and install the required dependencies. By doing this, you can execute the full creation and deletion of a conda environment with the bash commands
make create_environment
conda activate <myenv>
make update_environment
make delete_environment
You can find more information about this workflow in this talk.
Another solution to this is to include the following command to your Makefile
.ONESHELL:
SHELL = /bin/bash
By default, every line in a recipe of a make command is executed in a different process. The .ONESHELL:
command allow us to run all the commands inside an operation in the same shell. The line SHELL = /bin/bash
makes explicit the use of shell, which allow us to do
.ONESHELL:
SHELL = /bin/bash
create_environment :
source /srv/conda/etc/profile.d/conda.sh
conda env create -f environment.yml
conda activate notebook
conda install ipykernel
python -m ipykernel install --user --name make-env --display-name "IPython - Make"
3.5. Self-documenting Makefile#
A very useful feature we can add to our Makefile is to include documentation for the different operations we write. A simple hack for doing this automatically consists in including a commented line starting with ##
on top of each operation, for example,
## clean : Remove output files
.PHONY : clean
clean :
rm -f results/*
and then include the following command in your Makefile.
.PHONY : help
help : Makefile
@sed -n 's/^##//p' $<
Now, the next time you execute make help
from the terminal you will see
!make help
clean : Remove auto-generated files.