Analysis pipelines with Python

Resources and parallelism

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How do I scale a pipeline across multiple cores?

  • How do I manage access to resources while working in parallel?

Objectives
  • Modify your pipeline to run in parallel.

After the excercises at the end of our last lesson, our Snakefile looks something like this:

# our zipf analysis pipeline
DATS = glob_wildcards('books/{book}.txt').book

rule all:
    input:
        'zipf_analysis.tar.gz'

# delete everything so we can re-run things
rule clean:
    shell:  
        '''
        rm -rf results dats plots
        rm -f results.txt zipf_analysis.tar.gz
        '''

# count words in one of our "books"
rule count_words:
    input: 	
        wc='wordcount.py',
        book='books/{file}.txt'
    output: 'dats/{file}.dat'
    shell: 	'python {input.wc} {input.book} {output}'

# create a plot for each book
rule make_plot:
    input:
        plotcount='plotcount.py',
        book='dats/{file}.dat'
    output: 'plots/{file}.png'
    shell:  'python {input.plotcount} {input.book} {output}'

# generate summary table
rule zipf_test:
    input:  
        zipf='zipf_test.py',
        books=expand('dats/{book}.dat', book=DATS)
    output: 'results.txt'
    shell:  'python {input.zipf} {input.books} > {output}'

# create an archive with all of our results
rule make_archive:
    input:
        expand('plots/{book}.png', book=DATS),
        expand('dats/{book}.dat', book=DATS),
        'results.txt'
    output: 'zipf_analysis.tar.gz'
    shell: 'tar -czvf {output} {input}'

At this point, we have a complete data analysis pipeline. Very cool. But how do we make it run as efficiently as possible?

Running in parallel

Up to this point, Snakemake has printed out an interesting message whenever we run our pipeline.

Provided cores: 1
Rules claiming more threads will be scaled down.

So far, Snakemake has been run with one core. Let’s scale up our pipeline to run in parallel. The only change we need to make is run Snakemake with the -j argument. -j is used to indicate number of CPU cores available, and on a cluster, maximum number of jobs (we’ll get to that part later).

snakemake clean
snakemake -j 4    # 4 cores is usually a safe assumption when working on a laptop/desktop
Provided cores: 4
Rules claiming more threads will be scaled down.
# more output follows

Our pipeline ran in parallel and finished roughly 4 times as quickly! The takeaway here is that all we need to do to scale from a serial pipeline is run snakemake with the -j option.

How many CPUs does your computer have?

Now that we can have our pipeline use multiple CPUs, how do we know how many CPUs to provide to the -j option? Note that for all of these options, it’s best to use CPU cores, and not CPU threads.

Linux - You can use the lscpu command.

All platforms - Python’s psutil module can be used to fetch the number of cores in your computer. Using logical=False returns the number of true CPU cores. logical=True gives the number of CPU threads on your system.

import psutil
psutil.cpu_count(logical=False)

Managing CPUs

Each rule has a number of optional keywords aside from the usual input, output, and shell/run. The threads keyword is used to specify how many CPU cores a rule needs while executing. Though in reality CPU threads are not quite the same as CPU cores, the two terms are interchangeable when working with Snakemake.

Let’s pretend that our count_words rule is actually very CPU-intensive. We’ll say that it needs a whopping 4 CPUs per run. We can specify this with the threads keyword in our rule. We will also modify the rule to print out the number of threads it thinks it is using. Please note that just giving something 4 threads in Snakemake does not make it run in parallel! In this case wordcount.py is actually still running with 1 core, we are simply using it as a demonstration of how to go about running something with multiple cores.

rule count_words:
    input: 	
        wc='wordcount.py',
        book='books/{file}.txt'
    output: 'dats/{file}.dat'
    threads: 4
    shell:
        '''
        echo "Running {input.wc} with {threads} cores."
        python {input.wc} {input.book} {output}
        '''

Now, when we run snakemake -j 4, the count_words rules are run one at a time, so as to give each execution the resources it needs. All of our other rules will still run in parallel. Unless otherwise specified with {threads}, rules will use 1 core by default.

Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	4	count_words
	1	make_archive
	4	make_plot
	1	zipf_test
	11

rule count_words:
    input: wordcount.py, books/last.txt
    output: dats/last.dat
    jobid: 3
    wildcards: file=last
    threads: 4

Running wordcount.py with 4 cores.
Finished job 3.
1 of 11 steps (9%) done

# other output follows

What happens when we don’t have 4 cores available? What if we tell Snakemake to run with 2 cores instead?

snakemake -j 2
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	4	count_words
	1	make_archive
	4	make_plot
	1	zipf_test
	11

rule count_words:
    input: wordcount.py, books/last.txt
    output: dats/last.dat
    jobid: 6
    wildcards: file=last
    threads: 2

Running wordcount.py with 2 cores.
Finished job 6.
1 of 11 steps (9%) done

# more output below

The key bit of output is Rules claiming more threads will be scaled down.. When Snakemake doesn’t have enough cores to run a rule (as defined by {threads}), Snakemake will run that rule with the maximum available number of cores instead. After all, Snakemake’s job is to get our workflow done. It automatically scales our workload to match the maximum number of cores available without us editing the Snakefile.

Chaining multiple commands

Up until now, all of our commands have fit on one line. To execute multiple bash commands, the only modification we need to make is use a Python multiline string (begin and end with ''')

One important addition we should be aware of is the && operator. && is a bash operator that runs commands as part of a chain. If the first command fails, the remaining steps are not run. This is more forgiving than bash’s default “hit an error and keep going” behavior. After all, if the first command failed, it’s unlikely the other steps will work.

# count words in one of our "books"
rule count_words:
    input: 	
        wc='wordcount.py',
        book='books/{file}.txt'
    output: 'dats/{file}.dat'
    threads: 4
    shell:
        '''
        echo "Running {input.wc} with {threads} cores on {input.book}." &&
            python {input.wc} {input.book} {output}
        '''

Managing other types of resources

Not all compute resources are CPUs. Examples might include limited amounts of RAM, number of GPUs, database locks, or perhaps we simply don’t want multiple processes writing to the same file at once. All non-CPU resources are handled using the resources keyword.

For our example, let’s pretend that creating a plot with plotcount.py requires dedicated access to a GPU (it doesn’t), and only one GPU is available. How do we indicate this to Snakemake so that it knows to give dedicated access to a GPU for rules that need it? Let’s modify the make_plot rule as an example:

# create a plot for each book
rule make_plot:
    input:
        plotcount='plotcount.py',
        book='dats/{file}.dat'
    output: 'plots/{file}.png'
    resources: gpu=1
    shell:  'python {input.plotcount} {input.book} {output}'

We can execute our pipeline using the following (using 8 cores and 1 gpu):

snakemake clean
snakemake -j 8 --resources gpu=1
Provided cores: 8
Rules claiming more threads will be scaled down.
Provided resources: gpu=1
# other output removed for brevity

Resources are entirely arbitrary - like wildcards, they can be named anything. Snakemake knows nothing about them aside from the fact that they have a name and a value. In this case gpu indicates simply that there is a resource called gpu used by make_plot. We provided 1 gpu to the workflow, and the gpu is considered in use as long as the rule is running. Once the make_plot rule completes, the gpu it consumed is added back to the pool of available gpus. To be extra clear: gpu in this case does not actually represent a GPU, it is an arbitrary limit used to prevent multiple tasks that use a gpu from executing at the same time.

But what happens if we run our pipeline without specifying the number of GPUs?

snakemake clean
snakemake -j 8
Provided cores: 8
Rules claiming more threads will be scaled down.
Unlimited resources: gpu

If you have specified that a rule needs a certain resource, but do not specify how many you have, Snakemake will assume that the resources in question are unlimited.

Other uses for resources

Resources do not have to correspond to actual compute resources. Perhaps one rule is particularly I/O heavy, and it’s best if only a limited number of these jobs run at a time. Or maybe a type of rule uses a lot of network bandwidth as it downloads data. In all of these cases, resources can be used to constrain access to arbitrary compute resources so that each rule can run at it’s most efficient. Snakemake will run your rules in such a way as to maximize throughput given your resource constraints.

Key Points