Analysis pipelines with Python: Discussion

This lesson is heavily skewed towards teaching basic Python syntax and analysis pipelines using Snakemake. Of course this raises a couple questions - given the limited teaching time for these courses, why teach Snakemake over other Python concepts or tools? Why teach Python at all for high-performance computing tasks?

Why not other Python topics?

For a workshop on general data analysis or basic coding in Python, we recommend checking out one of Software Carpentry’s other workshops that focus on Numpy or Pandas instead.

The goal of this workshop is to teach Python in the context of high-performance computing. Of course, Python is not a fast language. Any code written in an interpreted language like Python will generally run hundreds of times slower than a compiled language like C++, Fortran, or even Java. Though it’s possible to improve Python’s performance with tools like PyPy, Cython, etc. the level of knowledge required to use these tools effectively is far beyond what can be taught in a one-day workshop. Python isn’t the right tool for the job if fast/parallel computing is required. Instructors looking to teach heavy-duty performance and/or parallelization related topics should check out our Chapel lesson instead.

So why teach Python at all?

In most scientific fields, there is a major need for automation. Workflows where the same computation needs to repeated for thousands of input files are commonplace. This is especially true for fields like bioinformatics, where researchers need to run dozens of pre-existing programs to process a piece of data, and then repeat this process for dozens, if not hundreds (or thousands) of input files. Running these types of high-throughput workflows is a significant amount of work, made even more complex by the scripting required to use an HPC cluster’s scheduler effectively.

Python is a great scripting language, and used in a combination with a workflow management tool like Snakemake, it is very simple to script the execution of these types of high-throughput/complex workflows. The goal of this workshop is to teach students how to automate their work with Python, and make their workflows reproducible. Importantly, this also covers how to use Snakemake to automate submission of jobs to an HPC scheduler in a reasonable manner (no runaway submission of tens of thousands of jobs, encountering an error safely stops the workflow without losing work, logfiles and output are handled appropriately, etc.).

Why not other workflow/pipeline tools?

There are lots of other pipeline/workflow management tools out there (in fact, this lesson was adapted from Software Carpentry’s GNU Make lesson). Why teach Snakemake instead of these other tools?