Analysis pipelines with Python

Python is probably the most versatile language in existence. However one of its most useful features is its ability to tie things together and automate the execution of other programs.

This tutorial focuses on using Python in high-performance computing environments to automate data analysis pipelines with Snakemake (for a detailed discussion for why we are teaching Snakemake, see this lesson’s discussion page). We’ll start with the basics and cover everything you need to get started. Some elements of writing performance-oriented code will be covered, but it is not the main focus. There is no prerequisite knowlege for this tutorial, although having some prior experience with the command-line or a compute cluster will be very helpful.

At the end of this lesson, you will know how to:

NOTE: This is the draft HPC Carpentry release. Comments and feedback are welcome.

Setup

You will want to have Python 3 and your favorite Python editor preinstalled and ready to go. If you don’t know where to get things or what to install, just install Anaconda (the Python 3 version) from https://www.continuum.io/downloads. Anaconda is an extremely comprehensive Python distribution that comes with all of the bells and whistles ready to go.

To install snakemake, please run the following in a command-line terminal: pip install --user snakemake

The files used in this lesson can be downloaded here.

Schedule

Setup Download files required for the lesson
00:00 1. Basic syntax Where do I start?
00:30 2. Scripts and imports What is a Python program?
01:00 3. Numpy arrays and lists How do we store large amounts of data?
01:30 4. Storing data with dicts How do I store structured data?
02:00 5. Functions and Conditions How do I write functions?
02:30 6. Introduction to parallel computing How do I run code in parallel?
03:00 7. Introduction to Snakemake How can I make my results easier to reproduce?
03:30 8. Snakefiles How do I write a simple workflow?
04:00 9. Wildcards How can I abbreviate the rules in my pipeline?
04:45 10. Pattern Rules How can I define rules to operate on similar files?
05:00 11. Snakefiles are Python code How can I automatically manage dependencies and outputs?
How can I use Python code to add features to my pipeline?
05:45 12. Resources and parallelism How do I scale a pipeline across multiple cores?
How do I manage access to resources while working in parallel?
06:30 13. Scaling a pipeline across a cluster How do I run my workflow on an HPC system?
07:15 14. Final notes What are some tips and tricks I can use to make this easier?
07:45 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.