Tips and Tricks#

This guide takes you through some advanced features and patterns that can be utilized in gwf. Remember that gwf is just a way of generating workflows using the Python programming language and thus many of these patterns simply use plain Python code to abstract and automate certain things.

Iterating Over a Parameter Space#

Say that you have a workflow that runs a program with many different combinations of parameters, e.g. the parameters xs, ys, and zs. Each parameter can take multiple values:

xs = [0, 1, 2, 4, 5]
ys = ['cold', 'warm']
zs = [0.1, 0.2, 0.3, 0.4, 0.5]

We now want to run out program simulate with all possible combinations of these parameters. To do this, we’ll use the Python function itertools.product() to create an iterator over all combinations of the parameters:

import itertools

parameter_space = itertools.product(xs, ys, zs)

We can then iterate over the parameter space:

gwf = Workflow()

for x, y, z in parameter_space:
    gwf.target(
        name='sim_{}_{}_{}'.format(x, y, z),
        inputs=['input.txt'],
        outputs=['output_{}_{}_{}.txt'.format(x, y, z)],
    ) << """
    ./simulate {} {} {}
    """.format(x, y, z)

Using itertools.product() with map is even nicer!

Dynamically Generating a Workflow#

We can make our workflows more reusable by generating them dynamically. For example, we may wish to make it easy for others to change the inputs to our workflow or let users specify a different output directory. When generating workflows dynamically you can essentially parameterize the workflow in any way you want. In combination with inclusion of workflows into other workflows, this allows for extremely powerful composition.

To dynamically generate a workflow, we simply create a function which builds the workflow and returns it:

import os.path.join
from gwf import Workflow

def my_fancy_workflow(output_dir='outputs/'):
    # Create an empty workflow object.
    w = Workflow()

    # Add targets to the workflow object, respecting the value of `output_dir`.
    foo_output = os.path.join(output_dir, 'output1.txt')
    w.target(
        name='Foo',
        inputs=['input.txt'],
        outputs=[foo_output],
    ) << """
    ./run_foo > {}
    """.format(foo_output)

    bar_output = os.path.join(output_dir, 'output2.txt')
    w.target(
        name='Bar',
        inputs=[foo_output],
        outputs=[bar_output]
    )

    # Now return the workflow.
    return w

You can put this function in file next to your workflow, or any other place from which you can import the function. In this case, let’s put the file next to workflow.py in a file called fancy.py.

In workflow.py we can then use the workflow as follows:

from fancy import my_fancy_workflow

gwf = my_fancy_workflow()

We can now run the workflow as usual:

$ gwf run

However, we can now easily change the output directory:

from fancy import my_fancy_workflow

gwf = my_fancy_workflow(output_dir='new_outputs/')

Parameterizing the workflow can also let the user choose to deactivate parts of the workflow. For example, imagine that Bar generates summary files that may now always be needed. In this case, we can let the user choose to leave it out:

import os.path.join
from gwf import Workflow

def my_fancy_workflow(output_dir='outputs/', summarize=True):
    # Create an empty workflow object.
    w = Workflow()

    # Add targets to the workflow object, respecting the value of `output_dir`.
    foo_output = os.path.join(output_dir, 'output1.txt')
    w.target(
        name='Foo',
        inputs=['input.txt'],
        outputs=[foo_output],
    ) << """
    ./run_foo > {}
    """.format(foo_output)

    # Only create target `Bar` if we want to summarize the data.
    if summarize:
        bar_output = os.path.join(output_dir, 'output2.txt')
        w.target(
            name='Bar',
            inputs=[foo_output],
            outputs=[bar_output]
        )

    # Now return the workflow.
    return w

In workflow.py we can then use the workflow as follows:

from fancy import my_fancy_workflow

gwf = my_fancy_workflow(summarize=False)

External Configuration of Workflows#

In the previous section we saw how we can parameterize workflows. However, in some cases we may want to let the user of our workflow specify the parameters without touching any Python code at all. That is, we want an external configuration file.

The configuration format could be anything, but in this example we’ll use a JSON as the configuration format. First, this is what our configuration file is going to look like:

{
    "output_dir": "some_output_directory/",
    "summarize": true
}

We put this file next to workflow.py, e.g. as config.json. We can now read the configuration using the Python json module in workflow.py:

import json
from fancy import my_fancy_workflow

config = json.load(open('config.json'))

gwf = my_fancy_workflow(
    output_dir=config['output_dir'],
    summarize=config['summarize'],
)

We can now change the values in config.json and run the workflow as usual.

Large Workflows#

While gwf can handle quite large workflows without any problems, there are some things that may cause significant pain when working with very, very large workflows, especially when the workflows has many (> 50000) targets producing many files. However, the problems depend hugely on your filesystem since most scalability problems are caused by the time it takes gwf to access the filesystem when scheduling targets.

In this section we will show a few tricks for handling very large workflows.

I have to run the same pipeline for a lot of files and running gwf status is very slow.

In this case gwf is probably slow because computing the dependency graph for your entire workflow takes a while and because gwf needs to access the filesystem for each input and output file in the workflow to check if any targets should be re-run.

One solution to this problem is to dynamically generate individual workflows for each input file, as shown here:

from glob import glob
from gwf import Workflow

data_files = ['Sample1', 'Sample2', 'Sample3']
for input_file in data_files:
    workflow_name = 'Analyse.{}'.format(input_file)

    wf = Workflow(name=workflow_name)
    wf.target('{}.Filter'.format(input_file), inputs=[input_file], outputs=[...]) << """..."""
    wf.target('{}.ComputeSummaries'.format(input_file), ...) << """..."""

    globals()[workflow_name] = wf

You can now run the workflow for a single sample by specifying the name of the workflow:

$ gwf -f workflow.py:Analyse.Sample1 run

This will only run the targets associated with Sample1. While this means that running all workflows in one go involves a bit more work, it also means that gwf will only have to compute the dependency graph and check timestamps for the targets associated with the selected sample.