Tutorial

In this tutorial we will explore various concepts in gwf. We will define workflows and see how gwf can help us keep track of the progress of workflow execution, the output of targets and dependencies between targets. Have fun!

First, let’s install gwf in its own conda environment. Create a new environment for your project, we’ll call it myproject.

$ conda create -n myproject gwf
$ conda activate myproject

You should now be able to run the following command.

$ gwf --help

This should show you the commands and options available through gwf.

Caution

You may see an error similar to this when you try running gwf:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2020' in
position 477: character maps to <undefined>

This error occurs because your isn’t configured to use UTF-8 as the default encoding. To fix the error insert the following lines in your .bashrc file:

export LANG=en_US.utf8
export LC_ALL=en_US.utf8

If you’re not in the US you may want to set it to something else. For example, if you’re in Denmark you may want to use the following configuration:

export LANG=da_DK.utf8
export LC_ALL=da_DK.utf8

A Minimal Workflow

To get started we must define a workflow file containing a workflow to which we can add targets. Unless gwf is told otherwise it assumes that the workflow file is called workflow.py and that the workflow is called gwf:

from gwf import Workflow

gwf = Workflow()

gwf.target('MyTarget', inputs=[], outputs=[]) << """
echo hello world
"""

In the example above we define a workflow and then add a target called MyTarget. A target is a single unit of computation that uses zero or more files (inputs) and produces zero or more files (outputs).

The target defined above does not use any files and doesn’t produce any files either. However, it does run a single command (echo hello world), but the output of the command is thrown away. Let’s fix that! Change the target definition to this:

gwf.target('MyTarget', inputs=[], outputs=['greeting.txt']) << """
echo hello world
"""

This tells gwf that the target will create a file called greeting.txt when it is run. However, the target does not actually create the file yet. Let’s fix that too:

gwf.target('MyTarget', inputs=[], outputs=['greeting.txt']) << """
echo hello world > greeting.txt
"""

There you go! We have now declared a workflow with one target and that target creates the file greeting.txt with the line hello world in it. Now let’s try to run our workflow…

Running Your First Workflow

First, let’s make a directory for our project. We’ll call the directory myproject. Now create an empty file called workflow.py in the project directory and paste the workflow specification into it:

from gwf import Workflow

gwf = Workflow()

gwf.target('MyTarget', inputs=[], outputs=['greeting.txt']) << """
echo hello world > greeting.txt
"""

We’re now ready to run our workflow. However, gwf does not actually execute the targets in a workflow, it only schedules the target using a backend. This may sound cumbersome, but it enables gwf to run workflows in very different environments: anything from your laptop to a cluster with thousands of cores available.

For this tutorial we just want to run our workflows locally. To do this we can use the built-in local backend. Essentially this backend allows you to run workflows utilizing all cores of your computer and thus it can be very useful for small workflows that don’t require a lot of resources.

Note

If you’re running gwf on a cluster you may want to use a backend that can submit targets to your clusters’ queueing system/workload manager like Slurm. For example, to use the Slurm backend, run the command:

$ gwf config set backend slurm

Now that you’re using the Slurm backend you don’t have to start any workers. That is, just skip the step below.

First, open another terminal window and navigate to the myproject directory. Then run the command:

$ gwf workers
Started 4 workers, listening on port 12345

This will start a pool of workers that gwf can now submit targets to. Switch back to the other terminal and then run:

$ gwf run
Scheduling target MyTarget
Submitting target MyTarget

gwf schedules and then submits MyTarget to the pool of workers you started in the other terminal window.

This says that gwf considered the target for execution and then decided to submit it to the backend (because the output file, greeting.txt, does not already exist).

Within a few seconds you should see greeting.txt in the project directory. Try to open it in your favorite text editor!

Now try the same command again:

$ gwf run
Scheduling target MyTarget

This time, gwf considers the target for submission, but decides not to submit it since all of the output files (only one in this case) exist.

Note

When you’ve completed this tutorial, you probably want to close the local workers. To do this simply change to the terminal where you started the workers and press Control-c.

Setting the Default Verbosity

Maybe you got tired of seeing this much output from gwf all the time, despite the pretty colors. We can change the verbosity (how chatty gwf is) using the -v/--verbose flag:

$ gwf -v warning run

Now gwf only prints warnings. However, it quickly gets annoying to type this again and again, so let’s configure gwf to make warning the default verbosity level.

$ gwf config set verbose warning
$ gwf run

As we’d expect, gwf outputs the same as before, but this time we didn’t have to set the -v warning flag!

We can configure other aspects of gwf through the config command. For more details, refer to the Configuration page.

Debugging a Workflow

If your workflow doesn’t look right or you find that e.g. gwf status doesn’t show the right thing, you may need to debug your workflow. The first thing to try is to increase the verbosity level:

$ gwf -v debug status

This will show you exactly what gwf is thinking about each target in your workflow. Why should it run? Why should it not run?

To investigate further you can always use print() in your code and run your workflow.py as a normal Python script:

$ python workflow.py

In the end, there’s not a big difference between debugging a gwf workflow and normal Python code.

Defining Targets with Dependencies

Targets in gwf represent isolated units of work. However, we can declare dependencies between targets to construct complex workflows. A target B that depends on a target A will only run when A has been run successfully (that is, if all of the output files of A exist).

In gwf, dependencies are declared through file dependencies. This is best understood through an example:

from gwf import Workflow

gwf = Workflow()

gwf.target('TargetA', inputs=[], outputs=['x.txt']) << """
echo "this is x" > x.txt
"""

gwf.target('TargetB', inputs=[], outputs=['y.txt']) << """
echo "this is y" > y.txt
"""

gwf.target('TargetC', inputs=['x.txt', 'y.txt'], outputs=['z.txt']) << """
cat x.txt y.txt > z.txt
"""

In this workflow, TargetA and TargetB each produce a file. TargetC declares that it needs two files as inputs. Since the file names match the file names produced by TargetA and TargetB, TargetC depends on these two targets.

Let’s try to run this workflow:

$ gwf run
Scheduling target TargetC
Scheduling dependency TargetA of TargetC
Submitting target TargetA
Scheduling dependency TargetB of TargetC
Submitting target TargetB
Submitting target TargetC

(You can leave out the -v info option if you set it as the default in the previous section).

Notice that gwf first attempts to submit TargetC. However, because of the file dependencies it first schedules each dependency and submits those to the backend. It then submits TargetC and makes sure that it will only be run when both TargetA and TargetB has been run. If we decided that we needed to re-run TargetC, but not TargetA and TargetB, we could just delete z.txt and run gwf run again. gwf will automatically figure out that it only needs to run TargetC again and submit it to the backend.

What happens if we do something nonsensical like declaring a cyclic dependency? Let’s try:

from gwf import Workflow

gwf = Workflow()

gwf.target('TargetA', inputs=['x.txt'], outputs=['x.txt']) << """
echo "this is x" > x.txt
"""

Run this workflow. You should see the following:

Error: Target TargetA depends on itself.

Named Inputs and Outputs

Added in version 1.6.0: Prior versions only allow lists of inputs and outputs.

The inputs and outputs arguments can either be a string, a list or a dictionary. If a dictionary is given, the keys act as names for the files. The values may be either strings or a list of strings:

foo = gwf.target(
    name='foo',
    inputs={'A': ['a1', 'a2'], 'B': 'b'},
    outputs={'C': ['a1b', 'a2b], 'D': 'd},
)

This is especially useful for referring the outputs of a target:

bar = gwf.target(
    name='bar',
    inputs=foo.outputs['C'],
    outputs='result',
)

Using named inputs and outputs also makes the workflow more readable since associated files can be grouped and named.

Specifying Target Resources

It’s a good idea to specify the resources required by your target. Backends like Slurm will use these resource limits to allocate a suitable node for you, prioritize work, and cancel your targets if they exceed the given limits.

The resources you can specify depend on the backend. For example, the local backend does not support any target options and will ignore them completely. The slurm backend supports a number of target options which are listed here under the Target options header.

For example, if you are using the slurm backend you can specify that you need 8 cores and 64 GB of memory like this:

foo = gwf.target(
    name='foo',
    inputs={'A': ['a1', 'a2'], 'B': 'b'},
    outputs={'C': ['a1b', 'a2b'], 'D': 'd'},
    cores=8,
    memory='64gb',
)

print(foo.options)
# => {'cores': 8, 'memory': '64gb'}

Some target options are global to the workflow. To request 8 cores for all targets in your workflow, you can give the defaults argument when initializing your workflow:

gwf = Workflow(defaults={'cores': 8})

foo = gwf.target(
    name='foo',
    inputs={'A': ['a1', 'a2'], 'B': 'b'},
    outputs={'C': ['a1b', 'a2b'], 'D': 'd'},
)

print(foo.options)
# => {'cores': 8}

Observing Target Execution

As workflows get larger they make take a very long time to run. With gwf it’s easy to see how many targets have been completed, how many failed and how many are still running using the gwf status command. We’ll modify the workflow from earlier to fake that each target takes some time to run:

from gwf import Workflow

gwf = Workflow()

gwf.target('TargetA', inputs=[], outputs=['x.txt']) << """
sleep 20 && echo "this is x" > x.txt
"""

gwf.target('TargetB', inputs=[], outputs=['y.txt']) << """
sleep 30 && echo "this is y" > y.txt
"""

gwf.target('TargetC', inputs=['x.txt', 'y.txt'], outputs=['z.txt']) << """
sleep 10 && cat x.txt y.txt > z.txt
"""

Now run gwf status (Remember to remove x.txt, y.txt and z.txt, otherwise gwf will not submit the targets again). You should see something like this, but with pretty colors.

⨯ TargetA      0.00%    spec has changed
⨯ TargetB      0.00%    spec has changed
⨯ TargetC      0.00%    a dependency was scheduled

Each target in the workflow is shown on a separate line. We can see the status of the target ( meaning that the target is incomplete) and percentage completion. In this case, gwf also tells us that the first two targets must run because their spec changed (since it’s the first time running the workflow).

The percentage tells us how many dependencies of the target have been completed. If all dependencies of the target, and the target itself, have been completed, the percentage will be 100%.

Let’s try to run the workflow and see what happens.

$ gwf run
$ gwf status
↻ TargetA      0.00%    is running
- TargetB      0.00%    has been submitted
- TargetC      0.00%    has been submitted

Now the first target is running and the other targets have been submitted and are waiting to run. Running the status command again after some time should show something like this.

✓ TargetA    100.00%    not scheduled because it is a source
✓ TargetB    100.00%    not scheduled because it is a source
↻ TargetC     66.67%    is running

Now the first two targets have completed and TargetC is running. We’re also told that, if we run gwf run again, TargetA and TargetB will not be submitted because they’re both “sources”, that is, they don’t have any input files.

After a while, all targets should have completed.

✓ TargetA    100.00%    not scheduled because it is a source
✓ TargetB    100.00%    not scheduled because it is a source
✓ TargetC    100.00%    is up-to-date

Here’s a few neat things you should know about the status command:

  • If you only want to see endpoints (targets that no other targets depend on), you can use the --endpoints flag.

  • You can use wildcards in target names. For example, gwf status 'Foo*' will list all targets beginning with Foo. You can specify multiple targets/patterns by separating them with a space. This also works in the cancel and clean commands (but remember the quotes around the pattern)!

  • Only want to see which targets are running? You can filter targets by their status using e.g. gwf status -s running. You can also combine filters, i.e. gwf status --endpoints --status running 'Align*' to show all endpoints that are running and where the name starts with Align.

For more details you can always refer to builtin help with gwf status --help.

What Happens When a Target Fails?

We all make mistakes. Sometimes there’s a mistake on your target specification which causes the target execution to fail. The target could also fail because the target exceeded the allocated resource limits or took too long to run and thus exceeded the defined walltime.

When a target fails there’s two different outcomes:

  • the target did not create all of its output files, or

  • the target created all of its output files, but the output is incomplete.

In the first case gwf will notice that the output files do still not exist and show the target status shouldrun. The second case is harder since gwf will actually think that the target completed successfully (because all of the output files exist and are newer than the input files). In this case you will need to remove the incomplete output files and re-run the workflow.

You may also prevent the second outcome from ever happening by only creating your output files at the end of your targets. For example:

We write the output to a temporary file. If the script filter_data.py fails to run our the target is killed, b.txt will not exist and gwf will correctly show that Example should run again. If the script succeeds and the target is not killed, the temporary file will be renamed to b.txt and gwf will show the target as completed.

Reusable Targets with Templates

Often you will want to reuse a target definition for a lot of different files. For example, you may have two files with reads that you need to map to a reference genome. The mapping is the same for the two files, so it would be annoying to repeat it in the workflow specification.

Instead, gwf allows us to define a template which can be used to generate one or more targets easily. In general, a template is just a function which returns four things:

  1. The inputs files,

  2. The outputs files,

  3. a dictionary with options for the target that is to be generated, for example how many cores the template needs and which files it depends on,

  4. a string which contains the specification of the target that is to be generated.

Templates are great because they allow you to reuse functionality and encapsulate target creation logic. Let’s walk through the example above.

Note

Code and data files for this example is available here. To get started, follow these steps:

  1. Change your working directory to the readmapping directory.

  2. Run conda env create to create a new environment called readmapping. This will install all required packages, including gwf itself, samtools and bwa.

  3. Activate the environment with source activate readmapping.

  4. Open another terminal and navigate to the same directory.

  5. Activate the environment in this terminal too, using the same command as above.

  6. Start a pool with two workers with gwf workers -n 2.

  7. Jump back to the first terminal. Configure gwf to use the local backend for this project using gwf config backend local.

  8. You should now be able to run gwf status and all of the other gwf commands used in this tutorial.

Our reference genome is stored in ponAbe2.fa.gz, so we’ll need to unzip it first. Let’s write a template that unpacks files.

def unzip(inputfile, outputfile):
    """A template for unzipping files."""
    inputs = [inputfile]
    outputs = [outputfile]
    options = {
        'cores': 1,
        'memory': '2g',
    }

    spec = '''
    gzcat {} > {}
    '''.format(inputfile, outputfile)

    return AnonymousTarget(inputs=inputs, outputs=outputs, options=options, spec=spec)

This is just a normal Python function that returns an AnonymousTarget. The function takes two arguments, the name of the input file and the name of the output file. In the function we define the inputs and outputs files, a dictionary that defines the options of the targets created with this template, and a string describing the action of the template.

We can now create a concrete target using this template:

gwf.target_from_template(
    name='UnzipGenome',
    template=unzip(
        inputfile='ponAbe2.fa.gz',
        outputfile='ponAbe2.fa'
    )
)

You could run the workflow now. The UnzipGenome target would be scheduled and submitted, and after a few seconds you should have a ponAbe2.fa file in the project directory.

Let’s now define another template for indexing a genome.

def bwa_index(ref_genome):
    """Template for indexing a genome with `bwa index`."""
    inputs = ['{}.fa'.format(ref_genome)]
    outputs = ['{}.amb'.format(ref_genome),
               '{}.ann'.format(ref_genome),
               '{}.pac'.format(ref_genome),
               '{}.bwt'.format(ref_genome),
               '{}.sa'.format(ref_genome),
               ]
    options = {
        'cores': 16,
        'memory': '1g',
    }

    spec = """
    bwa index -p {ref_genome} -a bwtsw {ref_genome}.fa
    """.format(ref_genome=ref_genome)

    return AnonymousTarget(inputs=inputs, outputs=outputs, options=options, spec=spec)

This template looks more complicated, but really it’s the same thing as before. We define the inputs and outputs, a dictionary with options and a string with the command that will be executed.

Let’s use this template to create a target for indexing the reference genome:

gwf.target_from_template(
    name='IndexGenome',
    template=bwa_index(
        ref_genome='ponAbe2'
    )
)

Finally, we’ll create a template for actually mapping the reads to the reference.

def bwa_map(ref_genome, r1, r2, bamfile):
    """Template for mapping reads to a reference genome with `bwa` and `samtools`."""
    inputs = [r1, r2,
              '{}.amb'.format(ref_genome),
              '{}.ann'.format(ref_genome),
              '{}.pac'.format(ref_genome),
             ]
    outputs = [bamfile]
    options = {
        'cores': 16,
        'memory': '1g',
    }

    spec = '''
    bwa mem -t 16 {ref_genome} {r1} {r2} | \
    samtools sort | \
    samtools rmdup -s - {bamfile}
    '''.format(ref_genome=ref_genome, r1=r1, r2=r2, bamfile=bamfile)

    return AnonymousTarget(inputs=inputs, outputs=outputs, options=options, spec=spec)

This is much the same as the previous template. Here’s how we’re going to use it:

gwf.target_from_template(
    name='MapReads',
    template=bwa_map(
        ref_genome='ponAbe2',
        r1='Masala_R1.fastq.gz',
        r2='Masala_R2.fastq.gz',
        bamfile='Masala.bam'
    )
)

As you can see, templates are just normal Python functions and thus they can be inspected and manipulated in much the same way. Also, templates can be put into modules and imported into your workflow files to facilitate reuse. It’s all up to you!

Viewing Logs

We may be curious about what the MapReads target wrote to the console when the target ran, to see if there were any warnings. If a target failed, it’s also valuable to see it’s output to diagnose the problem. Luckily, gwf makes this very easy.

$ gwf logs MapReads

When you run this command you’ll see nothing. This is because the gwf logs command by default only shows things written to stdout by the target, and not stderr, and apparently nothing was written to stdout in this target. Let’s try to take a look at stderr instead by applying the --stderr flag (or the short version -e).

$ gwf logs --stderr MapReads
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 15000 sequences (1500000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 65, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (313, 369, 429)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (81, 661)
[M::mem_pestat] mean and std.dev: (372.88, 86.21)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 777)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 15000 reads in 1.945 CPU sec, 0.678 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -t 16 ponAbe2 Masala_R1.fastq.gz Masala_R2.fastq.gz
[main] Real time: 0.877 sec; CPU: 2.036 sec

We can do this for any target in our workflow. The logs shown are always the most recent ones since gwf does not archive logs from old runs of targets.

Cleaning Up

Now that we have run our workflow we may wish to remove intermediate files to save disk space. In gwf we can use the gwf clean command for this:

$ gwf clean

This command only removes files produced by an endpoint target (a target which no other target depends on):

$ gwf clean
Will delete 1.3MiB of files!
Deleting output files of IndexGenome
Deleting file "/Users/das/Code/gwf/examples/readmapping/ponAbe2.amb" from target "IndexGenome"
Deleting file "/Users/das/Code/gwf/examples/readmapping/ponAbe2.ann" from target "IndexGenome"
Deleting file "/Users/das/Code/gwf/examples/readmapping/ponAbe2.pac" from target "IndexGenome"
Deleting file "/Users/das/Code/gwf/examples/readmapping/ponAbe2.bwt" from target "IndexGenome"
Deleting file "/Users/das/Code/gwf/examples/readmapping/ponAbe2.sa" from target "IndexGenome"
Deleting output files of UnzipGenome
Deleting file "/Users/das/Code/gwf/examples/readmapping/ponAbe2.fa" from target "UnzipGenome"

We can tell gwf to remove all files by running gwf clean --all.

Protecting Files From Being Cleaned Up

You can protect output files from being cleaned up:

gwf.target('TargetA', inputs=['a'], outputs=['b', 'c', 'd'], protect=['d']) << """
...
"""

Now, when running gwf clean, the file d will not be deleted.

A Note About Reproducibility

Reproducibility is an important part of research and since gwf workflows describe every step of your computation, how the steps are connected, and the files produced in each step, it’s a valuable tool in making your workflows reproducible. In combination with the conda package manager and the concept of environments, you can build completely reproducible workflows in a declarative, flexible fashion.

Consider the read mapping example used above. Since we included a specification of the complete environment through a environment.yml file, which even included samtools, bwa and gwf itself, we were able to easily create a working environment with exactly the right software versions used for our workflow. The whole workflow could also easily be copied to a cluster and run through e.g. the Slurm backend, since we can exactly reproduce the environment used locally.