.. _patterns: =============== Tips and Tricks =============== This guide takes you through some advanced features and patterns that can be utilized in *gwf*. Remember that *gwf* is just a way of generating workflows using the Python programming language and thus many of these patterns simply use plain Python code to abstract and automate certain things. Iterating Over a Parameter Space -------------------------------- Say that you have a workflow that runs a program with many different combinations of parameters, e.g. the parameters *xs*, *ys*, and *zs*. Each parameter can take multiple values: .. code-block:: python xs = [0, 1, 2, 4, 5] ys = ['cold', 'warm'] zs = [0.1, 0.2, 0.3, 0.4, 0.5] We now want to run out program `simulate` with all possible combinations of these parameters. To do this, we'll use the Python function :func:`itertools.product()` to create an iterator over all combinations of the parameters: .. code-block:: python import itertools parameter_space = itertools.product(xs, ys, zs) We can then iterate over the parameter space: .. code-block:: python gwf = Workflow() for x, y, z in parameter_space: gwf.target( name='sim_{}_{}_{}'.format(x, y, z), inputs=['input.txt'], outputs=['output_{}_{}_{}.txt'.format(x, y, z)], ) << """ ./simulate {} {} {} """.format(x, y, z) Using :func:`itertools.product()` with :ref:`map ` is even nicer! Dynamically Generating a Workflow --------------------------------- We can make our workflows more reusable by generating them dynamically. For example, we may wish to make it easy for others to change the inputs to our workflow or let users specify a different output directory. When generating workflows dynamically you can essentially parameterize the workflow in any way you want. In combination with inclusion of workflows into other workflows, this allows for extremely powerful composition. To dynamically generate a workflow, we simply create a function which builds the workflow and returns it: .. code-block:: python import os.path.join from gwf import Workflow def my_fancy_workflow(output_dir='outputs/'): # Create an empty workflow object. w = Workflow() # Add targets to the workflow object, respecting the value of `output_dir`. foo_output = os.path.join(output_dir, 'output1.txt') w.target( name='Foo', inputs=['input.txt'], outputs=[foo_output], ) << """ ./run_foo > {} """.format(foo_output) bar_output = os.path.join(output_dir, 'output2.txt') w.target( name='Bar', inputs=[foo_output], outputs=[bar_output] ) # Now return the workflow. return w You can put this function in file next to your workflow, or any other place from which you can import the function. In this case, let's put the file next to ``workflow.py`` in a file called ``fancy.py``. In ``workflow.py`` we can then use the workflow as follows: .. code-block:: python from fancy import my_fancy_workflow gwf = my_fancy_workflow() We can now run the workflow as usual: .. code-block:: shell $ gwf run However, we can now easily change the output directory: .. code-block:: python from fancy import my_fancy_workflow gwf = my_fancy_workflow(output_dir='new_outputs/') Parameterizing the workflow can also let the user choose to deactivate parts of the workflow. For example, imagine that ``Bar`` generates summary files that may now always be needed. In this case, we can let the user choose to leave it out: .. code-block:: python import os.path.join from gwf import Workflow def my_fancy_workflow(output_dir='outputs/', summarize=True): # Create an empty workflow object. w = Workflow() # Add targets to the workflow object, respecting the value of `output_dir`. foo_output = os.path.join(output_dir, 'output1.txt') w.target( name='Foo', inputs=['input.txt'], outputs=[foo_output], ) << """ ./run_foo > {} """.format(foo_output) # Only create target `Bar` if we want to summarize the data. if summarize: bar_output = os.path.join(output_dir, 'output2.txt') w.target( name='Bar', inputs=[foo_output], outputs=[bar_output] ) # Now return the workflow. return w In ``workflow.py`` we can then use the workflow as follows: .. code-block:: python from fancy import my_fancy_workflow gwf = my_fancy_workflow(summarize=False) External Configuration of Workflows ----------------------------------- In the previous section we saw how we can parameterize workflows. However, in some cases we may want to let the user of our workflow specify the parameters without touching any Python code at all. That is, we want an external configuration file. The configuration format could be anything, but in this example we'll use a JSON as the configuration format. First, this is what our configuration file is going to look like: .. code-block:: json { "output_dir": "some_output_directory/", "summarize": true } We put this file next to ``workflow.py``, e.g. as ``config.json``. We can now read the configuration using the Python ``json`` module in ``workflow.py``: .. code-block:: python import json from fancy import my_fancy_workflow config = json.load(open('config.json')) gwf = my_fancy_workflow( output_dir=config['output_dir'], summarize=config['summarize'], ) We can now change the values in ``config.json`` and run the workflow as usual. Large Workflows --------------- While *gwf* can handle quite large workflows without any problems, there are some things that may cause significant pain when working with very, very large workflows, especially when the workflows has many (> 50000) targets producing many files. However, the problems depend hugely on your filesystem since most scalability problems are caused by the time it takes *gwf* to access the filesystem when scheduling targets. In this section we will show a few tricks for handling very large workflows. I have to run the same pipeline for *a lot* of files and running ``gwf status`` is very slow. In this case *gwf* is probably slow because computing the dependency graph for your entire workflow takes a while and because *gwf* needs to access the filesystem for each input and output file in the workflow to check if any targets should be re-run. One solution to this problem is to dynamically generate individual workflows for each input file, as shown here: .. code-block:: python from glob import glob from gwf import Workflow data_files = ['Sample1', 'Sample2', 'Sample3'] for input_file in data_files: workflow_name = 'Analyse.{}'.format(input_file) wf = Workflow(name=workflow_name) wf.target('{}.Filter'.format(input_file), inputs=[input_file], outputs=[...]) << """...""" wf.target('{}.ComputeSummaries'.format(input_file), ...) << """...""" globals()[workflow_name] = wf You can now run the workflow for a single sample by specifying the name of the workflow: .. code-block:: console $ gwf -f workflow.py:Analyse.Sample1 run This will only run the targets associated with `Sample1`. While this means that running *all* workflows in one go involves a bit more work, it also means that *gwf* will only have to compute the dependency graph and check timestamps for the targets associated with the selected sample.