Ask Hjorth Larsen*,
Mikael J. Kuisma
*,
Tara M. Boland
,
Fredrik A. Nilsson
and
Kristian S. Thygesen
CAMD, Computational Atomic-Scale Materials Design, Department of Physics, Technical University of Denmark, Kgs. Lyngby, 2800, Denmark. E-mail: asklarsen@gmail.com
First published on 1st August 2025
We introduce Taskblaster, a generic and lightweight Python framework for composing, executing, and managing computational workflows with automated error handling. Taskblaster supports dynamic workflows including flow control using branches and iteration, making the system Turing complete. Taskblaster aims to promote modular designs, where workflows are composed of reusable sub-workflows, and to simplify data maintenance as projects evolve and change. We discuss the main design elements including workflow syntax, a storage model based on intuitively named tasks in a nested directory tree, and command-line tools to automate and control the execution of the tasks. Tasks are executed by worker processes that may run directly in a terminal or be submitted using a queueing system, allowing for task-specific resource control. We provide a library (ASR-lib) of workflows for common materials simulations employing the Atomic Simulation Environment and the GPAW electronic structure code, but Taskblaster can equally well be used with other computational codes.
On modern hardware, it is possible to create immense amounts of computational data in a short time. As a computational project progresses, both code and parameters will change: new computations must be done, code needs adaptation to support additional parameters, or underlying computational tools change. Many such changes cause computed results to be outdated with respect to the project code, and thus either the code must be updated or results must be patched or recomputed. Rather than computation time, the bottleneck quickly becomes the ability of researchers to maintain the generated data.
Here, we introduce Taskblaster – a Python framework executing computational workflows. Taskblaster (TB) workflows are defined using Python code. The workflow code defines a number of tasks, where each task encodes a future call to a Python function with particular inputs. Executing the workflow generates tasks and associated metadata as nodes of a directed acyclic graph (DAG) whose edges are the dependencies. Tasks can then be inspected or manipulated before configuring and launching parallel worker processes to run them. TB workflows support the use of branching, iteration, and dynamical generation of tasks, i.e., generation of tasks depending dynamically on the outcome of other tasks.
Projects can customise certain behaviours using a plug-in mechanism. Most importantly, this includes how TB integrates with a parallel Python environment and how custom datatypes are encoded when saving inputs and outputs.
TB adds to a growing set of workflow management tools39 of which some originate from the materials science community.20,40–49 These tools differ in many aspects including data storage and representation (e.g. database servers versus local files), protocols used for determining data equivalence/conflicts (e.g. should a piece of calculated data be recalculated or is it consistent with the current inputs?), the type of logic operations supported, the handling of dynamic tasks, the way in which the resources are allocated on the compute system, and the way computational tasks are submitted.
Given the pivotal role that (big) data will be playing in the future, the importance of workflow control software cannot be understated and their continued development should be a priority alongside conventional simulation codes. In this regard, a heterogeneous set of workflow codes can lead to cross-fertilization and help identifying the most promising concepts and approaches.
Over the next sections we will discuss different aspects of Taskblaster and finally highlight features that we believe to be special. The article is structured as follows: Section 2 explains the overall design goals of TB. Section 3 describes features of TB in detail: tasks, static and dynamic workflows, data storage, configurable worker processes, input validation, and error handling. Section 4 describes ASR-lib, a library of TB workflows for atomistic high-throughput projects. Section 5 highlights specific notable features. Section 6 is a brief conclusion.
After the project, there will be an immense collection of scripts and utilities along with associated output data tailored to that specific project. Some data may be subtly outdated due to the gradual evolution of the code. In spite of high-quality publications, it may not be clear how to reproduce the results, even if both data and code still exist. Finally, the process for reproducing the data, should someone attempt to do so, is likely dependent on many manual steps since the original project evolved manually as well.
For a small project, that may not be an issue. However, projects with large valuable datasets are likely intended to be maintained and extended with new computations in the long term. Such projects will see generations of PhD students and postdocs making extensions and adaptations, and this requires a much higher standard for structure, transparency and documentation.
The goal of TB is to solve the problems described above. To that end, TB is designed to:
• Organize the project intuitively as a directory tree of meaningfully named tasks and workflows.
• Abstract the passage of data and files between tasks to avoid excessive coupling to filesystem paths or machine specific information.
• Work with large selections of tasks and achieve a high level of automation.
• Keep track of the task dependency tree in a way that makes it easy to see if any tasks are outdated with respect to the workflows that generated them.
Another goal of TB is to be easy to use. New projects should be easy enough to set up that researchers will not feel the temptation to develop large collections of custom project scripts, as discussed earlier. Furthermore, TB is a lightweight utility which requires no database services, network connections, or monitoring daemon processes, and works much the same whether on a laptop or a supercomputer.
However, there are also trade-offs: the desire to formally keep track of dependencies somewhat restricts the freedom to perform arbitrary processing inside workflows, since TB must be able to see any information passed between tasks in order to build the dependency tree and guarantee consistency. Hence, special constructs are needed for advanced workflow-level control flow, which otherwise might have been “ordinary” for-loops and if-statements.
The next step is to define a main workflow. In principle, the main workflow defines every computation that will happen; in practice, it is gradually written as the project progresses. The main workflow can specify tasks, which are individual computations, and it can call other workflows, or subworkflows, which may likewise specify tasks and further subworkflows. A workflow also connects outputs from tasks to inputs of other tasks, defining the DAG.
Tasks and workflows are always assigned names. When subworkflows are nested, names are likewise nested. If a workflow named A defines a subworkflow, B, which defines a task, C, then the final name of that task will be , and its files will be stored in
, where
is the root directory of the repository. The name of a task is therefore a global identifier for that task.
Operations on a repository are generally carried out using the TB command-line utilities. Examples are to run a workflow,
to list tasks, and
to run tasks. Most commands take a list of task names as input. This can include shell wildcards (glob patterns) which make it easy to run operations on large selections of tasks. Once tasks are generated by a workflow, they can be run on the command-line or submitted via myqueue43 to an HPC job manager such as Slurm50 or Torque.51 TB runs tasks from worker processes that can be configured to pick up specific sets of tasks depending on the resources required. Once tasks run, they may succeed or fail, and workers keep picking up new tasks as long as there is time and there are available tasks that they are compatible with.
TB provides commands to remove tasks or “unrun” them. Removing a task deletes all its associated data and removes it from the DAG, whereas unrunning it only removes its output so that it can run again. Such commands work recursively on the dependency tree affecting all dependent tasks in topological order. Daily work often involves testing and revision of task implementations using many run/unrun cycles.
Furthermore, TB provides commands to list or view tasks in different levels of detail, to submit or manipulate workers, and to invoke actions that visualise or export data.
The inputs of a task can be either specific objects such as numbers or arrays, or abstract references to the outputs of other tasks, or any nested structure (lists and dictionaries for example) involving both. Representing the input as a reference to a future output allows TB to construct large parts of the dependency tree without executing the tasks. The tree can thus be freely visualized and inspected, and the user can later choose to run tasks one at a time or in arbitrary groups. TB will automatically ensure that they run in topological order following the DAG. For example, if the user tries to run task B that depends on task A that has not yet been run, TB will first run task A before running task B.
Once tasks are generated by a workflow, they can be manipulated using the command-line interface. A newly generated task starts in the “new” state, which means it is eligible to run once all its inputs are available. Issuing a command will change its state to “run”, after which it may change to “done” or “fail” depending on success. Fig. 1 shows the most important task states and how tasks transition between them via commands. Tasks can also go into a “partial” state in connection with error handling, or a “queue” state to signify that it may be picked up by a worker.
Some tasks may require runtime information about the machine or parallelization that is neither a global constant nor a proper input parameter. TB provides a Python decorator to inject such information into tasks without affecting (and hence invalidating upon change) the stored input. This includes MPI communicators and hardware flags such as whether to use a GPU. In addition, TB provides syntax and command-line tools to tag tasks according to which computational resources they need.
Tasks can be equipped with error handlers that can run in multiple stages. The special “partial” state is used when a task did not succeed, but may yet succeed if it has error handlers that did not run yet, and might recover from the failure.
Fig. 2a shows an example of how a static workflow is defined using Python syntax. The workflow is a class; each task is a decorated method to return a node for the DAG on Fig. 2b. The decorator can be used to specify rules for computational resources and error handling. Note how the workflow specifies the routing between tasks, i.e., which outputs from which tasks to connect to which inputs of others. The inputs must match the call signature of the target function: the relaxation job implies that there is a function named relax which takes an input named atoms. These inputs need not exist yet when the workflow runs; instead, entities like
or
are future references which specify that the parameter is to be loaded and passed to the target function when a task runs. Additionally, the syntax supports indexing, attribute access, and method calls into the outputs of other tasks. For example, the expression
under the
specification does not actually perform a function call, but rather saves an encoded representation of that call so that it can be evaluated when the corresponding task runs.
![]() | ||
Fig. 2 (a) Workflow with two tasks and a subworkflow with a further three tasks. This example is based on real computational workflows, but with complexity and number of parameters greatly reduced. ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Running the workflow builds the tree of tasks. Fig. 2c shows a screenshot produced by the command listing the state of tasks, their dependencies (done/total), requested resources, and location in the directory tree.
The workflow syntax bears similarities to the Workflow Definition Language OpenWDL53,54 in terms of subworkflows as well as routing of inputs and outputs. The TB syntax, being written in Python, provides convenience for projects that are written in Python and can simplify object serialisation.
Once a task runs, it is assigned a directory on the disk where its outputs are stored along with its input specification as JSON. This provides a level of redundancy which allows the registry to be reconstructed in case of corruption due to power outages, bugs, or user errors. Tasks may also leave arbitrary files in their directory, which is useful for storing larger outputs from computations that it would be inefficient to encode using JSON, or which are not useful to represent directly as Python objects. Tasks can return path objects pointing to files they generated in order for other tasks to access those paths via their input. TB takes care of storing these paths in a way that is robust with respect to moving a repository. TB automatically ensures that the path points to the correct location when used in subsequent tasks, although they run in a different directory.
To save and load Python objects, TB must be able to serialize those objects. TB itself supports only basic objects. A custom JSON encoder can be specified via a plug-in: the TB repository is configured to point to a special plug-in module which can specify a custom JSON encoder. For example ASR-lib uses this to integrate with the ASE encoder and hence supports commonly used objects including numpy arrays and ASE Atoms objects. Custom classes written by a user can also be supported by adding an encoding hook.
The user can configure multiple kinds of workers in a special configuration file. This facilitates specification of computational resources and encompasses number of cores, Slurm partition, walltime, and more. Multiple configured workers can then be submitted simultaneously and with a single command. Submitting a worker is, in principle, no different from submitting a command with specific settings. TB submits workers via myqueue.43
Tasks and workers can be equipped with configurable resource tags, like the relaxation task on Fig. 2a which has the “long” tag. Workers only pick up tasks with matching tags. For example, there can be one kind of worker intended for lightweight processing, another worker for heavy computations, and a third worker for computations that require particular hardware such as a GPU. Multiple “subworkers” can be configured to run inside a single HPC job. Doing so can allow better sharing of resources when tasks require fewer resources than a whole node.
To handle situations such as these, TB alerts the user to “conflicts” when the input data changes in the workflow. When changing a parameter, a user would adapt the workflow and then rerun it. This generates all the same tasks that already exist, but TB detects that an input has changed. The affected tasks will then be marked as conflicted, which “freezes” them along with every task that depends on them, preventing those tasks from running. The user must either unrun them or mark the conflicts as “resolved” in order to unfreeze the frozen tasks, indicating that the conflict does not affect data integrity. A resolved conflict is simply a way to tell the workflow that the task's results are to be kept as they are even though the inputs are different. Both original and new (conflicting) inputs are saved. TB can also show a diff highlighting the specific changes in input parameters.
TB provides a system, the warden, for solving this problem. If a task fails, the user can implement an error handler and modify the workflow to equip the task with the handler. The handler can execute any code to recover from the error. Often this means rerunning the target function of the task with modified inputs or using a checkpoint. Other tasks of the same type will automatically have the error handler as well. Multiple such handlers can execute in succession or in response to different errors, interacting with the warden using a particular programming interface.
Overall, the execution of error handlers is considered part of a task (as opposed to being represented as a succession of different tasks) and corresponds closely to typical “try/except” exception handling as supported in modern programming languages. Error handlers do not change the task inputs as stored, as that would cause a conflict. Instead, they have access to call the target function with a modified set of inputs after the original function fails.
The workflows in ASR-lib are written in a general style and can be used for any type of material, independent of dimensionality and composition. Consequently, the workflows in ASR-lib can be used as initial templates when producing more project specific workflows. Currently, ASR-lib contains workflows for many different operations/calculations, and is continuously being developed. Most of these employ the GPAW57,58 electronic structure code as a calculator, but it is straightforward to generalise to other types of calculators as long as they have a Python interface, e.g. via ASE. Below, we mention a few examples of workflows in ASR-lib, highlighting features that are enabled by TB.
For example, the GW and Phonon workflows utilize generators to generate q-point and displacement parallelisations at the task level. The structure relaxation workflow contains several branches and a while loop related to searching for the lowest energy magnetic configuration. The crystal defect workflows can dynamically generate various types of point defects using a generator and subsequently proceed with nested generators to classify their properties (formation energy, charge transition levels, etc.) by means of DFT calculations. ASR-lib also contains examples of large-scale data processing of existing trees, like evaluation of the energy above the convex hull. These so-called “from tree” methods can be used to collect data from TB repositories, perform analysis, and spawn projects with a new focus.
ASR-lib is currently used in a number of ongoing high-throughput projects related to 2D materials and point defects. In addition to ASR-lib, TB has also been independently used for workflows based on the FHI-Aims code.59
Other distinguishing features of TB are:
• Low infrastructure requirements: TB runs in any Python environment and does not require persistent network connections or database processes.
• Intuitive data storage: workflows and tasks are organised in a directory tree where nested subdirectories serve as namespaces.
• Automatic I/O: TB automates the loading and saving of Python objects as inputs and outputs and works to reduce filesystem path clutter throughout the code.
• General purpose: TB is a generic workflow tool and is not linked to any domain-specific simulation software.
• Plug-ins: users can facilitate work with domain-specific simulation software by writing a plug-in as in the case of ASR-lib.
• Configurable computational resources: tasks are executed by configurable worker processes, where each worker process can run any set of tasks. The logical division of a workflow into tasks is independent of the number or type of actual HPC jobs that run the tasks. Additionally, machine-specific configuration can be kept separate from the main project code.
In general, the top-level workflow encodes every computation that is going to happen. The command-line interface cannot itself add computations or change any result. It only provides a way for the user to choose what, when, and how to run. When running a workflow, TB eagerly adds as many tasks as possible to the DAG without executing any of them. This allows the user to “see into the future” and better assess the required resources, or to experiment with a subset of tasks using the characteristic “run/unrun” pattern. TB can generate parts of the DAG that depend on a dynamical workflow, even though the workflow did not run yet. This is possible because TB can use fixed-point tasks on the dynamic workflow to infer the existence of subsequent tasks. The fragments are then connected to a final DAG once the dynamical workflow runs.
TB aims to bridge the gap from small to large projects: it can act as a simple tool to automate processing steps locally on a laptop, or used in large projects that needs to scale and adapt over time. Major design features of TB are: intuitive organisation of data using a directory tree, a usage model which minimises infrastructure requirements by emphasising local data storage and interactive work in a terminal, avoiding the need for heavy-weight database connections, while keeping a strict representation of task dependencies as a persistent DAG.
We have found that this combination facilitates an efficient “unrun/rerun”-based approach to practical experimentation, which is often required in the development phase of new computational projects.
Most core design elements of TB are unlikely to change in their main structure, so future TB development will increasingly focus on smaller improvements to user experience, helper functionality for data migration and other tools that prove useful as the projects using TB mature further.
This journal is © The Royal Society of Chemistry 2025 |