9 Workflows
9.1 Learning Objectives
- Execute a workflow on the Fred Hutch Cluster
- Modify a workflow and test it on a compute node
- Utilize a container and test inside it on a compute node
9.2 Getting today’s exercises
In your bash_for_bio/ folder try doing a git pull. Or
copy these files into your own bash_for_bio directory:
cp /home/tladera2/bash_for_bio/scripts/week4/* ~/bash_for_bio/scripts/week4/
9.3 Using a Container
In Section 8.2, we learned a little bit about using Apptainer to run a Docker container. Let’s try to pull a common container, the Genome Analysis Toolkit (GATK) and run things inside the container.
The first thing we need to do is load Apptainer:
module load Apptainer/1.1.6Then we can pull the docker container:
apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1We can check if we have pulled the docker image by using
apptainer cache listOkay, now we have confirmed that we downloaded the apptainer image. Now we can try to execute things with it.
- 1
- Bind path (see Section 10.3.3)
- 2
- Docker image we have downloaded
- 3
-
samtoolscommand to run.
It’s worth trying this once to make sure you understand how all of the pieces are connected. In general, I do recommend using a workflow runner (Section 9.4) instead, because it helps manage all of these details, and it makes reading files easier.
9.4 Running Workflows
So far, we’ve been batch processing using job arrays in SLURM. However, when you graduate to processing in multiple steps, you should consider using a workflow manager for processing files.
A workflow manager will take a workflow, which can consist of multiple tasks and allow you to batch process files.
A good workflow manager will allow you to:
- Restart failed subjobs in the workflow
- Allow you to customize where intermediate and final outputs go
- Swap and customize modules in your workflow
- Adapt to different computing architectures (HPC/cloud/etc)
Many bioinformaticists have used workflow managers to process and manage hundreds or thousands of files at a time. They are well worth the extra effort it takes to learn.
Here is an overview of some of the common bioinformatics workflow managers. We will be using miniWDL, which runs on WDL files.
| Manager Software | Workflow Formats | Notes |
|---|---|---|
| Cromwell | WDL/CWL | Made for HPC Jobs |
| Sprocket | WDL | Made for HPC Jobs |
| MiniWDL | WDL | Used for local testing of workflows |
| DNANexus | WDL/CWL | Used for systems such as AllOfUs |
| Nextflow | .nf files |
Owned by seqera |
| Snakemake | make files |
MiniWDL is written to run locally on your own computer for testing. We don’t recommend that you run it on rhino for work in general. For one, it is not set up to take advantage of SLURM.
Cromwell is the suggested way to run workflows on rhino. We have been working on a web app called PROOF, that will let you manage your workflows, look at individual jobs, and launches a Cromwell server.
9.4.1 Specifying File Inputs
To run a WDL workflow, you will need to specify the inputs for that workflow. These inputs can be reference genomes, the files we want to process, and any other information the workflow requires.
Our WDL workflow specifies the following inputs:
input {
File ref_fasta
File ref_bwt
File ref_pac
File ref_amb
File ref_ann
File ref_sa
File fastq_r1
String name
}
Note that most of the entries in this input have to do with the reference, including the files that are generated by bwa index. This is because bwa mem needs to have these files accessible, even though our bwa mem script only runs with the ref_fasta input.
Here is an example of inputs for our workflow of interest.
#| filename: scripts/week4/input.json
{
"bwa_samtools_workflow.ref_fasta": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa",
"bwa_samtools_workflow.ref_bwt": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.bwt",
"bwa_samtools_workflow.ref_pac": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.pac",
"bwa_samtools_workflow.ref_amb": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.amb",
"bwa_samtools_workflow.ref_ann": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.ann",
"bwa_samtools_workflow.ref_sa": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.sa",
"bwa_samtools_workflow.fastq_r1": "./data/CALU1_combined_final.fastq",
"bwa_samtools_workflow.name": "CALU1_combined_final"
}9.4.2 Setting up for running a WDL workflow
Say someone has given us a WDL file - how do we set it up to run on our own data?
We’ll use miniWDL to run our WDL workflow. It is accessible via the cirro module.
module load cirro
module load Apptainer/1.1.6
export MINIWDL__SCHEDULER__CONTAINER_BACKEND=singularityBecause Docker is not available on rhino, we need to set the container backend to be singularity (the original name for Apptainer). This will enable MiniWDL to launch Docker containers.
If we run
which miniwdl
We will get the repsonse:
/app/software/cirro/1.7.0-foss-2024a/bin/miniwdl
Now we’re set up to run our WDL workflow.
9.4.3 Running a WDL workflow
We’ll use miniwdl run to execute our workflow. Again, in our bash_for_bio directory:
miniwdl run -i scripts/week4/input.json scripts/week4/ww-bwa-samtools.wdl2025-11-03 12:40:35.106 miniwdl-run no configuration file found
2025-11-03 12:40:35.998 wdl.w:bwa_samtools_workflow workflow start :: name: "bwa_samtools_workflow", source: "scripts/week4/ww-bwa-samtools.wdl", line: 2, column: 1, dir: "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow"
2025-11-03 12:40:36.015 wdl.w:bwa_samtools_workflow miniwdl :: version: "v1.13.0", uname: "Linux rhino02 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64"
2025-11-03 12:40:36.015 wdl.w:bwa_samtools_workflow task thread pool initialized :: task_concurrency: 72
2025-11-03 12:40:36.049 wdl.w:bwa_samtools_workflow visit :: node: "decl-fastq_r1", values: {"fastq_r1": "/home/tladera2/bash_for_bio/data/CALU1_combined_final.fastq"}
2025-11-03 12:40:36.049 wdl.w:bwa_samtools_workflow visit :: node: "decl-name", values: {"name": "CALU-test"}
[....]
2025-11-03 12:40:57.915 wdl.w:bwa_samtools_workflow finish :: job: "call-samtools_process"
2025-11-03 12:40:57.916 wdl.w:bwa_samtools_workflow output :: job: "call-samtools_process", values: {"sorted_bai": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bai/CALU-test_sorted.bam.bai", "sorted_bam": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bam/CALU-test_sorted.bam"}
2025-11-03 12:40:57.916 wdl.w:bwa_samtools_workflow visit :: node: "output-sorted_bai", values: {"sorted_bai": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bai/CALU-test_sorted.bam.bai"}
2025-11-03 12:40:57.916 wdl.w:bwa_samtools_workflow visit :: node: "output-sorted_bam", values: {"sorted_bam": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bam/CALU-test_sorted.bam"}
2025-11-03 12:40:57.916 wdl.w:bwa_samtools_workflow visit :: node: "outputs", values: {"sorted_bai": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bai/CALU-test_sorted.bam.bai", "sorted_bam": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bam/CALU-test_sorted.bam"}
2025-11-03 12:40:57.942 wdl.w:bwa_samtools_workflow done
Success! We got our WDL to run. Let’s look at the output.
9.4.4 Exploring the WDL output
The output of our WDL workflow is as follows:
output {
File sorted_bam = samtools_process.sorted_bam
File sorted_bai = samtools_process.sorted_bai
}
So we should expect a sorted_bam and a sorted_bai file as outputs. So where are they?
MiniWDL will generate a folder per workflow run. You can always see the latest workflow run by referencing _LAST/:
ls -l _LAST/total 248
drwxrwx--- 4 tladera2 g_tladera2 325 Nov 3 12:40 call-bwa_align
drwxrwx--- 4 tladera2 g_tladera2 325 Nov 3 12:40 call-samtools_process
-rw-rw---- 1 tladera2 g_tladera2 911 Nov 3 12:40 inputs.json
drwxrwx--- 4 tladera2 g_tladera2 56 Nov 3 12:40 out
-rw-rw---- 1 tladera2 g_tladera2 293 Nov 3 12:40 outputs.json
-rw-rw---- 1 tladera2 g_tladera2 127 Nov 3 12:40 rerun
drwxrwx--- 2 tladera2 g_tladera2 37 Nov 3 12:40 wdl
-rw-rw---- 1 tladera2 g_tladera2 14348 Nov 3 12:40 workflow.log
In the _LAST/out/ folder we have:
ls -l _LAST/outdrwxrwx--- 2 tladera2 g_tladera2 42 Nov 3 12:40 sorted_bai
drwxrwx--- 2 tladera2 g_tladera2 38 Nov 3 12:40 sorted_bam
Which will contain our output files symbolically linked.
9.5 Looking at a WDL workflow
Workflows are divided up into workflows and individual tasks. There are two tasks in our workflow, which we can tell by the two calls in the workflow: bwa_align (2) and samtools_process (3).
Each workflow and task requires an input and output.
#| eval: false
#| filename: scripts/week4/ww-bwa-samtools.wdl
version 1.0
workflow bwa_samtools_workflow {
1 input {
File ref_fasta
File ref_bwt
File ref_pac
File ref_amb
File ref_ann
File ref_sa
File fastq_r1
String name
}
2 call bwa_align {
input:
reference_bwt = ref_bwt,
reference_ann = ref_ann,
reference_pac = ref_pac,
reference_amb = ref_amb,
reference_sa = ref_sa,
reference_fasta = ref_fasta,
reads = fastq_r1,
name = name
}
3 call samtools_process {
input:
sam_file = bwa_align.bwa_output,
name = name
}
4 output {
File sorted_bam = samtools_process.sorted_bam
File sorted_bai = samtools_process.sorted_bai
}
}
5task bwa_align {
...
}
6task samtools_process {
...
}- 1
- Inputs to workflow
- 2
-
Call
bwa_aligntask using inputs - 3
-
Take outputs of
bwa_aligntask and process them withsamtools_process - 4
- Outputs of workflow
- 5
-
bwa_aligntask definition - 6
-
samtools_processtask definition.
You can see that the structure of the WDL file is a lot like a shell script. The inputs have particular expected formats, and it calls bwa_align first. Once the outputs have been generated from the bwa_align task, they are then used as inputs to the samtools_process task, which outputs the final outputs of the workflow.
If we look at the task definition called bwa_align it looks like this:
task bwa_align {
meta {
description: "Aligns paired-end reads to a reference using BWA-MEM"
outputs: {
bwa_output: "Output SAM file containing aligned reads"
}
}
parameter_meta {
reference_fasta: "Reference genome FASTA file"
reads: "FASTQ of forward (R1) reads"
name: "Sample name for output SAM file"
}
input {
File reference_fasta
File reference_bwt
File reference_ann
File reference_pac
File reference_amb
File reference_sa
File reads
String name
}
# Get name of reference fasta file within 'bwa_index' folder
String ref_name = reference_fasta
1 command <<<
# BWA alignment using 8 threads
bwa mem -t 8 "~{ref_name}" "~{reads}" > "~{name}.sam"
>>>
output {
File bwa_output = "~{name}.sam"
}
runtime {
docker: "getwilds/bwa:0.7.17"
cpu: 8
memory: "16 GB"
}
}- 1
- Contains the bash code.
9.5.1 How does this connect to shell scripting?
You’ll notice that everything between the:
command <<<
>>>
Is essentially a shell script, but adapted to work with the WDL inputs.
9.6 scatter(): how we divide up tasks.
We can batch process our fastq files in data/ by using the scatter function in WDL. There are two things we need to do to modify our workflow:
- Accept an
ArrayofFiles called - Use
scatterto iterate over the files by wrapping our two tasks,bwa_alignandsamtools_process.
Our input definition for the workflow looks like this:
input {
File ref_fasta
File ref_bwt
File ref_pac
File ref_amb
File ref_ann
File ref_sa
Array[File] fastq_r1
}
Which is identical to our other WDL, but it has a Array[File] as an input. The corresponding input in our JSON file looks like this:
"bwa_samtools_workflow.fastq_r1": [
"data/CALU1_combined_final.fastq",
"data/HCC4006_final.fastq",
"data/MOLM13_combined_final.fastq"
]
Which is a JSON list of the files we want to process.
When we run
miniwdl run -i scripts/week4/scatter.json scripts/week4/ww-bwa-samtools-scatter.wdlWe get the following output:
{
"dir": "/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow",
"outputs": {
"bwa_samtools_workflow.sorted_bai": [
"/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bai/0/CALU1_combined_final.fastq.sam_sorted.bam.bai",
"/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bai/1/HCC4006_final.fastq.sam_sorted.bam.bai",
"/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bai/2/MOLM13_combined_final.fastq.sam_sorted.bam.bai"
],
"bwa_samtools_workflow.sorted_bam": [
"/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bam/0/CALU1_combined_final.fastq.sam_sorted.bam",
"/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bam/1/HCC4006_final.fastq.sam_sorted.bam",
"/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bam/2/MOLM13_combined_final.fastq.sam_sorted.bam"
]
}
}
Examining the last workflow run directory,
ls -l _LAST/out/sorted_bam/We see that our output has 3 subdirectories:
total 120
drwxrwx--- 2 tladera2 g_tladera2 59 Nov 4 10:17 0
drwxrwx--- 2 tladera2 g_tladera2 52 Nov 4 10:17 1
drwxrwx--- 2 tladera2 g_tladera2 60 Nov 4 10:17 2
Each of these directories contains a symbolic link to the output files.
9.6.1 More about scatter
Here is the scatter block in the workflow:
- 1
-
scatterover ourArray[File]fastq_r1. We use similar syntax as for loops to iterate over each file with a placeholderfq. - 2
-
Utilize
fqplaceholder in thecall bwa_aligntask. - 3
- The rest of the workflow works as usual.
The output section of the workflow looks like this:
output {
Array[File] sorted_bam = samtools_process.sorted_bam
Array[File] sorted_bai = samtools_process.sorted_bai
}
Note that the scatter generates an Array of files as well.
9.6.2 For more information
I recommend that you read the Developing WDL Workflows book that we have written for much more information about building your own workflows.
You don’t have to start from scratch, either. There is the WILDS WDL library that has a number of different bioinformatics tasks that are already built for use in your workflows.
9.7 Working with file manifests in WDL
Another strategy you can use to cycle through a list of files is to use a file manifest to process a list of files.
There is an example workflow that shows how to work with a file manifest. This can be helpful for those who aren’t necessarily good at working with JSON.
This workflow has a single input, which is the location of the file manifest. It will then cycle through the manifest, line by line.
This is what the file manifest contains. Notice there are three named columns for this file: sampleName, bamLocation, and bedLocation.
| sampleName | bamLocation | bedLocation |
|---|---|---|
| smallTestData | /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/unpaired-panel-consensus-variants-human/smallTestData.unmapped.bam | /fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed |
| smallTestData-reference | /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/paired-panel-consensus-variants-human/smallTestData-reference.unmapped.bam | /fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed |
9.8 Where Next?
Now that you understand the basics of working with Bash and WDL, you are now ready to try testing and profiling your jobs.