9  Workflows

9.1 Learning Objectives

  • Execute a workflow on the Fred Hutch Cluster
  • Modify a workflow and test it on a compute node
  • Utilize a container and test inside it on a compute node

9.2 Getting today’s exercises

In your bash_for_bio/ folder try doing a git pull. Or

copy these files into your own bash_for_bio directory:

cp /home/tladera2/bash_for_bio/scripts/week4/* ~/bash_for_bio/scripts/week4/

9.3 Using a Container

In Section 8.2, we learned a little bit about using Apptainer to run a Docker container. Let’s try to pull a common container, the Genome Analysis Toolkit (GATK) and run things inside the container.

The first thing we need to do is load Apptainer:

module load Apptainer/1.1.6

Then we can pull the docker container:

apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1

We can check if we have pulled the docker image by using

apptainer cache list

Okay, now we have confirmed that we downloaded the apptainer image. Now we can try to execute things with it.

apptainer exec \
1    --bind /path/to/data:/data \
2    docker://biocontainers/samtools:v1.9-4-deb_cv1 \
3    samtools view -c /mydata/$1 > /mydata/$1.counts.txt
1
Bind path (see Section 10.3.3)
2
Docker image we have downloaded
3
samtools command to run.

It’s worth trying this once to make sure you understand how all of the pieces are connected. In general, I do recommend using a workflow runner (Section 9.4) instead, because it helps manage all of these details, and it makes reading files easier.

9.4 Running Workflows

So far, we’ve been batch processing using job arrays in SLURM. However, when you graduate to processing in multiple steps, you should consider using a workflow manager for processing files.

A workflow manager will take a workflow, which can consist of multiple tasks and allow you to batch process files.

A good workflow manager will allow you to:

  • Restart failed subjobs in the workflow
  • Allow you to customize where intermediate and final outputs go
  • Swap and customize modules in your workflow
  • Adapt to different computing architectures (HPC/cloud/etc)

Many bioinformaticists have used workflow managers to process and manage hundreds or thousands of files at a time. They are well worth the extra effort it takes to learn.

Here is an overview of some of the common bioinformatics workflow managers. We will be using miniWDL, which runs on WDL files.

Manager Software Workflow Formats Notes
Cromwell WDL/CWL Made for HPC Jobs
Sprocket WDL Made for HPC Jobs
MiniWDL WDL Used for local testing of workflows
DNANexus WDL/CWL Used for systems such as AllOfUs
Nextflow .nf files Owned by seqera
Snakemake make files
NoteMiniWDL is mostly for demonstration

MiniWDL is written to run locally on your own computer for testing. We don’t recommend that you run it on rhino for work in general. For one, it is not set up to take advantage of SLURM.

Cromwell is the suggested way to run workflows on rhino. We have been working on a web app called PROOF, that will let you manage your workflows, look at individual jobs, and launches a Cromwell server.

9.4.1 Specifying File Inputs

To run a WDL workflow, you will need to specify the inputs for that workflow. These inputs can be reference genomes, the files we want to process, and any other information the workflow requires.

Our WDL workflow specifies the following inputs:

input {
    File ref_fasta
    File ref_bwt
    File ref_pac
    File ref_amb
    File ref_ann
    File ref_sa
    File fastq_r1
    String name
  }

Note that most of the entries in this input have to do with the reference, including the files that are generated by bwa index. This is because bwa mem needs to have these files accessible, even though our bwa mem script only runs with the ref_fasta input.

Here is an example of inputs for our workflow of interest.

#| filename: scripts/week4/input.json
{
  "bwa_samtools_workflow.ref_fasta": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa",
  "bwa_samtools_workflow.ref_bwt": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.bwt",
  "bwa_samtools_workflow.ref_pac": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.pac",
  "bwa_samtools_workflow.ref_amb": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.amb",
  "bwa_samtools_workflow.ref_ann": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.ann",
  "bwa_samtools_workflow.ref_sa": "/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.sa",
  "bwa_samtools_workflow.fastq_r1": "./data/CALU1_combined_final.fastq",
  "bwa_samtools_workflow.name": "CALU1_combined_final"
}
NoneInput Files as JSON

If you want to work with the JSON format (Section 8.1), there is a trick to generating a template .json file for a workflow.

miniwdl input-template scripts/week4/ww-bwa-samtools.wdl > ww-bwa-inputs.json

This will generate a file called ww-bwa-inputs.json that contains all of the inputs:

{
  "bwa_samtools_workflow.ref_fasta": "File",
  "bwa_samtools_workflow.ref_bwt": "File",
  "bwa_samtools_workflow.ref_pac": "File",
  "bwa_samtools_workflow.ref_amb": "File",
  "bwa_samtools_workflow.ref_ann": "File",
  "bwa_samtools_workflow.ref_sa": "File",
  "bwa_samtools_workflow.fastq_r1": "File",
  "bwa_samtools_workflow.name": "String"
}

This can be a good head start to making your .json files.

For Cromwell, there is an additional utility called womtools that will generate a template file from a workflow file.

9.4.2 Setting up for running a WDL workflow

Say someone has given us a WDL file - how do we set it up to run on our own data?

We’ll use miniWDL to run our WDL workflow. It is accessible via the cirro module.

module load cirro
module load Apptainer/1.1.6
export MINIWDL__SCHEDULER__CONTAINER_BACKEND=singularity

Because Docker is not available on rhino, we need to set the container backend to be singularity (the original name for Apptainer). This will enable MiniWDL to launch Docker containers.

If we run

which miniwdl

We will get the repsonse:

/app/software/cirro/1.7.0-foss-2024a/bin/miniwdl

Now we’re set up to run our WDL workflow.

9.4.3 Running a WDL workflow

We’ll use miniwdl run to execute our workflow. Again, in our bash_for_bio directory:

miniwdl run -i scripts/week4/input.json scripts/week4/ww-bwa-samtools.wdl
2025-11-03 12:40:35.106 miniwdl-run no configuration file found
2025-11-03 12:40:35.998 wdl.w:bwa_samtools_workflow workflow start :: name: "bwa_samtools_workflow", source: "scripts/week4/ww-bwa-samtools.wdl", line: 2, column: 1, dir: "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow"
2025-11-03 12:40:36.015 wdl.w:bwa_samtools_workflow miniwdl :: version: "v1.13.0", uname: "Linux rhino02 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64"
2025-11-03 12:40:36.015 wdl.w:bwa_samtools_workflow task thread pool initialized :: task_concurrency: 72
2025-11-03 12:40:36.049 wdl.w:bwa_samtools_workflow visit :: node: "decl-fastq_r1", values: {"fastq_r1": "/home/tladera2/bash_for_bio/data/CALU1_combined_final.fastq"}
2025-11-03 12:40:36.049 wdl.w:bwa_samtools_workflow visit :: node: "decl-name", values: {"name": "CALU-test"}

[....]

2025-11-03 12:40:57.915 wdl.w:bwa_samtools_workflow finish :: job: "call-samtools_process"
2025-11-03 12:40:57.916 wdl.w:bwa_samtools_workflow output :: job: "call-samtools_process", values: {"sorted_bai": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bai/CALU-test_sorted.bam.bai", "sorted_bam": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bam/CALU-test_sorted.bam"}
2025-11-03 12:40:57.916 wdl.w:bwa_samtools_workflow visit :: node: "output-sorted_bai", values: {"sorted_bai": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bai/CALU-test_sorted.bam.bai"}
2025-11-03 12:40:57.916 wdl.w:bwa_samtools_workflow visit :: node: "output-sorted_bam", values: {"sorted_bam": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bam/CALU-test_sorted.bam"}
2025-11-03 12:40:57.916 wdl.w:bwa_samtools_workflow visit :: node: "outputs", values: {"sorted_bai": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bai/CALU-test_sorted.bam.bai", "sorted_bam": "/home/tladera2/bash_for_bio/20251103_124035_bwa_samtools_workflow/call-samtools_process/out/sorted_bam/CALU-test_sorted.bam"}
2025-11-03 12:40:57.942 wdl.w:bwa_samtools_workflow done

Success! We got our WDL to run. Let’s look at the output.

9.4.4 Exploring the WDL output

The output of our WDL workflow is as follows:

  output {
    File sorted_bam = samtools_process.sorted_bam
    File sorted_bai = samtools_process.sorted_bai
  }

So we should expect a sorted_bam and a sorted_bai file as outputs. So where are they?

MiniWDL will generate a folder per workflow run. You can always see the latest workflow run by referencing _LAST/:

ls -l _LAST/
total 248
drwxrwx--- 4 tladera2 g_tladera2   325 Nov  3 12:40 call-bwa_align
drwxrwx--- 4 tladera2 g_tladera2   325 Nov  3 12:40 call-samtools_process
-rw-rw---- 1 tladera2 g_tladera2   911 Nov  3 12:40 inputs.json
drwxrwx--- 4 tladera2 g_tladera2    56 Nov  3 12:40 out
-rw-rw---- 1 tladera2 g_tladera2   293 Nov  3 12:40 outputs.json
-rw-rw---- 1 tladera2 g_tladera2   127 Nov  3 12:40 rerun
drwxrwx--- 2 tladera2 g_tladera2    37 Nov  3 12:40 wdl
-rw-rw---- 1 tladera2 g_tladera2 14348 Nov  3 12:40 workflow.log

In the _LAST/out/ folder we have:

ls -l _LAST/out
drwxrwx--- 2 tladera2 g_tladera2 42 Nov  3 12:40 sorted_bai
drwxrwx--- 2 tladera2 g_tladera2 38 Nov  3 12:40 sorted_bam

Which will contain our output files symbolically linked.

9.5 Looking at a WDL workflow

Workflows are divided up into workflows and individual tasks. There are two tasks in our workflow, which we can tell by the two calls in the workflow: bwa_align (2) and samtools_process (3).

Each workflow and task requires an input and output.

#| eval: false
#| filename: scripts/week4/ww-bwa-samtools.wdl
version 1.0
workflow bwa_samtools_workflow {
1  input {
    File ref_fasta
    File ref_bwt
    File ref_pac
    File ref_amb
    File ref_ann
    File ref_sa
    File fastq_r1
    String name
  }
2  call bwa_align {
    input:
      reference_bwt = ref_bwt,
      reference_ann = ref_ann,
      reference_pac = ref_pac,
      reference_amb = ref_amb,
      reference_sa = ref_sa,
      reference_fasta = ref_fasta,
      reads = fastq_r1,
      name = name
  }

3  call samtools_process {
    input:
      sam_file = bwa_align.bwa_output,
      name = name
  }
4  output {
    File sorted_bam = samtools_process.sorted_bam
    File sorted_bai = samtools_process.sorted_bai
  }
}
5task bwa_align {
  ...
}
6task samtools_process {
  ...
}
1
Inputs to workflow
2
Call bwa_align task using inputs
3
Take outputs of bwa_align task and process them with samtools_process
4
Outputs of workflow
5
bwa_align task definition
6
samtools_process task definition.

You can see that the structure of the WDL file is a lot like a shell script. The inputs have particular expected formats, and it calls bwa_align first. Once the outputs have been generated from the bwa_align task, they are then used as inputs to the samtools_process task, which outputs the final outputs of the workflow.

If we look at the task definition called bwa_align it looks like this:

task bwa_align {
  meta {
    description: "Aligns paired-end reads to a reference using BWA-MEM"
    outputs: {
      bwa_output: "Output SAM file containing aligned reads"
    }
  }
  parameter_meta {
    reference_fasta: "Reference genome FASTA file"
    reads: "FASTQ of forward (R1) reads"
    name: "Sample name for output SAM file"
  }
  input {
    File reference_fasta
    File reference_bwt
    File reference_ann
    File reference_pac
    File reference_amb
    File reference_sa
    File reads
    String name
  }
  # Get name of reference fasta file within 'bwa_index' folder
  String ref_name = reference_fasta
1  command <<<
    # BWA alignment using 8 threads
    bwa mem -t 8 "~{ref_name}" "~{reads}" > "~{name}.sam"
  >>>
  output {
    File bwa_output = "~{name}.sam"
  }
  runtime {
    docker: "getwilds/bwa:0.7.17"
    cpu: 8
    memory: "16 GB"
  }
}
1
Contains the bash code.

9.5.1 How does this connect to shell scripting?

You’ll notice that everything between the:

command <<<

>>>

Is essentially a shell script, but adapted to work with the WDL inputs.

Noneminiwdl check

Miniwdl has a check built in that makes sure that your WDL has correct syntax (in fact it does use shellcheck to check your bash script portions). You can run it by using:

miniwdl check scripts/week4/ww-bwa-samtools.wdl

The output of it is:

ww-bwa-samtools.wdl
    workflow bwa_samtools_workflow
        call bwa_align
        call samtools_process
    task bwa_align
        (Ln 60, Col 5) UnusedDeclaration, nothing references File reference_bwt
        (Ln 61, Col 5) UnusedDeclaration, nothing references File reference_ann
        (Ln 62, Col 5) UnusedDeclaration, nothing references File reference_pac
        (Ln 63, Col 5) UnusedDeclaration, nothing references File reference_amb
        (Ln 64, Col 5) UnusedDeclaration, nothing references File reference_sa
    task samtools_process

In our case, you can see that it is complaining that we aren’t “using” the .bwt and other index files. This is ok, because we just need to have the files present to run bwa even if we aren’t directly using them in our script.

9.6 scatter(): how we divide up tasks.

We can batch process our fastq files in data/ by using the scatter function in WDL. There are two things we need to do to modify our workflow:

  1. Accept an Array of Files called
  2. Use scatter to iterate over the files by wrapping our two tasks, bwa_align and samtools_process.

Our input definition for the workflow looks like this:

  input {
    File ref_fasta
    File ref_bwt
    File ref_pac
    File ref_amb
    File ref_ann
    File ref_sa
    Array[File] fastq_r1
  }

Which is identical to our other WDL, but it has a Array[File] as an input. The corresponding input in our JSON file looks like this:

  "bwa_samtools_workflow.fastq_r1": [
    "data/CALU1_combined_final.fastq",
    "data/HCC4006_final.fastq",
    "data/MOLM13_combined_final.fastq"
  ]

Which is a JSON list of the files we want to process.

When we run

miniwdl run -i scripts/week4/scatter.json scripts/week4/ww-bwa-samtools-scatter.wdl

We get the following output:

{
  "dir": "/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow",
  "outputs": {
    "bwa_samtools_workflow.sorted_bai": [
      "/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bai/0/CALU1_combined_final.fastq.sam_sorted.bam.bai",
      "/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bai/1/HCC4006_final.fastq.sam_sorted.bam.bai",
      "/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bai/2/MOLM13_combined_final.fastq.sam_sorted.bam.bai"
    ],
    "bwa_samtools_workflow.sorted_bam": [
      "/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bam/0/CALU1_combined_final.fastq.sam_sorted.bam",
      "/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bam/1/HCC4006_final.fastq.sam_sorted.bam",
      "/home/tladera2/bash_for_bio/20251104_101700_bwa_samtools_workflow/out/sorted_bam/2/MOLM13_combined_final.fastq.sam_sorted.bam"
    ]
  }
}

Examining the last workflow run directory,

ls -l _LAST/out/sorted_bam/

We see that our output has 3 subdirectories:

total 120
drwxrwx--- 2 tladera2 g_tladera2 59 Nov  4 10:17 0
drwxrwx--- 2 tladera2 g_tladera2 52 Nov  4 10:17 1
drwxrwx--- 2 tladera2 g_tladera2 60 Nov  4 10:17 2

Each of these directories contains a symbolic link to the output files.

9.6.1 More about scatter

Here is the scatter block in the workflow:

1  scatter (fq in fastq_r1){
    call bwa_align {
        input:
            reference_bwt = ref_bwt,
            reference_ann = ref_ann,
            reference_pac = ref_pac,
            reference_amb = ref_amb,
            reference_sa = ref_sa,
            reference_fasta = ref_fasta,
2            reads = fq
        }

   call samtools_process {
    input:
3      sam_file = bwa_align.bwa_output
        }
  }
1
scatter over our Array[File] fastq_r1. We use similar syntax as for loops to iterate over each file with a placeholder fq.
2
Utilize fq placeholder in the call bwa_align task.
3
The rest of the workflow works as usual.

The output section of the workflow looks like this:

output {
    Array[File] sorted_bam = samtools_process.sorted_bam
    Array[File] sorted_bai = samtools_process.sorted_bai
  }

Note that the scatter generates an Array of files as well.

9.6.2 For more information

I recommend that you read the Developing WDL Workflows book that we have written for much more information about building your own workflows.

You don’t have to start from scratch, either. There is the WILDS WDL library that has a number of different bioinformatics tasks that are already built for use in your workflows.

9.7 Working with file manifests in WDL

Another strategy you can use to cycle through a list of files is to use a file manifest to process a list of files.

There is an example workflow that shows how to work with a file manifest. This can be helpful for those who aren’t necessarily good at working with JSON.

This workflow has a single input, which is the location of the file manifest. It will then cycle through the manifest, line by line.

This is what the file manifest contains. Notice there are three named columns for this file: sampleName, bamLocation, and bedLocation.

sampleName bamLocation bedLocation
smallTestData /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/unpaired-panel-consensus-variants-human/smallTestData.unmapped.bam /fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed
smallTestData-reference /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/paired-panel-consensus-variants-human/smallTestData-reference.unmapped.bam /fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed

9.8 Where Next?

Now that you understand the basics of working with Bash and WDL, you are now ready to try testing and profiling your jobs.