Skip to content
hero

TP 3: Job array

Objective: Speed up a job by splitting it into independant jobs runnning on many nodes.

battle plan

Understand the job-array

This practice is designed to understand how job-arrays work. We will start from the script we wrote during TP 2, and transform it step by step into a kind of template script that will be applied on a list of files.

Setting a test environment

We copy the script blastx.sh from TP 2 and rename it as blastx_pe.sh.

blastx_pe.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/sh
#SBATCH --cpus-per-task=1
#SBATCH --mem=200M
#SBATCH --time=00:05:00
#SBATCH -o blastx_pe.out

INPUT="contigs.fasta"

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $INPUT.blast \
    -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK
sleep 20

For testing and developing purpose, we did the following modifications:

  • We set 1 cpu per job, a low memory requirement and a short walltime to ensure that our job starts quickly.
  • We wrote the log output into blastx_pe.out file, a predictable filename, as we will look at it many times.
  • We add the echo command before the blastx command. This way, instead of running blastx, the command line will be written in the file blastx_pe.out.
  • Finally, we make the script waits for 20s before ending with the sleep command. It will allows us to catch it with the squeue command.

First try

A job array is run by using sbatch with the option --array <subjob-numbers>.

Question

Run the current blastx_pe.sh and observe what appends with the command squeue -u <my-login> (where you replace <my-login> with your login)

sbatch --array 0-9 blastx_pe.sh

When ended, look at the log file blastx_pe.out. What do you observe?

Solution

When checking the job with squeue command, you must saw 10 jobs (ending with _0 to _9) related to blastx_pe.sh appears and running in parallel.

When the jobs end, the file blastx_pe.out must contains 10 times the following line:

blastx -db ensembl_danio_rerio_pep -query contigs.fasta \ -out contigs.fasta.blast -evalue 10e-10 -num_threads 1

If we had really run the blastx instead of displaying the command, we would had run the same command 10 times in parallel, writing on the same file!

Prepare data

We need to split data in order to avoid running the same command many times on same data.

Split input files

Question

Split the fasta file in 10 fasta files into a directory called contigs_split.

Tip

The fastasplit program from exonerate can be used for this purpose. Here the expected pattern fastasplit command waiting for.

module load bioinfo/Exonerate/2.2.0fastasplit \ --fasta <my-fasta-file> \ --output <my-output-dir> \ --chunk <number-of-splitted-files>

Solution

mkdir contigs_splitmodule load bioinfo/Exonerate/2.2.0fastasplit --fasta contigs.fasta --chunk 10 --output contigs_split

Check the number of splitted files

Question

Check the number of files obtained in previous step

Solution

ls contigs_split/* | wc -l10

Must return 10, which is the number of splitted files

Check the content of splitted files

Question

Check if the number of sequences in contigs.fasta file is the same than in the sum of all sequences in splitted files.

Solution

# The number of sequences in 'contigs.fasta'grep -c ">" contigs.fasta# The number of sequences in splitted filesgrep ">" contigs_split/* | wc -l

Second try

Now that we have splitted fasta, we hope to apply the blastx_pe.sh script on each of them.

We modify the blastx_pe.sh in order to use them.

blastx_pe.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/sh
#SBATCH --cpus-per-task=1
#SBATCH --mem=200M
#SBATCH --time=00:05:00
#SBATCH -o blastx_pe.out

INPUT="contigs_split/*" # (1)!

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $INPUT.blast \
    -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK
sleep 20
  1. We use the * pattern in order to catch all files in the contig_split directory

Question

As in first try, run the blastx_pe.sh and observe what appends with the command squeue -u <my-login>.

sbatch --array 0-9 blastx_pe.sh

When ended, look at the log file blastx_pe.out. What do you observe?

Solution

When checking the job with squeue command, you must saw 10 jobs related to blastx_pe.sh appears and running in parallel.

When the jobs end, the file blastx_pe.out must contains 10 times the following line:

blastx_pe.out
1
2
blastx -db ensembl_danio_rerio_pep -query contigs_split/contigs.fasta_chunk_0000000 contigs_split/contigs.fasta_chunk_0000001 contigs_split/contigs.fasta_chunk_0000002 contigs_split/contigs.fasta_chunk_0000003 contigs_split/contigs.fasta_chunk_0000004 contigs_split/contigs.fasta_chunk_0000005 contigs_split/contigs.fasta_chunk_0000006 contigs_split/contigs.fasta_chunk_0000007 contigs_split/contigs.fasta_chunk_0000008 contigs_split/contigs.fasta_chunk_0000009 -out contigs_split/*.fasta_chunk*.blast -evalue 10e-10 -num_threads 1
...

If we had really run the blastx instead of displaying the command, we would got 10 times an error as 10 files are passed as input of blastx!

One file per job (SLURM_ARRAY_TASK_ID)

A simple solution

Our splitted sequence files are named ascontigs.fasta_chunk_0000000, contigs.fasta_chunk_0000001, and so on until contigs.fasta_chunk_0000009.

By using the variable SLURM_ARRAY_TASK_ID, we can select a file this way.

blastx_pe.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/sh
#SBATCH --cpus-per-task=1
#SBATCH --mem=200M
#SBATCH --time=00:05:00
#SBATCH -o blastx_pe.out

INPUT="contigs_split/contigs.fasta_chunk_000000$SLURM_ARRAY_TASK_ID"

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $INPUT.blast \
    -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK
sleep 20

Question

  1. Edit the script blastx_pe.sh and run it as a job array. When it ends, check the log file blastx_pe.out. What can you conclude?
  2. try to run the script again with sbatch --array 1-10 blastx_pe.sh. What do you observe with filename?
Solution
  1. The script works, each file is correctly selected.
    The variable SLURM_ARRAY_TASK_ID takes different values, matching values given to --array option in sbatch command (here 0 to 3).
    For the nth subjob, it allows compute the filename of the nth file.
  2. We observe that contigs.fasta_chunk_0000010 overflow on the left:
    contigs.fasta_chunk_0000009
    contigs.fasta_chunk_00000010
    
    • If we had more splitted files, it could be a problem.
    • Moreover, if we have a list of files without predictible names this solution doesn't work?

A more robust solution

Filenames are not alway predictible and computable. Instead of using filename pattern approach, we can select the nth file in the list of input files. There is many ways to archieve that in bash. One of them is using the bash arrays

blastx_pe.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#!/bin/sh
#SBATCH --cpus-per-task=1
#SBATCH --mem=200M
#SBATCH --time=00:05:00
#SBATCH -o blastx_pe.out

FILE_LIST=(contigs_split/*) # (1)!
INPUT=${FILE_LIST[$SLURM_ARRAY_TASK_ID]} # (2)!

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $INPUT.blast \
    -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK
sleep 20
  1. We create a bash array containing all the files we want to blast
  2. We select the nth file in the list of files, with n=SLURM_ARRAY_TASK_ID

Question

Edit the script blastx_pe.sh and run it as a job array. When it ends, check the log file blastx_pe.out

Alternative ways

awk: In this case the submission command will be sbatch --array 1-10 blastx_pe.sh as line numbers in awk start from 1 instead from 0

blastx_pe.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/sh
#SBATCH --cpus-per-task=1
#SBATCH --mem=200M
#SBATCH --time=00:05:00
#SBATCH -o blastx_pe.out

INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")" # (1)!

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $INPUT.blast \
    -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK
sleep 20
  1. Some explainations:
    • $() is called a subshell. It means run the command and get back the result.
    • NR means 'Number of Rows' in awk.

sed: In this case the submission command will be sbatch --array 1-10 blastx_pe.sh as line numbers in sed start from 1 instead from 0

blastx_pe.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/sh
#SBATCH --cpus-per-task=1
#SBATCH --mem=200M
#SBATCH --time=00:05:00
#SBATCH -o blastx_pe.out

INPUT="$(ls contigs_split/*.fasta_chunk* | sed -n "$SLURM_ARRAY_TASK_ID{p;q}")" # (1)!

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $INPUT.blast \
    -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK
sleep 20
  1. Some explainations:
    • $() is called a subshell. It means run the command and get back the result.
    • option -n means "suppress output", whereas p means "print matching line" and q means "quit after line" .

Run the job for real

Question

Remove the echo before the blastx command and run again the script blastx_pe.sh as an array of jobs.

When running, check the jobs status.

Solution

First, modify the script blastx_pe.sh to run the blast efficiently:

blastx_pe.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/sh
#SBATCH --cpus-per-task=8
#SBATCH --mem=1G
#SBATCH --time=00:20:00
#SBATCH -o blastx_pe.out

FILE_LIST=(contigs_split/*)
INPUT=${FILE_LIST[$SLURM_ARRAY_TASK_ID]}
OUTPUT="blastx_split/$(basename "$INPUT").blast" # (1)!

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
blastx -db ensembl_danio_rerio_pep -query $INPUT -out $OUTPUT \
    -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK
# sleep 20
  1. Some explainations:
    • $() is called a subshell. It means run the command and get back the result.
    • basename is a command that extracts the filename (e.g. contigs.fasta_chunk_0000000) from a path (e.g. contigs_split/contigs.fasta_chunk_0000000). It can also remove extension part from filename if needed.

Then, get the number of files to process (there are 10 files):

ls contigs_split/*.fasta_chunk* | wc -l

Finally, run the array of jobs on all files:

sbatch --array 0-9 blastx_pe.sh

Don't forget to check the running jobs and the logs:

sq_long -u "$(whoami)"

Throttling the subjobs

Question

Run again the array of job on the 4 first splited files while limiting the job to 2 simultaneous running subjobs?

Check the running jobs

Solution

sbatch --array 0-3%2 blastx_pe.sh

Merge results

Question

Concatenate all blast results obtained from the job array into one file.

Solution

cat blastx_split/*.blast > result.blast

The Genotoul-bioinfo sarray wrapper

We provide a wrapper called sarray that helps you to run some job arrays.

By giving it a file containing one job per line, it will run them as a job array. A job a list of commands.

We create a script named generate_blastx_array_cmds.sh that will generate such a file.

generate_blastx_array_cmds.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/bin/sh

NB_CPUS=2

for INPUT in contigs_split/*.fasta_chunk*; do
    OUTPUT="blastx_split/$(basename "$INPUT").blast"
    echo "module purge \
       && module load bioinfo/NCBI_Blast+/2.10.0+ \
       && blastx -db ensembl_danio_rerio_pep -query $INPUT -out $OUTPUT \
          -evalue 10e-10 -num_threads $NB_CPUS"
done

Then, we run in a interactive session the script generate_blastx_array_cmds.sh in order to generate the blastx_array.cmds:

bash generate_blastx_array_cmds.sh > blastx_array.cmds

Finally we run the array on jobs in blastx_array.cmds with sarray command, with maximum 4 tasks in parallel.

sarray -J blastx --cpus-per-task 2 --%=4 blastx_array.cmds

where options are same options as slurm with exception of --%:

  • -J is the job name
  • --cpus-per-task the number of cpu reserved by each task
  • --% the maximum number of task running in parallel