TP 3: Job array
Objective: Speed up a job by splitting it into independant jobs runnning on many nodes.

Understand the job-array¶
This practice is designed to understand how job-arrays work. We will start from the script we wrote during TP 2, and transform it step by step into a kind of template script that will be applied on a list of files.
Setting a test environment¶
We copy the script blastx.sh from TP 2 and rename it as blastx_pe.sh.
| blastx_pe.sh | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
For testing and developing purpose, we did the following modifications:
- We set 1 cpu per job, a low memory requirement and a short walltime to ensure that our job starts quickly.
- We wrote the log output into
blastx_pe.outfile, a predictable filename, as we will look at it many times. - We add the
echocommand before theblastxcommand. This way, instead of runningblastx, the command line will be written in the fileblastx_pe.out. - Finally, we make the script waits for 20s before ending with the
sleepcommand. It will allows us to catch it with thesqueuecommand.
First try¶
A job array is run by using sbatch with the option --array <subjob-numbers>.
Question
Run the current blastx_pe.sh and observe what appends with the command squeue -u <my-login> (where you replace <my-login> with your login)
When ended, look at the log file blastx_pe.out. What do you observe?
Solution
When checking the job with squeue command, you must saw 10 jobs (ending with _0 to _9) related to blastx_pe.sh appears and running in parallel.
When the jobs end, the file blastx_pe.out must contains 10 times the following line:
If we had really run the blastx instead of displaying the command, we would had run the same command 10 times in parallel, writing on the same file!
Prepare data¶
We need to split data in order to avoid running the same command many times on same data.
Split input files¶
Question
Split the fasta file in 10 fasta files into a directory called contigs_split.
Tip
The fastasplit program from exonerate can be used for this purpose. Here the expected pattern fastasplit command waiting for.
Solution
Check the number of splitted files¶
Question
Check the number of files obtained in previous step
Solution
Must return 10, which is the number of splitted files
Check the content of splitted files¶
Question
Check if the number of sequences in contigs.fasta file is the same than in the sum of all sequences in splitted files.
Solution
Second try¶
Now that we have splitted fasta, we hope to apply the blastx_pe.sh script on each of them.
We modify the blastx_pe.sh in order to use them.
| blastx_pe.sh | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
- We use the
*pattern in order to catch all files in thecontig_splitdirectory
Question
As in first try, run the blastx_pe.sh and observe what appends with the command squeue -u <my-login>.
When ended, look at the log file blastx_pe.out. What do you observe?
Solution
When checking the job with squeue command, you must saw 10 jobs related to blastx_pe.sh appears and running in parallel.
When the jobs end, the file blastx_pe.out must contains 10 times the following line:
| blastx_pe.out | |
|---|---|
1 2 | |
If we had really run the blastx instead of displaying the command, we would got 10 times an error as 10 files are passed as input of blastx!
One file per job (SLURM_ARRAY_TASK_ID)¶
A simple solution¶
Our splitted sequence files are named ascontigs.fasta_chunk_0000000, contigs.fasta_chunk_0000001, and so on until contigs.fasta_chunk_0000009.
By using the variable SLURM_ARRAY_TASK_ID, we can select a file this way.
| blastx_pe.sh | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Question
- Edit the script
blastx_pe.shand run it as a job array. When it ends, check the log fileblastx_pe.out. What can you conclude? - try to run the script again with
sbatch --array 1-10 blastx_pe.sh. What do you observe with filename?
Solution
- The script works, each file is correctly selected.
The variable
SLURM_ARRAY_TASK_IDtakes different values, matching values given to--arrayoption insbatchcommand (here 0 to 3). For the nth subjob, it allows compute the filename of the nth file. - We observe that
contigs.fasta_chunk_0000010overflow on the left:contigs.fasta_chunk_0000009 contigs.fasta_chunk_00000010- If we had more splitted files, it could be a problem.
- Moreover, if we have a list of files without predictible names this solution doesn't work?
A more robust solution¶
Filenames are not alway predictible and computable. Instead of using filename pattern approach, we can select the nth file in the list of input files. There is many ways to archieve that in bash. One of them is using the bash arrays
| blastx_pe.sh | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
- We create a bash array containing all the files we want to blast
- We select the nth file in the list of files, with n=
SLURM_ARRAY_TASK_ID
Question
Edit the script blastx_pe.sh and run it as a job array. When it ends, check the log file blastx_pe.out
Alternative ways
awk:
In this case the submission command will be sbatch --array 1-10 blastx_pe.sh as line numbers in awk start from 1 instead from 0
| blastx_pe.sh | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
- Some explainations:
$()is called a subshell. It means run the command and get back the result.NRmeans 'Number of Rows' inawk.
sed:
In this case the submission command will be sbatch --array 1-10 blastx_pe.sh as line numbers in sed start from 1 instead from 0
| blastx_pe.sh | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
- Some explainations:
$()is called a subshell. It means run the command and get back the result.- option
-nmeans "suppress output", whereaspmeans "print matching line" andqmeans "quit after line" .
Run the job for real¶
Question
Remove the echo before the blastx command and run again the script blastx_pe.sh as an array of jobs.
When running, check the jobs status.
Solution
First, modify the script blastx_pe.sh to run the blast efficiently:
| blastx_pe.sh | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
- Some explainations:
$()is called a subshell. It means run the command and get back the result.basenameis a command that extracts the filename (e.g.contigs.fasta_chunk_0000000) from a path (e.g.contigs_split/contigs.fasta_chunk_0000000). It can also remove extension part from filename if needed.
Then, get the number of files to process (there are 10 files):
Finally, run the array of jobs on all files:
Don't forget to check the running jobs and the logs:
Throttling the subjobs¶
Question
Run again the array of job on the 4 first splited files while limiting the job to 2 simultaneous running subjobs?
Check the running jobs
Solution
Merge results¶
Question
Concatenate all blast results obtained from the job array into one file.
Solution
The Genotoul-bioinfo sarray wrapper¶
We provide a wrapper called sarray that helps you to run some job arrays.
By giving it a file containing one job per line, it will run them as a job array. A job a list of commands.
We create a script named generate_blastx_array_cmds.sh that will generate such a file.
| generate_blastx_array_cmds.sh | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 | |
Then, we run in a interactive session the script generate_blastx_array_cmds.sh in order to generate the blastx_array.cmds:
Finally we run the array on jobs in blastx_array.cmds with sarray command, with maximum 4 tasks in parallel.
where options are same options as slurm with exception of --%:
-Jis the job name--cpus-per-taskthe number of cpu reserved by each task--%the maximum number of task running in parallel