Skip to content
hero

TP 3.2: Data mining from files

Prerequisites

TP 3.1 must be done beforehand.

You must be in directory ~/save/tp_linux.

/home/<username>
├── save
│   └── tp_linux                   <- you are here
│       ├── blast_result
│       └── data
└── work

For this practice, we will use blastn. It can be run with following command:

module load bioinfo/NCBI_Blast+/2.10.0+blastn -query data/ab005233.fasta -db ensembl_arabidopsis_thaliana_cdna \ -outfmt 7

Use "blast" to produce tabulated files

Question

Run the blastn command and redirect the results in a file named ab005233.blast.

Solution

module load bioinfo/NCBI_Blast+/2.10.0+blastn -query data/ab005233.fasta -db ensembl_arabidopsis_thaliana_cdna -outfmt 7 > ab005233.blastcat ab005233.blast

Sort result file

Question

Sort the file ab005233.blast according to the % identity by reverse order. In order to find the right column, please display the begining of the file.

Think about removing the 5th first lines with command tail.

Solution

tail -n +6 ab005233.blast | sort -k 3 -r -n

Display some columns

Question

By using the same blast file, display only the subject names.

Solution

head ab005233.blasttail -n +6 ab005233.blast | cut -f 2

Concatenate data files

Question

Go inside the directory ~/save/tp_linux/data, concatenate the fasta files matching ab005*.fasta in a new file called mes_sequences.fasta

/home/<username>
├── save
│   └── tp_linux
│       ├── blast_result
│       └── data                        <- you are here
│           ├── ab005*.fasta            <- concatenate them ...
│           └── mes_sequences.fasta     <- ... into this file
└── work

Count the number of sequence in the new file.

Solution

cd ~/save/tp_linux/datacat ab005*.fasta > mes_sequences.fastagrep -c ">" mes_sequences.fasta

Question

Add to file mes_sequences.fasta the sequence from ab017070.fasta

Solution

cat ab017070.fasta >> mes_sequences.fasta

Display page by page

Question

Display the file mes_sequences.fasta page per page.

Search for the string AB017070 in order to check that the sequence is correctly added. Use the / in the pager in order to start a search.

Solution

First, run a pager:

less mes_sequences.fasta

Then use the / to start the search inside the document.

When finished, use Q to quit

Count the number of sequences

Question

Count the number of sequences by using grep command.

Solution

grep -c ">" mes_sequences.fasta

Compare two files

Question

Compare by using meld command the file ab106670.fasta with the file /save/user/formation/tp_linux/ab106670_bis.fasta

Solution

meld ab106670.fasta /save/user/formation/tp_linux/ab106670_bis.fasta

Search in a directory

Question

Search in fasta files the sequences that contain the pattern ttatatatc

Solution

grep "ttatatatc" *.fasta