TP 3.2: Data mining from files
Prerequisites¶
TP 3.1 must be done beforehand.
You must be in directory ~/save/tp_linux.
/home/<username>
├── save
│ └── tp_linux <- you are here
│ ├── blast_result
│ └── data
└── work
For this practice, we will use blastn. It can be run with following command:
Use "blast" to produce tabulated files¶
Question
Run the blastn command and redirect the results in a file named ab005233.blast.
Solution
Sort result file¶
Question
Sort the file ab005233.blast according to the % identity by reverse order. In order to find the right column, please display the begining of the file.
Think about removing the 5th first lines with command tail.
Solution
Display some columns¶
Question
By using the same blast file, display only the subject names.
Solution
Concatenate data files¶
Question
Go inside the directory ~/save/tp_linux/data, concatenate the fasta files matching ab005*.fasta in a new file called mes_sequences.fasta
/home/<username>
├── save
│ └── tp_linux
│ ├── blast_result
│ └── data <- you are here
│ ├── ab005*.fasta <- concatenate them ...
│ └── mes_sequences.fasta <- ... into this file
└── work
Count the number of sequence in the new file.
Solution
Question
Add to file mes_sequences.fasta the sequence from ab017070.fasta
Solution
Display page by page¶
Question
Display the file mes_sequences.fasta page per page.
Search for the string AB017070 in order to check that the sequence is correctly added. Use the / in the pager in order to start a search.
Solution
First, run a pager:
Then use the / to start the search inside the document.
When finished, use Q to quit
Count the number of sequences¶
Question
Count the number of sequences by using grep command.
Solution
Compare two files¶
Question
Compare by using meld command the file ab106670.fasta with the file /save/user/formation/tp_linux/ab106670_bis.fasta
Solution
Search in a directory¶
Question
Search in fasta files the sequences that contain the pattern ttatatatc