# How to get my favorite SRA data?
- Let’s say I’m interested in the skin metagenome data set in Bioproject 266117 (prjna266117)
- I can access the data at https://www.ncbi.nlm.nih.gov/sra by searching for SRA bioproject for prjna266117 or SRA bioproject for 266117, which will lead me to: https://www.ncbi.nlm.nih.gov/bioproject/?term=(prjna266117)%20AND%20bioproject_sra[filter] %20NOT%20bioproject_gap[filter]
OR
- I will download and install the SRA toolkit to smoothly download the data (https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/)
# How to use SRA toolkit to get the data?
fastq-dump sequence_file_name—fasta --gzip --outdir bioproject_266117; done [bioproject_266117 is an output folder that I created]
YOU CAN ALSO download in fastq format
# Diamond commands:
If your database is not made yet, especially if it’s a custom one, you will want to make the database from a fasta file (here antimicrobial resistance proteins):
diamond makedb --in db.faa -d databasename
e.g., diamond makedb --in /data/resistome/blactamases_2019.fasta -d blaDB #creates a database for all beta lactamases that I had collected and curated (with student help) and named it here blaDB for future reference
# How to DIAMOND?
diamond blastx -t externalـdrive -d diamond_database -q fastq filtered query file -o output.f6 -k 1
# -k or —max-target-seqs: maximum number of target sequences to report alignments for
e.g.,
mkdir myAMR_analysis
diamond blastx -t ./myAMR_analysis -d myAMR_analysis/blaDB -q /metagenome_files/SRR5535731.fastq -o SRR5535731.vs.blaDB -k 1
# To see the alignments for the BLAST hits
diamond blastx -t ./myAMR_analysis -d myAMR_analysis/blaDB -q /metagenome_files/SRR5535731.fastq -o SRR5535731.vs.blaDB -k 1 --outfmt 0 #output format (-f or --outfmt) with the code 0 will give full alignments
# To change the statistical threshold (Expect value or E-value), add -e or --evalue
# How to copy files into my directory (folder)?
cd or cd ~ #changes the directory to your user directory, from wherever you are
mkdir skinMGs #creates a directory in your user directory
cp /data/MGs/skin/SRRxxxx ./skinMGs/ # (SRRxxx = your assigned file)
or, if you're ambitious ...
cp /data/MGs/skin/*fq skinMGs/ #copies all files that end with 'fq' and that are in the subdirectory 'skin' (under subdirectory 'MGs', which is in the public directory 'data') to the directory you just created
To check if the files were properly copied:
ls #lists all the files in the directory you are in
So, to have a list of the files you just copied, do either:
ls /skinMGs
or
ls /skinMGs/*
or
cd skinMGs
ls
For more details on the files:
ls -l #lists all the files, with all their details, including their size in bytes
or
ls -lh #this will list the files as above, but will give you the sizes in a "human readable" way (whence the -h), i.e., will tell you 18 M (megabytes) instead of 18202538
If the list is long (which it could be, since there are 60+ files), you can read it page by page if you pipe to the program 'less'
ls -l | less
one way to count these files is:
ls | wc #gives you the line, word, and character count by using the word count command
ADVANCED
#If you want to run several files
# So, with default E-value (0.001):
for i in $(ls SRR*sub.fq); do diamond blastx -b 1 -t /data/MGs/gut/Subsampled_60K -d blaDB -q $i -o $i.f6.txt -k 1; done
# With a permissive E-value (1)
for i in $(ls SRR*sub.fq); do diamond blastx -b 1 -t /data/MGs/gut/Subsampled_60K -d blaDB -q $i -o $i.f6.e_1.txt -k 1 -e 1; done
# With a more restrictive E-value (1e-10)
for i in $(ls SRR*sub.fq; do diamond blastx -b 1 -t /data/MGs/gut/Subsampled_60K -d blaDB -q $i -o $i.f6.e_10.txt -k 1 --evalue e-10; done #--evalue is the same as -e
#To do all the above in an analysis directory in your user directory
cd
mkdir myAMR_analysis #(if you have not yet created that directory)
cd myAMR_analysis
ls #just to verify you the directory is there
# Next
cd ~/skinMGs
for i in $(ls SRR*sub.fq); do diamond blastx -b 1 -t myAMR_analysis -d ~/blaDB -q $i -o myAMR_analysis/$i.f6.txt -k 1; done
and so on...
e.g.,
for i in $(ls SRR*sub.fq); do diamond blastx -b 1 -t myAMR_analysis -d ~/blaDB -q $i -o myAMR_analysis/$i.f6.e_10.txt -k 1 -e 1e-10; done #changes E-value threshold
for i in $(ls SRR*sub.fq); do diamond blastx -b 1 -t myAMR_analysis -d ~/blaDB -q $i -o myAMR_analysis/$i.f0.txt -k 1 -f 0; done #changes output format from default (6) to aligned (0)
for i in $(ls SRR*sub.fq); do diamond blastx -b 1 -t myAMR_analysis -d ~/cardDB -q $i -o myAMR_analysis/$i.f6.txt -k 1; done #changes database from blaDB to cardDB
# To combine all best hits in one file:
for i in $(ls *f6.txt*_05.out); do cat $i; done >MyCombinedAMR_Results.txt