Difference between revisions of "SampleData PairedEndSequencesWithQuality processing"
Line 23: | Line 23: | ||
Let me remind you that this sequencing data has already been demultiplexed, meaning that the sequencing data has been sorted based on the indexes/barcodes; this is done by the HiSeq instrument. | Let me remind you that this sequencing data has already been demultiplexed, meaning that the sequencing data has been sorted based on the indexes/barcodes; this is done by the HiSeq instrument. | ||
− | ====Command set 1==== | + | ====Command set 1 - Activate Qiime==== |
The following commands must be executed using a terminal window/shell from the directory immediately above the ‘reads’ directory. The following assumes that Qiime2 has been installed on your computer. If you are just opening a terminal window, you must enter into the Qiime2 environment by typing: | The following commands must be executed using a terminal window/shell from the directory immediately above the ‘reads’ directory. The following assumes that Qiime2 has been installed on your computer. If you are just opening a terminal window, you must enter into the Qiime2 environment by typing: | ||
Revision as of 16:10, 7 July 2022
Introduction
Data sent to us by Frank Oliaro (foliaro@sheddaquarium.org) from Shedd Aquarium has the input format of the “Casava 1.8 paired-end demultiplexed fastq.” This data (.fastq vs .fast) also contains “Quality” information. In this format there are two data files per sample, one for the forward read (R1) and one for the reverse read (R2); the files are compressed and have the XXX_fastq.gz extension. Qiime2 can work with compressed files, so no need to convert the XXX.fastq.gz to XXX.fastq. This file format is described on the Qiime2 website (see importing link below) as:
“In Casava 1.8 demultiplexed (paired-end) format, there are two fastq.gz files for each sample in the study, each containing the forward or reverse reads for that sample. The file name includes the sample identifier. The forward and reverse read file names for a single sample might look like L2S357_15_L001_R1_001.fastq.gz and L2S357_15_L001_R2_001.fastq.gz, respectively. The underscore-separated fields in this file name are:
- the sample identifier, - the barcode sequence or a barcode identifier, - the lane number, - the direction of the read (i.e. R1 or R2), and - the set number.”
The amplicon to be sequenced was for the variable region 4 (V4) of the 16S gene which contains 254 base pairs. The HiSeq Lumina sequncing requires additional DNA referred to as adapters, sequencing binding sites, indices (or barcodes), and complements to the flow cell oligios. Therefore the DNA sequence has an overall length of >400 base pairs. Since the HiSeq instrument works best with ~200 base pairs, sequencing forward and reverse generates these two sequences which can be analyzed (ie. combined/joined together) to generate a final sequence containing the entire 254 (V4) base pairs.
Analyzing the Data
STEP 1
The first step in the processing of the two forward and reverse files (XXX_R1_YYY.fastq.gz and XXX_R2_YYY.fastq.gz) is to import then into qiime2. Prior to doing any qiime2 work we must place all data files in a single directory. For example, Frank O sent us a link to the data that was compressed into a single zip file. When downloaded and unzipped, the resulting folder contained 11 subfolders each with the R1 and R2 data file. A new folder was made called ‘reads’ and all R1/R2 data sets were put directly (ie, no subfolders) in this folder. The resulting ‘reads’ folder now contains 22 files with sequencing data for 11 samples. When importing data into qiime2, all data in the ‘reads’ folder will be processed at the same time.
STEP 2
Importing this data will be outlined below, but can be found on the Qiime2 website: https://docs.qiime2.org/2021.8/tutorials/importing/ …scroll down under the heading of Casava 1.8 paired-end demultiplexed fastq
Let me remind you that this sequencing data has already been demultiplexed, meaning that the sequencing data has been sorted based on the indexes/barcodes; this is done by the HiSeq instrument.
Command set 1 - Activate Qiime
The following commands must be executed using a terminal window/shell from the directory immediately above the ‘reads’ directory. The following assumes that Qiime2 has been installed on your computer. If you are just opening a terminal window, you must enter into the Qiime2 environment by typing:
For VR laptops...
conda activate qiime2-2022.2
For other computers...you need to know what version of qiime you installed...
conda activate qiime2-202X.X <—where the X.X is your installed version; this is qiime2-2021.4 or qiime2-2022.2 (MAC), qiime2-2021.8 on the PC in CSB 360.
- - the above command will then result in the terminal window showing a (quine2-2021.4) in the command line.
Command set 2
Now navigate to the proper directory; fyi, when working in a “window” environment we usually refer to folders, but when working in a terminal window/shell we often use the term directory. These terms have the exact same meaning. The UNIX commands cd XYZ for change directory to XYZ and ls for list directory contents are necessary to navigate within a terminal window.
Once in the proper directory, ie if you type ls you will see the reads directory listed, the following series of commands will be executed (do not type the word <enter> just hit the enter/return key):
qiime tools import \ <enter> <— this will launch the qiime application and the \ means wait, then --type 'SampleData[PairedEndSequencesWithQuality]' \ <enter> <— this command identified the data type --input-path reads/ \ <enter> --input-format CasavaOneEightSingleLanePerSampleDirFmt \ <enter> --output-path demux-paired-end.qza
Command set 3
Once the set of commands above have run successfully, you need to exam the sequence quality (QC) to determine how the data sequences should be trimmed. There is a plugin for graphing sequence quality that is run by the following commands (this command is run from the directory containing the .qza file):
qiime demux summarize \ --i-data demux-paired-end.qza \ --o-visualization demux-paired-end_viz.qzv
This output file, demux-paired-end_vis.qzv is then “dragged and dropped” into the https://view.qiime2.org window and data will be displayed.
Interpreting the .qzv data The view.qiime.org webpage will display a window with 2 tabs:
- Overview - Interactive Quality Plot
The information in the Interactive Quality Plot shows that the quality score (blk data) is between 40 and 20 and this data looks really good (some quality scores are in the 2-10 range). Based on this plot we will trim the front by 15 basepairs and the end at 230. Keep in mind that the V4 region is 256 base pairs and since we have ~ 220 forward reads and ~220 reverse reads, we will have a significant amount of end-overlap which will be discarded once we join the two sets of reads.
Command set 4
To do this trimming, execute the following commands:
qiime dada2 denoise-paired \ --i-demultiplexed-seqs demux-paired-end.qza \ --p-trim-left-f 15 \ --p-trim-left-r 15 \ --p-trunc-len-f 220 \ --p-trunc-len-r 220 \ --o-representative-sequences rep-seqs-dada2.qza \ --o-table table-dada2.qza \ --o-denoising-stats stats-dada2.qza
The above command generates 3 .qza files.
Command set 5
Next, we want to visualize the stats data (stats-dada2.qza) by entering the following command:
qiime metadata tabulate \ --m-input-file stats-dada2.qza \ --o-visualization stats-dada2.qzv
The resulting .qzv files contains information about how the data was processes…there is no real info to pull for this at this time.
Next, we want to visualize the FeatureTable data (table-dada2.qza) by entering the following command:
qiime feature-table summarize \ --i-table table-dada2.qza \ --o-visualization table.qzv
the following command was listed in the tutorial, but was not used??? We do not have a metadata file…CORRECTION…the metafile .TSV file can be downloaded from the view.qiime.org…i don’t recall what visulization files showed this link???
--m-sample-metadata-file sample-metadata.tsv
Command set 6
Next, we want to visualize the seq data (rep-seqs-dada2.qza) by entering the following command:
qiime feature-table tabulate-seqs \ --i-data rep-seqs-dada2.qza \ --o-visualization rep-seqs-dada2.qzv
Command set 7
Now we are ready to generate a tree for phylogenetic diversity using the following commands:
qiime phylogeny align-to-tree-mafft-fasttree \ --i-sequences rep-seqs-dada2.qza \ --o-alignment align-rep-seqs.qza \ --o-masked-alignment masked-aligned-rep-seqs.qza \ --o-tree unroot-tree.qza \ --o-rooted-tree rooted-tree.qza
not totally sure what the above commands have done for us, but…
Command set 8
For the next command you will need to acquire a "classifier" file from here_1 or here_2 <-- this file needs to be saved/relocated into the main working folder.
On to taxonomy…to determine the taxonomic composition of the samples use the following commands:
qiime feature-classifier classify-sklearn \ --i-classifier gg-13-8-99-515-806-nb-classifier.qza \ <-- us this line if using gg-13-8-99-515-806-nb-classifier.qza --i-classifier gg-13-8-99-nb-classifier.qza \ <-- us this line if using gg-13-8-99-nb-classifier.qza --i-reads rep-seqs-dada2.qza \ --o-classification taxonomy.qza
qiime metadata tabulate \ --m-input-file taxonomy.qza \ --o-visualization taxonomy.qzv
Command set 9
Now we want to visualize this data using the barplot…using these commands:
qiime taxa barplot \ --i-table table-dada2.qza \ --i-taxonomy taxonomy.qza \ --m-metadata-file sample-metadata.tsv \ <-- skip this line --o-visualization taxa-bar-plots.qzv
Visualizing the barplot
Now drag and drop the taxa-bar-plots.qzv file into the view.qiime2.org/ window.