Difference between revisions of "SampleData PairedEndSequencesWithQuality processing"

Revision as of 22:52, 7 July 2022

Introduction

Data sent to us by Frank Oliaro (foliaro@sheddaquarium.org) from Shedd Aquarium has the input format of the “Casava 1.8 paired-end demultiplexed fastq.” This data (.fastq vs .fast) also contains “Quality” information. In this format there are two data files per sample, one for the forward read (R1) and one for the reverse read (R2); the files are compressed and have the XXX_fastq.gz extension. Qiime2 can work with compressed files, so no need to convert the XXX.fastq.gz to XXX.fastq. This file format is described on the Qiime2 website (see importing link below) as:

“In Casava 1.8 demultiplexed (paired-end) format, there are two fastq.gz files for each sample in the study, each containing the forward or reverse reads for that sample. The file name includes the sample identifier. The forward and reverse read file names for a single sample might look like L2S357_15_L001_R1_001.fastq.gz and L2S357_15_L001_R2_001.fastq.gz, respectively. The underscore-separated fields in this file name are:

- the sample identifier,

- the barcode sequence or a barcode identifier,

- the lane number,

- the direction of the read (i.e. R1 or R2), and

- the set number.”

The amplicon to be sequenced was for the variable region 4 (V4) of the 16S gene which contains 254 base pairs. The HiSeq Lumina sequncing requires additional DNA referred to as adapters, sequencing binding sites, indices (or barcodes), and complements to the flow cell oligios. Therefore the DNA sequence has an overall length of >400 base pairs.

Analyzing the Data

STEP 1

The first step in the processing of the two forward and reverse files (XXX_R1_YYY.fastq.gz and XXX_R2_YYY.fastq.gz) is to import then into qiime2. Prior to doing any qiime2 work we must place all data files in a single directory. For example, Frank O sent us a link to the data that was compressed into a single zip file. When downloaded and unzipped, the resulting folder contained 11 sample-subfolders each with the R1 and R2 data file. A new folder called ‘reads’ was made and all R1/R2 data sets were put directly (ie, no subfolders) in this folder. The resulting ‘reads’ folder now contains 22 files with sequencing data for 11 samples. When importing data into qiime2, all data in the ‘reads’ folder will be processed at the same time.

NOTE: when processing the data, generated output files will be put into the same directory/folder as the 'reads' directory/folder, so it is a good idea to make a sample specific folder like Sample_01 which contains the 'reads' directory/folder.

STEP 2

Importing this data will be outlined below, but can be found on the Qiime2 website: https://docs.qiime2.org/2022.2/tutorials/importing/#sequence-data-with-sequence-quality-information-i-e-fastq …scroll down under the heading of Casava 1.8 paired-end demultiplexed fastq

Let me remind you that this sequencing data has already been demultiplexed, meaning that the sequencing data has been sorted based on the indexes/barcodes; this is done by the HiSeq instrument.

Command set 1 - Activate Qiime

The following commands must be executed using a terminal window/shell from the directory immediately above the ‘reads’ directory. The following assumes that Qiime2 has been installed on your computer. If you are just opening a terminal window, you must enter into the Qiime2 environment by typing:

For VR laptops...as of June 2022...

conda activate qiime2-2022.2

For other computers...you need to know what version of qiime you installed...if you don't know the version then type in a terminal window

conda env list

- this will list the conda environments that are installed on your computer.

conda activate qiime2-202X.X

- where the X.X is your installed version.

- the above command will then result in the terminal window showing a (quine2-2021.4) in the command line.

Command set 2 - Import Data

Now navigate to the proper directory; fyi, when working in a “window” environment we usually refer to folders, but when working in a terminal window/shell we often use the term directory. These terms have the exact same meaning. The UNIX commands cd XYZ for change directory to XYZ and ls for list directory contents are necessary to navigate within a terminal window.

Once in the proper directory, ie if you type ls you will see the reads directory listed, the following series of commands will be executed (do not type the word <enter> just hit the enter/return key):

 qiime tools import \ <enter>
 --type 'SampleData[PairedEndSequencesWithQuality]' \ <enter>
 --input-path reads/ \ <enter>
 --input-format CasavaOneEightSingleLanePerSampleDirFmt \ <enter>
 --output-path demux-paired-end.qza

Command set 3 - visualizing the Quality Data

Once the set of commands above have run successfully, you need to exam the sequence quality (QC) to determine how the data sequences should be trimmed. There is a plugin for graphing sequence quality that is run by the following commands (this command is run from the directory containing the .qza file):

 qiime demux summarize \ <enter>
 --i-data demux-paired-end.qza \ <enter>
 --o-visualization demux-paired-end_viz.qzv <enter>

This output file, demux-paired-end_vis.qzv is then “dragged and dropped” into the https://view.qiime2.org window to view the Quality data by clicking on the "Interactive Quality Plot" tab.

Interpreting the displayed quality data...the view.qiime.org webpage will display a window with 2 tabs:

- Overview

- Interactive Quality Plot

The information in the Interactive Quality Plot shows that the quality score (blk data) for the imported data. It is best to use data with a quality score between 40 and 20. In the next set of commands we are going to trim off data (<20 score) from the forward and reverse reads and truncate the ends of the forward and reverse reads.

Command set 4 - trimming and truncating data with low quality scores

To do this trimming, execute the following commands:

 qiime dada2 denoise-paired \ <enter>
 --i-demultiplexed-seqs demux-paired-end.qza \ <enter>
 --p-trim-left-f 15 \ <enter>
 --p-trim-left-r 15 \ <enter>
 --p-trunc-len-f 220 \ <enter>
 --p-trunc-len-r 220 \ <enter>
 --o-representative-sequences rep-seqs-dada2.qza \ <enter>
 --o-table table-dada2.qza \ <enter>
 --o-denoising-stats stats-dada2.qza <enter>

The above command generates 3 .qza files.

Command set 5 - visualize "stats" data (Optional)

Next, we want to visualize the stats data (stats-dada2.qza) by entering the following command:

  qiime metadata tabulate \ <enter>
  --m-input-file stats-dada2.qza \ <enter>
  --o-visualization stats-dada2.qzv <enter>

The resulting stats-dada2.qzv file contains "metadata" (the TSV file) information about how the data was processes. It is not clear what information is to be drawn from this metadata/TSV file.

Command set 6 - visualize the "FeatureTable" data (Optional)

Next, we want to visualize the FeatureTable data (table-dada2.qza) by entering the following command:

  qiime feature-table summarize \ <enter>
  --i-table table-dada2.qza \ <enter>
  --o-visualization table.qzv <enter>

Command set 7 - SKIP

The following command was listed in the tutorial, but was not used??? The .TSV file can be downloaded from the view.qiime2.org/table.qzv if we need it. Skip the following for now:

  --m-sample-metadata-file sample-metadata.tsv

Command set 8 - visualizing the sequence data (Optional)

Next, we want to visualize the sequence data (rep-seqs-dada2.qza) by entering the following command:

  qiime feature-table tabulate-seqs \ <enter>
  --i-data rep-seqs-dada2.qza \ <enter>
  --o-visualization rep-seqs-dada2.qzv <enter>

Dragging and dropping this file into view.qiime2.org allows you to see the sequence for all the bacteria in your sample. There is no immediate need to view this data.

Command set 9 - (SKIP/Optional)

Now we are ready to generate a tree for phylogenetic diversity using the following commands:

  qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences rep-seqs-dada2.qza \
  --o-alignment align-rep-seqs.qza \
  --o-masked-alignment masked-aligned-rep-seqs.qza \
  --o-tree unroot-tree.qza \
  --o-rooted-tree rooted-tree.qza

not totally sure what the above commands have done for us...

Command set 8

For the next command you will need to acquire a "classifier" file from here_1 or here_2 <-- this file needs to be saved/relocated into the main working folder.

On to taxonomy…to determine the taxonomic composition of the samples use the following commands:

  qiime feature-classifier classify-sklearn \
  --i-classifier gg-13-8-99-515-806-nb-classifier.qza \ <-- us this line if using gg-13-8-99-515-806-nb-classifier.qza
  --i-classifier gg-13-8-99-nb-classifier.qza \ <-- us this line if using gg-13-8-99-nb-classifier.qza
  --i-reads rep-seqs-dada2.qza \
  --o-classification taxonomy.qza

  qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

Command set 9

Now we want to visualize this data using the barplot…using these commands:

  qiime taxa barplot \
  --i-table table-dada2.qza \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file sample-metadata.tsv \ <-- skip this line
  --o-visualization taxa-bar-plots.qzv

Visualizing the barplot

Now drag and drop the taxa-bar-plots.qzv file into the view.qiime2.org/ window.

@@ Line 109: / Line 109: @@
 Dragging and dropping this file into view.qiime2.org allows you to see the sequence for all the bacteria in your sample. There is no immediate need to view this data.
-====Command set 7====
+====Command set 9 - (SKIP/Optional)====
 Now we are ready to generate a tree for phylogenetic diversity using the following commands:
@@ Line 119: / Line 119: @@
     --o-rooted-tree rooted-tree.qza
-not totally sure what the above commands have done for us, but…
+not totally sure what the above commands have done for us...
 ====Command set 8====