Multiplexed single-end FASTQ with barcodes sequence

Introduction

Data sent to us by the University of Illinois Biotech Center has the input format of "MultiplexedPairedEndBarcodeInSequence" or “Multiplexed paired-end FASTQ with barcodes in sequence." all sample information is contained in 3 data files: 1) the forward reads, 2) the reverse reads, and 3) a metafile containing barcode assignments. This data is in a .fastq format. This file format is described on the Qiime2 website (see importing link below) as:

"Users with multiplexed paired-end barcodes in sequence reads should have:

- one forward.fastq.gz file, containing forward reads from multiple samples,

- one reverse.fastq.gz file, containing reverse reads from the same samples,

- one metadata file with a column of per-sample barcodes for use in FASTQ demultiplexing (or two columns of dual-index barcodes)

In this format, sequence data is still multiplexed (i.e. you have only one forward and one reverse fastq.gz file containing raw data for all of your samples).

Because you are importing a multi-file directory, the filenames forward.fastq.gz and reverse.fastq.gz are required.

The order of the records in the fastq.gz files defines the association between forward and reverse sequence reads, so a correct order must be preserved. Barcodes in the metadata mapping file are not required to be in any specific order.

The amplicon to be sequenced was for the variable region 4 (V4) of the 16S gene which contains 254 base pairs. The data was collected using Illumina sequencing combined with Fluidigm sample preparation at the University of Illinois Keck Center.

Included with the sequence data, but in a separate email, the following info was provided:

"1) Primer Sorted data: This .bz2 file contains smaller fastq files, generated by sorting the raw data by the PCR-specific primers. There are three fastq files per primer set--the read 1 (5' sequencing read), the index read, and the read 2 (3' sequencing read) data for that primer set. Note that each primer-set file is still a mix all the samples--they have not been demultiplexed by index. PCR-specific primer sequence has not been trimmed from these reads. PhiX control reads have been removed."

Analyzing the Data

STEP 1 - directory and file renaming/compressing

The first step in processing this data is to create a directory/folder (name it as being associated with the data) and then a sub directory/folder called muxed-reads. Place the forward and reverse reads data files in the muxed-reads subdirectory. Now you must rename the files...as an example our first set of Univ Of IL data was in a file called V4_515F_New_V4_806R_New_moore_R1.fastq and V4_515F_New_V4_806R_New_moore_R2.fastq. The R1 is the forward reads and the R2 is the reverse reads. These files need to be renamed to exactly --> forward.fastq and reverse.fastq. These files then need to be compressed into .gz files to ultimately have the names forward.fastq.gz and reverse.fastq.gz. It is recommended to compress these files within the terminal window by typing the following command at the subdirectory level (ie...ls shows the *.fastq file names).

gzip *.fastq

Let me remind you that this sequencing data has NOT already been demultiplexed, meaning that the sequencing data is all contained in one forward and 1 reverse reads file.

STEP 2

Command set 1 - Activate Qiime2

The following command must be executed using a terminal window/shell. The following assumes that Qiime2 has been installed on your computer. If you are just opening a terminal window, you must enter into the Qiime2 environment by typing:

For VR laptops...as of June 2022...

conda activate qiime2-2022.2

For other computers...you need to know what version of qiime you installed...if you don't know the version then type in a terminal window

conda env list

- this will list the conda environments that are installed on your computer.

conda activate qiime2-202X.X

- where the X.X is your installed version.

- the above command will then result in the terminal window showing a (quine2-2021.4) in the command line.

STEP 3

Command set 2 - Import Data

Now navigate to the proper directory. The UNIX commands cd XYZ for change directory to XYZ and ls -l for list directory contents are necessary to navigate within a terminal window.

Once in the proper directory, ie if you type ls -l you will see the "muxed-reads" directory listed, the following series of commands will be executed (do not type the word <enter> just hit the enter/return key):

 qiime tools import \ <enter>
 --type MultiplexedPairedEndBarcodeInSequence \ <enter>
 --input-path muxed_reads/ \ <enter>
 --output-path multplexed_seqs.qza

END HERE...

Command set 3 - visualizing the Quality Data

Once the set of commands above have run successfully, you need to exam the sequence quality (QC) to determine how the data sequences should be trimmed. There is a plugin for graphing sequence quality that is run by the following commands (this command is run from the directory containing the .qza file):

 qiime demux summarize \ <enter>
 --i-data demux-paired-end.qza \ <enter>
 --o-visualization demux-paired-end_viz.qzv <enter>

This output file, demux-paired-end_vis.qzv is then “dragged and dropped” into the https://view.qiime2.org window to view the Quality data by clicking on the "Interactive Quality Plot" tab.

Interpreting the displayed quality data...the view.qiime.org webpage will display a window with 2 tabs:

- Overview

- Interactive Quality Plot

The information in the Interactive Quality Plot shows that the quality score (blk data) for the imported data. It is best to use data with a quality score between 40 and 20. In the next set of commands we are going to trim off data (<20 score) from the forward and reverse reads and truncate the ends of the forward and reverse reads.

Command set 4 - trimming and truncating data with low quality scores

To do this trimming, execute the following commands:

 qiime dada2 denoise-paired \ <enter>
 --i-demultiplexed-seqs demux-paired-end.qza \ <enter>
 --p-trim-left-f 15 \ <enter>
 --p-trim-left-r 15 \ <enter>
 --p-trunc-len-f 220 \ <enter>
 --p-trunc-len-r 220 \ <enter>
 --o-representative-sequences rep-seqs-dada2.qza \ <enter>
 --o-table table-dada2.qza \ <enter>
 --o-denoising-stats stats-dada2.qza <enter>

The above command generates 3 .qza files.

Command set 5 - visualize "stats" data (Optional)

Next, we want to visualize the stats data (stats-dada2.qza) by entering the following command:

  qiime metadata tabulate \ <enter>
  --m-input-file stats-dada2.qza \ <enter>
  --o-visualization stats-dada2.qzv <enter>

The resulting stats-dada2.qzv file contains "metadata" (the TSV file) information about how the data was processes. It is not clear what information is to be drawn from this metadata/TSV file.

Command set 6 - visualize the "FeatureTable" data (Optional)

Next, we want to visualize the FeatureTable data (table-dada2.qza) by entering the following command:

  qiime feature-table summarize \ <enter>
  --i-table table-dada2.qza \ <enter>
  --o-visualization table.qzv <enter>

Command set 7 - SKIP

The following command was listed in the tutorial, but was not used??? The .TSV file can be downloaded from the view.qiime2.org/table.qzv if we need it. Skip the following for now:

  --m-sample-metadata-file sample-metadata.tsv

Command set 8 - visualizing the sequence data (Optional)

Next, we want to visualize the sequence data (rep-seqs-dada2.qza) by entering the following command:

  qiime feature-table tabulate-seqs \ <enter>
  --i-data rep-seqs-dada2.qza \ <enter>
  --o-visualization rep-seqs-dada2.qzv <enter>

Dragging and dropping this file into view.qiime2.org allows you to see the sequence for all the bacteria in your sample. There is no immediate need to view this data.

Command set 9 - (SKIP/Optional)

Now we are ready to generate a tree for phylogenetic diversity using the following commands:

  qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences rep-seqs-dada2.qza \
  --o-alignment align-rep-seqs.qza \
  --o-masked-alignment masked-aligned-rep-seqs.qza \
  --o-tree unroot-tree.qza \
  --o-rooted-tree rooted-tree.qza

not totally sure what the above commands have done for us...

Command set 10 - classifying the bacterial sequences

For the next command you will need to acquire a "classifier" file from here_515-806 or here_all 1500 <-- this file needs to be saved/relocated into the main working folder.

On to taxonomy…to determine the taxonomic composition of the samples use the following commands:

  qiime feature-classifier classify-sklearn \ <enter>
  --i-classifier gg-13-8-99-515-806-nb-classifier.qza \ <enter> OR
  --i-classifier gg-13-8-99-nb-classifier.qza \ <enter>
  --i-reads rep-seqs-dada2.qza \ <enter>
  --o-classification taxonomy.qza <enter>

Command set 11 - visualizing the bacteria names

  qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

Command set 12 - visualizing the bar plot

Now we want to visualize this data using the barplot…using these commands:

  qiime taxa barplot \ <enter>
  --i-table table-dada2.qza \ <enter>
  --i-taxonomy taxonomy.qza \ <enter>
  --m-metadata-file sample-metadata.tsv \ <-- skip this line
  --o-visualization taxa-bar-plots.qzv <enter>

Now drag and drop the taxa-bar-plots.qzv file into the view.qiime2.org/ window.