A collaborative study was conducted by researchers at SSBS Pune, NCMR Pune and Reliance Lifesciences Pvt Ltd Mumbai, benchmarking of 16S rRNA gene databases using known strain sequences. They used authentic and validly published 16S rRNA gene type strain sequences and analyzed them using a widely used QIIME open-source bioinformatics pipeline along with different parameters of OTU clustering and QIIME compatible databases. Limited differences were observed in the reference data set analysis using partial and full-length 16S rRNA gene sequences in microbiome studies. The analysis highlighted common discrepancies observed at various taxonomic levels using various methods and databases.
16S rRNA gene analysis is the most convenient and widely used method for microbiome studies. Inaccurate taxonomic assignment of bacterial strains might affect the results as all downstream analyses rely heavily on the accurate assessment of microbial taxonomy. A large number of databases and tools available for classification and taxonomic assignment of the 16S rRNA gene make it challenging to select the best-suited method for a particular dataset. This study was done to benchmark the 16S rRNA gene databases using known strain sequences.
Next-Generation Sequencing (NGS) techniques are capable of generating high quality, comparable data. Different methods are used to overcome the limitations regarding 16S rRNA gene analysis. However, though mock microbial communities serve the purpose of estimating sequencing errors, they mostly represent minimal diversity. They thus cannot be used as a standard for taxonomic identification by analysis pipeline and databases. Thus it is necessary to have 16S rRNA gene analysis pipeline validated using a standard data set with known taxonomic identification.
The sample dataset used in this study was obtained from an authentic database and the sample size that is the number of 16S rRNA sequences was 5395. In this study, researchers used authentic and validly published, type strain, full length and partial 16S rRNA gene sequences. These sequences were compared against various databases with QIIME pipeline, which incorporate various algorithms for quality control, clustering similar sequences, assigning taxonomy, calculating diversity measures and visualizing. They used 16S rRNA gene sequences of type strains obtained from the RDP database as it allows the option to download the bulk dataset. Three different databases were used for microbiome analysis, namely Greengenes, SILVA, and EzTaxon which are used for 16S rRNA gene-based microbiome studies.
Comparative analysis showed that higher numbers of OTUs were obtained for a 99% identity threshold compared to the 97% identity threshold for the respective combination of the database used. A total 18.78% and 10.53% discrepancy was observed at the genus level for the full length and partial sequences, respectively, which is a high amount of discrepancy. The discrepancy at each taxonomic level can be calculated, and the quality of data present in the database can be decided. It is crucial to select databases, pipelines, and algorithms very carefully considering discrepancies in taxonomic assignment and selection should be done based on the necessity of the study. Also, databases should be validated, and discrepancies should be corrected in successive updates of databases.
Primary goal of all microbial studies is to identify the bacteria that constitute the complex communities. A valid and reliable method is a must for identifying these complex communities. The purpose of this study was to validate widely used databases like EzTaxon, SILVA, Greengenes and data analysis pipelines like QIIME .