TY - GEN
T1 - Resolving read assignment ambiguities in metagenomic clustering
AU - Nihalani, Rahul
AU - Zola, Jaroslaw
AU - Aluru, Srinivas
PY - 2013
Y1 - 2013
N2 - Clustering is a popular technique used for analyzing metagenomic data. Specifically it is used to assign metagenomic reads to clusters, each cluster representing a species or a higher level taxonomic unit. Due to the difficulty in distinguishing between homologous subsequences common to multiple species and lack of a perfect similarity measure between reads, it is not possible to deduce a correct assignment of reads to clusters. Thus, metagenomic clustering methods must either resort to ambiguity, or make the best available choice at each read assignment stage which could lead to incorrect clusters and potentially cascading errors. In this paper, we argue for first generating an ambiguous clustering and then resolving the ambiguities collectively by analyzing the ambiguous clusters. We propose a rigorous formulation of this problem and show that it is NP-Hard. We then propose an efficient heuristic to solve it in practice. We validate our approach on several synthetically generated datasets and a metagenomic dataset consisting of 16S rRNA sequences from the gut microbiome.
AB - Clustering is a popular technique used for analyzing metagenomic data. Specifically it is used to assign metagenomic reads to clusters, each cluster representing a species or a higher level taxonomic unit. Due to the difficulty in distinguishing between homologous subsequences common to multiple species and lack of a perfect similarity measure between reads, it is not possible to deduce a correct assignment of reads to clusters. Thus, metagenomic clustering methods must either resort to ambiguity, or make the best available choice at each read assignment stage which could lead to incorrect clusters and potentially cascading errors. In this paper, we argue for first generating an ambiguous clustering and then resolving the ambiguities collectively by analyzing the ambiguous clusters. We propose a rigorous formulation of this problem and show that it is NP-Hard. We then propose an efficient heuristic to solve it in practice. We validate our approach on several synthetically generated datasets and a metagenomic dataset consisting of 16S rRNA sequences from the gut microbiome.
UR - https://www.scopus.com/pages/publications/84883618326
M3 - Conference contribution
SN - 9781622769711
T3 - 5th International Conference on Bioinformatics and Computational Biology 2013, BICoB 2013
SP - 73
EP - 80
BT - 5th International Conference on Bioinformatics and Computational Biology 2013, BICoB 2013
T2 - 5th International Conference on Bioinformatics and Computational Biology 2013, BICoB 2013
Y2 - 4 March 2013 through 6 March 2013
ER -