Skip to main navigation Skip to search Skip to main content

Parallel metagenomic sequence clustering via sketching and maximal quasi-clique enumeration on map-reduce clouds

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

22 Scopus citations

Abstract

Taxonomic clustering of species is an important and frequently arising problem in metagenomics. High-throughput next generation sequencing is facilitating the creation of large metagenomic samples, while at the same time making the clustering problem harder due to the short sequence length supported and unknown species sampled. In this paper, we present a parallel algorithm for hierarchical taxonomic clustering of large metagenomic samples with support for overlapping clusters. We adapt the sketching techniques originally developed for web document clustering to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all alignments. We formulate the metagenomics classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud based implementation. Apart from solving an important problem in metagenomics, this work demonstrates the applicability of map-reduce framework in relatively complicated algorithmic settings.

Original languageEnglish
Title of host publicationProceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
Pages1223-1233
Number of pages11
DOIs
StatePublished - 2011
Event25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011 - Anchorage, AK, United States
Duration: May 16 2011May 20 2011

Publication series

NameProceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011

Conference

Conference25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
Country/TerritoryUnited States
CityAnchorage, AK
Period05/16/1105/20/11

Keywords

  • MapReduce
  • cloud computing
  • metagenomics
  • next generation sequencing
  • quasi clique enumeration
  • sequence clustering
  • sketching

Fingerprint

Dive into the research topics of 'Parallel metagenomic sequence clustering via sketching and maximal quasi-clique enumeration on map-reduce clouds'. Together they form a unique fingerprint.

Cite this