Parallel Data Laboratory

DISC-Finder:

Identifying Clusters of Astronomical Objects

We are developing data-intensive algorithms for identifying halos, that is, clusters of astronomical objects, based on the results of massive sky surveys and cosmological simulations. This project is part of the work on the astronomy applications of data-intensive super computing (Astro-DISC).

Finder

PROBLEM

We are developing a distributed version of the Friends-of-Friends technique, which is an algorithm for identifying clusters of galaxies. Pairs of closely located galaxies, gravitationally attracted to each other, are identified, and connected components in the undirected graph of such pairwise attractions, roughly corresponding to gravitationally bound clusters, are found. While this method is an approximation and may not identify exact cluster boundaries, it has proved effective in studying the structure of the universe. Astronomers have used it since the early eighties as one of their key tools for analyzing both sky surveys and computer simulations of the evolving universe, and have developed a suite of sequential Friends-of-Friends algorithms. However, recent massive sky surveys and simulations have pushed the requirements for the scalability of the Friends-of-Friends analysis. Astronomers now need to process datasets with billions of galaxies and thus the traditional sequential algorithms are often inadequate.

RESULTS

We have developed an architecture for a distributed Friends-of-Friends computation under Hadoop. Specifically, we have designed a Map-Reduce "wrapper" that distributes a set of galaxies among multiple cores, runs a sequential Friends-of-Friends algorithm on each core, and then merges the results of the local computations. It treats the sequential Friends-of-Friends procedure as a black box; that is, it does not rely on any specific properties of that procedure. We can plug in any sequential Friends-of-Friends algorithm and run its distributed version without changing the wrapper. The resulting distributed computation can process trillions of galaxies, which makes it sufficiently powerful for all modern astronomy datasets.

More details:

CHALLENGES

Future challenges include generalizing the developed algorithms and applying them to other closely related problems in astronomy and cosmology, such as accounting for masses and relative velocities of galaxies, identifying halos, and analyzing properties of dark-matter simulations.

PEOPLE

FACULTY
Eugene Fink
Garth Gibson
Julio López

GRADUATE STUDENTS
Bin Fu
Kai Ren

EXTERNAL COLLABORATORS
Tiziana Di Matteo (Physics, Carnegie Mellon University)
Rupert Croft (Physics, Carnegie Mellon University)

PUBLICATIONS

DiscFinder: A data-intensive scalable cluster finder for astrophysics. Bin Fu, Kai Ren, Julio López, Eugene Fink, and Garth Gibson. In Proceedings of the ACM International Symposium on High Performance Distributed Computing (HPDC), Chicago, IL. June, 2010.
Abstract / PDF [372K]
An extended version is available as Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-104.
PDF [393K]