PARALLEL DATA LAB

DISC-Quasars:

Identifying Distant Quasars in Sky Surveys

We are developing data-mining techniques for detection of distant quasars in sky-survey datasets, which will help astronomers to distinguish distant quasars from other celestial objects, such as stars, galaxies, and nearby quasars, based on analysis of telescope images made through different color filters.



PROBLEM

A quasar is a galaxy with an unusually massive black hole in its center. This black hole causes compression and heating of matter around it under the impact of its gravity, which leads to emission of massive radiation, thus increasing the brightness of the galaxy. A quasar may be hundreds of times as bright as a regular galaxy, which makes it visible at very large distances. Modern telescopes allow detection of quasars billions of light years away, at the edge of the observable universe.  Astronomers and cosmologists use distant quasars as "beacons," which allow charting remote regions of the universe and studying its expansion.

Accurate identification of quasars is a hard problem, since regular telescope images do not provide explicit distance data, and quasars look like "shiny dots," not much different from stars and regular galaxies. To distinguish objects of different types, astronomers compare images made through five different color filters, and study the distribution of each object's brightness over these colors. While researchers have developed algorithms for accurate identification of nearby quasars, the problem of identifying distant quasars has turned out to be more challenging. We are working on application of data mining to this problem, specifically, finding quasars that are more than twelve billion light years away.

RESULTS

We represent a celestial object by five numeric values, which show its brightness in images made through five color filters, and apply supervised learning to identify distant quasars based on these values. We have experimented with a variety of learning techniques, including decision trees, support vector machines, clustering, and nearest neighbors.

The most effective among the techniques evaluated so far is the majority-vote combination of the C4.5 decision trees, support vector machines with RBF kernel, and 11 nearest neighbors. Its precision is about 81%, which means that 81% of the objects identified by the system are true quasars. It is about twice as accurate as the earlier methods developed by astronomers. Its main drawback is low recall, which is about 30%; that is, the system detects only 30% of distant quasars and misses the other 70%. Intuitively, it achieves high precision by being conservative and rejecting objects in case of uncertainty.

More details: Summary of the algorithms and empirical results

CHALLENGES

We are working on improvements to the developed technique, and implementing its distributed version, which will be able to process datasets with hundreds of millions of objects. We are also testing applicability of other data-mining algorithms to this problem.

PEOPLE

FACULTY
Eugene Fink
Garth Gibson
Julio López

GRADUATE STUDENTS
Bin Fu

EXTERNAL COLLABORATORS
Joel Welling (Pittsburgh Supercomputing Center)
Michael Wood-Vasey (Physics and Astronomy, University of Pittsburgh)