DiscFinder : A Data-Intensive Scalable Cluster Finder for Astrophysics Copyright (c) 2010 Bin Fu, Kai Ren and Julio Lopez All Rights Reserved You may use this code without fee, for educational and research purposes. Any for-profit use requires written consent of the copyright holders. Version:0.1 Date: 2010-09-10 Main Contact: Bin Fu(binf@cs.cmu.edu) 1. General information DiscFinder is a large scalable cluster finder on hadoop, fully written in Java. The details of DiscFinder can be found in the following paper: Bin Fu, Kai Ren, Julio Lopez, Eugene Fink and Garth A. Gibson DiscFinder: A Data-Intensive Scalable Cluster Finder for Astrophysics. The ACM International Symposium on High Performance Distributed Computing 2010, Chicago, Illinois, USA. For questions on DiscFinder, please contact authors. 2. Environment DiscFinder can be run in any Linux or Unix machine that supports hadoop, but the shell scripts and code packaging scripts works easily in Linux or Unix machines. DiscFinder needs the following softwares to be installed in the system: - Hadoop 0.20.1 or higher version from http://hadoop.apache.org/ - Ant - Java 1.6.x, preferably from Sun - GCC and Make - NFS or Other distributed file system 3. Source code information 1) List of source codes There is one main program in DiscFinder: - DiscFinder The source codes are in the following directories: src/DiscFinder.java --- The core file of the DiscFinder pipeline, which is implemented in Java and Apache Hadoop. This file calls a couple of external executables, including uf.java and fof. src/uf.java --- Our java code of the UnionFind algorithm, which is used in the merging step of the DiscFinder pipeline. lib/fof --- The sequential FOF program from the University of Washinton (http://www-hpcc.astro.washington.edu/tools). We have adjusted the code in order to integrate it to DiscFinder. Please contact us for more details. Implemented in C. 2) How to build the code Since the binary file discfinder-0.9.jar already exists, normally you don't need to build the code again. Another program fof is also compiled. However if you move to other platforms (like 64-bit machine), you may want to recompile it again with other C-compiler for that platform. Thus, this is the instruction when you modify the source code and try to build it. Before building the code, you will need to specify the directory where hadoop-core.jar file is located. Edit the build.xml by finding the following line and change it to the drectory where hadoop-core.jar file is located. You can hard-code the path, or change other variables ${hadoop.dir} and ${hadoop.version}. For example, to change to ${hadoop.dir} variable, edit the following line and modify the value appropriately based on the hadoop installation directory. The {basedir} means the current directory, and the hadoop installation directory is the one that contains the 'hadoop-0.20.1-core.jar' file. For example, let the DiscFinder and hadoop-0.20.1 are installed in the following directories. DiscFinder: /home/user/DiscFinder Hadoop-0.20.1: /home/user/hadoop-0.20.1 Then, the line should be changed to the following: After editing the build.xml file, build the code by executing make. The make will firstly invoke ant to build and generate the jar file. It then later compiles fof programs in lib/, and put fof programs into the directory bin/. 4. How to run program 1) Setup correct parameters To run the program DiscFinder, you can use the script discfinder.sh in the directory bin/. Firstly, you may want to change some parameters in the script discfinder.sh: BOTTOM=0 HIGH=400000 --- BOTTOM and HIGH are the lower/upper bounds of the coordination values of input points. SAMPLEFOLDER=./sampleoutput/ NUMPARTITIONS=1024 --- This parameter indicates how many partitions (cubes) DiscFinder would split to. More partitions usually means higher parallelism, but it would also introduce more system overhead. In current version, NUMPARTITIONS should be power of 2. NFSPATH=/h/kair/ --- This specifies where the binary of fof program is placed. So you need to guarantee that in your cluster, each node also mounts a NFS besides HDFS. LOCALPATH=/l/a2/scratch/kair/ TEMPPATH=/user/kair/temp INPUTPATH=/user/kair/input --- This specifies the input path in HDFS for DiscFinder OUTPUTPATH=/user/kair/output --- This specifies the output path in HDFS for DiscFinder SAMPLERATE=16001 --- This parameter is a prime integer which indicates how much input points will be sampled. 1/SAMPLERATE of the input points will be sampled to construct the kd-tree. EPSILON=0.1 --- Linking length (tau) in FOF algorithm. MAXPOINTEF=35000 --- Upper bound of the number of points in each partition (cube). For example, if there are 10 million input points and 1024 partitions, each partition should receive about 10,000 points in average. Due to safety reason (due to unbalanced load), MAXPOINTEF should be set to 12,000 or so for this case. NUMSLOT=124 --- Number of mapper/reducer slots in MapReduce configuratoin. This parameter is used to set the number of mappers and/or reducers for some MapReduce tasks. In our experiments, it turns out to be a very crucial parameter for efficiency. The most notable case is the number of reducers in the second MapReduce job. To learn more about the parameters, please refer to our HPDC paper. 2) Execute DiscFinder In the shell, under "bin/" directory, you can then simple run "sh ./discfinder.sh run". Then it automatically submits three stages map-reduce programs, and generate the result. 3) Remove tempory and result files. In the shell, run "sh ./discfinder.sh remove".