Skip to end of metadata
Go to start of metadata

This example demonstrates how a user can create a temporary Hadoop/YARN execution instance that will run within the NOTS cluster, using the SLURM batch scheduler.  It retrieves the text of Herman Melville’s Moby Dick and passes it as input to the example program, which uses Apache SPARK mapreduce functions to perform a word frequency count.  Please note that throughout this tutorial you will want to substitute your own username and project areas for the ones used in the example.

Prerequisites

  • If you do not have an account or require basic documentation, see our introduction to Research Computing resources and NOTS: 

Research Computing

Getting Started on NOTS

  • Hadoop requires passwordless SSH between nodes.  Please see the following for instructions on setting up passwordless SSH:

Setting Up Passwordless SSH (SSH Keys) on the Clusters


1) Log in to NOTS using your account:

2) Acquire the Linux Hadoop binary package and install it in your projects directory.  For this example we will use hadoop version 2.6.5.  The wget command pulls the install package from the web, the tar command will unzip it into your projects directory.

PLEASE NOTE THAT THIS EXAMPLE IS DESIGNED TO BE RUN IN YOUR PROJECTS DIRECTORY - If you use /home or /work it will not run correctly.  For information on creating/using your /projects area, see:
Using The $PROJECTS File System

Now untar/unzip the downloaded package, then make sure the package looks correct:

3) Acquire myhadoop, a framework for deploying Hadoop inside traditional HPC clusters (including clusters using the SLURM scheduler) using git clone:

(See https://github.com/glennklockwood/myhadoop for more information on configuration, installation and a user guide)

4)  Patch your Hadoop installation to utilize the myhadoop framework:

5)  Ensure that Java is loaded correctly:

6)  Install Apache Spark 2.2 in the top level of your projects directory.  Use the pre-built tarball compiled with Hadoop 2.6 support.; verify that the package contents were installed correctly.

7)  You are now ready to create a SLURM sbatch script to submit your first jobs.  The following example is based on Glenn Lockwood’s SLURM batch script (https://github.com/glennklockwood/myhadoop/blob/master/examples/slurm.sbatch) but modified to use the newer Hadoop 2.6.5 commands and startup scripts.

The example shown below runs a word count algorithm on the text of Hermann Melville’s Moby Dick, by spinning up a Hadoop cluster inside NOTS, and using Hadoop dfs commands to check input and output files out of the virtual HDFS filesystem.

First, create a run directory and cd into it.  Download a copy of Melville’s text, untar and rename it:

Now, using the editor of your choice, create a file called wordcount.py.  This will be the python script executed by pyspark for our example:

Next, we create the SBATCH script, hadoop_py_wc.sbatch, which will set up the Hadoop environment, start the Hadoop cluster with HDFS/YARN using the myhadoop configuration modifications.  Please note the comments in the SBATCH script, which explain each step of the configuration of the job.  Also note the spark-submit command; this is the spark tool used to launch multiple jobs within the Hadoop/YARN cluster.  When the spark-submit job is complete, the SBATCH script will spin down the Hadoop cluster *after* retrieving the output files from HDFS.  A cleanup script is then run, and the SLURM job will finish. 

8) Now, execute the sbatch script created above; observe the queue with the squeue command, and wait for the job to finish.  When it completes, examine the files in the output directory output_directory (retrieved from the HDFS filesystem prior to shutting down the Hadoop cluster).  

Start the job:

You can then check in on the progress of your active jobs by querying with squeue:

Monitor the directory for output.  When the job finishes, you should see a directory called output_directory.  Examine the contents of the output files using grep:

Review the log files created by the job and by the Hadoop cluster if necessary:

 

Additional Resources:

SPARK Cluster configuration:
https://spark.apache.org/docs/latest/cluster-overview.html

Hadoop/YARN:   
https://spark.apache.org/docs/latest/running-on-yarn.html

SPARK Submit:
https://spark.apache.org/docs/latest/submitting-applications.html


 

  • No labels