NOTS (Night Owls Time-Sharing Service history of name) is a batch scheduled HTC cluster running on the Rice Big Research Data (BiRD) cloud infrastructure. The system consists of 156 dual socket Intel CPUs housed in compute blades within HPE SL230s and HPE Apollo 2000 chassis. All the nodes are interconnected with 10 GigE network. In addition, the Apollo chassis' are connected with high speed Omni Path interconnect for message passing applications. There is 160TB Lustre connected to the compute nodes via 10GbE. The system can support various work loads including single node, parallel, large memory multithreaded and GPU jobs.
|Hardware||Nodes||CPU||Cores||Hyperthreaded||RAM||Disk||High Speed Interconnect|
|HPE SL230s||128||Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz||16||Yes||varies: 32GB to 128 GB||4 TB/node||No|
|HPE Apollo 2000||28||Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz||24||Yes||varies: 32GB to 128 GB||200 GB/node||Yes|
|HPE SL230s||2||Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz||32||128 GB||4 x Tesla K80||4 TB/node|
Prerequisite for Using This System
All of the clusters that make up our shared computing resources run the Linux operating system (not Windows or MacOS operating systems). In order to effectively utilize the clusters you must have knowledge of Linux, how to navigate the filesystem, how to create, edit, rename, and delete files, and how to run basic commands and write small scripts. If you need assistance in this area please review the tutorials that are available on our web site.
Logging Into the Cluster
The cluster login nodes can be accessed through Secure Shell from any machine on the Rice campus network. You will need an active NetID and password in order to login (unless otherwise instructed).
If you need off-campus access, please visit our Off-Campus Access Guide.
To login to the system from a Linux or Unix machine, use the ssh command:
To transfer files into the cluster from a Linux or Unix machine, use the scp command:
For more information about using Secure Shell, please see our Using SSH to Login and Copy Files Guide.
Once you are logged in to the system, you are logged into one of several login nodes as shown in the diagram below. These nodes are intended for users to compile software, prepare data files, and submit jobs to the job queue. They are not intended for running compute jobs. Please run all compute jobs in one of the job queues described later in this document.
Diagram courtesy of Chris Hunter, Rice University
Data and Quotas
A summary of all filesystems available to all users is presented in the following table:
Accessed via environment variable
Group Project directories
100 GB per group
|Work storage space||$WORK||/storage/hpc/work||214 TB||2 TB per group||NFS||none|
Shared Scratch high performance I/O
Local Scratch on each node
at the end of each job
Research Data Compliance
Due to recent changes in NSF, NIH, DOD, and other government granting agencies, Research Data Management has become an important area of growth for Rice and is a critical factor in both conducting and funding research. The onus of maintaining and preserving research data generated by funded research is placed squarely upon the research faculty, post docs, and graduate students conducting the research. It is imperative that you are aware of your compliance responsibilities so as not to jeopardize the ability of Rice University to receive federal funding. We will help in any way possible to provide you the information and assistance you need, but the best place to start is the campus research data management website.
To see your current quota and your disk usage for your home directory, run this command:
To see the quota and usage for the $PROJECTS directories for all groups that you belong to, run this command:
To see the quota and usage for the $WORK directories on DAVinCI, BIOU, and PO for the primary group to which you belong, run this command:
The clustered file system $SHARED_SCRATCH provides fast, high-bandwidth I/O for running jobs. Though not limited by quotas, $SHARED_SCRATCH is intended for in-flight data being used as input and output for running jobs, and may be periodically cleaned through voluntary and involuntary means as use and abuse dictate.
For information on how to use $PROJECTS, please see our FAQ.
Environment and Shells
The default shell on all the CRC clusters is bash. Other popular shells are available. To have your account's default shell changed from bash to one of these, please file a help request and specify the cluster, username, and desired shell in the ticket. Once your shell is changed this is reflective on all clusters with which you have access. Any active login sessions when your shell is changed will need to be terminated to effect change.
Due to the nature of high performance applications and the batch scheduling system used on CRC clusters, managing your shell environment variables properly is vital.
Customizing Your Environment With the Module Command
Each user can customize their environment using the module command. This command lets you select software and will source the appropriate paths and libraries. All the requested user applications are located under the /opt/apps directory.
To list what applications are available, use the spider sub command:
To see a description of a specific package, use the spider sub command again:
To load the module for OpenMPI built with the GCC compilers, for example, use the load sub command:
To see a list of modules that you have loaded, use this command:
To change to the Intel compiler build of OpenMPI use the swap sub command:
To unload all of your modules, use this command:
To make sure a set of modules are loaded automatically at login, use the module save sub command:
The Job Scheduler
The batch job scheduling system implemented on this system uses SLURM. SLURM is responsible for resource management, job scheduling, and monitoring.
Fairshare Scheduling Policy
We implement the SLURM Fairshare feature to provide a fair utilization of the available resources. This is accomplished by allowing historical resource utilization information to be incorporated into job feasibility and priority decisions. This is normally the most significant component of a job's priority, which ultimately defines the position of the job on a queue. We do not use a FIFO (First-In-First-Out) scheduler. Your jobs' priority will be determined by your utilization over the past seven days (sliding window), with high utilization resulting in lower priority for new jobs.
Backfill Scheduling Policy
This is a scheduling optimization which allows SLURM to make better use of available resources by running jobs out of order. Using job data such as walltime and resources requested, the scheduler can start other, lower-priority jobs so long as they do not delay the highest priority jobs. Because of the way it works, essentially filling in holes in node space, backfill tends to favor smaller and shorter running jobs more than larger and longer running ones.
Automatic Queue Routing
Each of our compute resources has a pre-defined default queue. If you submit your job without specifying a queue, your job will be automatically routed to the default queue. Therefore, be aware of which queue you intend for the job to run in and specify this queue in your SLURM batch script.
Available Partitions and System Load
The definition of the queues are as follows:
commons - intended for jobs that need one node for up to 24 hours.
interactive - intended for short jobs or the purpose of debugging sessions and interactive jobs. See our FAQ for information on interactive jobs.
scavenge - intended for jobs that need one node for up to 4 hours whereby taking advantage of idle nodes and possibly shortening your wait time.
Use the following command to determine the partitions with which you have access. Please note in the output the Account column information needs to be provided to your batch script in addition to the partition information.
Determining Partition Status
A good way to obtain the status of all partitions and their current usage is to run the following SLURM command:
Here is a brief description of the relevant fields:
PARTITION: Name of a partition. Node that the suffix "*" identifies the default partition.
AVAIL: Partition state: up or down.
TIMELIMIT: Maximum time limit for an user job in days-hours:minutes:seconds.
NODES: Count of nodes with this particular configuration by node state in the form "[A]vailable/[I]dle/[O]ther/[T]otal
STATE: State of the nodes.
NODELIST: Names of nodes associated with this configuration/partition.
See the manpage for sinfo for more information
Submitting Jobs with SLURM on NOTS
Once you have an executable program and are ready to run it on the compute nodes, you must create a job script that performs the following functions:
- Use job batch options to request the resources that will be needed (i.e. number of processors, run time, etc.), and
- Use commands to prepare for execution of the executable (i.e. cd to working directory, source shell environment files, copy input data to a scratch location, copy needed output off of scratch location, clean up scratch files, etc).
After the job script has been constructed you must submit it to the job scheduler for execution. The remainder of this section will describe the anatomy of a job script and how to submit and monitor jobs.
SLURM Batch Script Options
All jobs must be submitted via a SLURM batch script or invoking sbatch at the command line . See the table below for SLURM submission options.
Recommended: Assigns a job name. The default is the name of SLURM job script.
Recommended: Specify the name of the Partition (queue) to use. Use this to specify the default partition or a special partition i.e. non-condo partiton with which you have access.
Required: The maximum number of tasks per job. Usually used for MPI jobs.
|#SBATCH --cpus-per-task=16||Recommended: The number processes per task. Usually used for OpenMP or multi-threaded jobs.|
Required: The maximum run time needed for this job to run, in days-hh:mm:ss.
Recommended: The maximum amount of physical memory used by any single process of the job ([M]ega|[G]iga|[T]era)Bytes.
See our FAQ for more details.
|#SBATCH --mail-user=YourEmailAddress||Recommended: Email address for job status messages.|
|#SBATCH --mail-type=ALL||Recommended: SLURM will notify the user via email when the job reaches the following states BEGIN, END, FAIL or REQUEUE.|
|#SBATCH --nodes=1 --exclusive||Optional: Using both of these options will give your job exclusive access to a node such that no other jobs can share the node. |
This combination of arguments will assign eight tasks to your job and will give it exclusive access to all of the resources
(i.e. memory) of the entire node without interference from other jobs. Please see our FAQ for more details on exclusive access.
Optional: The full path for the standard output (stdout) and standard error (stderr) "slurm-%j.out" file, where the "%j" is replaced by the job ID. Current working directory is the default.
Optional: The full path for the standard error (stderr) "slurm-%j.out" files. Use this only when you want to separate (stderr) from (stdout). Current working directory is the default.
|Optional: Exports all environment variables to the job. See our FAQ for details.|
You need to specify the name of the condo account to use a condo on the cluster.
Use the command sacctmgr show assoc user=netID to show which accounts and partitions with which you have access.
Serial Job Script
A job script may consist of SLURM directives, comments and executable statements. A SLURM directive provides a way of specifying job attributes in addition to the command line options. For example, we could create a myjob.slurm script this way:
This example script will submit a job to the default partition using 1 task, 1GB of memory per processor core, with a maximum run time of 30 minutes.
If you need to debug your program and want to run in interactive mode, the same request above could be constructed like this (via the srun command):
For more details on interactive jobs, please see our FAQ on this topic.
SLURM Environment Variables in Job Scripts
When you submit a job, it will inherit several environment variables that are automatically set by SLURM. These environment variables can be useful in your job submission scripts as seen in the examples above. A summary of the most important variables are presented in the table below.
Location of shared scratch space. See our FAQ for more details.
|$LOCAL_SCRATCH||Location of local scratch space on each node.|
Environment variable containing a list of all nodes assigned to the job.
Path from where the job was submitted.
Job Launchers (srun)
For all jobs run on the cluster we require that you use srun to launch your job. The job launcher's purpose is to spawn copies of your executable across the resources allocated to your job. By default srun only needs your executable, the rest of the information will be extracted from SLURM.
The following is an example of how to use srun inside your SLURM batch script. This example will run myMPIprogram as a parallel MPI code on all of the processors allocated to your job by SLURM:
This example script will submit a job to the default partition using 16 processor cores per node, 1GB of memory per processor core, with a maximum run time of 30 minutes.
The following example will run myMPIprogram on only four processors even if your batch script requested more than four.
Submitting and Monitoring Jobs
Once your job script is ready, use sbatch to submit it as follows:
This will return a jobID number while the output and error stream of the job will be saved to one file inside the directory where the job was submitted, unless you specified otherwise.
The status of the job can be obtained using SLURM commands. See the table below for a list of commands:
Show a detailed list of all submitted jobs.
squeue -j jobID
Show a detailed description of the job given by jobID.
squeue --start -j jobID
Gives an estimate of the expected start time of the job given by jobID.
There are variations to these commands that can also be useful. They are described below:
Show a list of all running jobs.
squeue -u username
Show a list of all jobs in queue owned by the user specified by username.
scontrol show job jobID
To get a verbose description of the job given by jobID. The output can be used as a template when you are attempting to modify a job.
There are many different states that a job can be after submission: BOOT_FAIL (BF), CANCELLED (CA), COMPLETED (CD), CONFIGURING (CF), COMPLETING (CG), FAILED (F), NODE_FAIL (NF), PENDING (PD), PREEMPTED (PR), RUNNING (R), SUSPENDED (S), TIMEOUT (TO), or SPECIAL_EXIT (SE). The squeue command with no arguments will list all jobs in their current state. The most common states are described below.
Running (R): These are jobs that are running.
Pending (PD): These jobs are eligible to run but there is simply not enough resources to allocate to them at this time.
A job can be deleted by using the scancel command as follows:
Compiling and Optimizing
Several programming models are supported on this system. Programs that are sequential, parallel (within a node) or distributed can be run. Sequential programs require one processor to run. Parallel and distributed programs utilize multiple processors concurrently. Parallel programs are a subset of distributed programs. Generally speaking, distributed computing involve parametric sweeps, task farming, etc. Message passing, threaded applications generally fit under the scope of parallel computing. SPMD (single process, multiple data) is one of the most popular method of parallelism, where a single executable works on its own data.
The supported compilers on this system are Intel and GCC. The MPI implementations from OpenMPI and Intel are available for both compilers and can be loaded upon demand using the module command. The preferred compiler for this system is Intel.
Compiling Serial Code
To compile serial code you must first load the appropriate compiler environment module . To load the Intel compiler, execute this command:
Once the environment is set, you can compile your program with one of the following (using Intel compiler as an example):
When invoked as described above, the compiler will perform the preprocessing, compilation, assembly and linking stages in a single step. The output file (or executable) is specified by executablename and the source code file is specificed by sourcecode.f77, for example. Omitting the -o executablename option will result in the executable being named a.out by default. For additional instructions and advanced options please view the online manual pages for each compiler (i.e. execute the command man ifort ).
Compiling Parallel Code
To compile a parallel version of your code that has MPI library calls, use the appropriate MPI library. Again, use module command to load the appropriate compiler environment as follows (Intel versions highly recommended):
module load GCC OpenMPI
For gcc compiled OpenMPI
module load iccifort OpenMPI
For Intel compiled OpenMPI
|module load GCC impi||For gcc compiled Intel MPI|
|module load iccifort impi||For Intel compiled Intel MPI|
To compile your code you will have use the MPI compiler wrappers that are currently in your default path. The MPI wrappers are responsible for invoking the compiler, linking your program with the MPI library and setting the MPI include files.
Once the environment is set, you can compile your program with one of the following:
When invoked as described above, the compiler will perform the preprocessing, compilation, assembly and linking stages in a single step. The output file (or executable) is specified by executablename and the source code file is specificed by mpi_sourcecode.f77, for example. Omitting the -o executablename option will result in the executable being named a.out by default. For additional instructions and advanced options please view the online manual pages for each compiler (i.e. execute the command man mpif77 ).
The GNU compiler is installed as part of the Red Hat Enterprise Linux distribution. Use man gcc to view the online manual for the C and C++ compiler, and man gfortran to view the online manual for the Fortran compiler.