1. Introduction

This document provides a brief summary of information that you'll need to know to quickly get started working on the Supercomputing Outpost (SCOUT). For more detailed information, see the SCOUT User Guide.

2. Hardware

The system contains 22 nodes (tra[001-022]) for machine learning training workloads, each with two IBM POWER9 processors, 512 GB of system memory, 6 NVIDIA V100 GPU Processing units with 32 GB of high-bandwidth memory each and 12 TB of local solid-state storage. SCOUT also has 128 GPGPU-accelerated nodes for inferencing workloads (inf[001-128]), each with two IBM POWER9 processors, 4 NVIDIA T4 GPU'S, 256 GB of systems memory, and 3.3 TB of local solid-state storage. There is also 2 visualization nodes with two IBM POWER9 processors, 512 GB of system memory, 2 NVIDIA V100 GPU Processing units, and 3.3 TB of local solid-state storage.

3. Get a Kerberos Ticket

For security purposes, you must have a current Kerberos ticket on your computer before attempting to connect to SCOUT. A Kerberos client kit must be installed on your desktop to enable you to get a Kerberos ticket. Information about installing Kerberos clients on your Windows desktop can be found at HPC Centers: Kerberos & Authentication.

4. Connect to SCOUT

SCOUT can be accessed via Kerberized ssh as follows:

% ssh scout.arl.hpc.mil

5. Home, Working, and Center-wide Directories

Each user has file space in the $HOME, $WORKDIR, and $CENTER directories. The $HOME, $WORKDIR, and $CENTER environment variables are predefined for you and point to the appropriate locations in the file systems. You are strongly encouraged to use these variables in your scripts.

NOTE: $WORKDIR is a "scratch" file system, and $CENTER is a center-wide file system that is accessible to all center production machines. Neither of these file systems is backed up. You are responsible for managing files in your $WORKDIR directories by backing up files to the archive system and deleting unneeded files. Currently, $WORKDIR files that have not been accessed in 21 days and $CENTER files that have not been accessed in 120 days are subject to being purged.

If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you WILL NOT be notified prior to deletion. You are responsible to monitor your workspace to prevent data loss.

6. Login, Inference, Training, and Visualization Nodes

SCOUT has four types of nodes: login, inference, training, and visualization. When you log into the system, you are placed on a login node. These nodes are used for typically small tasks such as editing and compiling code. When the Batch Scripts or Interactive Jobs run, the resulting shell will run on the node requested.

The following "compute nodes" are accessed via the mpirun command, if done interactively. The mpiexec command is typically issued from within an LFS job script. Otherwise, you will not have any compute nodes allocated and your parallel job will run on the login node. If this happens, your job will interfere with (and be interfered with by) other users' login node tasks.

Inference, training, and visualization nodes have the following naming convention for accessing them respectively: inf, tra, and vis

7. Transfer Files and Data to SCOUT

File transfers to DSRC systems must be performed using Kerberized versions of the following tools: scp, ftp, sftp, and mpscp. For example, the command below uses secure copy (scp) to copy a local file into a destination directory on a SCOUT login node.

% scp local_file scout.arl.hpc.mil:/target_dir

For additional information on file transfers to and from SCOUT, see the File Transfers section of the SCOUT User Guide.

8. Submit Jobs to the Batch Queue

The IBM Load Sharing Facility (LSF) is the workload management system for SCOUT. To submit a batch job, use the following command:

bsub < myjob.lsf

where myjob.lsf is the name of the file containing your batch script.

For more information on job scripts, see the SCOUT User Guide or the sample script examples found in the $SAMPLES_HOME directory on SCOUT.

9. Batch Queues

The following table describes the LSF queues available on SCOUT:

Queue Descriptions and Limits on SCOUT
Priority Queue Name Max Wall Clock Time Max Cores Per Job Description
Highest transfer 48 Hours N/A Data transfer for user jobs. See the ARL DSRC Archive Guide, section 5.2.
Down arrow for decreasing priority urgent 96 Hours N/A Jobs belonging to DoD HPCMP Urgent Projects
debug 1 Hour N/A Time/resource-limited for user testing and debug purposes
high 168 Hours N/A Jobs belonging to DoD HPCMP High Priority Projects
frontier 168 Hours N/A Jobs belonging to DoD HPCMP Frontier Projects
HIE 24 Hours N/A Rapid response for interactive work. For more information see the HPC Interactive Environment (HIE) User Guide.
interactive 12 Hours N/A Interactive jobs
standard 168 Hours N/A Standard jobs
Lowest background 24 Hours N/A User jobs that are not charged against the project allocation

10. Monitoring Your Job

You can monitor your batch jobs on SCOUT using the bjobs commands.

The bjobs -u all command lists all jobs in the queue. The bjobs command with no options shows only jobs owned by the user, as follows:

% bjobs –u all 

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
9985    username RUN   normal     login02     6*tra014    *_training Feb 19 0:07
9986    username RUN   normal     login02     6*tra002    *_training Feb 19 0:07
9987    username RUN   normal     login02     6*tra004    *_training Feb 19 0:07
9994    username RUN   normal     login02     4*inf033    *inference Feb 19 0:08
9995    username RUN   normal     login02     4*inf019    *inference Feb 19 0:08
9996    username RUN   normal     login02     4*inf034    *inference Feb 19 0:08
9997    username RUN   normal     login02     4*inf048    *inference Feb 19 0:08

% bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
9985    username RUN   normal     login02     6*tra014    *_training Feb 19 0:07

Notice that the output contains the Job ID for each job. This ID can be used with the bkill, bjobs, and bstop commands.

To delete a job, use the command "bkill".

To delete all of your jobs, use "bkill -u username".

To get full information on your job(s) use "bjobs –l [JOB#]".

To view a partially completed output file, use the "bpeek JOB#" command.

11. Archiving Your Work

When your job is finished, you should archive any important data to prevent automatic deletion by the purge scripts.

Copy one or more files to the archive system
archive put file1

Copy one or more files from the archive system
archive get my_data/file1

For more information on archiving your files, see the Archive Guide.

12. Modules

Software modules are a very convenient way to set needed environment variables and include necessary directories in your path so that commands for particular applications can be found. SCOUT uses "modules" to initialize your environment with system commands and libraries, compiler suites, environment variables, and batch system commands.

A number of modules are loaded automatically as soon as you log in. To see the modules that are currently loaded, run "module list". To see the entire list of available modules, run "module avail". You can modify the configuration of your environment by loading and unloading modules. For complete information on how to do this, see the Modules User Guide.

13. Available Software

A list of software on SCOUT is available on the software page.