Skip Nav

SCOUT Quick Start Guide

Table of Contents

1. Introduction

This document provides a brief summary of information that you'll need to know to quickly get started working on Supercomputing Outpost (SCOUT). For more detailed information, see the SCOUT User Guide.

2. Hardware

The system contains 22 nodes (tra[001-022]) for machine learning training workloads, each with two IBM POWER9 processors, 512 GB of system memory, 6 nVidia V100 GPU Processing units with 32 GB of high-bandwidth memory each and 15 TB of local solid-state storage. SCOUT also has 128 GPGPU-accelerated nodes for inferencing workloads (inf[001-128]), each with two IBM POWER9 processors, 4 nVida T4 GPU’S, 256 GB of systems memory, and 4 TB of local solid state storage. There is also 2 visualization nodes with two IBM POWER9 processors, 512 GB of system memory, 2 nVidia V100 GPU Processing units, and 4 TB of local solid-state storage.

3. Get a Kerberos Ticket

For security purposes, you must have a current Kerberos ticket on your computer before attempting to connect to SCOUT. A Kerberos client kit must be installed on your desktop to enable you to get a Kerberos ticket. Information about installing Kerberos clients on your Windows desktop can be found at HPC Centers: Kerberos & Authentication.

4. Connect to SCOUT

SCOUT can be accessed via Kerberized ssh as follows:

% ssh scout.arl.hpc.mil

5. Home, Working, and Center-wide Directories

Each user has file space in the $HOME, $WORKDIR, and $CENTER directories. The $HOME, $WORKDIR, and $CENTER environment variables are predefined for you and point to the appropriate locations in the file systems. You are strongly encouraged to use these variables in your scripts.

NOTE: $WORKDIR is a "scratch" file system, and $CENTER is a center-wide file system that is accessible to all center production machines. The $WORKDIR file system is not backed up. You are responsible for managing files in your $WORKDIR directories by backing up files to the archive system and deleting unneeded files. Currently, $WORKDIR files that have not been accessed in 21 days and $CENTER files that have not been accessed in 120 days are subject to being purged.

If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you WILL NOT be notified prior to deletion. You are responsible to monitor your workspace to prevent data loss.

6. Login, Inference, Training, Visualization and Compute Nodes

SCOUT has four types of nodes: login, inference, training, and visualization. When you log into the system, you are placed on a login node. These nodes are used for typically small tasks such as editing and compiling code. When the Batch Scripts or Interactive Jobs run, the resulting shell will run on the node requested.

The following "compute nodes" are accessed via the mpirun command if done interactively. The mpiexec command is typically issued from within an LFS job script. Otherwise, you will not have any compute nodes allocated and your parallel job will run on the login node. If this happens, your job will interfere with (and be interfered with by) other users' login node tasks.

Inference, training, and visualization nodes have the following naming convention for accessing them respectively: inf, tra, and vis

7. Transfer Files and Data to SCOUT

File transfers to DSRC systems must be performed using Kerberized versions of the following tools: scp, ftp, sftp, and mpscp. For example, the command below uses secure copy (scp) to copy a local file into a destination directory on a SCOUT login node.

% scp local_file scout.arl.hpc.mil:/target_dir

8. Submit Jobs to the Batch Queue

The IBM Load Sharing Facility (LSF) is the workload management system for SCOUT. To submit a batch job, use the following command:

bsub < myjob.lsf

where my_job_script is the name of the file containing your batch script.

For more information on job scripts, see the sample script examples found in the $SAMPLES_HOME directory on SCOUT.

9. Batch Queues

The following table describes the BSUB queues available on SCOUT:

Queue Descriptions and Limits on SCOUT
Priority Queue
Name
Max Wall
Clock Time
Max Cores
Per Job
Comments
Highest transfer 48 Hours N/A Data transfer jobs
Down arrow for decreasing priority urgent 96 Hours N/A Designated urgent jobs by DoD HPCMP
debug 1 Hour N/A User diagnostic jobs
high 168 Hours N/A Designated high-priority projects by service/agency
frontier 168 Hours N/A Frontier projects only
cots 96 Hours N/A Abaqus and Fluent jobs
HIE 24 Hours N/A Rapid response for interactive work
interactive 12 Hours N/A Interactive jobs
standard 168 Hours N/A Normal user jobs
Lowest background 24 Hours N/A User jobs that will not be charged against the project allocation

10. Monitoring Your Job

You can monitor your batch jobs on SCOUT using the bjobs commands.

The qstat command lists all jobs in the queue. The "bjobs" option shows only jobs jobs owned by the given user, as follows:

% bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
9985    username RUN   normal     login02     6*tra014    *_training Feb 19 0:07

% bjobs –u all 

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
9985    username RUN   normal     login02     6*tra014    *_training Feb 19 0:07
9986    username RUN   normal     login02     6*tra002    *_training Feb 19 0:07
9987    username RUN   normal     login02     6*tra004    *_training Feb 19 0:07
9994    username RUN   normal     login02     4*inf033    *inference Feb 19 0:08
9995    username RUN   normal     login02     4*inf019    *inference Feb 19 0:08
9996    username RUN   normal     login02     4*inf034    *inference Feb 19 0:08
9997    username RUN   normal     login02     4*inf048    *inference Feb 19 0:08

Notice that the output contains the JobID for each job. This ID can be used with the bkill, bjobs, and bstop commands.

To delete a job, use the command "bkill".

To delete all of your jobs, use "bkill -u username".

To get full information on your job(s) use "bjobs –l [JOB#]".

To view a partially completed output file, use the "bpeek JOB#" command.

11. Archiving Your Work

When your job is finished, you should archive any important data to prevent automatic deletion by the purge scripts.

Copy one or more files to the archive system
archive put file1

Copy one or more files from the archive system
archive get my_data/file1

For more information on archiving your files, see the Archive Guide.

12. Modules

Software modules are a very convenient way to set needed environment variables and include necessary directories in your path so that commands for particular applications can be found. SCOUT uses "modules" to initialize your environment with Lmod, a Lua-based module system, system commands and libraries, compiler suites, environment variables, and BSUB batch system commands.

A number of modules are loaded automatically as soon as you log in. To see the modules that are currently loaded, run "module list". To see the entire list of available modules, run "module avail". You can modify the configuration of your environment by loading and unloading modules. For complete information on how to do this, see the Modules User Guide.

13. Available Software

A list of software on SCOUT will be available on the software page when the system goes into production.