SCOUT Quick Start Guide
Table of Contents
- 1. Introduction
- 2. Hardware
- 3. Get a Kerberos Ticket
- 4. Connect to SCOUT
- 5. Home, Working, and Center-wide Directories
- 6. Login, Inference, Training, and Visualization Nodes
- 7. Transfer Files and Data to SCOUT
- 8. Submit Jobs to the Batch Queue
- 9. Batch Queues
- 10. Monitoring Your Job
- 11. Archiving Your Work
- 12. Modules
- 13. Available Software
This document provides a brief summary of information that you'll need to know to quickly get started working on the Supercomputing Outpost (SCOUT). For more detailed information, see the SCOUT User Guide.
The system contains 22 nodes (tra[001-022]) for machine learning training workloads, each with two IBM POWER9 processors, 512 GB of system memory, 6 NVIDIA V100 GPU Processing units with 32 GB of high-bandwidth memory each and 12 TB of local solid-state storage. SCOUT also has 128 GPGPU-accelerated nodes for inferencing workloads (inf[001-128]), each with two IBM POWER9 processors, 4 NVIDIA T4 GPU'S, 256 GB of systems memory, and 3.3 TB of local solid-state storage. There is also 2 visualization nodes with two IBM POWER9 processors, 512 GB of system memory, 2 NVIDIA V100 GPU Processing units, and 3.3 TB of local solid-state storage.
3. Get a Kerberos Ticket
For security purposes, you must have a current Kerberos ticket on your computer before attempting to connect to SCOUT. A Kerberos client kit must be installed on your desktop to enable you to get a Kerberos ticket. Information about installing Kerberos clients on your Windows desktop can be found at HPC Centers: Kerberos & Authentication.
4. Connect to SCOUT
SCOUT can be accessed via Kerberized ssh as follows:% ssh scout.arl.hpc.mil
5. Home, Working, and Center-wide Directories
Each user has file space in the $HOME, $WORKDIR, and $CENTER directories. The $HOME, $WORKDIR, and $CENTER environment variables are predefined for you and point to the appropriate locations in the file systems. You are strongly encouraged to use these variables in your scripts.
NOTE: $WORKDIR is a "scratch" file system, and $CENTER is a center-wide file system that is accessible to all center production machines. Neither of these file systems is backed up. You are responsible for managing files in your $WORKDIR directories by backing up files to the archive system and deleting unneeded files. Currently, $WORKDIR files that have not been accessed in 21 days and $CENTER files that have not been accessed in 120 days are subject to being purged.
If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you WILL NOT be notified prior to deletion. You are responsible to monitor your workspace to prevent data loss.
6. Login, Inference, Training, and Visualization Nodes
SCOUT has four types of nodes: login, inference, training, and visualization. When you log into the system, you are placed on a login node. These nodes are used for typically small tasks such as editing and compiling code. When the Batch Scripts or Interactive Jobs run, the resulting shell will run on the node requested.
The following "compute nodes" are accessed via the mpirun command, if done interactively. The mpiexec command is typically issued from within an LFS job script. Otherwise, you will not have any compute nodes allocated and your parallel job will run on the login node. If this happens, your job will interfere with (and be interfered with by) other users' login node tasks.
Inference, training, and visualization nodes have the following naming convention for accessing them respectively: inf, tra, and vis
7. Transfer Files and Data to SCOUT
File transfers to DSRC systems must be performed using Kerberized versions of the following tools: scp, ftp, sftp, and mpscp. For example, the command below uses secure copy (scp) to copy a local file into a destination directory on a SCOUT login node.% scp local_file scout.arl.hpc.mil:/target_dir
For additional information on file transfers to and from SCOUT, see the File Transfers section of the SCOUT User Guide.
8. Submit Jobs to the Batch Queue
The IBM Load Sharing Facility (LSF) is the workload management system for SCOUT. To submit a batch job, use the following command:bsub < myjob.lsf
where myjob.lsf is the name of the file containing your batch script.
For more information on job scripts, see the SCOUT User Guide or the sample script examples found in the $SAMPLES_HOME directory on SCOUT.
9. Batch Queues
The following table describes the LSF queues available on SCOUT:
|Priority||Queue Name||Max Wall Clock Time||Max Cores Per Job||Description|
|Highest||transfer||48 Hours||N/A||Data transfer for user jobs. See the ARL DSRC Archive Guide, section 5.2.|
|urgent||96 Hours||N/A||Designated urgent jobs by DoD HPCMP|
|debug||1 Hour||N/A||User diagnostic jobs|
|high||168 Hours||N/A||Designated high-priority projects by service/agency|
|frontier||168 Hours||N/A||Frontier projects only|
|HIE||24 Hours||N/A||Rapid response for interactive work. For more information see the HPC Interactive Environment (HIE) User Guide.|
|interactive||12 Hours||N/A||Interactive jobs|
|standard||168 Hours||N/A||Normal user jobs|
|Lowest||background||24 Hours||N/A||User jobs that will not be charged against the project allocation|
10. Monitoring Your Job
You can monitor your batch jobs on SCOUT using the bjobs commands.
The bjobs -u all command lists all jobs in the queue. The bjobs command with no options shows only jobs owned by the user, as follows:
% bjobs –u all JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 9985 username RUN normal login02 6*tra014 *_training Feb 19 0:07 9986 username RUN normal login02 6*tra002 *_training Feb 19 0:07 9987 username RUN normal login02 6*tra004 *_training Feb 19 0:07 9994 username RUN normal login02 4*inf033 *inference Feb 19 0:08 9995 username RUN normal login02 4*inf019 *inference Feb 19 0:08 9996 username RUN normal login02 4*inf034 *inference Feb 19 0:08 9997 username RUN normal login02 4*inf048 *inference Feb 19 0:08 % bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 9985 username RUN normal login02 6*tra014 *_training Feb 19 0:07
Notice that the output contains the Job ID for each job. This ID can be used with the bkill, bjobs, and bstop commands.
To delete a job, use the command "bkill".
To delete all of your jobs, use "bkill -u username".
To get full information on your job(s) use "bjobs –l [JOB#]".
To view a partially completed output file, use the "bpeek JOB#" command.
11. Archiving Your Work
When your job is finished, you should archive any important data to prevent automatic deletion by the purge scripts.
Copy one or more files to the archive system
archive put file1
Copy one or more files from the archive system
archive get my_data/file1
For more information on archiving your files, see the Archive Guide.
Software modules are a very convenient way to set needed environment variables and include necessary directories in your path so that commands for particular applications can be found. SCOUT uses "modules" to initialize your environment with system commands and libraries, compiler suites, environment variables, and batch system commands.
A number of modules are loaded automatically as soon as you log in. To see the modules that are currently loaded, run "module list". To see the entire list of available modules, run "module avail". You can modify the configuration of your environment by loading and unloading modules. For complete information on how to do this, see the Modules User Guide.
13. Available Software
A list of software on SCOUT is available on the software page.