Research Computing Team and Service
Here to support research(ers)
Provide training
Support users of Grid and Cloud Computing platforms
Provide consultancy
To develop project proposals
To help recruit people with specialist skills
Working directly on research projects
For details please see our Website
Contact us via the IT Service Desk
If you follow a link then a new tab will open in the Slide View
Web browser.
This will want closing or the Slide View
activating again as he main tab in the browser.
Maybe point out some sections/features on the Research Computing Website
.
We support use and develop resources.
Purpose of HPC1
Introducing Research Computing and the HPCs at Leeds
Hands on with Linux and ARC4
Running code
Batch and Interactive jobs
Data management
Joys of parallel jobs
Advanced job submissions
Training
Introductions and Motivations
Who are you and why are you here?
What problems are you encountering with your computational work now?
Why / how do you think HPC will help?
Key Concepts
High Performance Computing (HPC)
High Throughput Computing (HTC)
“Supercomputing”
Applications
Terminology
Node : the physical machine/server. In current systems, a node would typically include one or more processors, as well as memory and other hardware.
Processor : the central processing unit (CPU) inside the node, which contains one or more cores.
Core : Refers to the basic computation unit of the CPU. This is the unit that carries out the actual computations.
Leeds facilities
ARC3 brought into service in 2017
ARC4 brought into service in 2019
A Supercomputer isn’t…
Single computer vs grid of computers
Serial and parallel programs
Serial programs run on a single CPU core, solving one problem at a time.
Parallel programs run across multiple CPU cores, splitting the workload between them and solving the problem faster.
Serial Program
Parallel Program
Amdahl’s Law
Basic parallel machine
Differences from Desktop computing?
We don’t log on to compute nodes directly
submit jobs via a batch scheduling system
Not a GUI-based environment
System is shared with many other users
Resources more tightly monitored and controlled
Memory
CPU usage (‘cores’)
Time
Benefits of using HPC
Speed
Volume
Cost
Efficiency
Convenience
Parallel Paradigms
From a systems perspective:
Shared memory parallelism
Distributed memory parallelism
Unless you are writing your own codes, the software developer takes care of this.
Basic HPC system layout
Exercise 1.1
What do the following Linux commands do? How might they be used on the HPC service?
Exercise 1.2
On the HPC service, you have a ‘HOME’ directory of 10GB and can create a directory on the /nobackup drive.
Using the man pages (or Google…) investigate how you could use the following commands to manage your storage:
Exercise 1.3
Linux systems include a number of file compression routines.
Find out which ones are available on the cluster and use them to create a compressed archive of a directory and its contents.
Exercise 1.4
How could you read a PDF file or an HTML document on the cluster?
HPC at Leeds
ARC4 is the latest incarnation of central HPC at Leeds.
HPC currently comprised of two facilities called ARC3 and ARC4
All Faculties are shareholders and so it is important that all who can benefit from the use of this facility do so.
ARC3
2 x login nodes: 24 cores and 128GB RAM
252 x standard compute nodes: 24 cores and 128GB RAM (=6048 cores); 100GB SSD
4 x High Memory (24 cores and 768GB nodes)
6 x P100 GPU nodes (24 core, 128GB and 4 x NVidia P100)
2 x K80 GPU nodes (24 core, 128GB and 2 x NVidia K80)
2 x Intel Xeon Phi (Knights Landing) vector processor nodes
836TB of high speed storage: /nobackup
ARC4
2 x login nodes: 40 cores and 192GB RAM
149 x standard compute nodes: 40 cores and 192GB RAM (=5960 cores); 100GB SSD
2 x High Memory (40 cores and 768GB nodes)
3 x V100 GPU nodes (40 core, 192GB and 4 x NVidia V100)
1.2PB of high speed storage: /nobackup
Exercise 2
We’re going to download some files to have a play with:
git clone https://github.com/ARCTraining/hpc1-files.git
SGE
SGE is a sophisticated scheduler:
Can define usage policies.
Control maximum limits.
Fair distribution of resources.
Produces detailed usage accounting information.
Submit some serial R jobs
We’re going to submit a job from the 1_R
directory
Scheduling notes
Our clusters adopt a “fair share” policy
Jobs preferentially run based on current and previous usage from Faculty. Same applies when comparing users in same Faculty.
The lower the usage, the higher the priority (and vice versa).
“Backfilling” is used to fit smaller jobs in between the top priority jobs. All jobs have specified run time, and so the scheduler will run lower priority jobs if they will start and finish before the highest priority jobs are scheduled to start. Thus indicating a realistic runtime for a job, will make short jobs eligible to be backfilled, potentially shortening their wait-time.
Submit a serial Python job
Can you now do the same for a Python job in the 2_Python directory?
To run this Python code you do not need any modules loaded and can run it with:
python example1.py
Normal end of Part 1
Questions, recap
Drives and Directories
HPC users have access to two storage areas:
A HOME directory
Space on /nobackup
Home Directory
This is:
Private to you
Backed up
Limited to 10GB storage (ARC)
Shared between machines
/nobackup
Each HPC cluster has its own high speed storage service called /nobackup
You need to make your own directory (using mkdir
)
Nothing is backed up
Files will expire after 90 days not being used
Need to set permissions to make files private on ARC3 but not on ARC4.
Local storage
Each compute node has a small SSD.
1Gbyte allocated per job by default via $TMPDIR
Typically much faster than other storage available
Can be increased if required:
Limits vary depending on node type, but at least 100G
More on local storage
Transferring files and data
scp
or rsync
command line utility
wget
(to download from a remote server)
git
(version control)
smbclient
to copy from local M:/ and N:/ drives on campus
Google Drive and OneDrive (via rclone
)
graphical programs like Cyberduck or Filezilla (or indeed MobaXterm)
Module system
module
avail
- what software could I add
list
- show what is active
add|load
- enable software
rm|unload
- disable software
help
- show details of software
swap|switch
- swap modules
Shared vs Distributed memory jobs
Submit some parallel jobs
Let’s look at and compare a few submissions:
serial
threaded
distributed
GPUs
Three different types of GPU available
ARC3:
2xK80 (1 node)
4xP100 (6 nodes)
ARC4:
Some extras in private queues
Submitting a GPU job
ARC4:
ARC3:
or
Should not ask for memory or CPU cores
More on GPGPU
Large memory nodes
ARC4:
#$ -l node_type=40core-768G
ARC3:
#$ -l node_type=24core-768G
Also allows jobs to run for up to 96hrs
Interactive jobs
General advice, don’t use them unless you have to.
qrsh -l h_rt=0:15:0 -pty y bash -i
More in interactive jobs
Task arrays
When you want to run lots of similar jobs
# Run 100 jobs from 1-100
#$ -t 1-100
# Don't run more than two at a time
#$ -tc 2
if [ $SGE_TASK_ID == $SGE_TASK_FIRST ] ; then
echo I am the first job
fi
echo I am job $SGE_TASK_ID
if [ $SGE_TASK_ID == $SGE_TASK_LAST ] ; then
echo I am the last job
fi
More on task arrays
Restartable jobs
For when 48hrs isn’t enough
At its simplest, just finish with a return code of 99 from the last line of your code, and the job will be rescheduled:
More on restartable jobs
Questions and time for recap
Something we’ve not covered that you’d like a look at
Anything we have covered but you’d like to go over more
A further look at arcdocs
Thank you
If you have any questions or would like to learn more about Research Computing, please do not hesitate to get in touch with us.
We are always here to assist you!