HPC 1

Introduction to High Performance Computing

Research Computing Team and Service

  • Here to support research(ers)
    • Provide training
    • Support users of Grid and Cloud Computing platforms
    • Provide consultancy
      • To develop project proposals
      • To help recruit people with specialist skills
      • Working directly on research projects
  • For details please see our Website
  • Contact us via the IT Service Desk

Purpose of HPC1

  • Introducing Research Computing and the HPCs at Leeds
  • Hands on with Linux and ARC4
  • Running code
  • Batch and Interactive jobs
  • Data management
  • Joys of parallel jobs
  • Advanced job submissions

Training

ARC/HPC Training Portfolio

https://arc.leeds.ac.uk/training

Introductions and Motivations

  • Who are you and why are you here?
  • What problems are you encountering with your computational work now? 
  • Why / how do you think HPC will help?

Key Concepts

  • High Performance Computing (HPC)
  • High Throughput Computing (HTC)
  • “Supercomputing”

Applications

Terminology

  • Node: the physical machine/server. In current systems, a node would typically include one or more processors, as well as memory and other hardware.

  • Processor: the central processing unit (CPU) inside the node, which contains one or more cores.

  • Core: Refers to the basic computation unit of the CPU. This is the unit that carries out the actual computations.

Leeds facilities

  • ARC3 brought into service in 2017
  • ARC4 brought into service in 2019

A Supercomputer isn’t…

Single computer vs grid of computers

Serial and parallel programs

  • Serial programs run on a single CPU core, solving one problem at a time.
  • Parallel programs run across multiple CPU cores, splitting the workload between them and solving the problem faster.

Serial Program

Parallel Program

Amdahl’s Law

Basic parallel machine

Differences from Desktop computing?

  • We don’t log on to compute nodes directly
    • submit jobs via a batch scheduling system
  • Not a GUI-based environment
  • System is shared with many other users
  • Resources more tightly monitored and controlled
    • Memory
    • CPU usage (‘cores’)
    • Time

Benefits of using HPC

  • Speed
  • Volume
  • Cost
  • Efficiency
  • Convenience

Parallel Paradigms

From a systems perspective:

  • Shared memory parallelism
  • Distributed memory parallelism

Unless you are writing your own codes, the software developer takes care of this.

Basic HPC system layout

Exercise 1.1

What do the following Linux commands do? How might they be used on the HPC service?

  • ls
  • pwd
  • mkdir
  • cp
  • wget
  • rm

Exercise 1.2

On the HPC service, you have a ‘HOME’ directory of 10GB and can create a directory on the /nobackup drive.

Using the man pages (or Google…) investigate how you could use the following commands to manage your storage:

  • quota
  • df
  • du

Exercise 1.3

Linux systems include a number of file compression routines.

Find out which ones are available on the cluster and use them to create a compressed archive of a directory and its contents.

Exercise 1.4

How could you read a PDF file or an HTML document on the cluster?

HPC at Leeds

ARC4 is the latest incarnation of central HPC at Leeds.

HPC currently comprised of two facilities called ARC3 and ARC4

All Faculties are shareholders and so it is important that all who can benefit from the use of this facility do so.

ARC3

2 x login nodes: 24 cores and 128GB RAM

252 x standard compute nodes: 24 cores and 128GB RAM (=6048 cores); 100GB SSD

4 x High Memory (24 cores and 768GB nodes)

6 x P100 GPU nodes (24 core, 128GB and 4 x NVidia P100)

2 x K80 GPU nodes (24 core, 128GB and 2 x NVidia K80)

2 x Intel Xeon Phi (Knights Landing) vector processor nodes

836TB of high speed storage: /nobackup

ARC4

2 x login nodes: 40 cores and 192GB RAM

149 x standard compute nodes: 40 cores and 192GB RAM (=5960 cores); 100GB SSD

2 x High Memory (40 cores and 768GB nodes)

3 x V100 GPU nodes (40 core, 192GB and 4 x NVidia V100)

1.2PB of high speed storage: /nobackup

Exercise 2

We’re going to download some files to have a play with:

git clone https://github.com/ARCTraining/hpc1-files.git

SGE

SGE is a sophisticated scheduler:

  • Can define usage policies.
  • Control maximum limits.
  • Fair distribution of resources.
  • Produces detailed usage accounting information.

Queue tools

  • qsub: submit a batch job
  • qrsh: run an interactive job
  • qdel: delete a job
  • qstat: Details on queued or running jobs
  • qacct: Details on previous completed jobs
  • qsched: A hint as to when your job might run

Submit some serial R jobs

We’re going to submit a job from the 1_R directory

Scheduling notes

Our clusters adopt a “fair share” policy

  • Jobs preferentially run based on current and previous usage from Faculty. Same applies when comparing users in same Faculty.
  • The lower the usage, the higher the priority (and vice versa).
  • “Backfilling” is used to fit smaller jobs in between the top priority jobs. All jobs have specified run time, and so the scheduler will run lower priority jobs if they will start and finish before the highest priority jobs are scheduled to start. Thus indicating a realistic runtime for a job, will make short jobs eligible to be backfilled, potentially shortening their wait-time.

Submit a serial Python job

Can you now do the same for a Python job in the 2_Python directory?

To run this Python code you do not need any modules loaded and can run it with:

python example1.py

Normal end of Part 1

Questions, recap

Drives and Directories

HPC users have access to two storage areas:

  • A HOME directory
  • Space on /nobackup

Home Directory

This is:

  • Private to you
  • Backed up
  • Limited to 10GB storage (ARC)
  • Shared between machines

/nobackup

  • Each HPC cluster has its own high speed storage service called /nobackup
  • You need to make your own directory (using mkdir)
  • Nothing is backed up
  • Files will expire after 90 days not being used
  • Need to set permissions to make files private on ARC3 but not on ARC4.

Local storage

Each compute node has a small SSD.

  • 1Gbyte allocated per job by default via $TMPDIR

  • Typically much faster than other storage available

  • Can be increased if required:

    #$ -l disk=10G
  • Limits vary depending on node type, but at least 100G

More on local storage

Transferring files and data

  • scp or rsync command line utility
  • wget (to download from a remote server)
  • git (version control)
  • smbclient to copy from local M:/ and N:/ drives on campus
  • Google Drive and OneDrive (via rclone)
  • graphical programs like Cyberduck or Filezilla (or indeed MobaXterm)

Module system

module

  • avail - what software could I add
  • list - show what is active
  • add|load - enable software
  • rm|unload - disable software
  • help - show details of software
  • swap|switch - swap modules

Shared vs Distributed memory jobs

Submit some parallel jobs

Let’s look at and compare a few submissions:

  • serial
  • threaded
  • distributed

GPUs

Three different types of GPU available

ARC3:

  • 2xK80 (1 node)
  • 4xP100 (6 nodes)

ARC4:

  • 4xV100 (3 nodes)

Some extras in private queues

Submitting a GPU job

ARC4:

#$ -l coproc_v100=1

ARC3:

#$ -l coproc_p100=1

or

#$ -l coproc_k80=1

Should not ask for memory or CPU cores

More on GPGPU

Large memory nodes

ARC4:

#$ -l node_type=40core-768G

ARC3:

#$ -l node_type=24core-768G

Also allows jobs to run for up to 96hrs

Interactive jobs

General advice, don’t use them unless you have to.

qrsh -l h_rt=0:15:0 -pty y bash -i

More in interactive jobs

Task arrays

When you want to run lots of similar jobs

# Run 100 jobs from 1-100
#$ -t 1-100
# Don't run more than two at a time
#$ -tc 2

if[ $SGE_TASK_ID == $SGE_TASK_FIRST ]; then
  echo I am the first job
fi

echo I am job $SGE_TASK_ID

if [ $SGE_TASK_ID == $SGE_TASK_LAST ]; then
  echo I am the last job
fi

More on task arrays

Restartable jobs

For when 48hrs isn’t enough

At its simplest, just finish with a return code of 99 from the last line of your code, and the job will be rescheduled:

exit 99

More on restartable jobs

Questions and time for recap

  • Something we’ve not covered that you’d like a look at
  • Anything we have covered but you’d like to go over more
  • A further look at arcdocs

Thank you

If you have any questions or would like to learn more about Research Computing, please do not hesitate to get in touch with us.


We are always here to assist you!