HPC1: Introduction to High Performance Computing – Best Practices and Troubleshooting

Why Best Practices Matter

Without best practices:

Wasted compute resources
Unreproducible results
Difficult debugging
Poor collaboration
Security vulnerabilities

With best practices:

Efficient resource usage
Reproducible workflows
Easier troubleshooting
Better collaboration
Secure computing

Note

Best practices save time, resources, and frustration for everyone!

Pre-Submission Checklist

Before Submitting Jobs

✅ Test interactively first - Debug on login nodes or interactive sessions

✅ Validate input data - Check files exist and are accessible

✅ Estimate resources - Based on small test runs

✅ Choose right partition - Standard, GPU, high-memory, etc.

✅ Set realistic time limits - Add buffer but don’t over-request

✅ Specify module versions - For reproducibility

Job Script Quality

Error Handling:

#!/bin/bash
set -e  # Exit on any error

# Check input files exist
if [ ! -f "input.txt" ]; then
    echo "Error: input.txt not found"
    exit 1
fi

Good Organization:

#SBATCH --job-name=meaningful_name
#SBATCH --output=logs/job_%j.out
#SBATCH --error=logs/job_%j.err

# Create logs directory
mkdir -p logs

# Document what you're doing
echo "Starting analysis at $(date)"

Tip

Use meaningful names and organize your output files!

Right-Sizing Resources

# Check actual usage after jobs complete
sacct -j JOBID --format=JobID,MaxRSS,ReqMem,Elapsed,ReqCPUS

# Example output:
# JobID     MaxRSS   ReqMem   Elapsed  ReqCPUS
# 12345     2.5G     8G       00:15:23    4

Warning

Over-requesting wastes resources and delays scheduling!

In this example: requested 8G memory, only used 2.5G → reduce future requests

Storage Best Practices

Storage Type	Use For	Don’t Use For
Home	Scripts, configs, results	Large datasets, temp files, job results
Scratch	Working data, temp files	Long-term storage
Flash	High I/O during jobs	Permanent storage

Performance Optimisation: CPU

# Check if you're using all requested CPUs
sacct -j JOBID --format=JobID,AveCPU,ReqCPUS,Elapsed,State

Common Issues:

Not using all requested CPUs
I/O bottlenecks
Inefficient algorithms
Wrong parallelization strategy

Performance Optimization: I/O

❌ Bad: Many small writes

for i in {1..1000}; do
    echo "data $i" >> output.txt
done

✅ Good: Batch writes

{
    for i in {1..1000}; do
        echo "data $i"
    done
} > output.txt

Common Issues & Solutions

Module Problems

Module Not Found:

# Check availability
module avail python
module spider python

# Load specific version
module load python/3.13.0

Python Package Issues:

# Use conda environments
module load miniforge/24.3.0
conda create -n myenv python=3.13
conda activate myenv
conda install numpy pandas

Common Issues & Solutions

Job Failures

Job Killed (Out of Memory):

Increase --mem or --mem-per-cpu
Profile memory usage
Optimize data structures

Job Timeout:

Increase --time limit
Optimize algorithms
Split into smaller jobs

Debugging Techniques

Check Log Files First

# Look at error output
cat job_12345.err

# Check standard output
cat job_12345.out

# Look for common error patterns
grep -i "error\|failed\|killed" job_12345.err

Use Slurm Diagnostics

# Why is my job pending?
squeue -u $USER --long

# Check job details
scontrol show job 12345

# View job accounting
sacct -j 12345 --format=JobID,State,ExitCode,MaxRSS,Elapsed

Interactive Debugging

Start Interactive Session

# Request interactive resources
srun --pty --partition=test --time=1:00:00 --mem=4G bash

# Or use salloc for longer sessions
salloc --partition=test --time=2:00:00 --mem=8G

Debug Step by Step

Load same modules as your job script
Navigate to same directory (cd $SLURM_SUBMIT_DIR)
Run commands manually one by one
Check intermediate outputs
Fix issues before resubmitting batch job

Tip

Interactive debugging saves time versus repeated batch submissions!

Reproducibility Best Practices

Version Control Everything

# Track your scripts
git init
git add job_script.sh analysis.py
git commit -m "Initial analysis setup"

Document Your Environment

# Record module versions
module list > modules_used.txt

# Save package versions
pip freeze > requirements.txt
# or
conda env export > environment.yml

Organize Your Workflow

project/
├── scripts/          # Job scripts and analysis code
├── data/            # Input data (or symlinks)
├── results/         # Output data and figures  
├── logs/            # Job output and error files
└── docs/            # Documentation and notes

Collaboration Best Practices

# Make scripts readable by group
chmod 755 my_script.sh

# Share via version control
git remote add origin https://github.com/user/project.git
git push origin main

Documentation

# Include usage information in scripts
#!/bin/bash
# Purpose: Analyze genomic data using BWA and samtools
# Usage: sbatch alignment_job.sh <input_fastq> <reference_genome>
# Author: Your Name
# Date: 2024-10-21

File Permissions

# Private files (default)
chmod 700 private_directory/

# Group readable
chmod 750 shared_analysis/

# World readable (public data)
chmod 755 public_results/

Security Best Practices

Protect Your Credentials

Never Do This:

Share your login credentials
Leave passwords in scripts
Store/process sensitive data on Aire
Leave interactive sessions running

Use SSH Keys

# Generate SSH key pair (on your local machine)
ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

# Copy public key to HPC system
ssh-copy-id username@system

Data Security

# Set appropriate permissions
chmod 700 sensitive_data/
chmod 600 private_key.pem

# Check who can access your files
ls -la

Troubleshooting Workflow

flowchart TD
    A[Job Fails] --> B[Check Error Files]
    B --> C[Look for Obvious Errors]
    C --> D{Error Found?}
    D -->|Yes| E[Fix and Resubmit]
    D -->|No| F[Check Slurm Status]
    F --> G[Use sacct for Details]
    G --> H[Try Interactive Debug]
    H --> I[Search Documentation]
    I --> J[Ask for Help]
    J --> K[Include: Job ID, Error Messages...]

When seeking help: Include This Information:

Job ID and submission command
Error messages (exact text)
Expected vs actual behavior
Steps to reproduce the problem
System information (which cluster, when)

Where to Get Help:

Aire Documentation: arcdocs.leeds.ac.uk
Research Computing Query: bit.ly/arc-help
Training Sessions: Regular workshops
User Community: Connect with other researchers

Tip

Search documentation first - many common issues are already covered!

Best Practices Checklist

Summary

Key Takeaways

Test and validate workflows before large-scale submission
Right-size resource requests to avoid waste and improve scheduling
Use appropriate storage for different types of data and I/O patterns
Monitor job performance with sacct and system tools
Debug systematically using log files and Slurm commands
Follow security practices to protect credentials and data
Document solutions for reproducibility and future troubleshooting

Remember: Good practices benefit everyone in the HPC community!

Best Practices and Troubleshooting

Why Best Practices Matter

Pre-Submission Checklist

Job Script Quality

Right-Sizing Resources

Storage Best Practices

Performance Optimisation: CPU

Performance Optimization: I/O

Common Issues & Solutions

Module Problems

Common Issues & Solutions

Job Failures

Debugging Techniques

Check Log Files First

Use Slurm Diagnostics

Interactive Debugging

Start Interactive Session

Debug Step by Step

Reproducibility Best Practices

Version Control Everything

Document Your Environment

Organize Your Workflow

Collaboration Best Practices

Sharing Code

Documentation

File Permissions

Security Best Practices

Protect Your Credentials

Use SSH Keys

Data Security

Troubleshooting Workflow

When seeking help: Include This Information:

Where to Get Help:

Best Practices Checklist

Summary