Session 6: Best Practices and Troubleshooting
Optimizing Your HPC Workflows
Session content
Session aims
By the end of this session, you will be able to:
- Apply best practices for resource management and job planning
- Optimize file I/O and data management workflows
- Troubleshoot common HPC problems and job failures
- Monitor and analyze job performance effectively
- Follow proper HPC etiquette and community guidelines
- Develop efficient and reproducible computational workflows
View Interactive Slides: Best Practices and Troubleshooting
In this final session, we’ll cover best practices for using HPC systems effectively and how to troubleshoot common problems you might encounter.
HPC Best Practices
Job Planning and Resource Management
Right-Size Your Resource Requests
- Start small: Test with minimal resources
- Profile first: Run small tests to understand resource needs
- Scale gradually: Increase resources based on actual usage
- Monitor usage: Use
sacctto check what you actually used
# Check actual resource usage after job completes
sacct -j JOBID --format=JobID,Elapsed,MaxRSS,AveCPU,StateEfficient File Management
- Use appropriate storage: Home for code, scratch for data processing
- Clean up regularly: Remove temporary files from scratch
- Organize your work: Use clear directory structures
- Backup important results: Don’t rely on scratch storage
Writing Robust Job Scripts
Error Handling
#!/bin/bash
#SBATCH --job-name=robust_job
#SBATCH --output=%j.out
#SBATCH --error=%j.err
# Exit immediately if any command fails
set -e
# Print commands as they execute (for debugging)
set -x
# Check if required files exist
if [ ! -f "input.txt" ]; then
echo "ERROR: input.txt not found"
exit 1
fi
# Load modules with error checking
module load python/3.13.0 || {
echo "ERROR: Failed to load Python module"
exit 1
}
# Run your program
python analysis.py
# Check if output was created
if [ ! -f "results.txt" ]; then
echo "ERROR: results.txt was not created"
exit 1
fi
echo "Job completed successfully"Environment Variables
Use Slurm environment variables in your scripts:
#!/bin/bash
#SBATCH --job-name=env_demo
echo "Job Information:"
echo "Job ID: $SLURM_JOB_ID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Node: $SLURMD_NODENAME"
echo "CPUs per task: $SLURM_CPUS_PER_TASK"
echo "Memory per node: $SLURM_MEM_PER_NODE"
echo "Submit directory: $SLURM_SUBMIT_DIR"
# Use variables for parallel processing
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASKTroubleshooting Common Issues
Job Submission Problems
Job Won’t Submit
Error: sbatch: error: Invalid partition name specified
Solution: Check available partitions
sinfo
sinfo --summarizeError: sbatch: error: Memory specification can not be satisfied
Solution: Reduce memory request or check node limits
sinfo -o "%P %l %D %c %m %N" # Show partition limitsJob Execution Problems
Job Stays in Pending State
Check why your job isn’t starting:
squeue -u $USER --long
squeue -j JOBID --start # Estimate start timeCommon reasons and solutions:
| Reason | Meaning | Solution |
|---|---|---|
Resources |
Not enough free resources | Wait, or reduce requests |
Priority |
Other jobs have higher priority | Wait for fair-share to adjust |
QOSMaxJobsPerUserLimit |
Too many jobs running | Wait for some jobs to finish |
PartitionNodeLimit |
Requesting too many nodes | Reduce node count |
Job Killed or Failed
Check job exit status and resource usage:
sacct -j JOBID --format=JobID,State,ExitCode,MaxRSS,ElapsedExit Codes:
0= Success1= General error125= Out of memory130= Job cancelled by user
Memory Issues
Out of Memory (OOM) Errors
Symptoms: Job killed with OutOfMemory or exit code 125
Diagnosis:
# Check memory usage
sacct -j JOBID --format=JobID,MaxRSS,ReqMem,State
# Check job log files for OOM messages
grep -i "memory\|oom\|killed" job_output.errSolutions:
- Increase memory request:
#SBATCH --mem=32G - Optimize your code to use less memory
- Process data in smaller chunks
- Use memory profiling tools
Memory Profiling
Monitor memory usage during development:
#!/bin/bash
# Add to your job script
/usr/bin/time -v python my_script.py
# Or use a memory profiler
module load valgrind
valgrind --tool=massif python my_script.pyPerformance Issues
Job Running Slowly
Check CPU utilization:
sacct -j JOBID --format=JobID,AveCPU,ReqCPUS,Elapsed,StateCommon causes:
- Not using all requested CPUs
- I/O bottlenecks
- Inefficient algorithms
- Wrong parallelization strategy
I/O Performance
Use appropriate storage:
- Small files: Home directory
- Large temporary files:
$SCRATCH - High I/O during job:
$TMP_SHARED(flash storage)
Optimize I/O patterns:
# Bad: Many small writes
for i in {1..1000}; do
echo "data $i" >> output.txt
done
# Good: Batch writes
{
for i in {1..1000}; do
echo "data $i"
done
} > output.txtModule and Software Issues
Module Not Found
# Check module availability
module avail python
module spider python # More detailed search
# Load required modules
module load python/3.13.0Python Package Issues
Use conda environments for better package management:
# In your job script
module load miniforge/24.3.0
source activate myenv
python analysis.pyVersion Conflicts
Always specify exact versions:
# Bad - version may change
module load gcc
module load python
# Good - reproducible
module load gcc/14.2.0
module load python/3.13.0Performance Optimization
Parallel Programming
MPI (Distributed Memory)
#!/bin/bash
#SBATCH --ntasks=16
#SBATCH --nodes=2
module load openmpi/4.1.4
mpirun ./my_mpi_programBenchmarking and Scaling
Test how your job scales with resources:
# Test different core counts
for cores in 1 2 4 8 16; do
sbatch --cpus-per-task=$cores benchmark.sh
doneCreate scaling plots to find optimal resource usage.
Monitoring and Debugging
Real-time Job Monitoring
While job is running:
# Check job status
squeue -j JOBID
# SSH to compute node (if allowed)
ssh $(squeue -h -j JOBID -o %N)
top -u $USERPost-job Analysis
After job completion:
# Detailed accounting
sacct -j JOBID --format=ALL
# Custom format for specific metrics
sacct -j JOBID --format=JobID,JobName,Elapsed,State,MaxRSS,AveCPU,ReqCPUS,ReqMemLog File Analysis
Organize your log files:
#!/bin/bash
#SBATCH --output=logs/job_%j_%x.out
#SBATCH --error=logs/job_%j_%x.err
# Create logs directory if it doesn't exist
mkdir -p logsCollaboration and Reproducibility
Getting Help
Before Asking for Help
- Check the logs: Look at
.outand.errfiles - Check resource usage: Use
sacctto see what happened - Test interactively: Try running parts of your script interactively
- Search documentation: Check Aire documentation and forums
- Google the error: Many HPC issues are common
Where to Get Help
- Research Computing Team: Submit a query
- Aire Documentation: https://arcdocs.leeds.ac.uk/aire/
- Training Courses: https://arc.leeds.ac.uk/courses/
- Community Forums: Slurm user groups, Stack Overflow
When Reporting Issues
Include this information:
- Job ID and submission command
- Error messages (exact text)
- Expected behavior vs actual behavior
- Steps to reproduce the problem
- System information (which cluster, when it happened)
Security Best Practices
Protect Your Credentials
- Never share your login credentials
- Use SSH keys instead of passwords when possible
- Log out when finished
- Don’t leave interactive sessions running
Data Security
- Don’t store sensitive data on shared systems unnecessarily
- Use appropriate file permissions:
chmod 700 private_dir/ - Be aware of who can access your data
- Follow institutional data management policies
Summary Checklist
Before submitting jobs:
Job script quality:
Resource management:
Troubleshooting:
Summary
- Test and validate workflows before large-scale submission
- Right-size resource requests to avoid waste and improve scheduling
- Use appropriate storage for different types of data and I/O patterns
- Monitor job performance with
sacctand system tools - Debug systematically using log files and Slurm commands
- Follow security practices to protect credentials and data
- Document solutions for reproducibility and future troubleshooting
Next Steps
Congratulations! You’ve learned the fundamentals of using HPC systems. Let’s move on to Session 7: Wrap Up for final thoughts and next steps.