CCF (Core Computational Facility) @ UQ run by ITS / SMP
[Home] [why HPC?] [UQ Resources] [getafix] [dogmatix] [asterix] [ghost] [contacts/help]
howto: [data] [Slurm] [OpenMP/MKL] [Nvidia GPU] [Intel Phi] [MPI] [matlab] [mathematica] [FAQ]
SMP getafix cluster
The SMP getafix cluster is accessed through
getafix.smp.uq.edu.au
(which resolves to two login nodes to share the load).
The cluster has new CPU nodes with 512GB of RAM purchased in 2017/2018 by SMP
along with former CPU-based systems
from dogmatix.
It also has 9 Nvidia Tesla P100's (see Nvidia GPU for more info
(the more than 80 GPUs of asterix
and ghost will remain accessible via dogmatix system).
- Dell FX2 blade chassis with FC430 compute nodes and FD332 disk nodes. (needs updating)
- Master node has been virtualised with VMware and is running the SLURM queuing system. An additonal two VMs running as login nodes has been installed to allow better utilisation of the existing resources.
- In total 1372 CPU cores on 84 compute nodes with 13376GB memory. (needs updating)
- Total of 10 x Intel Phi Mics
- Located in Prentice DC1
- Running Rocks 7.X
- smp
- the default partition including the new nodes and the former dogmatix cluster nodes.
- gpu
- contains the new nodes with Nvidia Tesla P100's
-
/home/uqname/
- your home directory on the NAS. The default quota for the home directory is 50GB. Do not run calculations from this directory (always use/data/uqname/
). -
/data/uqname/
- your data directory on the NAS. The default quota for the home directory is 500GB.
System status
You can see the CPU loads on the cluster via the webpage
http://faculty-cluster.hpc.net.uq.edu.au/ganglia/
This page can only be accessed from a computer within the UQ domain
(or after starting a UQ VPN session)
Software
(needs updating)
System documentation
getafix system documentation is found below, but see also the various howto pages starting with the SLURM queue.
Connecting to getafix
The login host of the cluster is getafix.smp.uq.edu.au. You should be able to log in to this via ssh with your UQ login name and password, once you have contacted ITS to get an account. If you're off campus, you can either (a) ssh in on port 2022, or (b) run a UQ VPN.
NOTE: The most effective way to use the cluster is to login to getafix.smp.uq.edu.au, submit your job to the batch queue and logout again. If you need a persistent login (Remote Desktop, screen or tmux) you'll need to connect explicitly to one of the two login nodes to ensure you can return to the same session later: either getafix1.smp.uq.edu.au or getafix2.smp.uq.edu.au.
Slave node types
There are XX compute/slave nodes in total in the getafix cluster.
If you want your job to run on a high memory node (max 512GB - system), specify the amount of memory you need with: --mem=memory
. If you want your job to run on a specific type of hardware specify the hardware type as a constraint.
Recommendations
All jobs should be submitted through the SLURM queueing system to ensure optimal use of resources.
Always specify a maximum run time (--time=[days-]hh:mm:ss
). The default run time is unlimited to stop jobs being killed prematurely but the scheduler will favour jobs with a shorter maximum run time.
Always specify the memory you need (--mem=memory
). The default memory allocation is small to ensure optimal use of resources and your job will stall or fail if it requires more memory than requested.
If you specify an appropriate memory limit, your job will likely run sooner and it keeps the large memory nodes free for jobs that really need large memory.
If you want your job to run on a specific type of hardware, specify the hardware to run on with: --constraint=hardware
.
Storage
It is important to remember that the files stored on the getafix cluster are NOT BACKED UP! This means that you need to keep a backup copy of any important data that you have on the cluster.
You have space to store your files in:
Note that to get access to more space under /data/
you need to
specifically request it from ITS and provide a
brief justification.
This page last updated 16th July 2019. [Contacts/help]