Getafix

SMP getafix cluster

The current SMP computing cluster is getafix. There were several older clusters that have either been retired or integrated into getafix, which are described at the History page.

The SMP getafix cluster is accessed through getafix.smp.uq.edu.au (which resolves to two login nodes to share the load). The cluster has new CPU nodes with 512GB of RAM purchased in 2017/2018 by SMP along with former CPU-based systems from dogmatix. It also has 9 Nvidia Tesla P100's (see Howto.GPU for more info (the more than 80 GPUs of asterix and ghost).

  • Dell FX2 blade chassis with FC430 compute nodes and FD332 disk nodes. (needs updating)
  • Master node has been virtualised with VMware and is running the Slurm. An additional two VMs running as login nodes has been installed to allow better utilisation of the existing resources.
  • In total 1372 CPU cores on 84 compute nodes with 13376GB memory. (needs updating)
  • Total of 10 x Intel Phi Mics
  • Located in Prentice DC1

System status

You can see the CPU loads on the cluster via the webpage
http://faculty-cluster.hpc.net.uq.edu.au/ganglia/
This page can only be accessed from a computer within the UQ domain (or after starting a UQ VPN session)

Software

 (needs updating)

System documentation

getafix system documentation is found below,

   but see also the various howto pages starting with the SLURM queue?.

Connecting to getafix

The login host of the cluster is getafix.smp.uq.edu.au. You should be able to log in to this via ssh with your UQ login name and password, once you have contacted ITS? to get an account. If you're off campus, you can either (a) ssh in on port 2022, or (b) run a UQ VPN.

To connect to getafix from windows, there is a guide to installing an ssh client here: https://www.linode.com/docs/guides/connect-to-server-over-ssh-on-windows/

NOTE: The most effective way to use the cluster is to login to getafix.smp.uq.edu.au, submit your job to the batch queue and logout again. If you need a persistent login (Remote Desktop, screen or tmux) you'll need to connect explicitly to one of the two login nodes to ensure you can return to the same session later: either getafix1.smp.uq.edu.au or getafix2.smp.uq.edu.au.

Slave node types

There are XX compute/slave nodes in total in the getafix cluster.

smp
the default partition including the new nodes and the former dogmatix cluster? nodes.
gpu
contains the new nodes with Nvidia Tesla P100's

If you want your job to run on a high memory node (max 512GB - system), specify the amount of memory you need with: --mem=memory. If you want your job to run on a specific type of hardware specify the hardware type as a constraint.

Recommendations

All jobs should be submitted through the SLURM queueing system to ensure optimal use of resources.

Always specify a maximum run time (--time=[days-]hh:mm:ss). The default run time is unlimited to stop jobs being killed prematurely but the scheduler will favour jobs with a shorter maximum run time.

Always specify the memory you need (--mem=memory). The default memory allocation is small to ensure optimal use of resources and your job will stall or fail if it requires more memory than requested. If you specify an appropriate memory limit, your job will likely run sooner and it keeps the large memory nodes free for jobs that really need large memory.

If you want your job to run on a specific type of hardware, specify the hardware to run on with: --constraint=hardware.

Storage

It is important to remember that the files stored on the getafix cluster are NOT BACKED UP! This means that you need to keep a backup copy of any important data that you have on the cluster.

You have space to store your files in:

  • /home/uqname/ - your home directory on the NAS. The default quota for the home directory is 50GB. Do not run calculations from this directory (always use /data/uqname/).
    * /data/uqname/ - your data directory on the NAS. The default quota for the home directory is 500GB.
    Note that to get access to more space under /data/ you need to

specifically request it from ITS and provide a brief justification.

Page last modified on March 10, 2022, at 12:51 AM
Powered by PmWiki