Chippie Cluster

Overview

The Chippie cluster is a compute facility for the use of CPI members, affiliates, and researchers sponsored by them. It features password-less accounts for security, a reservation system for exclusive use of compute resources and multiple GPU systems for fast computation.

Hardware

The current hardware of the Chippie cluster is:

Hostname CPU RAM Storage Network GPU Other
chippie.cs.uwaterloo.ca 2x AMD EPYC 7302 16-Core 512GB 7TB all-flash cluster NFS-shared for users gigabit WAN 6x NVIDIA A100: 40GB VRAM per device

Reservable LXC containers each with exclusive use of one GPU: chippie01-chippie06
Note: chippie07 and chippie08 do not have GPUs at this time

chippie-100.cs.uwaterloo.ca  2x AMD EPYC 7272 12-Core 512GB 5TB all-flash cluster NFS-shared scratch/static data gigabit WAN 8x NVIDIA A40: 48GB VRAM per device Reservable LXC containers each with exclusive use of one GPU: chippie09-chippie16

Using the cluster

Account Creation

Users of the Chippie cluster need to have an account sponsored by an authorized CPI member.  In order obtain an account, please send the following information to contact-cpi@uwaterloo.ca and CC your advisor.

Data Comment
Full name of user First, last, full name as appropriate
Email address Users will be added to chippie-users mail list for announcements on outages, updates and other cluster news
userid Linux-appropriate userid (lowercase, no spaces).  UW users use WatIAm userid
Supervisor/Sponsor:

CPI member recommending your chippie account

Purpose for account PhD, URA, MMath, project name as appropriate
Public ssh key Public sibling of ssh keypair.  ed25519 preferred.  Create following: https://uwaterloo.atlassian.net/wiki/spaces/ISTKB/pages/1548878163/SSH+Key+Generation

Once an account has been approved and set up, the user will receive an email from the cluster administrator.

Accessing the cluster and its resources

Using the private key sibling of public key provided on account creation, users can ssh to chippie.cs.uwaterloo.ca. This system is the login node of the cluster.  Here you have access to your cluster-global home directory which is shared on all cluster systems you access. At the login node you can up/download data to/from the cluster, do light processing of results and computing.  Note that resource usage on the login node is monitored and limited.

In order to make full use of the cluster, user will have to reserve compute nodes at: https://chippie-reserve.cs.uwaterloo.ca/
Access to that site requires a password/token.  Users can generate a token at any time on the login node with the command:
get-token
Tokens are good for three days.

At the reservation site, users can reserve free compute nodes at any time in the future.  Generally, it is recommended that reservations are limited to the time required for a project, reservations are limited in time to one day and longer reservation periods be achieved by repeating shorter-term reservations.  This gives flexibility to interrupt and rejoin reservations in case of outages on the cluster.  

When a reservation becomes active, a user can ssh from the login node directly to the reserved compute node eg. 
ssh chippie05
The reservation mechanism only allows access to reserved resources to those included in the reservation.  Access is revoked at the end of reservation.

Details of the usage of the reservation site can be found here: https://cs.uwaterloo.ca/twiki/view/CF/Paper

Important guidance for making reservations

  • Reservations should be no more than a day long.

  • Longer term reservations should consist of repeated day-long reservations. This gives users flexibility to cancel remaining time in long reservations (as they should) if they finish their work early.

  • Reserving resources for convenience without a particular task should be discouraged.

  • Cluster administrators reserve the right to audit usage and terminate idle reservations.

Conda environment on compute systems

Compute system have anaconda installed for management of user environments, including python eg.

$ conda create --name test-env python=3.7
$ conda activate test-env
$ conda install tensorflow-gpu=2.1

The OSU Open-CE repo https://osuosl.org/services/powerdev/opence/ is useful for installing common deep learning tools:
$ conda config --prepend channels https://ftp.osuosl.org/pub/open-ce/current/

Install pytorch LTS:

The A100 GPUs in the cluster require pytorch version >1.7.1.  To install the LTS (1.8.1 at time of writing), create an environment and populate it:

$  conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch-lts -c nvidia

Additional options for pytorch installation are available at the Start Locally page:
https://pytorch.org/get-started/locally/

Conda cheatsheet

https://docs.conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf

Shared scratch filesystem

Each user has a shared scratch filesystem available at /share/$userid.  By default, this filesystem is readable by all cluster users but only writable by the owner.  Users can change the permissions on this fileystem as required.   Working data that is used throughout the cluster should be stored here.

Additional notes

Notes on keeping tensorflow from using all GPU RAM (which slows job execution):
https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory/55541385#55541385