Overview
The Chippie cluster is a compute facility for the use of CPI members, affiliates, and researchers sponsored by them. It features password-less accounts for security, a reservation system for exclusive use of compute resources and multiple GPU systems for fast computation.
Hardware
The current hardware of the Chippie cluster is:
Hostname | CPU | RAM | Storage | Network | GPU | Other |
---|---|---|---|---|---|---|
chippie-base.cs.uwaterloo.ca | 2x AMD EPYC 7302 16-Core | 512GB | 7TB all-flash cluster NFS-shared for users | gigabit WAN |
6x NVIDIA A100 each with 40GB VRAM, 2x NVIDIA A100 each with 80GB VRAM |
Reservable LXC containers each with exclusive use of one GPU: chippie01-chippie08 |
chippie-100.cs.uwaterloo.ca | 2x AMD EPYC 7272 12-Core | 512GB | 5TB all-flash cluster NFS-shared scratch/static data | gigabit WAN | 8x NVIDIA A40: 48GB VRAM per device | Reservable LXC containers each with exclusive use of one GPU: chippie09-chippie16 |
chippie-800.cs.uwaterloo.ca | 8x Intel Xeon Platinum 8276 | 6TB | 40TB all-flash cluster NFS-shared general-use data | gigabit WAN | (no GPU) | Reservable LXC container for use of 224 CPU cores and 6TB RAM: chippie808 |
Using the cluster
Account Creation
Users of the Chippie cluster need to have an account sponsored by an authorized CPI member. In order obtain an account, please send the following information to contact-cpi@uwaterloo.ca and CC your advisor.
Data | Comment |
---|---|
Full name of user | First, last, full name as appropriate |
Email address | Users will be added to chippie-users mail list for announcements on outages, updates and other cluster news |
userid | Linux-appropriate userid (lowercase, no spaces). UW users use WatIAm userid |
Supervisor/Sponsor: |
CPI member recommending your chippie account |
Purpose for account | PhD, URA, MMath, project name as appropriate |
Public ssh key |
Public sibling of ssh keypair. ed25519 preferred. Create following: https://uwaterloo.atlassian.net/wiki/spaces/ISTKB/pages/1548878163/SSH+Key+Generation Note: Please send the SSH key in plain text as part of this email. (do not send as a .pub file) |
Once an account has been approved and set up, the user will receive an email from the cluster administrator.
Accessing the cluster and its resources
Using the private key sibling of public key provided on account creation, users can ssh to chippie.cs.uwaterloo.ca. This system is the login node of the cluster. Here you have access to your cluster-global home directory which is shared on all cluster systems you access. At the login node you can up/download data to/from the cluster, do light processing of results and computing. Note that resource usage on the login node is monitored and limited.
In order to make full use of the cluster, user will have to reserve compute nodes at: https://chippie-reserve.cs.uwaterloo.ca/
Access to that site requires a password/token. Users can generate a token at any time on the login node with the command:
get-token
Tokens are good for three days.
At the reservation site, users can reserve free compute nodes at any time in the future. Generally, it is recommended that reservations are limited to the time required for a project, reservations are limited in time to one day and longer reservation periods be achieved by repeating shorter-term reservations. This gives flexibility to interrupt and rejoin reservations in case of outages on the cluster.
When a reservation becomes active, a user can ssh from the login node directly to the reserved compute node eg.
ssh chippie05
The reservation mechanism only allows access to reserved resources to those included in the reservation. Access is revoked at the end of reservation.
Details of the usage of the reservation site can be found here: https://cs.uwaterloo.ca/twiki/view/CF/Paper
Important guidance for making reservations
-
Reservations should be no more than a day long.
-
Longer term reservations should consist of repeated day-long reservations. This gives users flexibility to cancel remaining time in long reservations (as they should) if they finish their work early.
-
Reserving resources for convenience without a particular task should be discouraged.
-
Cluster administrators reserve the right to audit usage and terminate idle reservations.
Conda environment on compute systems
Compute system have anaconda installed for management of user environments, including python eg.
$ conda create --name test-env python=3.7
$ conda activate test-env
$ conda install tensorflow-gpu=2.1
The OSU Open-CE repo https://osuosl.org/services/powerdev/opence/ is useful for installing common deep learning tools:
$ conda config --prepend channels https://ftp.osuosl.org/pub/open-ce/current/
Install pytorch LTS:
The A100 GPUs in the cluster require pytorch version >1.7.1. To install the LTS (1.8.1 at time of writing), create an environment and populate it:
$ conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch-lts -c nvidia
Additional options for pytorch installation are available at the Start Locally page:
https://pytorch.org/get-started/locally/
Conda cheatsheet
Shared scratch filesystem
Each user has a shared scratch filesystem available at /share/$userid. By default, this filesystem is readable by all cluster users but only writable by the owner. Users can change the permissions on this fileystem as required. Working data that is used throughout the cluster should be stored here.
Additional notes
Notes on keeping tensorflow from using all GPU RAM (which slows job execution):
https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory/55541385#55541385