Chippie Cluster | Cybersecurity and Privacy Institute

Overview

The Chippie cluster is a compute facility for the use of CPI members, affiliates, and researchers sponsored by them. It features password-less accounts for security, a reservation system for exclusive use of compute resources and multiple GPU systems for fast computation.

Hardware

The current hardware of the Chippie cluster is:

Hostname	CPU	RAM	Storage	Network	GPU	Other
chippie-base.cs.uwaterloo.ca	2x AMD EPYC 7302 16-Core	512GB	7TB all-flash cluster NFS-shared for users	gigabit WAN	6x NVIDIA A100 each with 40GB VRAM, 2x NVIDIA A100 each with 80GB VRAM	Reservable LXC containers each with exclusive use of one GPU: chippie01-chippie08
chippie-100.cs.uwaterloo.ca	2x AMD EPYC 7272 12-Core	512GB	5TB all-flash cluster NFS-shared scratch/static data	gigabit WAN	8x NVIDIA A40: 48GB VRAM per device	Reservable LXC containers each with exclusive use of one GPU: chippie09-chippie16
chippie-800.cs.uwaterloo.ca	8x Intel Xeon Platinum 8276	6TB	40TB all-flash cluster NFS-shared general-use data	gigabit WAN	(no GPU)	Reservable LXC container for use of 224 CPU cores and 6TB RAM: chippie808

Hostname

CPU

RAM

Storage

Network

GPU

Other

chippie-base.cs.uwaterloo.ca

2x AMD EPYC 7302 16-Core

512GB

7TB all-flash cluster NFS-shared for users

gigabit WAN

6x NVIDIA A100 each with 40GB VRAM,

2x NVIDIA A100 each with 80GB VRAM

Reservable LXC containers each with exclusive use of one GPU: chippie01-chippie08

chippie-100.cs.uwaterloo.ca

2x AMD EPYC 7272 12-Core

512GB

5TB all-flash cluster NFS-shared scratch/static data

gigabit WAN

8x NVIDIA A40: 48GB VRAM per device

Reservable LXC containers each with exclusive use of one GPU: chippie09-chippie16

chippie-800.cs.uwaterloo.ca

8x Intel Xeon Platinum 8276

6TB

40TB all-flash cluster NFS-shared general-use data

gigabit WAN

(no GPU)

Reservable LXC container for use of 224 CPU cores and 6TB RAM: chippie808

Using the cluster

Account Creation

Users of the Chippie cluster need to have an account sponsored by an authorized CPI member. In order obtain an account, please send the following information to contact-cpi@uwaterloo.ca and CC your advisor.

Data	Comment
Full name of user	First, last, full name as appropriate
Email address	Users will be added to chippie-users mail list for announcements on outages, updates and other cluster news
userid	Linux-appropriate userid (lowercase, no spaces). UW users use WatIAm userid
Supervisor/Sponsor:	CPI member recommending your chippie account
Purpose for account	PhD, URA, MMath, project name as appropriate
Public ssh key	Public sibling of ssh keypair. ed25519 preferred. Create following: https://uwaterloo.atlassian.net/wiki/spaces/ISTKB/pages/1548878163/SSH+Key+Generation Note: Please send the SSH key in plain text as part of this email. (do not send as a .pub file)

Once an account has been approved and set up, the user will receive an email from the cluster administrator.

Accessing the cluster and its resources

Using the private key sibling of public key provided on account creation, users can ssh to chippie.cs.uwaterloo.ca. This system is the login node of the cluster. Here you have access to your cluster-global home directory which is shared on all cluster systems you access. At the login node you can up/download data to/from the cluster, do light processing of results and computing. Note that resource usage on the login node is monitored and limited.

In order to make full use of the cluster, user will have to reserve compute nodes at: https://chippie-reserve.cs.uwaterloo.ca/
Access to that site requires a password/token. Users can generate a token at any time on the login node with the command:
get-token
Tokens are good for three days.

At the reservation site, users can reserve free compute nodes at any time in the future. Generally, it is recommended that reservations are limited to the time required for a project, reservations are limited in time to one day and longer reservation periods be achieved by repeating shorter-term reservations. This gives flexibility to interrupt and rejoin reservations in case of outages on the cluster.

When a reservation becomes active, a user can ssh from the login node directly to the reserved compute node eg.
ssh chippie05
The reservation mechanism only allows access to reserved resources to those included in the reservation. Access is revoked at the end of reservation.

Details of the usage of the reservation site can be found here: https://cs.uwaterloo.ca/twiki/view/CF/Paper

Important guidance for making reservations

Reservations should be no more than a day long.
Longer term reservations should consist of repeated day-long reservations. This gives users flexibility to cancel remaining time in long reservations (as they should) if they finish their work early.
Reserving resources for convenience without a particular task should be discouraged.
Cluster administrators reserve the right to audit usage and terminate idle reservations.

Conda environment on compute systems

Compute system have anaconda installed for management of user environments, including python eg.

$ conda create --name test-env python=3.7
$ conda activate test-env
$ conda install tensorflow-gpu=2.1

The OSU Open-CE repo https://osuosl.org/services/powerdev/opence/ is useful for installing common deep learning tools:
$ conda config --prepend channels https://ftp.osuosl.org/pub/open-ce/current/

Install pytorch LTS:

The A100 GPUs in the cluster require pytorch version >1.7.1. To install the LTS (1.8.1 at time of writing), create an environment and populate it:

$ conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch-lts -c nvidia

Additional options for pytorch installation are available at the Start Locally page:
https://pytorch.org/get-started/locally/

Conda cheatsheet

https://docs.conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf

Shared scratch filesystem

Each user has a shared scratch filesystem available at /share/$userid. By default, this filesystem is readable by all cluster users but only writable by the owner. Users can change the permissions on this fileystem as required. Working data that is used throughout the cluster should be stored here.

Additional notes

Notes on keeping tensorflow from using all GPU RAM (which slows job execution):
https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory/55541385#55541385