Cluster computing

There are two examples shown here, one using R and the other Python. The R example does not produce any output of particular usefullness, but will help you get acquainted with the conventions of cluster computing. The Python example is closer to useful code, but again is mostly there to provide an example of how to use Python on Falcon.

First step - Log into Falcon - either through ondemand or SSH jump box. If you do not already have an account, you can request one here.

The Getting Started Guide has more information about loading modules and the partitions on Falcon.

R Example

In this example we'll simulate 1000's of monkeys typing randomly and see how many words are produced, and create a literary masterpiece along the way.

Log into ondemand interface, and and choose Clusters -> _staging Shell Access (best) or Clusters -> _Falcon Shell Access

Load the R module and install R packages stringi and foreach.

boswald.ui@ondemand ~ >  module load r
boswald.ui@ondemand ~ >  R
R version 4.2.2 (2022-10-31) -- "Innocent and Trusting"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
  Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> install.packages("stringi",repos='https://ftp.osuosl.org/pub/cran/')

... some time later...

> install.packages("foreach",repos='https://ftp.osuosl.org/pub/cran/')
...

> quit(save="no")

You only need to install the R packages once - and it must be from staging or ondemand.c3plus3.org - the cluster nodes do not have access to the internet.

Next, upload a list of words. Download from here or here

Use the Files->Home Directory interface in your browser to create a new folder called 'workshop' and upload the data file.

The R script

#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)

if (length(args)==0) {
  stop("You must provide an output file name", call.=FALSE)
}

library(stringi)
library(foreach)

nwords = 1000
maxlength = 10
wlist = foreach(i=1:maxlength) %do% stri_rand_strings(nwords, i, '[a-z]')

#load word dictionary from file
real_words = read.table("~/workshop/engmix.txt", header = FALSE, sep = "", dec = ".")

#test if the words in wlist are real words
found_words = c()
for(wl in 2:maxlength){
    for(wi in 1:nwords){
     if (any(real_words$V1 == wlist[[wl]][wi]))
     {
        found_words = append(found_words, wlist[[wl]][wi])
     }
    }
}
#open a file to save found words
fileConn<-file(args[1])
writeLines(found_words, fileConn)
close(fileConn)

The SLURM script (monkey.slurm)

#!/bin/bash

#SBATCH -p short

cd $SLURM_SUBMIT_DIR
module load r

Rscript --vanilla monkey.R m$SLURM_JOB_ID.txt

Submit the script

sbatch monkey.slurm

Note: If you get an error about line endings, run the command 'dos2unix monkey.slurm' to change from Windows style line endings to Unix/Linux style line endings.

Check to see if your job is running

squeue --me

Easy enough to start a few monkeys typing - just repeat the sbatch command, but what if we want a thousand monkeys? Time for an array job. We just need to modify the SLURM submit script (monkeys.slurm)

#!/bin/bash

#SBATCH -p tiny

cd $SLURM_SUBMIT_DIR
module load r

Rscript --vanilla monkey.R m$SLURM_ARRAY_JOB_ID.$SLURM_ARRAY_TASK_ID.txt

Now submit a thousand monkeys at once:

sbatch -a 1-1000 monkeys.slurm

Retrieve our Shakespeare-esq work (replace the job id below with yours) with a little BASH magic, this will grab a random word from each output file.

for fn in {1..1000}; do printf "%s " $(shuf -n1 m48290.$fn.txt); done

Save this masterpiece to a file (again replace the job id below with yours):

for fn in {1..1000}; do printf "%s " $(shuf -n1 m48290.$fn.txt) >> shakespeare.txt; done

Python Example

Create python virtual environment

Use Python, start by creating a virtual environment and activating it

boswald.ui@ondemand ~ * virtualenv one-o-one
Using base prefix '/usr'
New python executable in /lfs/boswald.ui/one-o-one/bin/python3.6
Also creating executable in /lfs/boswald.ui/one-o-one/bin/python
Installing setuptools, pip, wheel...done.
boswald.ui@ondemand ~ * source one-o-one/bin/activate

Now install needed packages

(one-o-one) boswald.ui@ondemand ~ * pip3 install tensorflow tensorflow_datasets numpy matplotlib
Collecting tensorflow
  Downloading tensorflow-2.6.2-cp36-cp36m-manylinux2010_x86_64.whl (458.3 MB)
...
much output later
...
  Building wheel for wrapt (setup.py) ... done
  Created wheel for wrapt: filename=wrapt-1.12.1-cp36-cp36m-linux_x86_64.whl size=73055 sha256=f1c3c0250657f61b5693308939dd5fbabd127c691c9b403df6c1e3a115aa361b
  Stored in directory: /lfs/boswald.ui/.cache/pip/wheels/32/42/7f/23cae9ff6ef66798d00dc5d659088e57dbba01566f6c60db63
Successfully built clang termcolor wrapt
Installing collected packages: urllib3, pyasn1, idna, charset-normalizer, certifi, zipp, typing-extensions, six, rsa, requests, pyasn1-modules, oauthlib, cachetools, requests-oauthlib, importlib-metadata, google-auth, dataclasses, werkzeug, tensorboard-plugin-wit, tensorboard-data-server, protobuf, numpy, markdown, grpcio, google-auth-oauthlib, cached-property, absl-py, wrapt, termcolor, tensorflow-estimator, tensorboard, python-dateutil, pyparsing, pillow, opt-einsum, kiwisolver, keras-preprocessing, keras, h5py, google-pasta, gast, flatbuffers, cycler, clang, astunparse, tensorflow, matplotlib
Successfully installed absl-py-0.15.0 astunparse-1.6.3 cached-property-1.5.2 cachetools-4.2.4 certifi-2022.12.7 charset-normalizer-2.0.12 clang-5.0 cycler-0.11.0 dataclasses-0.8 flatbuffers-1.12 gast-0.4.0 google-auth-1.35.0 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.48.2 h5py-3.1.0 idna-3.4 importlib-metadata-4.8.3 keras-2.6.0 keras-preprocessing-1.1.2 kiwisolver-1.3.1 markdown-3.3.7 matplotlib-3.3.4 numpy-1.19.5 oauthlib-3.2.2 opt-einsum-3.3.0 pillow-8.4.0 protobuf-3.19.6 pyasn1-0.4.8 pyasn1-modules-0.2.8 pyparsing-3.0.9 python-dateutil-2.8.2 requests-2.27.1 requests-oauthlib-1.3.1 rsa-4.9 six-1.15.0 tensorboard-2.6.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-2.6.2 tensorflow-estimator-2.6.0 termcolor-1.1.0 typing-extensions-3.7.4.3 urllib3-1.26.14 werkzeug-2.0.3 wrapt-1.12.1 zipp-3.6.0
(one-o-one)

All installed ! Now make a directory to organize

(one-o-one) boswald.ui@ondemand ~ * mkdir one
(one-o-one) boswald.ui@ondemand ~ * cd one

Create python files

python script to train model - saved as 'sentiment.train.py' Following the Tensorflow example here. The script file can be created through the console by copy/pasting or through the ondemand interface. Update the path to the Tensorflow datasets

#!/bin/python

import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
import argparse
import os
import matplotlib.pyplot as plt

parser = argparse.ArgumentParser(description="Trains and saves a tensorflow-keras model for sentiment analysis")
parser.add_argument("-j","--jobid",help="the slurm jobid or other unique number",required=False,default="00000")
args = parser.parse_args()

tfds.disable_progress_bar()

dataset, info = tfds.load('imdb_reviews', data_dir='/lfs/boswald.ui/tensorflow_datasets', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

VOCAB_SIZE = 1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))


model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])


model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

history = model.fit(train_dataset, epochs=3,
                    validation_data=test_dataset,
                    validation_steps=30)

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

model.save_weights("sentiment.ckpt"+args.jobid)

#export a plot of the training

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

plt.savefig(os.getcwd()+"/training_results.png", format='png', dpi=150)

#this saves, but is buggy and can't load the model again:
model.save('sentiment.'+args.jobid)

exit()

Download data

Cluster nodes do not have internet access - you need to download any data prior to submitting the job

(one-o-one) boswald.ui@ondemand ~/one * python3
Python 3.6.8 (default, Apr 12 2022, 06:55:39) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-10)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow_datasets as tfds
2023-02-07 10:47:21.440650: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-07 10:47:21.440686: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
>>> 
>>> BATCH_SIZE = 64
>>> train_ds = tfds.load('imdb_reviews', split='train[:80%]', batch_size=BATCH_SIZE, shuffle_files=True, as_supervised=True)
>>> exit()
(one-o-one) boswald.ui@ondemand ~/one *

Submit job to SLURM

First, create the job submission script

#!/bin/bash
#SBATCH -p short

cd $SLURM_SUBMIT_DIR

hostname

source ~/one-o-one/bin/activate
START=$(date +%s)

python3 sentiment.train.py -j $SLURM_JOBID

let RUNTIME=$(date +%s)-$START
echo "Training time: $RUNTIME"

echo "*--done--*"

Now submit the job

(one-o-one) boswald.ui@ondemand ~/one * sbatch sentiment.slurm 
Submitted batch job 10092
(one-o-one) boswald.ui@ondemand ~/one * squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             10092     short sentimen boswald.  R       0:07      1 r3i5n0
(one-o-one) boswald.ui@ondemand ~/one *

Inference

Now lets use the model we trained. First, create the inference file:

#!/bin/python3

import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
import argparse
import code

parser = argparse.ArgumentParser(description="Trains and saves a tensorflow-keras model for sentiment analysis")
parser.add_argument("-j","--jobid",help="the slurm jobid or other unique number",required=False,default="00000")
args = parser.parse_args()

tfds.disable_progress_bar()

dataset, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)

VOCAB_SIZE = 1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))


model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])


model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])


model.load_weights("sentiment.ckpt"+args.jobid)

print("#--------------------------------------------------------------------#")
print("\nUse the infer function to analyze some text.  For example:\ninfer('this is the text to analyze sentiment in',model) \n negative numbers indicate negative sentiment, positive numbers positive sentiment\n")
def infer(thetext, mdl):
    predicts = mdl.predict(np.array([thetext]))
    print("sentiment: "+str(predicts[0]))


code.interact(local=locals())

exit()

Run the file, using the job number that you used to train the model

(one-o-one) boswald.ui@ondemand ~/one * ls
checkpoint  sentiment.10096  sentiment.ckpt10096.data-00000-of-00001  sentiment.ckpt10096.index  sentiment.slurm  sentiment.train.py  slurm-10092.out  slurm-10095.out  slurm-10096.out  training_results.png
(one-o-one) boswald.ui@ondemand ~/one * nano inference.py
(one-o-one) boswald.ui@ondemand ~/one * python3 inference.py -j 10096
2023-02-07 11:09:47.506250: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-07 11:09:47.506289: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-02-07 11:09:53.298806: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-02-07 11:09:53.299341: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-02-07 11:09:53.299401: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ondemand): /proc/driver/nvidia/version does not exist
2023-02-07 11:09:53.302200: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-07 11:09:54.181238: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
#--------------------------------------------------------------------#

Use the infer function to analyze some text.  For example:
infer('this is the text to analyze sentiment in',model) 
 negative numbers indicate negative sentiment, positive numbers positive sentiment

Python 3.6.8 (default, Apr 12 2022, 06:55:39) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-10)] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> infer('some text to analyze here',model)
sentiment: [-0.15224136]
>>> infer('happy day, a good movie, fun for all',model)
sentiment: [0.9714721]
>>> infer('i hate apples, they taste like sand',model)
sentiment: [-0.13639668]
>>> infer('puppies are cute - especially when playing with a ball',model)
sentiment: [0.36142796]
>>> exit()
(one-o-one) boswald.ui@ondemand ~/one *