IBM PowerAI machine learning framework

We have the IBM PowerAI machine learning framework available on Mio's Power8 GPU enabled nodes. PowerAI release 3.4 provides software packages for several Deep Learning frameworks, supporting libraries, and tools:

Information about this framework can be found at: https://www.ibm.com/us-en/marketplace/deep-learning-platform.





Quick Start

Follow the directions below to quickly run some examples. The examples and the scripts are described in more detail below.

  1. In a new directory, download the examples
    wget http://geco.mines.edu/prototype/Show_me_Machine_Learning_examples/power/fullset.tgz
  2. Uncompress it
    tar -xzf fullset.tgz
  3. Run the batch scripts
    1. sbatch sbatch_simple
      Runs an example on ppc002 using the native Ubuntu OS.




Running on ppc002 or ppc001

We will look at running natively on ppc002. We will show how this can be done interactively and with a batch script.

If you are logged into Mio the following command will give you an interactive session on ppc002.

srun -N 1 --tasks-per-node=1 -p ppc --time=1:00:00 --gres=gpu:kepler:4 --nodelist=ppc002 --pty bash -l

After this we can see we are running Ubuntu:

joeuser@ppc002:~/scratch/ml$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
joeuser@ppc002:~/scratch/ml$ 

In order to use the packages in the framework some environmental variables need to be set up. This is done by sourcing one of the following files:

For example:

joeuser@ppc002:~/scratch/ml$ source /opt/DL/tensorflow/bin/tensorflow-activate
joeuser@ppc002:~/scratch/ml$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>> 

We have two TensorFlow examples which are slight modifications of codes from the TensoFlow tutorials.

fully_connected_feed.py
mnist.py
Handwritten digit classification
Original Source:https://www.tensorflow.org/get_started/mnist/mechanics
tf_test.py
Trainable linear regression model
Original Source: https://www.tensorflow.org/get_started/get_started
tymer.py
Simple timing routine
./tymer.py -h will show usage

After you have an interactive session on ppc002 and you have sourced the file as discussed above you can run these examples. For example:

joeuser@ppc002:~/scratch/ml$ source /opt/DL/tensorflow/bin/tensorflow-activate
joeuser@ppc002:~/scratch/ml$ python tf_test.py
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
...
...
W: [-0.99999911] b: [ 0.99999744] loss: 4.20641e-12
47.2702050209
joeuser@ppc002:~/scratch/ml$




Batch script for running natively on ppc002

sbatch_simple is a simple script for running a TensorFlow example natively on ppc002:

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --partition=ppc
#SBATCH --overcommit
#SBATCH --exclusive
#SBATCH --nodelist=ppc002
#SBATCH --gres=gpu:4
##SBATCH --ntasks=1
#SBATCH --export=ALL
#SBATCH --out=%J.out
#SBATCH --err=%J.msg

# Go to the directoy from which our job was launched
cd $SLURM_SUBMIT_DIR

#set up our environment
source /etc/profile
module purge

#set up our environment
source /opt/DL/tensorflow/bin/tensorflow-activate
ulimit -c 0
#this may help run better
ulimit -u 4096


#run a test
python tf_test.py

The program fully_connected_feed.py will by default write to /tmp. We don't want to do that but instead want to write to our local directory. This can be done by adding command line arguments for the program. So the command line to run this example is:

python fully_connected_feed.py --input_data_dir=./tmp --log_dir=./tmp 




Notes

When TensorFlow starts it dumps copious amounts of information describing the platform on which it is running to stdout. You can redirect this to a file by adding 2>file to your command line.





Files:

fullset.tgz
Contains all of the files listed below.
fully_connected_feed.py
Handwritten digit classification, part one.
mnist.py
Handwritten digit classification, part two.
sbatch_ppc
Batch script for running on the power nodes with GPUs.
sbatch_x86
Batch script for running on x86 nodes with GPUs.
sbatch_con
Batch script for running using the container.
sbatch_simple
A simple script for running a TensorFlow example natively on ppc002.
tf_test.py
Trainable linear regression model.
tymer.py
Simple timing routine.
sbatch_script
Runs with/without container with lots of bells and whistles.
bash_script
Not actually used - could be used if you want to call a script from the container.