this will (hopefully) be a quick guide for students to start running parallel matlab jobs on the high performance computing (hpc) cluster flux. this can be especially useful for:
- parameter sweeps. spread parameter values over nodes!
- monte carlo simulations. spread monte carlo trials over nodes!
we’ll show the process for an example related to wigner’s semicircle law, a very cool result from random matrix theory. let’s jump in!
remark: there are many ways to set this up - we’ll just focus on one here.
we want to explore the average histogram of eigenvalues for the real symmetric random matrix
more specifically, we want to
- generate many instances of the random matrix .
- compute the eigenvalues for each instance.
- make a histogram of the eigenvalues collected from all the instances.
flux allows us to spread this work over nodes. :)
step 0: pre-requisites
you need a few things before we start:
- mtoken. see instructions here!
- flux user account. sign up here!
- flux allocation access. ask your advisor for this. your user account needs to be granted access to their allocation, and you need to know the name of their allocation.
step 1: write the simulation program
we’ll have each node generate one instance of and compute its eigenvalues. here’s a matlab function to do that!
% simulation.m function simulation(jobid) % parallel configuration rng(jobid); outfile = sprintf('data/sim%g.mat',jobid); if exist(outfile,'file') ~= 0 fprintf('file %s already exists! simulation %g skipped.\n',outfile,jobid); return end % run simulation y = randn(100); x = 1/2*(y+y'); e = eig(x); % save outputs save(outfile); end
when we submit this to flux, we’ll provide an array of “job ids”. for each id, flux will allocate a node to us and run our matlab function on it with that id as input.
note that we
- seed the random number generator using
jobidso that nodes don’t generate the same random numbers
- save the results in an output file corresponding to
- check if the output file already exists. if flux goes down before all the nodes finish, we’ll submit the job again and won’t want to waste time redoing runs that already completed.
remark: for parameter sweeps,
jobid is a great way to select the parameters to use for each node.
step 2: write the pbs script
a pbs script describes the job we want to run so that flux can schedule and run it.
# script.pbs ## pbs directives (configuration) # job description and messaging (i.e., notifications) #pbs -n eigrand #pbs -m [your email here] #pbs -m abe # account information #pbs -a [allocation name here] #pbs -l qos=flux #pbs -q flux # requested resources and environment #pbs -l nodes=1:ppn=1,pmem=1gb #pbs -l walltime=15:00 #pbs -v # job array (1 to 10 with at most 5 running at once) #pbs -t 1-10%5 # location for log files (stdout and stderr) #pbs -o logs/ #pbs -e logs/ ## script cd $pbs_o_workdir matlab -nodisplay -r "simulation($pbs_arrayid)"
the pbs script is actually just a normal bash script with “pbs directives” at the top. the script gets run on each node with access to some special environment variables like:
$pbs_o_workdir: the directory we submitted the job from
$pbs_arrayid: the job id assigned to that node
note that this script is what runs our matlab function above with the job id as input.
each pbs directive starts with
#pbs and tells flux about our job:
#pbs -n eigrandsets the name of the job.
#pbs -m [your email here]sets the email you want to use for messages from flux.
#pbs -m abeconfigures flux to email you when each job id aborts, begins and ends.
#pbs -a [allocation name here]sets the allocation you are using.
#pbs -l qos=fluxsets the quality of service (this should be
fluxunless told otherwise).
#pbs -q fluxsets the queue. it generally matches the allocation name suffix (i.e., an allocation called
fluxas the queue)
#pbs -l nodes=1:ppn=1,pmem=1gb(approximately) requests that each job id get 1 node with 1 processor per node and 1gb of physical memory.
#pbs -l walltime=15:00requests 15 minutes for each job id to complete. once this time is up, flux kills our program even if it’s still running.
#pbs -vtells flux to copy the environemnt variables from where we submit the job to each node. this is important because we’ll need to put matlab in the
pathand we’ll need that to be applied to all the nodes.
#pbs -t 1-10%5sets the array of job ids to be 1,2,…,10. it also tells flux to only run 5 job ids at a time (that way you don’t hog all the nodes available in the allocation!).
#pbs -o logs/and
#pbs -e logs/tell flux where to store the
stderrstreams from each run.
step 3: submit the job
we now have all the files we need ready! time to upload them to flux and submit the job.
uploading to flux
script.pbs to your directory in
/scratch using the transfer server
don’t know what your directory in
/scratch is? ask your advisor. it’s likely something like
/scratch/[allocation name here]/[your uniquename here].
submitting the job
sign in to the login server
flux-login.engin.umich.edu and run
note: this must be done from the university network (i.e., you’ll need to be on the network, vpn in, or go through another on-campus server first) and you’ll need to use your mtoken to authenticate.
- move us into the directory where we have put our files
- creates directories for the output files
- adds matlab 2015a to the
- submits the job to flux
keeping track of the job
you’ll recieve an email from flux when each job id begins, ends and aborts because of the pbs directive
#pbs -m abe.
to check the current status run the following command on the login node
you’ll see something like
each line corresponds to a job id.
step 4: download output files and merge the results
once all job ids are completed, download the directories containing output files. once again use the transfer server
now we need to merge the results from the many files (one for each job id) to a single data file. a good way is to write a program like this.
in matlab run this program with the job id list we had (1,2,…,10) with the command
this will generate a new file
data/sim-merged.mat that we can use to make our histogram as follows
after you run this, you should get the following histogram
turns out the histogram is a semicircle! :)
to run a different parallel matlab job, modify
simulation.m with the code you want to run for each job id and adjust the
script.pbs. you’ll want to remember to change the name, the requested resources (especially the physical memory and *walltime and the job id array.*)