No category

Download Curie advanced userguide - VI-HPS

Transcript

Contents
1 Curie 's advance d us age manual
2 Optimiz ation
2.1 Compilation options
2.1.1 Inte l
2.1.1.1 Inte l Sandy Bridge proce s s ors
2.1.2 GNU
3 Submis s ion
3.1 Choos ing or e xcluding node s
4 MPI
4.1 Embarras s ingly paralle l jobs and MPMD jobs
4.2 BullxMPI
4.2.1 MPMD jobs
4.2.2 Tuning BullxMPI
4.2.3 Optimiz ing with BullxMPI
4.2.4 De bugging with BullxMPI
5 Proce s s dis tribution, affinity and binding
5.1 Introduction
5.1.1 Hardware topology
5.1.2 De finitions
5.1.3 Proce s s dis tribution
5.1.4 Why is affinity important for improving pe rformance ?
5.1.5 CPU affinity mas k
5.2 SLURM
5.2.1 Proce s s dis tribution
5.2.1.1 Curie hybrid node
5.2.2 Proce s s binding
5.3 BullxMPI
5.3.1 Proce s s dis tribution
5.3.2 Proce s s binding
5.3.3 Manual proce s s manage me nt
6 Us ing GPU
6.1 Two s e que ntial GPU runs on a s ingle hybrid node
7 Profiling
7.1 PAPI
7.2 VampirTrace /Vampir
7.2.1 Bas ics
7.2.2 Tips
7.2.3 Vampirs e rve r
7.2.4 CUDA profiling
7.3 Scalas ca
7.3.1 Standard utiliz ation
7.3.2 Scalas ca + Vampir
7.3.3 Scalas ca + PAPI
7.4 Parave r
7.4.1 Trace ge ne ration
7.4.2 Conve rting trace s to Parave r format
7.4.3 Launching Parave r
Curie's advanced usage manual
If you have s ugge s tions or re marks , ple as e contact us : hotline .tgcc@ce a.fr
Optimization
Compilation options
Compile rs provide s many options to optimiz e a code . The s e options are de s cribe d in the following s e ction.
Int el
-opt_re port : ge ne rate s a re port which de s cribe s the optimis ation in s tde rr (-O3 re quire d)
-ip, -ipo : inte r-proce dural optimiz ations (mono and multi file s ). The command xiar mus t be us e d ins te ad of
ar to ge ne rate a s tatic library file with obje cts compile d with -ipo option.
-fas t : de fault high optimis ation le ve l (-O3 -ipo -s tatic). + Care full : This option is not allowe d us ing MPI, the MPI
conte xt ne e ds to call s ome librarie s which only e xis ts in dynamic mode . This is incompatible with the -s tatic
option. You ne e d to re place -fas t by -O3 -ipo
-ftz : cons ide rs all the de normaliz e d numbe rs (like INF or NAN) as z e ros at runtime .
-fp-re laxe d : mathe matical optimis ation functions . Le ads to a s mall los s of accuracy.
-pad : make s the modification of the me mory pos itions ope rational (ifort only)
The re are s ome options which allow to us e s pe cific ins tructions of Inte l proce s s ors in orde r to optimiz e the code .
The s e options are compatible with mos t of Inte l proce s s ors . The compile r will try to ge ne rate the s e ins tructions if
the proce s s or allow it.
-xSSE4.2 : May ge ne rate Inte l® SSE4 Efficie nt Acce le rate d String and Te xt Proce s s ing ins tructions . May
ge ne rate Inte l® SSE4 Ve ctoriz ing Compile r and Me dia Acce le rator, Inte l® SSSE3, SSE3, SSE2, and SSE
ins tructions .
-xSSE4.1 : May ge ne rate Inte l® SSE4 Ve ctoriz ing Compile r and Me dia Acce le rator ins tructions for Inte l
proce s s ors . May ge ne rate Inte l® SSSE3, SSE3, SSE2, and SSE ins tructions .
-xSSSE3 : May ge ne rate Inte l® SSSE3, SSE3, SSE2, and SSE ins tructions for Inte l proce s s ors .
-xSSE3 : May ge ne rate Inte l® SSE3, SSE2, and SSE ins tructions for Inte l proce s s ors .
-xSSE2 : May ge ne rate Inte l® SSE2 and SSE ins tructions for Inte l proce s s ors .
-xHos t : this option will apply one of the pre vious options de pe nding on the proce s s or whe re the compilation
is pe rforme d. This option is re comme nde d for optimiz ing your code .
None of the s e options are us e d by de fault. The SSE ins tructions us e the ve ctoriz ation capability of Inte l proce s s ors .
Int el Sandy Bridge processors
Curie thin node s us e the las t Inte l proce s s ors bas e d on Sandy Bridge archite cture . This archite cture provide s ne w
ve ctoriz ation ins tructions calle d AVX for Advance d Ve ctor e Xte ns ions . The option -xAVX allows to ge ne rate a
s pe cific code for Curie thin node s .
Be care ful, a code ge ne rate d with -xAVX option runs only on Inte l Sandy Bridge proce s s ors . Othe rwis e , you will ge t
this e rror me s s age :
Fa ta l Error: This progra m wa s not built to run in your s ys te m.
Ple a s e ve rify tha t both the ope ra ting s ys te m a nd the proce s s or s upport Inte l(R) AVX.
Curie login node s are Curie large node s with Ne hale m-EX proce s s ors . AVX code s can be ge ne rate d on the s e
node s through cros s -compilation by adding -xAVX option. On Curie large node , the -xHos t option will not ge ne rate a
AVX code . If you ne e d to compile with -xHos t or if the ins tallation re quire s s ome te s ts (like autotools /configure ), you
can s ubmit a job which will compile on the Curie thin node s .
GNU
The re are s ome options which allow us age of s pe cific s e t of ins tructions for Inte l proce s s ors , in orde r to optimiz e
code be havior. The s e options are compatible with mos t of Inte l proce s s ors . The compile r will try to us e the s e
ins tructions if the proce s s or allow it.
-mmmx / -mno-mmx : Switch on or off the us age of s aid ins truction s e t.
-ms s e / -mno-s s e : ide m.
-ms s e 2 / -mno-s s e 2 : ide m.
-ms s e 3 / -mno-s s e 3 : ide m.
-ms s s e 3 / -mno-s s s e 3 : ide m.
-ms s e 4.1 / -mno-s s e 4.1 : ide m.
-ms s e 4.2 / -mno-s s e 4.2 : ide m.
-ms s e 4 / -mno-s s e 4 : ide m.
-mavx / -mno-avx : ide m, f o r Curie T hin no des part it io n o nly.
Submission
Choosing or excluding nodes
SLURM provide s the pos s ibility to choos e or e xclude any node s in the re s e rvation for your job.
To choos e node s :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#
#MS UB -T 1800
#
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
#MS UB -A pa xxxx
#MS UB -E '-w curie [1000-1003]'
# Re que s t na me
Numbe r of ta s ks to us e
Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
# Proje ct ID
# Include 4 node s (curie 1000 to curie 1003)
s e t -x
cd ${BRIDGE_MS UB_PWD}
ccc_mprun ./a .out
To e xclude node s :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#
#MS UB -T 1800
#
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
#MS UB -A pa xxxx
#MS UB -E '-x curie [1000-1003]'
# Re que s t na me
Numbe r of ta s ks to us e
Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
# Proje ct ID
# Exclude 4 node s (curie 1000 to curie 1003)
s e t -x
cd ${BRIDGE_MS UB_PWD}
ccc_mprun ./a .out
MPI
Embarrassingly parallel jobs and MPMD jobs
An e mbarras s ingly paralle l job is a job which launch inde pe nde nt proce s s e s . The s e proce s s e s ne e d fe w or
no communications
A MPMD job is a paralle l job which launch diffe re nt e xe cutable s ove r the proce s s e s . A MPMD job can be
paralle l with MPI and can do many communications .
The s e two conce pts are s e parate but we pre s e nt the m toge the r be caus e the way to launch the m on Curie is
s imilar. An s imple e xample in the Curie info page was alre ady give n.
In the following e xample , we us e ccc_mprun to launch the job. srun can be us e d too. We want to launch bin0 on the
MPI rank 0, bin1 on the MPI rank 1 and bin2 on the MPI rank 2. We have firs t to write a s he ll s cript which de s cribe s
the topology of our job:
launch_e xe .s h:
#! /bin/ba s h
if [ $S LURM_PROCID -e q 0 ]
the n
./bin0
fi
if [ $S LURM_PROCID -e q 1 ]
the n
./bin1
fi
if [ $S LURM_PROCID -e q 2 ]
the n
./bin2
fi
We can the n launch our job with 3 proce s s e s :
ccc_mprun -n 3 ./la unch_e xe .s h
The s cript launch_exe.sh mus t have e xe cute pe rmis s ion. Whe n ccc_mprun launche s the job, it will initializ e s ome
e nvironme nt variable s . Among the m, SLURM_PROCID de fine s the curre nt MPI rank.
BullxMPI
MPMD jobs
BullxMPI (or Ope nMPI) jobs can be launche d with mpirun launche r. In this cas e , we have othe r ways to launch MPMD
jobs (s e e e mbarras s ingly paralle l jobs s e ction).
We take the s ame e xample in the e mbarras s ingly paralle l jobs s e ction. The re are the n two ways for launching
MPMD s cripts
We don't ne e d the launch_exe.sh anymore . We can launch dire ctly the job with mpirun command:
mpirun -np 1 ./bin0 : -np 1 ./bin1 : -np 1 ./bin2
In the launch_exe.sh, we can re place SLURM_PROCID by OMPI_COMM_WORLD_RANK:
launch_e xe .s h:
#! /bin/ba s h
if [ ${OMPI_COMM_WORLD_RANK} -e q 0 ]
the n
./bin0
fi
if [ ${OMPI_COMM_WORLD_RANK} -e q 1 ]
the n
./bin1
fi
if [ ${OMPI_COMM_WORLD_RANK} -e q 2 ]
the n
./bin2
fi
We can the n launch our job with 3 proce s s e s :
mpirun -np 3 ./la unch_e xe .s h
Tuning BullxMPI
BullxMPI is bas e d on Ope nMPI. It can be tune d with parame te rs . The command ompi_info -a give s you a lis t of all
parame te rs and the ir de s criptions .
curie 50$ ompi_info -a
(...)
MCA mpi: pa ra me te r "mpi_s how_mca _pa ra ms " (curre nt va lue : <none >, da ta s ource : de fa ult va lue )
Whe the r to s how a ll MCA pa ra me te r va lue s during MPI_INIT or not (good for re produca bility of MPI jobs for de bug purpos e s ). Acce pte d va lue s a re a ll, de fa ult, file , a pi, a nd e nvironme nt
- or a comma de limite d combina tion of the m
(...)
The s e s parame te rs can be modifie d with e nvironme nt variable s s e t be fore the ccc_mprun command. The form of
the corre s ponding e nvironme nt variable is OMPI_MCA_xxxxx whe re xxxxx is the parame te r.
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
#MS UB -A pa xxxx
# Re que s t na me
# Numbe r of ta s ks to us e
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
# Proje ct ID
s e t -x
cd ${BRIDGE_MS UB_PWD}
e xport OMPI_MCA_mpi_s how_mca _pa ra ms =a ll
ccc_mprun ./a .out
Opt imizing wit h BullxMPI
You can try the s e s parame te rs in orde r to optimiz e BullxMPI:
e xport OMPI_MCA_mpi_le a ve _pinne d=1
This s e tting improve s the bandwidth for communication if the code us e s the s ame buffe rs for communication during
the e xe cution.
e xport OMPI_MCA_btl_ope nib_us e _e a ge r_rdma =1
This parame te r optimiz e s the late nce for s hort me s s age s on Infiniband ne twork. But the code will us e more
me mory.
Be care ful, the s e s parame te rs are not s e t by de fault. The y can have influe nce s on the be haviour of your code s .
Debugging wit h BullxMPI
Some time s , BullxMPI code s can hang in any colle ctive communication for large jobs . If you find yours e lf in this cas e ,
you can try this parame te r:
e xport OMPI_MCA_coll="^ghc,tune d"
This s e tting dis able s optimiz e d colle ctive communications : it can s low down your code if it us e s many colle ctive
ope rations .
Process distribution, affinity and binding
Introduction
Hardware t opology
Hardware topology of a Curie fat node
The hardware topology is the organiz ation of core s , proce s s ors , s ocke ts and me mory in a node . The pre vious
image was cre ate d with hwloc. You can have acce s s to hwloc on Curie with the command module load hwloc.
Definit ions
We de fine he re s ome vocabulary:
Binding : a Linux proce s s can be bound (or s tuck) to one or many core s . It me ans a proce s s and its thre ads
can run only on a give n s e le ction of core s . For e xample , a proce s s which is bound to a s ocke t on a Curie fat
node can run on any of the 8 core s of a proce s s or.
Af f init y : it re pre s e nts the policy of re s ource s manage me nt (core s and me mory) for proce s s e s .
Dist ribut io n : the dis tribution of MPI proce s s e s de s cribe s how the s e s proce s s e s are s pre ad accros s the
core , s ocke ts or node s .
On Curie , the de fault be haviour for dis tribution, affinity and binding are manage d by SLURM, pre cis e ly the ccc_mprun
command.
Process dist ribut ion
We pre s e nt he re s ome e xample of MPI proce s s e s dis tributions .
blo ck or ro und : this is the s tandard dis tribution. From SLURM manpage : The block dis tribution me thod will
dis tribute tas ks to a node s uch that cons e cutive tas ks s hare a node . For e xample , cons ide r an allocation of
two node s e ach with 8 core s . A block dis tribution re que s t will dis tribute thos e tas ks to the node s with tas ks 0
to 7 on the firs t node , tas k 8 to 15 on the s e cond node .
Block distribution by core
cyclic by s ocke t: from SLURM manpage , the cyclic dis tribution me thod will dis tribute tas ks to a s ocke t s uch
that cons e cutive tas ks are dis tribute d ove r cons e cutive s ocke t (in a round-robin fas hion). For e xample ,
cons ide r an allocation of two node s e ach with 2 s ocke ts e ach with 4 core s . A cyclic dis tribution by s ocke t
re que s t will dis tribute thos e tas ks to the s ocke t with tas ks 0,2,4,6 on the firs t s ocke t, tas k 1,3,5,7 on the
s e cond s ocke t. In the following image , the dis tribution is cyclic by s ocke t and block by node .
Cyclic distribution by socket
cyclic by node : from SLURM manpage , the cyclic dis tribution me thod will dis tribute tas ks to a node s uch that
cons e cutive tas ks are dis tribute d ove r cons e cutive node s (in a round-robin fas hion). For e xample , cons ide r
an allocation of two node s e ach with 2 s ocke ts e ach with 4 core s . A cyclic dis tribution by node re que s t will
dis tribute thos e tas ks to the node s with tas ks 0,2,4,6,8,10,12,14 on the firs t node , tas k 1,3,5,7,9,11,13,15 on
the s e cond node . In the following image , the dis tribution is cyclic by node and block by s ocke t.
Block distribution by node
Why is affinit y import ant for improving performance ?
Curie node s are NUMA (Non-Uniform Me mory Acce s s ) node s . It me ans that it will take longe r to acce s s s ome
re gions of me mory than othe rs . This is due to the fact that all me mory re gions are not phys ically on the s ame bus .
NUMA node : Curie hybrid
node
In this picture , we can s e e that if a data is in the me mory module 0, a proce s s running on the s e cond s ocke t like
the 4th proce s s will take more time to acce s s the data. We can introduce the notion of local data vs remote data. In
our e xample , if we cons ide r a proce s s running on the s ocke t 0, a data is local if it is on the me mory module 0. The
data is remote if it is on the me mory module 1.
We can the n de duce the re as ons why tuning the proce s s affinity is important:
Data locality improve pe rformance . If your code us e s hare d me mory (like pthre ads or Ope nMP), the be s t
choice is to re group your thre ads on the s ame s ocke t. The s hare d datas s hould be local to the s ocke t and
more ove r, the datas will pote ntially s tay on the proce s s or's cache .
Sys te m proce s s e s can inte rrupt your proce s s running on a core . If your proce s s is not bound to a core or to
a s ocke t, it can be move d to anothe r core or to anothe r s ocke t. In this cas e , all datas for this proce s s have
to be move d with the proce s s too and it can take s ome time .
MPI communications are fas te r be twe e n proce s s e s which are on the s ame s ocke t. If you know that two
proce s s e s have many communications , you can bind the m to the s ame s ocke t.
On Curie hybrid node s , the GPUs are conne cte d to bus e s which are local to s ocke t. Proce s s e s can take
longe r time to acce s s a GPU which is not conne cte d to its s ocke t.
NUMA node : Curie hybrid node with GPU
For all the s e s re as ons , it is be tte r to know the NUMA configuration of Curie node s (fat, hybrid and thin). In the
following s e ction, we will pre s e nt s ome ways to tune your proce s s e s affinity for your jobs .
CPU affinit y mask
The affinity of a proce s s is de fine d by a mas k. A mas k is a binary value which le ngth is de fine d by the numbe r of
core s available on a node . By e xample , Curie hybrid node s have 8 core s : the binary mas k value will have 8 figure s .
Each figure s will have 0 or 1. The proce s s will run only on the core which have 1 as value . A binary mas k mus t be
re ad from right to le ft.
For e xample , a proce s s which runs on the core s 0,4,6 and 7 will have as affinity binary mas k: 11010001
SLURM and BullxMPI us e the s e s mas ks but conve rte d in he xade cimal numbe r.
To conve rt a binary value to he xade cimal:
$ e cho "iba s e =2;oba s e =16;11010001"| bc
21202
To conve rt a he xade cimal value to binary:
$ e cho "iba s e =16;oba s e =2;21202"| bc
11010001
The numbe ring of the core s is the PU numbe r from the output of hwloc.
SLURM
SLURM is the de fault launche r for jobs on Curie . SLURM manage s the proce s s e s e ve n for s e que ntial jobs . We
re comme nd you to us e ccc_mprun. By de fault, SLURM binds proce s s e s to a core . The dis tribution is block by node
and by core .
The option -E '--cpu_bind=verbose' for ccc_mprun give s you a re port about the binding of proce s s e s be fore the run:
$ ccc_mprun -E '--cpu_bind=ve rbos e ' -q hybrid -n 8 ./a .out
cpu_bind=MAS K - curie 7054, ta s k 3 3 [3534]: ma s k 0x8 s e t
cpu_bind=MAS K - curie 7054, ta s k 0 0 [3531]: ma s k 0x1 s e t
cpu_bind=MAS K - curie 7054, ta s k 1 1 [3532]: ma s k 0x2 s e t
cpu_bind=MAS K - curie 7054, ta s k 2 2 [3533]: ma s k 0x4 s e t
cpu_bind=MAS K - curie 7054, ta s k 4 4 [3535]: ma s k 0x10 s e t
cpu_bind=MAS K - curie 7054, ta s k 5 5 [3536]: ma s k 0x20 s e t
cpu_bind=MAS K - curie 7054, ta s k 7 7 [3538]: ma s k 0x80 s e t
cpu_bind=MAS K - curie 7054, ta s k 6 6 [3537]: ma s k 0x40 s e t
In this e xample , we can s e e the proce s s 5 has 20 as he xade cimal mas k or 00100000 as binary mas k: the 5th
proce s s will run only on the core 5.
Process dist ribut ion
To change the de fault dis tribution of proce s s e s , you can us e the option -E '-m' for ccc_mprun. With SLURM, you have
two le ve ls for proce s s dis tribution: node and s ocke t.
Node block dis tribution:
ccc_mprun -E '-m block' ./a .out
Node cyclic dis tribution:
ccc_mprun -E '-m cyclic' ./a .out
By de fault, the dis tribution ove r the s ocke t is block. In the following e xample s for s ocke t dis tribution, the node
dis tribution will be block.
Socke t block dis tribution:
ccc_mprun -E '-m block:block' ./a .out
Socke t cyclic dis tribution:
ccc_mprun -E '-m block:cyclic' ./a .out
Curie hybrid node
On Curie hybrid node , e ach GPU is conne cte d to a s ocke t (s e e pre vious picture ). It will take longe r for a proce s s to
acce s s a GPU if this proce s s is not on the s ame s ocke t of the GPU. By de fault, the dis tribution is block by core .
The n the MPI rank 0 is locate d on the firs t s ocke t and the MPI rank 1 is on the firs t s ocke t too. The majority of GPU
code s will as s ign GPU 0 to MPI rank 0 and GPU 1 to MPI rank 1. In this cas e , the bandwidth be twe e n MPI rank 1 and
GPU 1 is not optimal.
If your code doe s this , in orde r to obtain the be s t pe rformance , you s hould :
us e the block:cyclic dis tribution
if you inte nd to us e only 2 MPI proce s s e s pe r node , you can re s e rve 4 core s pe r proce s s with the dire ctive
#MSUB -c 4. The two proce s s e s will be place d on two diffe re nt s ocke ts .
Process binding
By de fault, proce s s e s are bound to the core . For multi-thre ade d jobs , proce s s e s cre ate s thre ads : the s e thre ads
will be bound to the as s igne d core . To allow the s e thre ads to us e othe r core s , SLURM provide s the option -c to
as s ign many core s to a proce s s .
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 8
#MS UB -c 4
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -A pa xxxx
# Re que s t na me
# Numbe r of ta s ks to us e
# As s ign 4 core s pe r proce s s
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Proje ct ID
e xport OMP_NUM_THREADS =4
ccc_mprun ./a .out
In this e xample , our hybrid Ope nMP/MPI code runs on 8 MPI proce s s e s and e ach proce s s will us e 4 Ope nMP
thre ads . We give he re an e xample for the output with the ve rbos e option for binding:
$ ccc_mprun ./a .out
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
ta s k
ta s k
ta s k
ta s k
ta s k
ta s k
ta s k
ta s k
5
0
1
6
4
3
2
7
5 [18761]: ma s k 0x40404040 s e t
0 [18756]: ma s k 0x1010101 s e t
1 [18757]: ma s k 0x10101010 s e t
6 [18762]: ma s k 0x8080808 s e t
4 [18760]: ma s k 0x4040404 s e t
3 [18759]: ma s k 0x20202020 s e t
2 [18758]: ma s k 0x2020202 s e t
7 [18763]: ma s k 0x80808080 s e t
We can s e e he re the MPI rank 0 proce s s is launche d ove r the core s 0,8,16 and 24 of the node . The s e core s are all
locate d on the node 's firs t s ocke t.
Re mark: With the -c option, SLURM will try to gathe r at be s t the core s to have be s t pe rformance s . In the pre vious
e xample , all the core s of a MPI proce s s will be locate d on the s ame s ocke t.
Anothe r e xample :
$ ccc_mprun -n 1 -c 32 -E '--cpu_bind=ve rbos e ' ./a .out
cpu_bind=MAS K - curie 1017, ta s k 0 0 [34710]: ma s k 0xffffffff s e t
We can s e e the proce s s is not bound to a core and can run ove r all core s of a node .
BullxMPI
BullxMPI has its own proce s s manage me nt policy. To us e it, you have firs t to dis able SLURM's proce s s manage me nt
policy by adding the dire ctive #MSUB -E '--cpu_bind=none' . You can the n us e BullxMPI launche r mpirun:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun -np 32 ./a .out
Note : In this e xample , BullxMPI proce s s manage me nt policy can be e ffe ctive only on the 32 core s allocate d by
SLURM.
The de fault BullxMPI proce s s manage me nt policy is :
the proce s s e s are not bound
the proce s s e s can run on all core s
the de fault dis tribution is block by core and by node
The option --report-bindings give s you a re port about the binding of proce s s e s be fore the run:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --re port-bindings --bind-to-s ocke t --cpus -pe r-proc 4 -np 8 ./a .out
And the re is the output:
+ mpirun --bind-to-s ocke t --cpus -pe r-proc 4 -np 8 ./a .out
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],3] to s ocke t 1 cpus 22222222
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],4] to s ocke t 2 cpus 44444444
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],5] to s ocke t 2 cpus 44444444
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],6] to s ocke t 3 cpus 88888888
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],7] to s ocke t 3 cpus 88888888
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],0] to s ocke t 0 cpus 11111111
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],1] to s ocke t 0 cpus 11111111
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],2] to s ocke t 1 cpus 22222222
In the following paragraphs , we pre s e nt the diffe re nt pos s ibilitie s of proce s s dis tribution and binding. The s e options
can be mixe d (if pos s ible ).
Re mark: the following e xample s us e a whole Curie fat node . We re s e rve 32 core s with #MSUB -n 32 and #MSUB -x
to have all the core s and to do what we want with the m. This is only e xample s for s imple cas e s . In othe rs cas e ,
the re may be conflicts with SLURM.
Process dist ribut ion
Block dis tribution by core :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bycore -np 32 ./a .out
Cyclic dis tribution by s ocke t:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bys ocke t -np 32 ./a .out
Cyclic dis tribution by node :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -N 16
#MS UB -x
# Re quire e xclus ive node s
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bynode -np 32 ./a .out
Process binding
No binding:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bind-to-none -np 32 ./a .out
Core binding:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bind-to-core -np 32 ./a .out
Socke t binding (the proce s s and his thre ads can run on all core s of a s ocke t):
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bind-to-s ocke t -np 32 ./a .out
You can s pe cify the numbe r of core s to as s ign to a MPI proce s s :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bind-to-s ocke t --cpus -pe r-proc 4 -np 8 ./a .out
He re we as s ign 4 core s pe r MPI proce s s .
Manual process management
BullxMPI give s the pos s ibility to manually as s ign your proce s s e s through a hos tfile and a rankfile . An e xample :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
hos tna me > hos tfile .txt
e cho "ra nk 0=${HOS TNAME} s lot=0,1,2,3 " > ra nkfile .txt
e cho "ra nk 1=${HOS TNAME} s lot=8,10,12,14 " >> ra nkfile .txt
e cho "ra nk 2=${HOS TNAME} s lot=16,17,22,23" >> ra nkfile .txt
e cho "ra nk 3=${HOS TNAME} s lot=19,20,21,31" >> ra nkfile .txt
mpirun --hos tfile hos tfile .txt --ra nkfile ra nkfile .txt -np 4 ./a .out
In this e xample , the re are many s te ps :
You have to cre ate a hostfile he re hos tfile .txt whe re you put the hos tname of all node s your run will us e
You have to cre ate a rankfile he re rankfile .txt whe re you as s ign to e ach MPI rank the core whe re it can run.
In our e xample , the proce s s of rank 0 will have as affinity the core 0,1,2 and 3, e tc... Be care ful, the
numbe ring of the core is diffe re nt than the hwloc output: on Curie fat node , the e ight firs t core are on the
firs t s ocke t 0, e tc...
you can launch mpirun by s pe cifying the hos tfile and the rankfile .
Using GPU
T wo sequential GPU runs on a single hybrid node
To launch two s e parate s e que ntial GPU runs on a s ingle hybrid node , you have to s e t the e nvironme nt variable
CUDA_VISIBLE_DEVICES which e nable s GPUs wante d. Firs t, cre ate a s cript to launch binarie s :
$ ca t la unch_e xe .s h
#! /bin/ba s h
s e t -x
e xport CUDA_VIS IBLE_DEVICES =${S LURM_PROCID} # the firs t proce s s will s e e only the firs t GPU a nd the s e cond proce s s will s e e only the s e cond GPU.
if [ $S LURM_PROCID -e q 0 ]
the n
./bin_1 > job_${S LURM_PROCID}.out
fi
if [ $S LURM_PROCID -e q 1 ]
the n
./bin_2 > job_${S LURM_PROCID}.out
fi
/!\ To work corre ctly, the two binarie s have to be e n s e que ntial (not us ing MPI).
The n run your s cript, making s ure to s ubmit two MPI proce s s e s with 4 core s pe r proce s s :
$ ca t multi_jobs _gpu.s h
#! /bin/ba s h
#MS UB -r jobs _gpu
#MS UB -n 2
# 2 ta s ks
#MS UB -N 1
# 1 node
#MS UB -c 4
# e a ch ta s k ta ke s 4 core s
#MS UB -q hybrid
#MS UB -T 1800
#MS UB -o multi_jobs _gpu_%I.out
#MS UB -e multi_jobs _gpu_%I.out
s e t -x
cd $BRIDGE_MS UB_PWD
e xport OMP_NUM_THREADS =4
ccc_mprun -E '--wa it=0' -n 2 -c 4 ./la unch_e xe .s h
# -E '--wa it=0' s pe cify to s lurm to not kill the job if one of the two proce s s e s is te rmina te d a nd not the s e cond
So your firs t proce s s will be locate d on the firs t CPU s ocke t and the s e cond proce s s will be on the s e cond CPU
s ocke t (e ach s ocke t is linke d with a GPU).
$ ccc_ms ub multi_jobs _gpu.s h
Profiling
PAPI
PAPI is an API which allows you to re trie ve hardware counte rs from the CPU. He re an e xample in Fortran to ge t the
numbe r of floating point ope rations of a matrix DAXPY:
progra m ma in
implicit none
include 'f90pa pi.h'
!
inte ge r, pa ra me te r :: s iz e = 1000
inte ge r, pa ra me te r :: ntime s = 10
double pre cis ion, dime ns ion(s iz e ,s iz e ) :: A,B,C
inte ge r :: i,j,n
! Va ria ble PAPI
inte ge r, pa ra me te r :: ma x_e ve nt = 1
inte ge r, dime ns ion(ma x_e ve nt) :: e ve nt
inte ge r :: num_e ve nts , re tva l
inte ge r(kind=8), dime ns ion(ma x_e ve nt) :: va lue s
! Init PAPI
ca ll PAPIf_num_counte rs ( num_e ve nts )
print *, 'Numbe r of ha rdwa re counte rs s upporte d: ', num_e ve nts
ca ll PAPIf_que ry_e ve nt(PAPI_FP_INS , re tva l)
if (re tva l .NE. PAPI_OK) the n
e ve nt(1) = PAPI_TOT_INS
e ls e
! Tota l floa ting point ope ra tions
e ve nt(1) = PAPI_FP_INS
e nd if
! Init Ma trix
do i=1,s iz e
do j=1,s iz e
C(i,j) = re a l(i+j,8)
B(i,j) = -i+0.1*j
e nd do
e nd do
! S e t up counte rs
num_e ve nts = 1
ca ll PAPIf_s ta rt_counte rs ( e ve nt, num_e ve nts , re tva l)
! Cle a r the counte r va lue s
ca ll PAPIf_re a d_counte rs (va lue s , num_e ve nts ,re tva l)
! DAXPY
do n=1,ntime s
do i=1,s iz e
do j=1,s iz e
A(i,j) = 2.0*B(i,j) + C(i,j)
e nd do
e nd do
e nd do
! S top the counte rs a nd put the re s ults in the a rra y va lue s
ca ll PAPIf_s top_counte rs (va lue s ,num_e ve nts ,re tva l)
! Print re s ults
if (e ve nt(1) .EQ. PAPI_TOT_INS ) the n
print *, 'TOT Ins tructions : ',va lue s (1)
e ls e
print *, 'FP Ins tructions : ',va lue s (1)
e nd if
e nd progra m ma in
To compile , you have to load the PAPI module :
ba s h-4.00 $ module loa d pa pi/4.1.3
ba s h-4.00 $ ifort -I${PAPI_INC_DIR} pa pi.f90 ${PAPI_LIBS }
ba s h-4.00 $ ./a .out
Numbe r of ha rdwa re counte rs s upporte d:
7
FP Ins tructions :
10046163
To ge t the available hardware counte rs , you can type "papi_avail" commande .
This library can re trie ve the MFLOPS of a ce rtain re gion of your code :
progra m ma in
implicit none
include 'f90pa pi.h'
!
inte ge r, pa ra me te r :: s iz e = 1000
inte ge r, pa ra me te r :: ntime s = 100
double pre cis ion, dime ns ion(s iz e ,s iz e ) :: A,B,C
inte ge r :: i,j,n
! Va ria ble PAPI
inte ge r :: re tva l
re a l(kind=4) :: proc_time , mflops , re a l_time
inte ge r(kind=8) :: flpins
! Init PAPI
re tva l = PAPI_VER_CURRENT
ca ll PAPIf_libra ry_init(re tva l)
if ( re tva l.NE.PAPI_VER_CURRENT) the n
print*, 'PAPI_libra ry_init', re tva l
e nd if
ca ll PAPIf_que ry_e ve nt(PAPI_FP_INS , re tva l)
! Init Ma trix
do i=1,s iz e
do j=1,s iz e
C(i,j) = re a l(i+j,8)
B(i,j) = -i+0.1*j
e nd do
e nd do
! S e tup Counte r
ca ll PAPIf_flips ( re a l_time , proc_time , flpins , mflops , re tva l )
! DAXPY
do n=1,ntime s
do i=1,s iz e
do j=1,s iz e
A(i,j) = 2.0*B(i,j) + C(i,j)
e nd do
e nd do
e nd do
! Colle ct the da ta into the Va ria ble s pa s s e d in
ca ll PAPIf_flips ( re a l_time , proc_time , flpins , mflops , re tva l)
! Print re s ults
print *, 'Re a l_time : ', re a l_time
print *, ' Proc_time : ', proc_time
print *, ' Tota l flpins : ', flpins
print *, ' MFLOPS : ', mflops
!
e nd progra m ma in
and the output:
ba s h-4.00 $ module loa d pa pi/4.1.3
ba s h-4.00 $ ifort -I${PAPI_INC_DIR} pa pi_flops .f90 ${PAPI_LIBS }
ba s h-4.00 $ ./a .out
Re a l_time : 6.1250001E-02
Proc_time : 5.1447589E-02
Tota l flpins :
100056592
MFLOPS : 1944.826
If you want more pre cis ions , you can contact us or vis it PAPI we bs ite .
VampirT race/Vampir
VampirTrace is a library which le t you profile your paralle l code by taking trace s during the e xe cution of the
program. We pre s e nt he re an introduction of Vampir/Vampirtrace .
Basics
Firs t, you mus t compile your code with VampirTrace compile rs . In orde r to us e VampirTrace , you ne e d to load the
vampirtrace module :
ba s h-4.00 $ module loa d va mpirtra ce
ba s h-4.00 $ vtcc -c prog.c
ba s h-4.00 $ vtcc -o prog.e xe prog.o
Available compile rs are :
vtcc : C compile r
vtc++, vtCC e t vtcxx : C++ compile rs
vtf77 e t vtf90 : Fortran compile rs
To compile a MPI code , you s hould type :
ba s h-4.00 $ vtcc -vt:cc mpicc -g -c prog.c
ba s h-4.00 $ vtcc -vt:cc mpicc -g -o prog.e xe prog.o
For othe rs language s you have :
vtcc -vt:cc mpicc : MPI C compile r
vtc++ -vt:cxx mpic++, vtCC -vt:cxx mpiCC e t vtcxx -vt:cxx mpicxx : MPI C++ compile rs
vtf77 -vt:f77 mpif77 e t vtf90 -vt:f90 mpif90 : MPI Fortran compile rs
By de fault, VampirTrace wrappe rs us e Inte l compile rs . To change for anothe r compile r, you can us e the s ame
me thod for MPI:
ba s h-4.00 $ vtcc -vt:cc gcc -O2 -c prog.c
ba s h-4.00 $ vtcc -vt:cc gcc -O2 -o prog.e xe prog.o
To profile an Ope nMP or a hybrid Ope nMP/MPI application, you s hould add the corre s ponding Ope nMP option for the
compile r:
ba s h-4.00 $ vtcc -ope nmp -O2 -c prog.c
ba s h-4.00 $ vtcc -ope nmp -O2 -o prog.e xe prog.o
The n you can s ubmit your job. He re is an e xample of s ubmis s ion s cript:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
# Re que s t na me
# Numbe r of ta s ks to us e
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
s e t -x
cd ${BRIDGE_MS UB_PWD}
ccc_mprun ./prog.e xe
At the e nd of e xe cution, the program ge ne rate s many profiling file s :
ba s h-4.00 $ ls
a .out a .out.0.de f.z a .out.1.e ve nts .z ... a .out.otf
To vis ualiz e thos e file s , you mus t load the vampir module :
ba s h-4.00 $ module loa d va mpir
ba s h-4.00 $ va mpir a .out.otf
Vampir window
If you ne e d more information, you can contact us .
Tips
Vampirtrace allocate a buffe r to s tore its profiling information. If the buffe r is full, Vampirtrace will flus h the buffe r
on dis k. By de fault, the s iz e of this buffe r is 32MB pe r proce s s and the maximum numbe r of flus he s is only one
time . You can incre as e (or re duce ) the s iz e of the buffe r: your code will als o us e more me mory. To change the
s iz e , you have to initializ e an e nvironme nt variable :
e xport VT_BUFFER_S IZ E=64M
ccc_mprun ./prog.e xe
In this e xample , the buffe r is s e t to 64 MB. We can incre as e the maximum numbe r of flus he s :
e xport VT_MAX_FLUS HES =10
ccc_mprun ./prog.e xe
If the value for VT_MAX_FLUSHES is 0, the numbe r of flus he s is unlimite d.
By de fault, Vampirtrace will firs t s tore profiling information in a local dire ctory (/tmp) of proce s s . The s e file s can be
ve ry large and fill the dire ctory. You have to change this local dire ctory with anothe r location:
e xport VT_PFORM_LDIR=$S CRATCHDIR
The re are more Vampirtrace variable s which can be us e d. Se e Us e r Manual for more pre cis ions .
Vampirserver
Trace s ge ne rate d by Vampirtrace can be ve ry large : Vampir can be ve ry s low if you want to vis ualiz e the s e trace s .
Vampir provide s Vampirs e rve r: it is a paralle l program which us e s CPU computing to acce le rate Vampir
vis ualiz ation. Firs tly, you have to s ubmit a job which will launch Vampirs e rve r on Curie node s :
$ ca t va mpirs e rve r.s h
#! /bin/ba s h
#MS UB -r va mpirs e rve r
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o va mpirs e rve r_%I.o
# S ta nda rd output. %I is the job id
#MS UB -e va mpirs e rve r_%I.e
# Error output. %I is the job id
ccc_mprun vngd
$ module loa d va mpir
$ ccc_ms ub va mpirs e rve r.s h
Whe n the job is running, you will obtain this ouput:
$ ccc_mpp
US ER
ACCOUNT BATCHID NCPU QUEUE
PRIORITY S TATE RLIM RUN/S TART
toto
ge nXXX
234481 32
la rge
210332
RUN 30.0m 1.3m
$ ccc_mpe e k 234481
Found lice ns e file : /us r/loca l/va mpir-7.3/bin/lic.da t
Running 31 a na lys is proce s s e s ... (a bort with Ctrl-C or vngd-s hutdown)
S e rve r lis te ns on: curie 1352:30000
S US P OLD NAME
1.3m va mpirs e rve r
NODES
curie 1352
In our e xample , the Vampirs e rve r mas te r node is on curie 1352. The port to conne ct is 30000. The n you can launch
Vampir on front node . Ins te ad of clicking on Open, you will click on Remote Open:
Connecting to Vampirserver
Fill the s e rve r and the port. You will be conne cte d to vampirs e rve r. The n you can ope n an OTF file s and vis ualiz e it.
Note s :
You can as k any numbe r of proce s s ors you want: it will be fas te r if your profiling file s are big. But be care ful,
it cons ume s your computing time s .
Don't forge t to de le te the Vampirs e rve r job afte r your analyz e .
CUDA profiling
Vampirtrace can colle ct profiling data from CUDA programs . As pre vious ly, you have to re place compile rs by
Vampirtrace wrappe rs . NVCC compile r s hould be re place d by vtnvcc. The n, whe n you run your program, you have to
s e t an e nvironme nt variable :
e xport e xport VT_CUDARTTRACE=ye s
ccc_mprun ./prog.e xe
Scalasca
Scalas ca is a s e t of s oftware which le t you profile your paralle l code by taking trace s during the e xe cution of the
program. This s oftware is a kind of paralle l gprof with more information. We pre s e nt he re an introduction of
Scalas ca.
St andard ut ilizat ion
Firs t, you mus t compile your code by adding Scalas ca tool be fore your call of the compile r. In orde r to us e Scalas ca,
you ne e d to load the s calas ca module :
ba s h-4.00 $ module loa d s ca la s ca
ba s h-4.00 $ s ca la s ca -ins trume nt mpicc -c prog.c
ba s h-4.00 $ s ca la s ca -ins trume nt mpicc -o prog.e xe prog.o
or for Fortran :
ba s h-4.00 $ module loa d s ca la s ca
ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -c prog.f90
ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -o prog.e xe prog.o
You can compile for Ope nMP programs :
ba s h-4.00 $ s ca la s ca -ins trume nt ifort -ope nmp -c prog.f90
ba s h-4.00 $ s ca la s ca -ins trume nt ifort -ope nmp -o prog.e xe prog.o
You can profile hybrid programs :
ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -ope nmp -O3 -c prog.f90
ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -ope nmp -O3 -o prog.e xe prog.o
The n you can s ubmit your job. He re is an e xample of s ubmis s ion s cript:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
# Re que s t na me
# Numbe r of ta s ks to us e
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
s e t -x
cd ${BRIDGE_MS UB_PWD}
e xport S CAN_MPI_LAUNCHER=ccc_mprun
s ca la s ca -a na lyz e ccc_mprun ./prog.e xe
At the e nd of e xe cution, the program ge ne rate s a dire ctory which contains the profiling file s :
ba s h-4.00 $ ls e pik_*
...
To vis ualiz e thos e file s , you can type :
ba s h-4.00 $ s ca la s ca -e xa mine e pik_*
Scalasca
If you ne e d more information, you can contact us .
Scalasca + Vampir
Scalas ca can ge ne rate OTF trace file in orde r vis ualiz e it with Vampir. To activate trace s , you can add -t option to
scalasca whe n you launch the run. He re is the pre vious modifie d s cript:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
s e t -x
cd ${BRIDGE_MS UB_PWD}
# Re que s t na me
# Numbe r of ta s ks to us e
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
s ca la s ca -a na lyz e -t mpirun ./prog.e xe
At the e nd of e xe cution, the program ge ne rate s a dire ctory which contains the profiling file s :
ba s h-4.00 $ ls e pik_*
...
To vis ualiz e thos e file s , you can vis ualiz e the m as pre vious ly. To ge ne rate the OTF trace file s , you can type :
ba s h-4.00 $ ls e pik_*
ba s h-4.00 $ e lg2otf e pik_*
It will ge ne rate an OTF file unde r the e pik_* dire ctory. To vis ualiz e it, you can load Vampir:
ba s h-4.00 $ module loa d va mpir
ba s h-4.00 $ va mpir e pik_*/a .otf
Scalasca + PAPI
Scalas ca can re trie ve the hardware counte r with PAPI. For e xample , if you want re trie ve the numbe r of floating
point ope rations :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
# Re que s t na me
# Numbe r of ta s ks to us e
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
s e t -x
cd ${BRIDGE_MS UB_PWD}
e xport EPK_METRICS =PAPI_FP_OPS
s ca la s ca -a na lyz e mpirun ./prog.e xe
The n the numbe r of floating point ope rations will appe ar on the profile whe n you vis ualiz e it. You can re trie ve only 3
hardware counte rs at the s ame time on Curie . The the s yntax is :
e xport EPK_METRICS ="PAPI_FP_OPS :PAPI_TOT_CYC"
Paraver
Parave r is a fle xible pe rformance vis ualiz ation and analys is tool that can be us e d to analyz e MPI, Ope nMP,
MPI+Ope nMP, hardware counte rs profile , Ope rating s ys te m activity and many othe r things you may think of!
In orde r to us e Parave r tools , you ne e d to load the parave r module :
ba s h-4.00 $ module loa d pa ra ve r
ba s h-4.00 $ module s how pa ra ve r
------------------------------------------------------------------/us r/loca l/ccc_us e rs _e nv/module s /de ve lopme nt/pa ra ve r/4.1.1:
module -wha tis Pa ra ve r
conflict pa ra ve r
pre pe nd-pa th PATH /us r/loca l/pa ra ve r-4.1.1/bin
pre pe nd-pa th PATH /us r/loca l/e xtra e -2.1.1/bin
pre pe nd-pa th LD_LIBRARY_PATH /us r/loca l/pa ra ve r-4.1.1/lib
pre pe nd-pa th LD_LIBRARY_PATH /us r/loca l/e xtra e -2.1.1/lib
module loa d pa pi
s e te nv PARAVER_HOME /us r/loca l/pa ra ve r-4.1.1
s e te nv EXTRAE_HOME /us r/loca l/e xtra e -2.1.1
s e te nv EXTRAE_LIB_DIR /us r/loca l/e xtra e -2.1.1/lib
s e te nv MPI_TRACE_LIBS /us r/loca l/e xtra e -2.1.1/lib/libmpitra ce .s o
-------------------------------------------------------------------
Trace generat ion
The s implie s t way to activate mpi ins trume ntation of your code is to dynamically load the library be fore e xe cution.
This can be done by adding the following line to your s ubmis s ion s cript:
e xport LD_PRELOAD=$LD_PRELOAD:$MPI_TRACE_LIBS
The ins trume ntation proce s s is manage d by Extrae and als o ne e d a configuration file in xml format. You will have to
add ne xt line to your s ubmis s ion s cript.
e xport EXTRAE_CONFIG_FILE=./e xtra e _config_file .xml
All de taille d about how to write a config file are available in Extrae 's manual which you can re ach at
$EXTRAE_HOME/doc/us e r-guide .pdf. You will als o find many e xample s of s cripts in $EXTRAE_HOME/e xample s /LINUX
file tre e .
You can als o add s ome manual ins trume ntation in your code to add s ome s pe cific us e r e ve nt. This is mandatory if
you want to s e e your own functions in Parave r time line s .
If trace ge ne ration s ucce e d during computation, you'll find a dire ctory set-0 containing s ome .mpit file s in your
working dire ctory. You will als o find a TRACE.mpits file which lis ts all the s e file s .
Convert ing t races t o Paraver format
Extrae provide s a tool name d mpi2prv to conve rt mpit file s into a .prv which will be re ad by Parave r. Since it can be
a long ope ration, we re comme nd you to us e the paralle l ve rs ion of this tool, mpimpi2prv. You will ne e d le s s
proce s s e s than pre vious ly us e d to compute . An e xample s cript is provide d be low:
ba s h-4.00$ ca t re build.s h
#MS UB -r me rge
#MS UB -n 8
#MS UB -T 1800
s e t -x
cd $BRIDGE_MS UB_PWD
ccc_mprun mpimpi2prv -s yn -e pa th_to_your_bina ry -f TRACE.mpits -o file _to_be _a na lys e d.prv
Launching Paraver
You jus t now have to launch "parave r file _to_be _analys e d.prv". As Parave r may as k for high me mory & CPU us age ,
it may be be tte r to launch it through a s ubmis s ion s cript (do not forge t the n to activate the -X option in ccc_ms ub).
For analyz ing your data you will ne e d s ome configurations file s available in Parave r's brows e r unde r
$PARAVER_HOME/cfgs dire ctory.
Paraver window

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Download Curie advanced userguide - VI-HPS