Download Curie advanced userguide - VI-HPS
Transcript
Contents 1 Curie 's advance d us age manual 2 Optimiz ation 2.1 Compilation options 2.1.1 Inte l 2.1.1.1 Inte l Sandy Bridge proce s s ors 2.1.2 GNU 3 Submis s ion 3.1 Choos ing or e xcluding node s 4 MPI 4.1 Embarras s ingly paralle l jobs and MPMD jobs 4.2 BullxMPI 4.2.1 MPMD jobs 4.2.2 Tuning BullxMPI 4.2.3 Optimiz ing with BullxMPI 4.2.4 De bugging with BullxMPI 5 Proce s s dis tribution, affinity and binding 5.1 Introduction 5.1.1 Hardware topology 5.1.2 De finitions 5.1.3 Proce s s dis tribution 5.1.4 Why is affinity important for improving pe rformance ? 5.1.5 CPU affinity mas k 5.2 SLURM 5.2.1 Proce s s dis tribution 5.2.1.1 Curie hybrid node 5.2.2 Proce s s binding 5.3 BullxMPI 5.3.1 Proce s s dis tribution 5.3.2 Proce s s binding 5.3.3 Manual proce s s manage me nt 6 Us ing GPU 6.1 Two s e que ntial GPU runs on a s ingle hybrid node 7 Profiling 7.1 PAPI 7.2 VampirTrace /Vampir 7.2.1 Bas ics 7.2.2 Tips 7.2.3 Vampirs e rve r 7.2.4 CUDA profiling 7.3 Scalas ca 7.3.1 Standard utiliz ation 7.3.2 Scalas ca + Vampir 7.3.3 Scalas ca + PAPI 7.4 Parave r 7.4.1 Trace ge ne ration 7.4.2 Conve rting trace s to Parave r format 7.4.3 Launching Parave r Curie's advanced usage manual If you have s ugge s tions or re marks , ple as e contact us : hotline .tgcc@ce a.fr Optimization Compilation options Compile rs provide s many options to optimiz e a code . The s e options are de s cribe d in the following s e ction. Int el -opt_re port : ge ne rate s a re port which de s cribe s the optimis ation in s tde rr (-O3 re quire d) -ip, -ipo : inte r-proce dural optimiz ations (mono and multi file s ). The command xiar mus t be us e d ins te ad of ar to ge ne rate a s tatic library file with obje cts compile d with -ipo option. -fas t : de fault high optimis ation le ve l (-O3 -ipo -s tatic). + Care full : This option is not allowe d us ing MPI, the MPI conte xt ne e ds to call s ome librarie s which only e xis ts in dynamic mode . This is incompatible with the -s tatic option. You ne e d to re place -fas t by -O3 -ipo -ftz : cons ide rs all the de normaliz e d numbe rs (like INF or NAN) as z e ros at runtime . -fp-re laxe d : mathe matical optimis ation functions . Le ads to a s mall los s of accuracy. -pad : make s the modification of the me mory pos itions ope rational (ifort only) The re are s ome options which allow to us e s pe cific ins tructions of Inte l proce s s ors in orde r to optimiz e the code . The s e options are compatible with mos t of Inte l proce s s ors . The compile r will try to ge ne rate the s e ins tructions if the proce s s or allow it. -xSSE4.2 : May ge ne rate Inte l® SSE4 Efficie nt Acce le rate d String and Te xt Proce s s ing ins tructions . May ge ne rate Inte l® SSE4 Ve ctoriz ing Compile r and Me dia Acce le rator, Inte l® SSSE3, SSE3, SSE2, and SSE ins tructions . -xSSE4.1 : May ge ne rate Inte l® SSE4 Ve ctoriz ing Compile r and Me dia Acce le rator ins tructions for Inte l proce s s ors . May ge ne rate Inte l® SSSE3, SSE3, SSE2, and SSE ins tructions . -xSSSE3 : May ge ne rate Inte l® SSSE3, SSE3, SSE2, and SSE ins tructions for Inte l proce s s ors . -xSSE3 : May ge ne rate Inte l® SSE3, SSE2, and SSE ins tructions for Inte l proce s s ors . -xSSE2 : May ge ne rate Inte l® SSE2 and SSE ins tructions for Inte l proce s s ors . -xHos t : this option will apply one of the pre vious options de pe nding on the proce s s or whe re the compilation is pe rforme d. This option is re comme nde d for optimiz ing your code . None of the s e options are us e d by de fault. The SSE ins tructions us e the ve ctoriz ation capability of Inte l proce s s ors . Int el Sandy Bridge processors Curie thin node s us e the las t Inte l proce s s ors bas e d on Sandy Bridge archite cture . This archite cture provide s ne w ve ctoriz ation ins tructions calle d AVX for Advance d Ve ctor e Xte ns ions . The option -xAVX allows to ge ne rate a s pe cific code for Curie thin node s . Be care ful, a code ge ne rate d with -xAVX option runs only on Inte l Sandy Bridge proce s s ors . Othe rwis e , you will ge t this e rror me s s age : Fa ta l Error: This progra m wa s not built to run in your s ys te m. Ple a s e ve rify tha t both the ope ra ting s ys te m a nd the proce s s or s upport Inte l(R) AVX. Curie login node s are Curie large node s with Ne hale m-EX proce s s ors . AVX code s can be ge ne rate d on the s e node s through cros s -compilation by adding -xAVX option. On Curie large node , the -xHos t option will not ge ne rate a AVX code . If you ne e d to compile with -xHos t or if the ins tallation re quire s s ome te s ts (like autotools /configure ), you can s ubmit a job which will compile on the Curie thin node s . GNU The re are s ome options which allow us age of s pe cific s e t of ins tructions for Inte l proce s s ors , in orde r to optimiz e code be havior. The s e options are compatible with mos t of Inte l proce s s ors . The compile r will try to us e the s e ins tructions if the proce s s or allow it. -mmmx / -mno-mmx : Switch on or off the us age of s aid ins truction s e t. -ms s e / -mno-s s e : ide m. -ms s e 2 / -mno-s s e 2 : ide m. -ms s e 3 / -mno-s s e 3 : ide m. -ms s s e 3 / -mno-s s s e 3 : ide m. -ms s e 4.1 / -mno-s s e 4.1 : ide m. -ms s e 4.2 / -mno-s s e 4.2 : ide m. -ms s e 4 / -mno-s s e 4 : ide m. -mavx / -mno-avx : ide m, f o r Curie T hin no des part it io n o nly. Submission Choosing or excluding nodes SLURM provide s the pos s ibility to choos e or e xclude any node s in the re s e rvation for your job. To choos e node s : #! /bin/ba s h #MS UB -r MyJob_Pa ra #MS UB -n 32 # #MS UB -T 1800 # #MS UB -o e xa mple _%I.o #MS UB -e e xa mple _%I.e #MS UB -A pa xxxx #MS UB -E '-w curie [1000-1003]' # Re que s t na me Numbe r of ta s ks to us e Ela ps e d time limit in s e conds # S ta nda rd output. %I is the job id # Error output. %I is the job id # Proje ct ID # Include 4 node s (curie 1000 to curie 1003) s e t -x cd ${BRIDGE_MS UB_PWD} ccc_mprun ./a .out To e xclude node s : #! /bin/ba s h #MS UB -r MyJob_Pa ra #MS UB -n 32 # #MS UB -T 1800 # #MS UB -o e xa mple _%I.o #MS UB -e e xa mple _%I.e #MS UB -A pa xxxx #MS UB -E '-x curie [1000-1003]' # Re que s t na me Numbe r of ta s ks to us e Ela ps e d time limit in s e conds # S ta nda rd output. %I is the job id # Error output. %I is the job id # Proje ct ID # Exclude 4 node s (curie 1000 to curie 1003) s e t -x cd ${BRIDGE_MS UB_PWD} ccc_mprun ./a .out MPI Embarrassingly parallel jobs and MPMD jobs An e mbarras s ingly paralle l job is a job which launch inde pe nde nt proce s s e s . The s e proce s s e s ne e d fe w or no communications A MPMD job is a paralle l job which launch diffe re nt e xe cutable s ove r the proce s s e s . A MPMD job can be paralle l with MPI and can do many communications . The s e two conce pts are s e parate but we pre s e nt the m toge the r be caus e the way to launch the m on Curie is s imilar. An s imple e xample in the Curie info page was alre ady give n. In the following e xample , we us e ccc_mprun to launch the job. srun can be us e d too. We want to launch bin0 on the MPI rank 0, bin1 on the MPI rank 1 and bin2 on the MPI rank 2. We have firs t to write a s he ll s cript which de s cribe s the topology of our job: launch_e xe .s h: #! /bin/ba s h if [ $S LURM_PROCID -e q 0 ] the n ./bin0 fi if [ $S LURM_PROCID -e q 1 ] the n ./bin1 fi if [ $S LURM_PROCID -e q 2 ] the n ./bin2 fi We can the n launch our job with 3 proce s s e s : ccc_mprun -n 3 ./la unch_e xe .s h The s cript launch_exe.sh mus t have e xe cute pe rmis s ion. Whe n ccc_mprun launche s the job, it will initializ e s ome e nvironme nt variable s . Among the m, SLURM_PROCID de fine s the curre nt MPI rank. BullxMPI MPMD jobs BullxMPI (or Ope nMPI) jobs can be launche d with mpirun launche r. In this cas e , we have othe r ways to launch MPMD jobs (s e e e mbarras s ingly paralle l jobs s e ction). We take the s ame e xample in the e mbarras s ingly paralle l jobs s e ction. The re are the n two ways for launching MPMD s cripts We don't ne e d the launch_exe.sh anymore . We can launch dire ctly the job with mpirun command: mpirun -np 1 ./bin0 : -np 1 ./bin1 : -np 1 ./bin2 In the launch_exe.sh, we can re place SLURM_PROCID by OMPI_COMM_WORLD_RANK: launch_e xe .s h: #! /bin/ba s h if [ ${OMPI_COMM_WORLD_RANK} -e q 0 ] the n ./bin0 fi if [ ${OMPI_COMM_WORLD_RANK} -e q 1 ] the n ./bin1 fi if [ ${OMPI_COMM_WORLD_RANK} -e q 2 ] the n ./bin2 fi We can the n launch our job with 3 proce s s e s : mpirun -np 3 ./la unch_e xe .s h Tuning BullxMPI BullxMPI is bas e d on Ope nMPI. It can be tune d with parame te rs . The command ompi_info -a give s you a lis t of all parame te rs and the ir de s criptions . curie 50$ ompi_info -a (...) MCA mpi: pa ra me te r "mpi_s how_mca _pa ra ms " (curre nt va lue : <none >, da ta s ource : de fa ult va lue ) Whe the r to s how a ll MCA pa ra me te r va lue s during MPI_INIT or not (good for re produca bility of MPI jobs for de bug purpos e s ). Acce pte d va lue s a re a ll, de fa ult, file , a pi, a nd e nvironme nt - or a comma de limite d combina tion of the m (...) The s e s parame te rs can be modifie d with e nvironme nt variable s s e t be fore the ccc_mprun command. The form of the corre s ponding e nvironme nt variable is OMPI_MCA_xxxxx whe re xxxxx is the parame te r. #! /bin/ba s h #MS UB -r MyJob_Pa ra #MS UB -n 32 #MS UB -T 1800 #MS UB -o e xa mple _%I.o #MS UB -e e xa mple _%I.e #MS UB -A pa xxxx # Re que s t na me # Numbe r of ta s ks to us e # Ela ps e d time limit in s e conds # S ta nda rd output. %I is the job id # Error output. %I is the job id # Proje ct ID s e t -x cd ${BRIDGE_MS UB_PWD} e xport OMPI_MCA_mpi_s how_mca _pa ra ms =a ll ccc_mprun ./a .out Opt imizing wit h BullxMPI You can try the s e s parame te rs in orde r to optimiz e BullxMPI: e xport OMPI_MCA_mpi_le a ve _pinne d=1 This s e tting improve s the bandwidth for communication if the code us e s the s ame buffe rs for communication during the e xe cution. e xport OMPI_MCA_btl_ope nib_us e _e a ge r_rdma =1 This parame te r optimiz e s the late nce for s hort me s s age s on Infiniband ne twork. But the code will us e more me mory. Be care ful, the s e s parame te rs are not s e t by de fault. The y can have influe nce s on the be haviour of your code s . Debugging wit h BullxMPI Some time s , BullxMPI code s can hang in any colle ctive communication for large jobs . If you find yours e lf in this cas e , you can try this parame te r: e xport OMPI_MCA_coll="^ghc,tune d" This s e tting dis able s optimiz e d colle ctive communications : it can s low down your code if it us e s many colle ctive ope rations . Process distribution, affinity and binding Introduction Hardware t opology Hardware topology of a Curie fat node The hardware topology is the organiz ation of core s , proce s s ors , s ocke ts and me mory in a node . The pre vious image was cre ate d with hwloc. You can have acce s s to hwloc on Curie with the command module load hwloc. Definit ions We de fine he re s ome vocabulary: Binding : a Linux proce s s can be bound (or s tuck) to one or many core s . It me ans a proce s s and its thre ads can run only on a give n s e le ction of core s . For e xample , a proce s s which is bound to a s ocke t on a Curie fat node can run on any of the 8 core s of a proce s s or. Af f init y : it re pre s e nts the policy of re s ource s manage me nt (core s and me mory) for proce s s e s . Dist ribut io n : the dis tribution of MPI proce s s e s de s cribe s how the s e s proce s s e s are s pre ad accros s the core , s ocke ts or node s . On Curie , the de fault be haviour for dis tribution, affinity and binding are manage d by SLURM, pre cis e ly the ccc_mprun command. Process dist ribut ion We pre s e nt he re s ome e xample of MPI proce s s e s dis tributions . blo ck or ro und : this is the s tandard dis tribution. From SLURM manpage : The block dis tribution me thod will dis tribute tas ks to a node s uch that cons e cutive tas ks s hare a node . For e xample , cons ide r an allocation of two node s e ach with 8 core s . A block dis tribution re que s t will dis tribute thos e tas ks to the node s with tas ks 0 to 7 on the firs t node , tas k 8 to 15 on the s e cond node . Block distribution by core cyclic by s ocke t: from SLURM manpage , the cyclic dis tribution me thod will dis tribute tas ks to a s ocke t s uch that cons e cutive tas ks are dis tribute d ove r cons e cutive s ocke t (in a round-robin fas hion). For e xample , cons ide r an allocation of two node s e ach with 2 s ocke ts e ach with 4 core s . A cyclic dis tribution by s ocke t re que s t will dis tribute thos e tas ks to the s ocke t with tas ks 0,2,4,6 on the firs t s ocke t, tas k 1,3,5,7 on the s e cond s ocke t. In the following image , the dis tribution is cyclic by s ocke t and block by node . Cyclic distribution by socket cyclic by node : from SLURM manpage , the cyclic dis tribution me thod will dis tribute tas ks to a node s uch that cons e cutive tas ks are dis tribute d ove r cons e cutive node s (in a round-robin fas hion). For e xample , cons ide r an allocation of two node s e ach with 2 s ocke ts e ach with 4 core s . A cyclic dis tribution by node re que s t will dis tribute thos e tas ks to the node s with tas ks 0,2,4,6,8,10,12,14 on the firs t node , tas k 1,3,5,7,9,11,13,15 on the s e cond node . In the following image , the dis tribution is cyclic by node and block by s ocke t. Block distribution by node Why is affinit y import ant for improving performance ? Curie node s are NUMA (Non-Uniform Me mory Acce s s ) node s . It me ans that it will take longe r to acce s s s ome re gions of me mory than othe rs . This is due to the fact that all me mory re gions are not phys ically on the s ame bus . NUMA node : Curie hybrid node In this picture , we can s e e that if a data is in the me mory module 0, a proce s s running on the s e cond s ocke t like the 4th proce s s will take more time to acce s s the data. We can introduce the notion of local data vs remote data. In our e xample , if we cons ide r a proce s s running on the s ocke t 0, a data is local if it is on the me mory module 0. The data is remote if it is on the me mory module 1. We can the n de duce the re as ons why tuning the proce s s affinity is important: Data locality improve pe rformance . If your code us e s hare d me mory (like pthre ads or Ope nMP), the be s t choice is to re group your thre ads on the s ame s ocke t. The s hare d datas s hould be local to the s ocke t and more ove r, the datas will pote ntially s tay on the proce s s or's cache . Sys te m proce s s e s can inte rrupt your proce s s running on a core . If your proce s s is not bound to a core or to a s ocke t, it can be move d to anothe r core or to anothe r s ocke t. In this cas e , all datas for this proce s s have to be move d with the proce s s too and it can take s ome time . MPI communications are fas te r be twe e n proce s s e s which are on the s ame s ocke t. If you know that two proce s s e s have many communications , you can bind the m to the s ame s ocke t. On Curie hybrid node s , the GPUs are conne cte d to bus e s which are local to s ocke t. Proce s s e s can take longe r time to acce s s a GPU which is not conne cte d to its s ocke t. NUMA node : Curie hybrid node with GPU For all the s e s re as ons , it is be tte r to know the NUMA configuration of Curie node s (fat, hybrid and thin). In the following s e ction, we will pre s e nt s ome ways to tune your proce s s e s affinity for your jobs . CPU affinit y mask The affinity of a proce s s is de fine d by a mas k. A mas k is a binary value which le ngth is de fine d by the numbe r of core s available on a node . By e xample , Curie hybrid node s have 8 core s : the binary mas k value will have 8 figure s . Each figure s will have 0 or 1. The proce s s will run only on the core which have 1 as value . A binary mas k mus t be re ad from right to le ft. For e xample , a proce s s which runs on the core s 0,4,6 and 7 will have as affinity binary mas k: 11010001 SLURM and BullxMPI us e the s e s mas ks but conve rte d in he xade cimal numbe r. To conve rt a binary value to he xade cimal: $ e cho "iba s e =2;oba s e =16;11010001"| bc 21202 To conve rt a he xade cimal value to binary: $ e cho "iba s e =16;oba s e =2;21202"| bc 11010001 The numbe ring of the core s is the PU numbe r from the output of hwloc. SLURM SLURM is the de fault launche r for jobs on Curie . SLURM manage s the proce s s e s e ve n for s e que ntial jobs . We re comme nd you to us e ccc_mprun. By de fault, SLURM binds proce s s e s to a core . The dis tribution is block by node and by core . The option -E '--cpu_bind=verbose' for ccc_mprun give s you a re port about the binding of proce s s e s be fore the run: $ ccc_mprun -E '--cpu_bind=ve rbos e ' -q hybrid -n 8 ./a .out cpu_bind=MAS K - curie 7054, ta s k 3 3 [3534]: ma s k 0x8 s e t cpu_bind=MAS K - curie 7054, ta s k 0 0 [3531]: ma s k 0x1 s e t cpu_bind=MAS K - curie 7054, ta s k 1 1 [3532]: ma s k 0x2 s e t cpu_bind=MAS K - curie 7054, ta s k 2 2 [3533]: ma s k 0x4 s e t cpu_bind=MAS K - curie 7054, ta s k 4 4 [3535]: ma s k 0x10 s e t cpu_bind=MAS K - curie 7054, ta s k 5 5 [3536]: ma s k 0x20 s e t cpu_bind=MAS K - curie 7054, ta s k 7 7 [3538]: ma s k 0x80 s e t cpu_bind=MAS K - curie 7054, ta s k 6 6 [3537]: ma s k 0x40 s e t In this e xample , we can s e e the proce s s 5 has 20 as he xade cimal mas k or 00100000 as binary mas k: the 5th proce s s will run only on the core 5. Process dist ribut ion To change the de fault dis tribution of proce s s e s , you can us e the option -E '-m' for ccc_mprun. With SLURM, you have two le ve ls for proce s s dis tribution: node and s ocke t. Node block dis tribution: ccc_mprun -E '-m block' ./a .out Node cyclic dis tribution: ccc_mprun -E '-m cyclic' ./a .out By de fault, the dis tribution ove r the s ocke t is block. In the following e xample s for s ocke t dis tribution, the node dis tribution will be block. Socke t block dis tribution: ccc_mprun -E '-m block:block' ./a .out Socke t cyclic dis tribution: ccc_mprun -E '-m block:cyclic' ./a .out Curie hybrid node On Curie hybrid node , e ach GPU is conne cte d to a s ocke t (s e e pre vious picture ). It will take longe r for a proce s s to acce s s a GPU if this proce s s is not on the s ame s ocke t of the GPU. By de fault, the dis tribution is block by core . The n the MPI rank 0 is locate d on the firs t s ocke t and the MPI rank 1 is on the firs t s ocke t too. The majority of GPU code s will as s ign GPU 0 to MPI rank 0 and GPU 1 to MPI rank 1. In this cas e , the bandwidth be twe e n MPI rank 1 and GPU 1 is not optimal. If your code doe s this , in orde r to obtain the be s t pe rformance , you s hould : us e the block:cyclic dis tribution if you inte nd to us e only 2 MPI proce s s e s pe r node , you can re s e rve 4 core s pe r proce s s with the dire ctive #MSUB -c 4. The two proce s s e s will be place d on two diffe re nt s ocke ts . Process binding By de fault, proce s s e s are bound to the core . For multi-thre ade d jobs , proce s s e s cre ate s thre ads : the s e thre ads will be bound to the as s igne d core . To allow the s e thre ads to us e othe r core s , SLURM provide s the option -c to as s ign many core s to a proce s s . #! /bin/ba s h #MS UB -r MyJob_Pa ra #MS UB -n 8 #MS UB -c 4 #MS UB -T 1800 #MS UB -o e xa mple _%I.o #MS UB -A pa xxxx # Re que s t na me # Numbe r of ta s ks to us e # As s ign 4 core s pe r proce s s # Ela ps e d time limit in s e conds # S ta nda rd output. %I is the job id # Proje ct ID e xport OMP_NUM_THREADS =4 ccc_mprun ./a .out In this e xample , our hybrid Ope nMP/MPI code runs on 8 MPI proce s s e s and e ach proce s s will us e 4 Ope nMP thre ads . We give he re an e xample for the output with the ve rbos e option for binding: $ ccc_mprun ./a .out cpu_bind=MAS K - curie 1139, cpu_bind=MAS K - curie 1139, cpu_bind=MAS K - curie 1139, cpu_bind=MAS K - curie 1139, cpu_bind=MAS K - curie 1139, cpu_bind=MAS K - curie 1139, cpu_bind=MAS K - curie 1139, cpu_bind=MAS K - curie 1139, ta s k ta s k ta s k ta s k ta s k ta s k ta s k ta s k 5 0 1 6 4 3 2 7 5 [18761]: ma s k 0x40404040 s e t 0 [18756]: ma s k 0x1010101 s e t 1 [18757]: ma s k 0x10101010 s e t 6 [18762]: ma s k 0x8080808 s e t 4 [18760]: ma s k 0x4040404 s e t 3 [18759]: ma s k 0x20202020 s e t 2 [18758]: ma s k 0x2020202 s e t 7 [18763]: ma s k 0x80808080 s e t We can s e e he re the MPI rank 0 proce s s is launche d ove r the core s 0,8,16 and 24 of the node . The s e core s are all locate d on the node 's firs t s ocke t. Re mark: With the -c option, SLURM will try to gathe r at be s t the core s to have be s t pe rformance s . In the pre vious e xample , all the core s of a MPI proce s s will be locate d on the s ame s ocke t. Anothe r e xample : $ ccc_mprun -n 1 -c 32 -E '--cpu_bind=ve rbos e ' ./a .out cpu_bind=MAS K - curie 1017, ta s k 0 0 [34710]: ma s k 0xffffffff s e t We can s e e the proce s s is not bound to a core and can run ove r all core s of a node . BullxMPI BullxMPI has its own proce s s manage me nt policy. To us e it, you have firs t to dis able SLURM's proce s s manage me nt policy by adding the dire ctive #MSUB -E '--cpu_bind=none' . You can the n us e BullxMPI launche r mpirun: #! /bin/ba s h #MS UB -r MyJob_Pa ra # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -x # Re quire a e xclus ive node #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o e xa mple _%I.o # S ta nda rd output. %I is the job id #MS UB -A pa xxxx # Proje ct ID #MS UB -E '--cpu_bind=none ' # Dis a ble de fa ult S LURM binding mpirun -np 32 ./a .out Note : In this e xample , BullxMPI proce s s manage me nt policy can be e ffe ctive only on the 32 core s allocate d by SLURM. The de fault BullxMPI proce s s manage me nt policy is : the proce s s e s are not bound the proce s s e s can run on all core s the de fault dis tribution is block by core and by node The option --report-bindings give s you a re port about the binding of proce s s e s be fore the run: #! /bin/ba s h #MS UB -r MyJob_Pa ra # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -x # Re quire a e xclus ive node #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o e xa mple _%I.o # S ta nda rd output. %I is the job id #MS UB -A pa xxxx # Proje ct ID #MS UB -E '--cpu_bind=none ' # Dis a ble de fa ult S LURM binding mpirun --re port-bindings --bind-to-s ocke t --cpus -pe r-proc 4 -np 8 ./a .out And the re is the output: + mpirun --bind-to-s ocke t --cpus -pe r-proc 4 -np 8 ./a .out [curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],3] to s ocke t 1 cpus 22222222 [curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],4] to s ocke t 2 cpus 44444444 [curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],5] to s ocke t 2 cpus 44444444 [curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],6] to s ocke t 3 cpus 88888888 [curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],7] to s ocke t 3 cpus 88888888 [curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],0] to s ocke t 0 cpus 11111111 [curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],1] to s ocke t 0 cpus 11111111 [curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],2] to s ocke t 1 cpus 22222222 In the following paragraphs , we pre s e nt the diffe re nt pos s ibilitie s of proce s s dis tribution and binding. The s e options can be mixe d (if pos s ible ). Re mark: the following e xample s us e a whole Curie fat node . We re s e rve 32 core s with #MSUB -n 32 and #MSUB -x to have all the core s and to do what we want with the m. This is only e xample s for s imple cas e s . In othe rs cas e , the re may be conflicts with SLURM. Process dist ribut ion Block dis tribution by core : #! /bin/ba s h #MS UB -r MyJob_Pa ra # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -x # Re quire a e xclus ive node #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o e xa mple _%I.o # S ta nda rd output. %I is the job id #MS UB -A pa xxxx # Proje ct ID #MS UB -E '--cpu_bind=none ' # Dis a ble de fa ult S LURM binding mpirun --bycore -np 32 ./a .out Cyclic dis tribution by s ocke t: #! /bin/ba s h #MS UB -r MyJob_Pa ra # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -x # Re quire a e xclus ive node #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o e xa mple _%I.o # S ta nda rd output. %I is the job id #MS UB -A pa xxxx # Proje ct ID #MS UB -E '--cpu_bind=none ' # Dis a ble de fa ult S LURM binding mpirun --bys ocke t -np 32 ./a .out Cyclic dis tribution by node : #! /bin/ba s h #MS UB -r MyJob_Pa ra # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -N 16 #MS UB -x # Re quire e xclus ive node s #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o e xa mple _%I.o # S ta nda rd output. %I is the job id #MS UB -A pa xxxx # Proje ct ID #MS UB -E '--cpu_bind=none ' # Dis a ble de fa ult S LURM binding mpirun --bynode -np 32 ./a .out Process binding No binding: #! /bin/ba s h #MS UB -r MyJob_Pa ra # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -x # Re quire a e xclus ive node #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o e xa mple _%I.o # S ta nda rd output. %I is the job id #MS UB -A pa xxxx # Proje ct ID #MS UB -E '--cpu_bind=none ' # Dis a ble de fa ult S LURM binding mpirun --bind-to-none -np 32 ./a .out Core binding: #! /bin/ba s h #MS UB -r MyJob_Pa ra # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -x # Re quire a e xclus ive node #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o e xa mple _%I.o # S ta nda rd output. %I is the job id #MS UB -A pa xxxx # Proje ct ID #MS UB -E '--cpu_bind=none ' # Dis a ble de fa ult S LURM binding mpirun --bind-to-core -np 32 ./a .out Socke t binding (the proce s s and his thre ads can run on all core s of a s ocke t): #! /bin/ba s h #MS UB -r MyJob_Pa ra # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -x # Re quire a e xclus ive node #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o e xa mple _%I.o # S ta nda rd output. %I is the job id #MS UB -A pa xxxx # Proje ct ID #MS UB -E '--cpu_bind=none ' # Dis a ble de fa ult S LURM binding mpirun --bind-to-s ocke t -np 32 ./a .out You can s pe cify the numbe r of core s to as s ign to a MPI proce s s : #! /bin/ba s h #MS UB -r MyJob_Pa ra # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -x # Re quire a e xclus ive node #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o e xa mple _%I.o # S ta nda rd output. %I is the job id #MS UB -A pa xxxx # Proje ct ID #MS UB -E '--cpu_bind=none ' # Dis a ble de fa ult S LURM binding mpirun --bind-to-s ocke t --cpus -pe r-proc 4 -np 8 ./a .out He re we as s ign 4 core s pe r MPI proce s s . Manual process management BullxMPI give s the pos s ibility to manually as s ign your proce s s e s through a hos tfile and a rankfile . An e xample : #! /bin/ba s h #MS UB -r MyJob_Pa ra # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -x # Re quire a e xclus ive node #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o e xa mple _%I.o # S ta nda rd output. %I is the job id #MS UB -A pa xxxx # Proje ct ID #MS UB -E '--cpu_bind=none ' # Dis a ble de fa ult S LURM binding hos tna me > hos tfile .txt e cho "ra nk 0=${HOS TNAME} s lot=0,1,2,3 " > ra nkfile .txt e cho "ra nk 1=${HOS TNAME} s lot=8,10,12,14 " >> ra nkfile .txt e cho "ra nk 2=${HOS TNAME} s lot=16,17,22,23" >> ra nkfile .txt e cho "ra nk 3=${HOS TNAME} s lot=19,20,21,31" >> ra nkfile .txt mpirun --hos tfile hos tfile .txt --ra nkfile ra nkfile .txt -np 4 ./a .out In this e xample , the re are many s te ps : You have to cre ate a hostfile he re hos tfile .txt whe re you put the hos tname of all node s your run will us e You have to cre ate a rankfile he re rankfile .txt whe re you as s ign to e ach MPI rank the core whe re it can run. In our e xample , the proce s s of rank 0 will have as affinity the core 0,1,2 and 3, e tc... Be care ful, the numbe ring of the core is diffe re nt than the hwloc output: on Curie fat node , the e ight firs t core are on the firs t s ocke t 0, e tc... you can launch mpirun by s pe cifying the hos tfile and the rankfile . Using GPU T wo sequential GPU runs on a single hybrid node To launch two s e parate s e que ntial GPU runs on a s ingle hybrid node , you have to s e t the e nvironme nt variable CUDA_VISIBLE_DEVICES which e nable s GPUs wante d. Firs t, cre ate a s cript to launch binarie s : $ ca t la unch_e xe .s h #! /bin/ba s h s e t -x e xport CUDA_VIS IBLE_DEVICES =${S LURM_PROCID} # the firs t proce s s will s e e only the firs t GPU a nd the s e cond proce s s will s e e only the s e cond GPU. if [ $S LURM_PROCID -e q 0 ] the n ./bin_1 > job_${S LURM_PROCID}.out fi if [ $S LURM_PROCID -e q 1 ] the n ./bin_2 > job_${S LURM_PROCID}.out fi /!\ To work corre ctly, the two binarie s have to be e n s e que ntial (not us ing MPI). The n run your s cript, making s ure to s ubmit two MPI proce s s e s with 4 core s pe r proce s s : $ ca t multi_jobs _gpu.s h #! /bin/ba s h #MS UB -r jobs _gpu #MS UB -n 2 # 2 ta s ks #MS UB -N 1 # 1 node #MS UB -c 4 # e a ch ta s k ta ke s 4 core s #MS UB -q hybrid #MS UB -T 1800 #MS UB -o multi_jobs _gpu_%I.out #MS UB -e multi_jobs _gpu_%I.out s e t -x cd $BRIDGE_MS UB_PWD e xport OMP_NUM_THREADS =4 ccc_mprun -E '--wa it=0' -n 2 -c 4 ./la unch_e xe .s h # -E '--wa it=0' s pe cify to s lurm to not kill the job if one of the two proce s s e s is te rmina te d a nd not the s e cond So your firs t proce s s will be locate d on the firs t CPU s ocke t and the s e cond proce s s will be on the s e cond CPU s ocke t (e ach s ocke t is linke d with a GPU). $ ccc_ms ub multi_jobs _gpu.s h Profiling PAPI PAPI is an API which allows you to re trie ve hardware counte rs from the CPU. He re an e xample in Fortran to ge t the numbe r of floating point ope rations of a matrix DAXPY: progra m ma in implicit none include 'f90pa pi.h' ! inte ge r, pa ra me te r :: s iz e = 1000 inte ge r, pa ra me te r :: ntime s = 10 double pre cis ion, dime ns ion(s iz e ,s iz e ) :: A,B,C inte ge r :: i,j,n ! Va ria ble PAPI inte ge r, pa ra me te r :: ma x_e ve nt = 1 inte ge r, dime ns ion(ma x_e ve nt) :: e ve nt inte ge r :: num_e ve nts , re tva l inte ge r(kind=8), dime ns ion(ma x_e ve nt) :: va lue s ! Init PAPI ca ll PAPIf_num_counte rs ( num_e ve nts ) print *, 'Numbe r of ha rdwa re counte rs s upporte d: ', num_e ve nts ca ll PAPIf_que ry_e ve nt(PAPI_FP_INS , re tva l) if (re tva l .NE. PAPI_OK) the n e ve nt(1) = PAPI_TOT_INS e ls e ! Tota l floa ting point ope ra tions e ve nt(1) = PAPI_FP_INS e nd if ! Init Ma trix do i=1,s iz e do j=1,s iz e C(i,j) = re a l(i+j,8) B(i,j) = -i+0.1*j e nd do e nd do ! S e t up counte rs num_e ve nts = 1 ca ll PAPIf_s ta rt_counte rs ( e ve nt, num_e ve nts , re tva l) ! Cle a r the counte r va lue s ca ll PAPIf_re a d_counte rs (va lue s , num_e ve nts ,re tva l) ! DAXPY do n=1,ntime s do i=1,s iz e do j=1,s iz e A(i,j) = 2.0*B(i,j) + C(i,j) e nd do e nd do e nd do ! S top the counte rs a nd put the re s ults in the a rra y va lue s ca ll PAPIf_s top_counte rs (va lue s ,num_e ve nts ,re tva l) ! Print re s ults if (e ve nt(1) .EQ. PAPI_TOT_INS ) the n print *, 'TOT Ins tructions : ',va lue s (1) e ls e print *, 'FP Ins tructions : ',va lue s (1) e nd if e nd progra m ma in To compile , you have to load the PAPI module : ba s h-4.00 $ module loa d pa pi/4.1.3 ba s h-4.00 $ ifort -I${PAPI_INC_DIR} pa pi.f90 ${PAPI_LIBS } ba s h-4.00 $ ./a .out Numbe r of ha rdwa re counte rs s upporte d: 7 FP Ins tructions : 10046163 To ge t the available hardware counte rs , you can type "papi_avail" commande . This library can re trie ve the MFLOPS of a ce rtain re gion of your code : progra m ma in implicit none include 'f90pa pi.h' ! inte ge r, pa ra me te r :: s iz e = 1000 inte ge r, pa ra me te r :: ntime s = 100 double pre cis ion, dime ns ion(s iz e ,s iz e ) :: A,B,C inte ge r :: i,j,n ! Va ria ble PAPI inte ge r :: re tva l re a l(kind=4) :: proc_time , mflops , re a l_time inte ge r(kind=8) :: flpins ! Init PAPI re tva l = PAPI_VER_CURRENT ca ll PAPIf_libra ry_init(re tva l) if ( re tva l.NE.PAPI_VER_CURRENT) the n print*, 'PAPI_libra ry_init', re tva l e nd if ca ll PAPIf_que ry_e ve nt(PAPI_FP_INS , re tva l) ! Init Ma trix do i=1,s iz e do j=1,s iz e C(i,j) = re a l(i+j,8) B(i,j) = -i+0.1*j e nd do e nd do ! S e tup Counte r ca ll PAPIf_flips ( re a l_time , proc_time , flpins , mflops , re tva l ) ! DAXPY do n=1,ntime s do i=1,s iz e do j=1,s iz e A(i,j) = 2.0*B(i,j) + C(i,j) e nd do e nd do e nd do ! Colle ct the da ta into the Va ria ble s pa s s e d in ca ll PAPIf_flips ( re a l_time , proc_time , flpins , mflops , re tva l) ! Print re s ults print *, 'Re a l_time : ', re a l_time print *, ' Proc_time : ', proc_time print *, ' Tota l flpins : ', flpins print *, ' MFLOPS : ', mflops ! e nd progra m ma in and the output: ba s h-4.00 $ module loa d pa pi/4.1.3 ba s h-4.00 $ ifort -I${PAPI_INC_DIR} pa pi_flops .f90 ${PAPI_LIBS } ba s h-4.00 $ ./a .out Re a l_time : 6.1250001E-02 Proc_time : 5.1447589E-02 Tota l flpins : 100056592 MFLOPS : 1944.826 If you want more pre cis ions , you can contact us or vis it PAPI we bs ite . VampirT race/Vampir VampirTrace is a library which le t you profile your paralle l code by taking trace s during the e xe cution of the program. We pre s e nt he re an introduction of Vampir/Vampirtrace . Basics Firs t, you mus t compile your code with VampirTrace compile rs . In orde r to us e VampirTrace , you ne e d to load the vampirtrace module : ba s h-4.00 $ module loa d va mpirtra ce ba s h-4.00 $ vtcc -c prog.c ba s h-4.00 $ vtcc -o prog.e xe prog.o Available compile rs are : vtcc : C compile r vtc++, vtCC e t vtcxx : C++ compile rs vtf77 e t vtf90 : Fortran compile rs To compile a MPI code , you s hould type : ba s h-4.00 $ vtcc -vt:cc mpicc -g -c prog.c ba s h-4.00 $ vtcc -vt:cc mpicc -g -o prog.e xe prog.o For othe rs language s you have : vtcc -vt:cc mpicc : MPI C compile r vtc++ -vt:cxx mpic++, vtCC -vt:cxx mpiCC e t vtcxx -vt:cxx mpicxx : MPI C++ compile rs vtf77 -vt:f77 mpif77 e t vtf90 -vt:f90 mpif90 : MPI Fortran compile rs By de fault, VampirTrace wrappe rs us e Inte l compile rs . To change for anothe r compile r, you can us e the s ame me thod for MPI: ba s h-4.00 $ vtcc -vt:cc gcc -O2 -c prog.c ba s h-4.00 $ vtcc -vt:cc gcc -O2 -o prog.e xe prog.o To profile an Ope nMP or a hybrid Ope nMP/MPI application, you s hould add the corre s ponding Ope nMP option for the compile r: ba s h-4.00 $ vtcc -ope nmp -O2 -c prog.c ba s h-4.00 $ vtcc -ope nmp -O2 -o prog.e xe prog.o The n you can s ubmit your job. He re is an e xample of s ubmis s ion s cript: #! /bin/ba s h #MS UB -r MyJob_Pa ra #MS UB -n 32 #MS UB -T 1800 #MS UB -o e xa mple _%I.o #MS UB -e e xa mple _%I.e # Re que s t na me # Numbe r of ta s ks to us e # Ela ps e d time limit in s e conds # S ta nda rd output. %I is the job id # Error output. %I is the job id s e t -x cd ${BRIDGE_MS UB_PWD} ccc_mprun ./prog.e xe At the e nd of e xe cution, the program ge ne rate s many profiling file s : ba s h-4.00 $ ls a .out a .out.0.de f.z a .out.1.e ve nts .z ... a .out.otf To vis ualiz e thos e file s , you mus t load the vampir module : ba s h-4.00 $ module loa d va mpir ba s h-4.00 $ va mpir a .out.otf Vampir window If you ne e d more information, you can contact us . Tips Vampirtrace allocate a buffe r to s tore its profiling information. If the buffe r is full, Vampirtrace will flus h the buffe r on dis k. By de fault, the s iz e of this buffe r is 32MB pe r proce s s and the maximum numbe r of flus he s is only one time . You can incre as e (or re duce ) the s iz e of the buffe r: your code will als o us e more me mory. To change the s iz e , you have to initializ e an e nvironme nt variable : e xport VT_BUFFER_S IZ E=64M ccc_mprun ./prog.e xe In this e xample , the buffe r is s e t to 64 MB. We can incre as e the maximum numbe r of flus he s : e xport VT_MAX_FLUS HES =10 ccc_mprun ./prog.e xe If the value for VT_MAX_FLUSHES is 0, the numbe r of flus he s is unlimite d. By de fault, Vampirtrace will firs t s tore profiling information in a local dire ctory (/tmp) of proce s s . The s e file s can be ve ry large and fill the dire ctory. You have to change this local dire ctory with anothe r location: e xport VT_PFORM_LDIR=$S CRATCHDIR The re are more Vampirtrace variable s which can be us e d. Se e Us e r Manual for more pre cis ions . Vampirserver Trace s ge ne rate d by Vampirtrace can be ve ry large : Vampir can be ve ry s low if you want to vis ualiz e the s e trace s . Vampir provide s Vampirs e rve r: it is a paralle l program which us e s CPU computing to acce le rate Vampir vis ualiz ation. Firs tly, you have to s ubmit a job which will launch Vampirs e rve r on Curie node s : $ ca t va mpirs e rve r.s h #! /bin/ba s h #MS UB -r va mpirs e rve r # Re que s t na me #MS UB -n 32 # Numbe r of ta s ks to us e #MS UB -T 1800 # Ela ps e d time limit in s e conds #MS UB -o va mpirs e rve r_%I.o # S ta nda rd output. %I is the job id #MS UB -e va mpirs e rve r_%I.e # Error output. %I is the job id ccc_mprun vngd $ module loa d va mpir $ ccc_ms ub va mpirs e rve r.s h Whe n the job is running, you will obtain this ouput: $ ccc_mpp US ER ACCOUNT BATCHID NCPU QUEUE PRIORITY S TATE RLIM RUN/S TART toto ge nXXX 234481 32 la rge 210332 RUN 30.0m 1.3m $ ccc_mpe e k 234481 Found lice ns e file : /us r/loca l/va mpir-7.3/bin/lic.da t Running 31 a na lys is proce s s e s ... (a bort with Ctrl-C or vngd-s hutdown) S e rve r lis te ns on: curie 1352:30000 S US P OLD NAME 1.3m va mpirs e rve r NODES curie 1352 In our e xample , the Vampirs e rve r mas te r node is on curie 1352. The port to conne ct is 30000. The n you can launch Vampir on front node . Ins te ad of clicking on Open, you will click on Remote Open: Connecting to Vampirserver Fill the s e rve r and the port. You will be conne cte d to vampirs e rve r. The n you can ope n an OTF file s and vis ualiz e it. Note s : You can as k any numbe r of proce s s ors you want: it will be fas te r if your profiling file s are big. But be care ful, it cons ume s your computing time s . Don't forge t to de le te the Vampirs e rve r job afte r your analyz e . CUDA profiling Vampirtrace can colle ct profiling data from CUDA programs . As pre vious ly, you have to re place compile rs by Vampirtrace wrappe rs . NVCC compile r s hould be re place d by vtnvcc. The n, whe n you run your program, you have to s e t an e nvironme nt variable : e xport e xport VT_CUDARTTRACE=ye s ccc_mprun ./prog.e xe Scalasca Scalas ca is a s e t of s oftware which le t you profile your paralle l code by taking trace s during the e xe cution of the program. This s oftware is a kind of paralle l gprof with more information. We pre s e nt he re an introduction of Scalas ca. St andard ut ilizat ion Firs t, you mus t compile your code by adding Scalas ca tool be fore your call of the compile r. In orde r to us e Scalas ca, you ne e d to load the s calas ca module : ba s h-4.00 $ module loa d s ca la s ca ba s h-4.00 $ s ca la s ca -ins trume nt mpicc -c prog.c ba s h-4.00 $ s ca la s ca -ins trume nt mpicc -o prog.e xe prog.o or for Fortran : ba s h-4.00 $ module loa d s ca la s ca ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -c prog.f90 ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -o prog.e xe prog.o You can compile for Ope nMP programs : ba s h-4.00 $ s ca la s ca -ins trume nt ifort -ope nmp -c prog.f90 ba s h-4.00 $ s ca la s ca -ins trume nt ifort -ope nmp -o prog.e xe prog.o You can profile hybrid programs : ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -ope nmp -O3 -c prog.f90 ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -ope nmp -O3 -o prog.e xe prog.o The n you can s ubmit your job. He re is an e xample of s ubmis s ion s cript: #! /bin/ba s h #MS UB -r MyJob_Pa ra #MS UB -n 32 #MS UB -T 1800 #MS UB -o e xa mple _%I.o #MS UB -e e xa mple _%I.e # Re que s t na me # Numbe r of ta s ks to us e # Ela ps e d time limit in s e conds # S ta nda rd output. %I is the job id # Error output. %I is the job id s e t -x cd ${BRIDGE_MS UB_PWD} e xport S CAN_MPI_LAUNCHER=ccc_mprun s ca la s ca -a na lyz e ccc_mprun ./prog.e xe At the e nd of e xe cution, the program ge ne rate s a dire ctory which contains the profiling file s : ba s h-4.00 $ ls e pik_* ... To vis ualiz e thos e file s , you can type : ba s h-4.00 $ s ca la s ca -e xa mine e pik_* Scalasca If you ne e d more information, you can contact us . Scalasca + Vampir Scalas ca can ge ne rate OTF trace file in orde r vis ualiz e it with Vampir. To activate trace s , you can add -t option to scalasca whe n you launch the run. He re is the pre vious modifie d s cript: #! /bin/ba s h #MS UB -r MyJob_Pa ra #MS UB -n 32 #MS UB -T 1800 #MS UB -o e xa mple _%I.o #MS UB -e e xa mple _%I.e s e t -x cd ${BRIDGE_MS UB_PWD} # Re que s t na me # Numbe r of ta s ks to us e # Ela ps e d time limit in s e conds # S ta nda rd output. %I is the job id # Error output. %I is the job id s ca la s ca -a na lyz e -t mpirun ./prog.e xe At the e nd of e xe cution, the program ge ne rate s a dire ctory which contains the profiling file s : ba s h-4.00 $ ls e pik_* ... To vis ualiz e thos e file s , you can vis ualiz e the m as pre vious ly. To ge ne rate the OTF trace file s , you can type : ba s h-4.00 $ ls e pik_* ba s h-4.00 $ e lg2otf e pik_* It will ge ne rate an OTF file unde r the e pik_* dire ctory. To vis ualiz e it, you can load Vampir: ba s h-4.00 $ module loa d va mpir ba s h-4.00 $ va mpir e pik_*/a .otf Scalasca + PAPI Scalas ca can re trie ve the hardware counte r with PAPI. For e xample , if you want re trie ve the numbe r of floating point ope rations : #! /bin/ba s h #MS UB -r MyJob_Pa ra #MS UB -n 32 #MS UB -T 1800 #MS UB -o e xa mple _%I.o #MS UB -e e xa mple _%I.e # Re que s t na me # Numbe r of ta s ks to us e # Ela ps e d time limit in s e conds # S ta nda rd output. %I is the job id # Error output. %I is the job id s e t -x cd ${BRIDGE_MS UB_PWD} e xport EPK_METRICS =PAPI_FP_OPS s ca la s ca -a na lyz e mpirun ./prog.e xe The n the numbe r of floating point ope rations will appe ar on the profile whe n you vis ualiz e it. You can re trie ve only 3 hardware counte rs at the s ame time on Curie . The the s yntax is : e xport EPK_METRICS ="PAPI_FP_OPS :PAPI_TOT_CYC" Paraver Parave r is a fle xible pe rformance vis ualiz ation and analys is tool that can be us e d to analyz e MPI, Ope nMP, MPI+Ope nMP, hardware counte rs profile , Ope rating s ys te m activity and many othe r things you may think of! In orde r to us e Parave r tools , you ne e d to load the parave r module : ba s h-4.00 $ module loa d pa ra ve r ba s h-4.00 $ module s how pa ra ve r ------------------------------------------------------------------/us r/loca l/ccc_us e rs _e nv/module s /de ve lopme nt/pa ra ve r/4.1.1: module -wha tis Pa ra ve r conflict pa ra ve r pre pe nd-pa th PATH /us r/loca l/pa ra ve r-4.1.1/bin pre pe nd-pa th PATH /us r/loca l/e xtra e -2.1.1/bin pre pe nd-pa th LD_LIBRARY_PATH /us r/loca l/pa ra ve r-4.1.1/lib pre pe nd-pa th LD_LIBRARY_PATH /us r/loca l/e xtra e -2.1.1/lib module loa d pa pi s e te nv PARAVER_HOME /us r/loca l/pa ra ve r-4.1.1 s e te nv EXTRAE_HOME /us r/loca l/e xtra e -2.1.1 s e te nv EXTRAE_LIB_DIR /us r/loca l/e xtra e -2.1.1/lib s e te nv MPI_TRACE_LIBS /us r/loca l/e xtra e -2.1.1/lib/libmpitra ce .s o ------------------------------------------------------------------- Trace generat ion The s implie s t way to activate mpi ins trume ntation of your code is to dynamically load the library be fore e xe cution. This can be done by adding the following line to your s ubmis s ion s cript: e xport LD_PRELOAD=$LD_PRELOAD:$MPI_TRACE_LIBS The ins trume ntation proce s s is manage d by Extrae and als o ne e d a configuration file in xml format. You will have to add ne xt line to your s ubmis s ion s cript. e xport EXTRAE_CONFIG_FILE=./e xtra e _config_file .xml All de taille d about how to write a config file are available in Extrae 's manual which you can re ach at $EXTRAE_HOME/doc/us e r-guide .pdf. You will als o find many e xample s of s cripts in $EXTRAE_HOME/e xample s /LINUX file tre e . You can als o add s ome manual ins trume ntation in your code to add s ome s pe cific us e r e ve nt. This is mandatory if you want to s e e your own functions in Parave r time line s . If trace ge ne ration s ucce e d during computation, you'll find a dire ctory set-0 containing s ome .mpit file s in your working dire ctory. You will als o find a TRACE.mpits file which lis ts all the s e file s . Convert ing t races t o Paraver format Extrae provide s a tool name d mpi2prv to conve rt mpit file s into a .prv which will be re ad by Parave r. Since it can be a long ope ration, we re comme nd you to us e the paralle l ve rs ion of this tool, mpimpi2prv. You will ne e d le s s proce s s e s than pre vious ly us e d to compute . An e xample s cript is provide d be low: ba s h-4.00$ ca t re build.s h #MS UB -r me rge #MS UB -n 8 #MS UB -T 1800 s e t -x cd $BRIDGE_MS UB_PWD ccc_mprun mpimpi2prv -s yn -e pa th_to_your_bina ry -f TRACE.mpits -o file _to_be _a na lys e d.prv Launching Paraver You jus t now have to launch "parave r file _to_be _analys e d.prv". As Parave r may as k for high me mory & CPU us age , it may be be tte r to launch it through a s ubmis s ion s cript (do not forge t the n to activate the -X option in ccc_ms ub). For analyz ing your data you will ne e d s ome configurations file s available in Parave r's brows e r unde r $PARAVER_HOME/cfgs dire ctory. Paraver window