Quantcast
Channel: Intel® C++ Composer XE
Viewing all 245 articles
Browse latest View live

The Chronicles of Phi - part 1 The Hyper-Thread Phalanx

$
0
0

Recently I have been blessed with the opportunity to have a workstation with dual Intel® Xeon Phi™ 5110P coprocessors.  One of the design goals of the Xeon Phi™ was to provide a programming environment that is nearly the same as that on the host system. The Intel engineers based the design on an extension of their existing architecture and did a marvelous job of integrating this into a package complete with drivers and language extensions.

As much as there is a desire to make the programming of the coprocessor be the same as programming the host , the practical aspect is it is a different “beastie” with respect to code optimization. Several factors play into optimization, principle amongst them are:

o In-order core execution
o Distributed L1/L2 caches
o Large number of memory banks
o Substantial number of pending hardware prefetches
o Four-wide Hyper-Threading

Other than for a “look-see” at the two past IDF programming labs, I have to admit that I have had no prior programming experience with Xeon Phi. Ordinarily this lack of experience would be a disadvantage for someone writing a technical article. However, the intent of this article is to present a learning experience. The journey is more important to the programmer than the destination. For the journey provides a means to the end, the end being a better program.

In search of a relatively simple but functional application I chose to borrow someone else’s program that is representative of real work. What I chose was an example from chapter 4 of Intel® Xeon Phi™ Coprocessor High-Performance Programming, by Jim Jeffers, and James Reinders - Morgan Kaufman publications. You can find this on Amazon.com as well as at other book outlets.

Since the hardware on my system is slightly different from that in the book, I reran the test programs to get my own baseline results.  My host system has one Xeon E5-2620 in a motherboard with the x79 chipset. Two Xeon Phi™ 5110P coprocessors are installed; one was used for the test. The book's Xeon Phi™ coprocessor was a pre-production version that had 61 cores, whereas 5110P’s have 60 cores each. Therefore the tests run will use slightly different thread and core counts from those used in the examples in the book.

This article will not duplicate the content of chapter 4 of the book; rather I take the end of chapter 4 as a starting point.

The nature of the program under investigation is to simulate diffusion of a solute through a volume of liquid over time within a 3D container such as a cube. The original sample code is available from http://www.lotsofcores.com, Copyrighted© Naoya Maruyama. The full copyright can be found both in the aforementioned book and in the sample code. This blog is written with the assumption that you have a Xeon Phi™ coprocessor and have already downloaded the sample programs.

Starting from the end of the chapter with the results data:

This chart is not a cut and paste from the book. Instead, it is derived from running the sample programs configured for my system (60 cores not 61 cores).  The chart data represents the average of three runs of the program. This was done to smooth out any adverse interaction caused by the system O/S.

Note, the numbers represent the ratio of the Mega-Flops of the book's various threaded programs versus the Mega-Flops of a single thread run of the base program. The chart is not a scaling chart because we are comparing the performance as we tune the implementation, and not a chart of as we change the number of cores and/or threads.

base is the single thread version to the program
omp is the simplified conversion to parallel program
ompvect adds simd vectorization directives to the inner loop
peel removes unneeded code from the inner loop
tiled partitions the work to improve L1 and L2 cache hit ratios

The chart above illustrates the typical progress of the optimization process:

o Start with a functioning algorithm
o Introduce parallel code
o Improve vectorization (interchangeable with adding parallel code)
o Remove unnecessary code (improve efficiency)
o Improve cache hit ratios

After you finish the above optimization process there is usually not much left to optimize. That is, unless you’ve overlooked something important.

N.B. In the book, the results for “tiled” showed approximately 426x the performance over the baseline program using 61 cores and 183 threads.  However, the book had a typo indicating 435x improvement, the numbers reported 106752.742 MFlops vs 250.763 Mflops yield 425.71x. Setting the typo aside, my system performed better showing 455.52x improvement of tiled over base.

At the end of the chapter, the authors suggest that additional performance could be achieved with additional effort. That additional effort will be the subject of this blog.

Prior to listing the code changes, I think it is important to emphasize that given the nature of the Xeon Phi™ you would not use it to run a single threaded application. Therefore a better chart to use as a frame of reference is one that illustrates the speedup against the simple parallel OpenMP implementation (omp on charts).

The above chart clearly illustrates the benefits you get from following the advice in chapter 4 of Intel® Xeon Phi™ Coprocessor High-Performance Programming. Now let’s get down to business to see where we can go from here.

First, let me state that minor edits to the original code were made, none of which affects the intent of the code. The changes were made to make it easier to run the tests.

The first change was to permit an optional -DNX=nnn on the C++ compiler command line to define the NX macro. This macro was used to specify the dimensions of a cube: nx=NX, ny=NX, nz=NX that sets the problem size. When the macro is not defined as a compiler command line option, the original value of 256 is used. Should the macro NX be defined on the command line, the command line value will be used. This was done to permit the Makefile to build a “small” model 256x256x256, and a large model 512x512x512 and do so without having to edit the sources.

The compute section of the book's tiled code is as follows:

#pragma omp parallel 
{ 
    REAL *f1_t = f1; 
    REAL *f2_t = f2; 
    int mythread; 
    for (int i = 0; i < count; ++i) { 
#define YBF 16 
#pragma omp for collapse(2) 
      for (int yy = 0; yy < ny; yy += YBF) { 
      for (int z = 0; z < nz; z++) { 
        int ymax = yy + YBF; 
        if (ymax >= ny) ymax = ny; 
        for (int y = yy; y < ymax; y++) { 
          int x; 
          int c, n, s, b, t; 
          x = 0; 
          c =  x + y * NXP + z * NXP * ny; 
          n = (y == 0)    ? c : c - NXP; 
          s = (y == ny-1) ? c : c + NXP; 
          b = (z == 0)    ? c : c - NXP * ny; 
          t = (z == nz-1) ? c : c + NXP * ny; 
          f2_t[c] = cc * f1_t[c] + cw * f1_t[c] + ce * f1_t[c+1] 
              + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 
#pragma simd   
          for (x = 1; x < nx-1; x++) { 
            ++c; 
            ++n; 
            ++s; 
            ++b; 
            ++t; 
            f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c+1] 
                + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 
          } 
          ++c; 
          ++n; 
          ++s; 
          ++b; 
          ++t; 
          f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c] 
              + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 
        } // tile ny 
      } // tile nz 
      } // block ny 
      REAL *t = f1_t; 
      f1_t = f2_t; 
      f2_t = t; 
    } // count 
  } // parallel 

What can be done to improve the above code?

The compute intensive loop contains a single floating point expression with a handful of integer increments. The compiler optimization -O3 is going to unroll this loop and reduce the number of (x < nx-1) test, as well as reduce the number of ++ operations through use of offsets. More importantly, the compiler will vectorize this loop.  The strength of the Xeon Phi™ is in its wide vector unit.

This loop handles nx-2 indices of dimension x (254 in the small problem case). Whereas the expressions preceding and following the loop handle the remaining 2 indices of x (x=0 and x=nx-1). With NX=256, the inner loop accounts for 254/256 = 99.22% of the work.

The compiler does a good job of deciding where and when to insert prefetches. You can look at the disassembly code if you wish. I experimented with different #pragma prefetch ... options all of which produce lower performance than using the compiler generated prefetches. Curiously I also noted that for this example using #pragma noprefetch, produced marginally faster code (<1% faster). This is likely due to the fact that the loop is simple enough that the hardware prefetcher will perform the prefetching for you. Each unnecessary prefetch instruction that is eliminated from the loop gains back at least one clock cycle.

The authors of the book found the best performance was achieved using KMP_AFFINITY=scatter and OMP_NUM_THREADS=183 (three threads from each core). On my machine, with one fewer core, 180 threads were used. The tiling (YBF=16) was used to improve cache hit probability, and we assume the authors of the book tuned this factor to yield best performance.

I ask again:

What can be done to improve the above code?

Well, there are some problems with the code. These problems are not obvious when reading the code and relating it to the computational characteristics of the Intel® Xeon Phi™ many in order core, each with four hardware threads. The book authors found that using three of the four hardware threads per core gave optimal performance (of this program) for use with the tiled algorithm. My test runs produced comparable numbers.

The first optimization oversight made by the authors of the book – The Hyper-Thread Phalanx

The term phalanx is derived from a military formation used by the ancient Greeks and Romans. The formation generally involved soldiers lining up shoulder to shoulder, shield to shield multiple rows deep. The formation would advance in unison becoming “an irresistible force”. I use the term Hyper-Thread Phalanx to refer to the Hyper-Thread siblings of a core being aligned shoulder-to-shoulder and advancing forward.

I feel I must preface this section with the presumption that the authors of the book had a publishing deadline as well as had additional programming examples to write, test, edit and insert into the manuscript, and therefore had little time to explore this method.

The first optimization technique I employed was to gang the hardware threads of each core into a cohesive working unit (phalanx) thus enhancing the L1 and/or L2 cache hit ratio. In the book’s code, each thread worked in a different section of the 3D problem space. Looking down the X direction (view from Z/Y plane), we can illustrate each thread’s L1 and/or L2 cache activity as it processes each column of X.

For illustrative purposes, the left image represents a view of the Z/Y plane, with X going into the page. Also, for illustrative purposes, we show blue to indicate a cold cache miss, yellow to indicate a partial cache-hit on the first traversal of X. As computation progresses to x=nx-1, the c column has three reads (c+1, c, c-1). With the c+1 taking the memory hit, the c and c-1 reads will benefit with cache hit. At the end of the loop, all five columns c, n, s, t, b will be located in L1 and/or L2 cache (so will the f1_t[c+1] and f1_t[c-1] entries being in the same column as to f1_t[c]).

The right image illustrates each next traversal along x (++y, x=0:n-1). Note now that two of the five columns of x (c and t) are now hot (red) in L1 and/or L2 cache, while three are cold (blue), meaning not in cache. Additionally, the [c-1] and [c+] references will be in cache. This same activity is progressing for all the threads, including the HT siblings of each core. Counting 3 hits for the center column we can estimate the cache hit ratio as 3/7 (42.86%). This hit ratio is estimated without prefetches, and assuming the x depth does not exceed the cache capacity.

What is non-optimal about this strategy?

First, all the HT siblings of a core share the L1 and L2 cache of that core. Each core has separate L1 and L2 caches. Therefore, each of the three HT siblings of the original tiled code (which used three out of the four available HT siblings) could effectively only use 1/3rd of the core’s shared L1 and/or L2 cache. Additionally, depending on the start points, the HT siblings may experience false-sharing and evict each other’s hot cache-lines. This is not an efficient situation.

Moving on from the best settings for the original tiled code, this author chose a different tactic to perform tiling.

Hyper-Thread Phalanx

The Hyper-Thread Phalanx is introduced by changing the outer computation loop such that the HT threads within each core process neighboring z cells in the y direction and down x (vector by vector). The technique for doing so will be discussed later, but first let’s examine the reasoning behind this choice of access pattern. For this specific computational loop, the Hyper-Thread Phalanx design permits higher L1/L2 cache hit ratios.

Expanding upon the prior illustration of the activity of a single thread we will look at an illustration of 2-wide, 3-wide, and 4-wide Hyper-Thread Phalanxes:

The left image of each pair, illustrates the cold cache penetration of x. Yellow indicates one of the threads incurs a cache miss, while all the adjacent threads accessing the same cell experiences a cache hit. More importantly, moving on to the next drill down of x (right illustration of paired illustrations), we can now estimate the cache hit ratios: 10/14 (71.43%), 16/21(76.19%), and 22/18 (78.57%). These are significant improvements over the single thread layout value of 3/7 (42.86%).

A second benefit is that each of the HT siblings within a core now shares all of L1 and L2 of that core. Meaning this effectively triples the thread’s working set in these caches over the prior technique of maintaining spatial separation of the work performed by each thread within a core.

A third benefit is elimination of intra-core thread cache evictions due to false sharing.

The programming changes to implement this were a little out of the ordinary for the OpenMP programmer. To provide for the use of Hyper-Thread Phalanx you remove the #pragma omp for collapse(2) code structure, and insert hand partitioning of the iteration space by core and by HT sibling within core.  This amounted to removing 5 lines of code and inserting 13 lines of code.

The question then becomes:

How to determine thread binding to core and HT within core?

I will let you think about this while you wait for part 2 of this blog.

Jim Dempsey
Consultant
QuickThread Programming, LLC

  • Jim Dempsey
  • Intel Xeon Phi Coprocessor
  • Haswell
  • Intel® Xeon Phi™ Coprocessor High-Performance Programming
  • Symbol-Bild: 

  • Technical Article
  • Intel® Many Integrated Core Architektur
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Parallel Studio XE
  • Intel® Advanced Vector Extensions
  • OpenMP*
  • C/C++
  • Server
  • UX
  • Windows*
  • Server
  • Desktop
  • Entwickler
  • Professoren
  • Studenten
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8

  • A variable in a blank common cannot be specified with the OFFLOAD:TARGET attribute.

    $
    0
    0
    A variable in a blank common cannot be specified with the OFFLOAD:TARGET attribute.

    Thank you for your interest in this diagnostic message.We are still in the process of documenting this specific diagnostic.

    Please let us know of your experience with this diagnostic message by posting a comment below. Your interest in this diagnostic will help us prioritize the order we document diagnostics.

  • error
  • warning
  • remark
  • Linux*
  • Apple OS X*
  • Microsoft Windows* (XP, Vista, 7)
  • C/C++
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • URL
  • This entity is not permitted in a specification expression.

    How Intel® AVX Improves Performance on Server Application

    $
    0
    0

    You can see a performance boost with the latest Intel® Xeon® E7 4890 v2 and the Intel® Advanced Vector Extensions (Intel® AVX) optimized code.  For existing vectorized code that uses floating operations, you can gain a potential performance boost just by re-compiling your code for Intel® AVX. Or you can use the optimized Intel® Math Kernel Library (Intel® MKL) to run on the latest Intel® Xeon E7-4890 v2. For other applications migrating from Intel® Stream SIMD Extensions (Intel® SSE) code to Intel AVX code, you can write code in the equivalent assembly instructions or intrinsic instructions for high level languages. In this blog, I will share a simple experiment using the Intel® Optimized LINPACK benchmark to demonstrate the performance gain of three different workloads (30K, 40K, and 75K) from Intel AVX running on Windows* and Linux*. I will also share the list of Intel AVX instructions that were executed and the equivalent Intel SSE instructions for developers who are interested in coding.

    I used the following platform for the experiment:

    CPU & Chipset

    Model/Speed/Cache:  E7-4890 v2 QFJY 2.8GHz, 37.5mb cache, 155W TDP, D1

    • # of cores per chip: 15
    • # of CPU sockets: 4
    • Chipset: Patsburg SSB-J C1
    • System bus: 8GT/s QPI

    Platform

    Brand/model: Intel SDP S4TR1SY2B Brickland IVT-EX Qual MM # 931237

    • Chassis: Intel 4U Rackable
    • Baseboard: Intel CRB baseboard codenamed Thunder Ridge
    • BIOS: BIVTSDP1.86B.0042.R04.1309061422, BMC 70.06.r5145 w/ Closed Chassis SDR, ME 2.3.0, FRU D.00, CPLD 1.06
    • Dimm slots: 96
    • PCI slots: 1 x4, 7 x8, 4 x16
    • Drive controller: LSI SAS9217-8i (with custom FW)
    • Power supply: 2x1200W NON-REDUNDANT (+2 empty slots)
    • CD ROM: TEAC Slim
    • Network (nic): 1x Intel Ethernet Converged Network Adapter x540-T2 "Twin Pond" (OEM-GEN)

    Memory

    Memory Size: 256GB (32 x 8GB) - 4 DIMMS per memory riser card (Slot A0, B0, C0, D0)

    Brand/model: Samsung M393B1K70DH0-YK0 1309

    DIMM info: 8GB 2Rx4 PC3L-1200R-11-11-E2-P2

    Mass storage

    Brand & model: Intel SSD DC S3700 Series SSDSC2BA800G3

    Number/Size/RPM/Cache: 2 in RAID 0 / 800gb

    Operating system

    Windows Server 2012 R2 / SLES 11 SP3

    Procedure for running LINPACK: 

    1. Download and install the following:
      1. Intel® Math Kernel Library – LINPACK Download
        http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
      2. Intel® Math Kernel Library (MKL)
        http://software.intel.com/en-us/intel-math-kernel-library-evaluation-options
    2. Create three different input files for 30K, 40K, and 75K from the “...\linpack” directory
    3. For AVX runs update files as follows:
      1. For Windows, update the runme_xeon64.bat file to take new input files you that have created runme_xeon64.bat file.  For Linux, update the runme_xeon64 shell script file to take the new input files.
      2. The results will be in Glops similar to the Table 2
    4. For Intel SSE runs, you will need to have a processor with Intel AVX disabled and repeat sub-step a. and b. in step number 3.

    What are the Intel AVX and the equivalent Intel SSE instructions that were executed? 

    Table 1 has a list of Intel AVX instructions that were executed during the Intel AVX runs. I have provided the equivalent Intel SSE instructions for those developers who are thinking of moving their existing Intel SSE code to Intel AVX. There are four different ways to take advantage of Intel AVX. For more information, see the references below:

    1. Use the Intel AVX intrinsic instructions. For high level language (such as C or C++) developers, you can use Intel® Intrinsic instructions to make the call and recompile their codes.  See the Intel® Intrinsic Guide and Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details
    2. Code in assembly instructions directly. For low level language (such as assembly) developers, you can use those equivalent Intel AVX instruction from their existing Intel SSE code.  See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details
    3. Use Intel® compiler with the proper Intel AVX switch. If you don’t want to write code, you can take advantage of the Intel® compiler by using the proper Intel AVX switch to compile their existing Intel SSE code.  See the Intel® AVX State Transitions: Migrating SSE Code to AVX
    4. Use the Intel AVX optimized library such as the Intel® Math Kernel Library (Intel MKL).  For statically built application, you need to re-link the application with the new library.  For dynamically built application, you need to replace the new dynamic library in the system.

    Intel AVX Instructions
    from
    the LINPACK Runs

    Equivalent Intel SSE Instructions

    (SSE/SSE2/SSE3/SSE4)

    Definitions

    VADDPD                                 

    ADDPD                                 

    Add Packed Double-Precision Floating-Point Values

    VBLENDPD                                  

    BLENDPD                                  

    Blend Packed Double Precision Floating-Point Values

    VBROADCASTSD                              

    N/A – AVX and AVX2 only                               

    Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register.

    VDIVSD                                         

    DIVSD                                         

     

    VEXTRACTF128                              

    N/A – AVX and AVX2 only                              

    Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.

    VINSERTF128                               

    N/A – AVX and AVX2 only                              

    Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.

    VMOVAPD                                 

    MOVAPD                                 

    Move Aligned Packed Double-Precision Floating-Point Values

    VMOVAPS                                            

    MOVAPS                                           

    Move Aligned Packed Single-Precision Floating-Point Values

    VMOVDDUP                                       

    MOVDDUP                                       

    Move One Double-FP and Duplicate

    VMOVDQU                                        

    MOVDQU                                        

    Move Unaligned Double Quadword

    VMOVHPD                                       

    MOVHPD                                       

    Move High Packed Double-Precision Floating-Point Value

    VMOVLPD                                        

    MOVLPD                                         

    Move Low Packed Double-Precision Floating-Point Value

    VMOVSD                                      

    MOVSD                                      

    Move Data from String to String

    VMOVUPD                                   

    MOVUPD                                   

    Move Unaligned Packed Double-Precision Floating-Point Values

    VMOVUPS                                  

    MOVUPS                                   

    Move Unaligned Packed Single-Precision Floating-Point Values

    VMULPD                                 

    MULPD                                 

    Multiply Packed Double-Precision Floating-Point Values

    VMULSD                                        

    MULSD                                        

    Multiply Scalar Double-Precision Floating-Point Values

    VPERM2F128                              

    N/A – AVX  only                              

    Shuffle the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.

    VPERMILPD                               

    N/A – AVX only                              

    Shuffle the 32-bit or 64-bit vector elements of one input operand with an immediate operand as selector.

    VPXOR                                           

    VPXOR                                            

    Logical Exclusive OR

    VSUBPD                                    

    SUBPD                                    

    Subtract Packed Double-Precision Floating-Point Values

    VSUBSD                                         

    SUBSD                                        

    Subtract Scalar Double-Precision Floating-Point Values

    VUCOMISD                                    

    UCOMISD                                    

    Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS

    VUNPCKHPD                                  

    UNPCKHPD                                  

    Unpack and Interleave High Packed Double-Precision Floating-Point Values

    VUNPCKLPD                                  

    UNPCKLPD                                  

    Unpack and Interleave Low Packed Double-Precision Floating-Point Values

    VXORPD                                    

    XORPD                                    

    Bitwise Logical XOR for Double-Precision Floating-Point Values

    VZEROUPPER                                  

    N/A – AVX and AVX2 only                              

    Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.

    Table 1– Intel AVX and Equivalent Intel SSE Instructions

     

    The list in Table 1 is just a subset.  The full list can be obtained from the Intel® 64 and IA-32 Architectures Optimization Reference Manual.

     

    What is the performance gain for running the LINPACK benchmark with Intel AVX vs. Intel SSE enabled on the Intel Xeon E7 4890 v2 server?

    Table 2 shows the results from the three different workloads running on Windows* and Linux*.  In the Ratio column, the numbers show that the LINPACK benchmark produces ~1.6x-1.7x better performance when running with the combination of an Intel AVX optimized LINPACK and an Intel AVX capable processor. This is just an example of the potential performance boost for LINPACK.  For other applications, the performance gain will vary depending on the optimized code and the hardware environment.

    Windows*

    Intel AVX (Gflops)

    Intel SSE (Gflops)

    Ratio: Intel AVX/Intel SSE

    LINPACK 30K v11.1.1

    631.8

    400.3

    1.6

    LINPACK 40K v11.1.1

    756.4

    480.6

    1.6

    LINPACK 75K v11.1.1

    829.3

    514.3

    1.6

    Linux*

     

     

     

    LINPACK 30K v11.1.1

    913.6

    534.3

    1.7

    LINPACK 40K v11.1.1

    1023.5

    621.2

    1.6

    LINPACK 75K v11.1.1

    1128.8

    657.0

    1.7

    Table2 – Results and Performance Gain from the LINPACK benchmark

    Conclusion

    From the LINPACK experiment, there is a performance boost from the Intel AVX of ~1.6x-1.7x on the Intel E7 4890 v2. With the latest Intel® processor, software developers can migrate their Intel SSE code to Intel AVX for better performance. The reference materials below can help developers learn how to migrate existing Intel SSE code to Intel AVX code.

    References:

     

  • vectorization
  • MKL
  • AVX
  • cloud
  • Symbol-Bild: 

  • Cloud-Computing
  • Intel® Many Integrated Core Architektur
  • Vektorisierung
  • Intel® C++ Composer XE
  • Intel® Parallel Composer
  • Intel® Math Kernel Library
  • Intel® Advanced Vector Extensions
  • Intel® Streaming SIMD Extensions
  • C/C++
  • Server
  • Desktop
  • Entwickler
  • Cilk plus implicit threshold

    $
    0
    0

    Hi,

    I'm new to cilk, and i wanted to ask if it has an implicit threshold for the task creation, in recursive computations like fib?

    If so, is it based on the number of tasks created, or in the depth of the computation?

     

    Thanks!

    The Chronicles of Phi - part 2 Hyper-Thread Phalanx – tiled_HT1

    $
    0
    0

    In the first part I discussed the diffusion problem and the proposed strategy to address the performance issue through use of a Hyper-Thread Phalanx. I left you with the dangling question:

    How to determine thread binding to core and HT within core?

    Let's begin with an illustration of 2-wide, 3-wide, and 4-wide Hyper-Thread Phalanxes:

    View of the y/z plane of a x/y/z space, x going into page. Computation order into page along x, then stepping down on page in the +y direction.

    The left image of each pair, illustrates the cold cache penetration of x. Yellow indicates one of the threads incurs a cache miss, while all the adjacent threads accessing the same cell experiences a cache hit. More importantly, moving on to the next drill down of x (right illustration of paired illustrations), we can now estimate the cache hit ratios: 10/14 (71.43%), 16/21(76.19%), and 22/18 (78.57%). These are significant improvements over the single thread layout value of 3/7 (42.86%).

    The question then becomes:

    How to determine thread binding to core and HT within core

    One could specify affinity and core placement through use of environment variables external to the program, but this may not be suitable or reliable. It is better to place the least constrictions and requirements on the environment variables. While one set of affinity bindings may be best for this function, your overall application may benefit from a different arrangement of thread bindings. Therefore, this necessitates having the program determine the affinity bindings applied by the environment.

    The following header HyperThreadPhalanx.h  and utility code HyperThreadPhalanx.c  were used for the improved performance test programs added to the sample program folder. The original test programs were written in C. Therefore, this version of the utility code is also written in C. As an exercise for the reader, you may modify the code for use with C++.

    The primary goal of the HyperThreadPhalanx.c  utility function is to:

    o Determine the number of OpenMP threads in the outer most region of the application
    o Compute a logical core number (zero based and contiguous) for each thread
    o Compute a logical HT number within the core (zero based and contiguous) for each thread
    o Compute the number of logical cores
    o Compute number of HTs per core as used in the working set

    Notes:

    The programmer (operator of program) must specify some form of realistic affinity binding. They are free to choose almost any strategy that is reasonable for the remainder (non- HyperThreadPhalanx’ed part) of the application. KMP_AFFINITY=compact, KMP_AFFINITY=scatter, as well as combining with KMP_PLACE_THREADS=nnC, mmT, oO. The only “reasonable” requirement is for each core used to have the same number of working threads. If they do not, the current code will choose the smallest number (though testing of adverse configurations has not been strenuously performed).

    The header file:

    // HyperThreadPhalanx.h
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <math.h>
    #include <omp.h>
    #include <assert.h>
    
    // types:
    struct HyperThreadPhalanxThreadInfo_t
    {
      int APICid;
      int PhysicalCore;
      int PhysicalHT;
      int LogicalCore;
      int LogicalHT;
    };
    
    struct HyperThreadPhalanx_t
    {
      int isIntel;
      union {
      char ProcessorBrand[48];
      unsigned int ProcessorBrand_uint32[12];
      };
      int nHTsPerCore;// hardware
      int nThreads;   // omp_get_num_threads() {in parallel region, no nesting}
      int nCores;     // number of core derived therefrom
      int nHTs;       // smallest number of HT's in mapped cores (logical HTs/core)
      struct HyperThreadPhalanxThreadInfo_t* ThreadInfo; // allocated to nThreads
    };
    
    // global variables:
    extern struct HyperThreadPhalanx_t HyperThreadPhalanx;
    
    
    // global thread private variables:
    #if defined(__linux)
    // logical core (may be subset of physical cores and not necessarily core(0))
    extern __thread int myCore; 
    // logical Hyper-Thread within core
    // (may be subset of hw threads in core and not necessarily hwThread(0) in core)
    extern __thread int myHT;
    #else
    // logical core (may be subset of physical cores and not necessarily core(0))
    extern __declspec(thread) int myCore;
    // logical Hyper-Thread within core
    // (may be subset of hw threads in core and not necessarily hwThread(0) in core)
    extern __declspec(thread) int myHT;
    #endif
    
    // functions:
    int HyperThreadPhalanxInit();
    

    The header introduces into your namespace the HyperThreadPhalanx object and two Thread Local Storage variables myCore and myHT. Other than for the two TLS variables, the user is free to use the post-HyperThreadPhalanxInit() values if they wish to do so.

    The current code was kept brief, and only is functional for Intel processors (P4 and later). The code uses the CPUID intrinsic and instruction. Information on the CPUID instruction can be found in Intel® Processor Identification and the CPUID instruction. Application Note 485.

    The code now follows:

    // HyperThreadPhalanx.c
    
    #include "HyperThreadPhalanx.h"
    
    struct HyperThreadPhalanx_t HyperThreadPhalanx;
    
    #if defined(__linux)
    // logical core (may be subset of physical cores and not necessarily core(0))
    __thread int myCore = -1;
    // logical Hyper-Thread within core
    // (may be subset of hw threads in core and not necessarily hwThread(0) in core)
    __thread int myHT = -1;
    #else
    // logical core (may be subset of physical cores and not necessarily core(0))
    __declspec(thread) int myCore = -1;
    // logical Hyper-Thread within core
    // (may be subset of hw threads in core and not necessarily hwThread(0) in core)
    __declspec(thread) int myHT = -1;
    #endif
    
    void __cpuidEX(int cpuinfo[4], int func_a, int func_c)
    {
     int eax, ebx, ecx, edx;
     __asm__ __volatile__ ("cpuid":\
     "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (func_a), "c" (func_c));
     cpuinfo[0] = eax;
     cpuinfo[1] = ebx;
     cpuinfo[2] = ecx;
     cpuinfo[3] = edx;
    } // void __cpuidEX(int cpuinfo[4], int func_a, int func_c)
    
    void InitProcessor()
    {
      unsigned int CPUinfo[4];
      __cpuid(CPUinfo, 0); // This code requires at least support of CPUID
      HyperThreadPhalanx.ProcessorBrand_uint32[0] = CPUinfo[1];
      HyperThreadPhalanx.ProcessorBrand_uint32[1] = CPUinfo[3]; // note order different
      HyperThreadPhalanx.ProcessorBrand_uint32[2] = CPUinfo[2];
      HyperThreadPhalanx.ProcessorBrand_uint32[3] = 0;
      HyperThreadPhalanx.isIntel =
        (strcmp(HyperThreadPhalanx.ProcessorBrand, "GenuineIntel") == 0);
    }
    
    int HyperThreadPhalanxInit()
    {
      InitProcessor();
      if(!HyperThreadPhalanx.isIntel)
      {
        printf("Not Intel processor. Add code to handle this processor.\n");
        return 1;
      }
      if(omp_in_parallel())
      {
        printf("HyperThreadPhalanxInit() must be called from outside parallel .\n");
        return 2;
      }
    
    #pragma omp parallel
      {
        // use omp_get_num_threads() NOT omp_get_max_threads()
        int nThreads = omp_get_num_threads();
        int iThread = omp_get_thread_num();
        unsigned int CPUinfo[4];
    
    #pragma omp master
        {
          HyperThreadPhalanx.nThreads = nThreads;
          HyperThreadPhalanx.ThreadInfo =
           malloc(nThreads * sizeof(struct HyperThreadPhalanxThreadInfo_t));
          __cpuidEX(CPUinfo, 4, 0);
          HyperThreadPhalanx.nHTsPerCore = ((CPUinfo[0] >> 14) & 0x3F) + 1;
          // default logical HT's per core to physical (may change later)
          HyperThreadPhalanx.nHTs = HyperThreadPhalanx.nHTsPerCore;
    
        }
    #pragma omp barrier
        // master region finished, see if allocation succeeded
        if(HyperThreadPhalanx.ThreadInfo)
        {
          __cpuidEX(CPUinfo, 1, 0); // get features
          if(CPUinfo[2] & (1 << 21))
          {
            // processor has x2APIC
            __cpuidEX(CPUinfo, 0x0B, 0);
     // get thread's APICid
            HyperThreadPhalanx.ThreadInfo[iThread].APICid = CPUinfo[3];
          }
          else
          {
            // older processor without x2APIC
            __cpuidEX(CPUinfo, 1, 0);
     // get thread's APICid
            HyperThreadPhalanx.ThreadInfo[iThread].APICid = (CPUinfo[1] >> 24) & 0xFF;
          }
          // Use thread's APICid to determine physical core and physical HT number within core
          HyperThreadPhalanx.ThreadInfo[iThread].PhysicalCore =
            HyperThreadPhalanx.ThreadInfo[iThread].APICid
             / HyperThreadPhalanx.nHTsPerCore;
          HyperThreadPhalanx.ThreadInfo[iThread].PhysicalHT =
            HyperThreadPhalanx.ThreadInfo[iThread].APICid
              % HyperThreadPhalanx.nHTsPerCore;
          // for now indicate LogicalCore and LogicalHT not assigned
          HyperThreadPhalanx.ThreadInfo[iThread].LogicalCore = -1;
          HyperThreadPhalanx.ThreadInfo[iThread].LogicalHT = -1;
        }
    #pragma omp barrier
        // At this point, all the HyperThreadPhalanx.ThreadInfo[iThread].APICid,
        // PhysicalCore and PhysicalHT have been filled-in.
        // However, the logical core number may differ from physical core number
        // no different than OpenMP thread number differing from logical processor number
        // The logical core numbers are 0-based, without gaps
    #pragma omp master
        {
          int NextLogicalCore = 0;
          for(;;)
          {
            int iLowest = -1; // none found
            for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
            {
              // see if unassigned core
              if(HyperThreadPhalanx.ThreadInfo[i].LogicalCore == -1)
              {
                if(iLowest < 0)
                {
                  // first unassigned is lowest
                  iLowest = i;
                }
                else
                {
                  if(HyperThreadPhalanx.ThreadInfo[i].APICid < HyperThreadPhalanx.ThreadInfo[iLowest].APICid)
                   iLowest = i; // new lowest
                }
              } // if(HyperThreadPhalanx.ThreadInfo[i].LogicalCore < 0)
            } // for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
            if(iLowest < 0)
              break;
            if(HyperThreadPhalanx.ThreadInfo[iLowest].PhysicalHT != 0)
            {
              // unable to use core
              for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
              {
                if(HyperThreadPhalanx.ThreadInfo[i].PhysicalCore == HyperThreadPhalanx.ThreadInfo[iLowest].PhysicalCore)
                  HyperThreadPhalanx.ThreadInfo[i].LogicalCore = -2; // mark as unavailable
              } // for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
            }
            else
            {
              // able to use core
              int NextLogicalHT = 0;
              for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
              {
                if(HyperThreadPhalanx.ThreadInfo[i].PhysicalCore == HyperThreadPhalanx.ThreadInfo[iLowest].PhysicalCore)
                {
                  HyperThreadPhalanx.ThreadInfo[i].LogicalCore = NextLogicalCore;
                  HyperThreadPhalanx.ThreadInfo[i].LogicalHT = NextLogicalHT++;
                }
              } // for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
              ++NextLogicalCore;
              if(NextLogicalHT < HyperThreadPhalanx.nHTs)
                HyperThreadPhalanx.nHTs = NextLogicalHT; // reduce
            }
          } // for(;;)
          HyperThreadPhalanx.nCores = NextLogicalCore;
        } // omp master
    #pragma omp barrier
        // master is finished
        myCore = HyperThreadPhalanx.ThreadInfo[iThread].LogicalCore;
        myHT = HyperThreadPhalanx.ThreadInfo[iThread].LogicalHT;
      } // omp parallel
      
      for(int i = 1; i < HyperThreadPhalanx.nThreads; ++i)
      {
        for(int j = 0; j < i; ++ j)
        {
          if(HyperThreadPhalanx.ThreadInfo[j].APICid == HyperThreadPhalanx.ThreadInfo[i].APICid)
          {
            printf("Oversubscription of threads\n");
            printf("Multiple SW threads assigned to same HW thread\n"); 
            return 4;
          }
        } // for(int j = 0; j < i; ++ j)
      }
      return 0;
    } // void HyperThreadPhalanxInit()
    

    Next we can integrate the above function into the sample code. Which I will cover in the next part of this blog.

    Jim Dempsey
    Consultant
    QuickThread Programming, LLC

     

    Symbol-Bild: 

  • Intel® Many Integrated Core Architektur
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Parallel Studio XE
  • Intel® Advanced Vector Extensions
  • OpenMP*
  • C/C++
  • Server
  • Windows*
  • Server
  • Desktop
  • Entwickler
  • Professoren
  • Studenten
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Question about steal-continuation semantics in Cilk Plus, Global counter slowing down computation, return value of functions

    $
    0
    0

    1)
    What I understood about steal-continuation is, that every idle thread does not actually steal work, but the continuation which generates a new working item.
    Does that mean, that inter-spawn execution time is crucial? If 2 threads are idle at the same time, from what I understand only one can steal the continuation and create its working unit, the other thread stays idle during that time?!

    2)
    As a debugging artefact, I had a global counter incremented on every function call of a function used within every working item.

    I expect this value to be wrong (e.g. lost update), as it is not protected by a lock. what I didn't expect was execution time being 50% longer. Can somone tell me, why this is the case?

    3)
    Du I assume correctly, that a cilk-spwaned function can never (directly) return a result, as the continuation might continue in the mean time and one would never know when the return value is actually written?

    developer documents for Cilk Plus

    $
    0
    0

    Hi,

    First I would like to thank you all for the awesome cilk plus tools you have open source in GCC and LLVM.

    I am trying to study the runtime library and finding it a bit difficult to follow the execution in a sample application.

    Are there any developer documents available? A wiki perhaps.

    Specifically, I am trying to trace the execution path for cilk_spawn which is a key word. Any helpful links to get me started would be really great!

    Thanks,

    Arya


    The Chronicles of Phi - part 3 Hyper-Thread Phalanx – tiled_HT1 continued

    $
    0
    0

    The prior part (2) of this blog provided a header and set of function that can be used to determine the logical core and logical Hyper-Thread number within the core. This determination is to be use in an optimization strategy called the Hyper-Thread Phalanx.

    The term phalanx is derived from a military formation used by the ancient Greeks and Romans. The formation generally involved soldiers lining up shoulder to shoulder, shield to shield multiple rows deep. The formation would advance in unison becoming “an irresistible force”. I use the term Hyper-Thread Phalanx to refer to the Hyper-Thread siblings of a core being aligned shoulder-to-shoulder and advancing forward.

    Note, the Hyper-Thread Phalanx code provided in part 2 of this blog allows you to experiment with different thread teaming scenarios. We intend to run with 180 threads (3 threads per core) and 240 threads (4 threads per core).

    Additionally,  the also works on the host processor(s) with 2 threads per core (as well as 1 thread per core should you disable HT).

    The Hyper-Thread Phalanx can be attained by a relatively simple loop hand partitioning technique:

    Code for main computation function in tiled_HT1:

    diffusion_tiled(REAL *restrict f1, REAL *restrict f2, int nx, int ny, int nz,
                  REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
                  REAL cb, REAL cc, REAL dt, int count) {
      #pragma omp parallel
      {
        REAL *f1_t = f1;
        REAL *f2_t = f2;
    
        // number of Squads (singles/doublets/triplets/quadruples) across z dimension
        int nSquadsZ = (nz + nHTs - 1) / nHTs;
        // number of full (and partial) singles/doublets/triads/quads on z-y face
        int nSquadsZY = nSquadsZ * ny;
        int nSquadsZYPerCore = (nSquadsZY + nCores - 1) / nCores;
        // Determine this thread's triads/quads (TLS init setup myCore and myHT)
        int SquadBegin = nSquadsZYPerCore * myCore;
        int SquadEnd = SquadBegin + nSquadsZYPerCore; // 1 after last Squad for core
        if(SquadEnd > nSquadsZY)
     SquadEnd = nSquadsZY; // truncate if necessary
    
        // benchmark timing loop
        for (int i = 0; i < count; ++i) {
          // restrict current thread to its subset of Squads on the Z/Y face.
          for(int iSquad = SquadBegin; iSquad < SquadEnd; ++iSquad) {
        // home z for 0'th team member for next Squad
            int z0 = (iSquad / ny) * nHTs;
            int z = z0 + myHT;  // z for this team member
            int y = iSquad % ny;
            // last double/triad/quad along z may be partially filled
            // assure we are within z
            if(z < nz)
            {
                // determine the center cells and cells about the center
                int x = 0;
                int c, n, s, b, t;
                c =  x + y * nx + z * nx * ny;
                n = (y == 0)    ? c : c - nx;
                s = (y == ny-1) ? c : c + nx;
                b = (z == 0)    ? c : c - nx * ny;
                t = (z == nz-1) ? c : c + nx * ny;
                // c runs through x, n and s through y, b and t through z
                // x=0 special (no f1_t[c-1])
                f2_t[c] = cc * f1_t[c] + cw * f1_t[c] + ce * f1_t[c+1]
                    + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];
                // interior x's faster
    #pragma noprefetch
    #pragma simd  
                for (x = 1; x < nx-1; x++) {
                  ++c;
                  ++n;
                  ++s;
                  ++b;
                  ++t;
                  f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c+1]
                      + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];
                } // for (x = 1; x < nx-1; x++)
                // final x special (f1_t[c+1])
                ++c;
                ++n;
                ++s;
                ++b;
                ++t;
                f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c]
                    + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];
            } // if(z < nz)
          } // for(int iSquad = SquadBegin; iSquad < SquadEnd; ++iSquad)
    // barrier required because we removed implicit barrier of #pragma omp for collapse(2)
              #pragma omp barrier
    #if defined(VERIFY)
              #pragma omp master
              diffusion_baseline_verify(f1_t, f2_t, nx, ny, nz,
                       ce, cw, cn, cs, ct,
                       cb, cc);
              #pragma omp barrier
    #endif
    
          REAL *t = f1_t;
          f1_t = f2_t;
          f2_t = t;
        } // count
      } // parallel
      return;
    }

    Effectively we removed 5 lines of code relating to the OpenMP loop control and added 13 lines for hand control (net 8 lines of code difference). This tiledHT1 code also changed how the blocking factor was performed. Due to the data flow, blocking was removed in favor of data flow and hardware prefetching.

    In making three runs of each problem size of the original tiled code and the newer tiled_HT1 code we find:

    export KMP_AFFINITY=scatter
    export OMP_NUM_THREADS=180
     
    ./diffusion_tiled_xphi
    118771.945 
    123131.672 
    122726.906 
     121543.508
     
    ./diffusion_tiled_Large_xphi
    114972.258 
    114524.977 
    116626.805 
     115374.680
     
    export KMP_AFFINITY=compact
    unset OMP_NUM_THREADS
     
    ./diffusion_tiled_HT1_xphi
    134904.891 
    131310.906 
    133888.688 
     133368.162
     
    ./diffusion_tiled_HT1_Large_xphi
    118476.734 
    118078.930 
    118157.188 
     118237.617

    The number on the right is the average of the three runs. The ganging strategy did show some improvement in the small model but not a similar improvement in the large model. Furthermore, the small model improvement was not as anticipated. The chart including the tiled_HT1 code:

    The improvement to the small model looks good, Something isn’t right with the large model. Let’s discover what it is.

    Sidebar:

    The earlier draft of this article progressed making optimizations from here. However, discoveries made later caused me to revisit this code. At this point it is important for me to take you on a slight divergence so that you can learn from my experience.

    My system is dual boot for the host. I have both CentOS Linux and Windows 7 installed on different hard drives. I was curious to see if there was any impact of running the native coprocessor code dependent on which host operating system was running. The expectation was that there should be no noticeable difference.

    I configured the environment variables to run the 3-wide phalanx and ran tests on both CentOS and Windows 7 host (the chart above is for 4-wide phalanx).To my surprise, running the same code (same file in fact) hosted on Windows and comparing the results run in the coprocessor on Linux, the relative performance figures were reversed! What was faster with a Linux host was slower with the Windows host, and what was slower became faster. This didn’t make sense.

    One of the thoughts that came to mind is there may be a memory alignment issue between the allocations of the arrays. This has been my experience on Intel64 and IA32 platforms. So I added a printf to display the address of the buffers. The two programs tiled and tiled_HT both had 16 byte alignment and approximately the same offset within the page, so data alignment differences could not be the cause. Curiously, by adding the printf of the addresses of the two buffers, the performance figures flipped again. The results of the runs were:

    [Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
    f1 = 0x7fbbf3945010 f2 = 0x7fbbef944010  printf
    FLOPS        : 122157.898 (MFlops) With printf
    [Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
    f1 = 0x7ffaf7c0b010 f2 = 0x7ffaf3c0a010  printf
    FLOPS        : 123543.602 (MFlops) With printf
    [Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
    f1 = 0x7f3afa480010 f2 = 0x7f3af647f010  printf
    FLOPS        : 123908.375 (MFlops) With printf
    Average with printf: 123203.292 MFlops
    
    [Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
    FLOPS        : 114380.531 (MFlops) Without printf
    [Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
    FLOPS        : 121105.062 (MFlops) Without printf
    [Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
    FLOPS        : 116298.797 (MFlops) Without printf
    Average without printf 117261.463 Mflops
    With printf +5941.829 MFlops (+5.1% over without printf)
    

    Clearly there is a shift of 5% in performance due to code shift (shift in load position of code).

    What this means then is that the relative performance difference measured in the original tiled code and the newer tiled_HT code (earlier versions) may be completely obscured by the fortuitous, or lack thereof, of placement of the code. One version might be +5%, and the other version -5%, yielding a comparative uncertainty of up to 10%. This difference is specific to this particular sample of code. Do not assume that all code exhibits this amount of performance difference due to code placement. However, using this experience as an example, suggests that you be vigilant in your tuning process to look for code alignment issues.

    End sidebar

    Added to the above concerns, the expectation of the performance improvements was not observed. We only achieved a 9.7% improvement for the small model and a 2.5% improvement for large model.

    Applying Occam’s razor: If we did not observe an increase in L1 hit ratio – it didn’t happen.

    The estimated improvement in the L1 hit ratio was based on the amount of data we calculated should be in the L1  cache for the specified algorithm.

    Three-wide, small data:  8 x 256 x  4 = 8KB low side, 11 x 256 x 4 = 11KB high side
    Four-wide, small data: 10 x 256 x 4 = 10KB low side, 14 x 256 x 4 = 14KB high side

    Both calculations indicate plenty of room in the 32KB L1 cache. Something else must be going on and will need some investigation (deferred to some other time).

    Regardless of this, the GFlops is in the range of 133 GFlops, far short of the capability of the Xeon Phi™.

    Now I must ask myself:

    What is non-optimal about this strategy?
    And: What can be improved?

    You can think about it while you wait for part 4

    Jim Dempsey
    Consultant
    QuickThread Programming, LLC

    - Part 1 - The Chronicles of Phi - part 1 The Hyper-Thread Phalanx

    - Part 2 - The Chronicles of Phi - part 2 Hyper-Thread Phalanx – tiled_HT1

     

     

  • Jim Dempsey
  • Intel Xeon Phi Coprocessor
  • Symbol-Bild: 

  • Intel® Many Integrated Core Architektur
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Parallel Studio XE
  • Intel® Advanced Vector Extensions
  • OpenMP*
  • C/C++
  • Server
  • UX
  • Windows*
  • Server
  • Desktop
  • Entwickler
  • Professoren
  • Studenten
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Server
  • Explicit Vector Programming – Best Known Methods

    $
    0
    0

    Explicit Vector Programming – Best Known Methods

    Why do we care about vectorizing applications? The simple answer: Vectorizing improves performance, and achieving high performance can save power. The faster an application can compute CPU-intensive regions, the faster the CPU can be set to a lower power state.

    How does vectorizing compare to scalar operations with regard to performance and power? Vectorizing consumes less power than equivalent scalar operations because it performs better: Scalar operations process less several times data per cycle and require more instructions  and more cycles to complete.

    The introduction of wider vector registers in x86 platforms and the increasing number of cores that support single instruction multiple data (SIMD) and threading parallelism now make vectorization an optimization consideration for developers. This is because vector performance gains are applied per core, so multiplicative application performance gains become possible for more applications. In the past, many developers relied heavily on the compiler to auto-vectorize some loops, but serial constraints of programming languages have hindered the compiler’s ability to vectorize many different kinds of loops.  The need arose for explicit vector programming methods to extend vectorization capability for supporting reductions, vectorizing:

    • Outer loops

    • Loops with user defined functions

    • Loops that the compiler assumes to have data dependencies, but on developer were understood to benign.

    In summary: achieving high performance can also save power.

    (An excellent web reference is the “Programming and Compiling for Intel® Many Integrated Core Architecture”. While the focus is on Intel® Xeon™ Phi coprocessor optimization, much of the content is also applicable to Intel Xeon® and Intel® Core™ processors.)

    This document describes high-level best known methods (BKMs) for using explicit vector programming to improve the performance of CPU-bound applications on modern processors with vector processing units.  In many cases, it is advisable to consider structural changes that accommodate both thread-level parallelism and as SIMD-level parallelism as you pursue your optimization strategy.

    Note: To determine whether your application is CPU-bound or memory-bound, see About Performance Analysis with VTune™ Amplifier and Detecting Memory Bandwidth Saturation in Threaded Applications. Using hotspot analysis, find the parts of your application that are CPU-bound.

    The following six steps are applicable for CPU-bound applications:

    1. Measure baseline application performance.
    2. Run hotspots and general exploration report analysis with the Intel® VTune™ Amplifier XE.
    3. Determine hot loop/functions candidates to see if they are qualified for SIMD parallelism.
    4. Implement SIMD parallelism using explicit vector programming techniques.
    5. Measure SIMD performance.
    6. [Optional for advanced developers] Generate assembly code and inspect.
    7. Repeat!

    Step 1.  Measure Baseline Application Performance

    You first need to have a baseline for your application’s existing performance level to if your vectorization changes are effective. In addition, you need have a baseline to measure your progress and final application performance relative to your starting point. Understanding this provides some guidance about when to stop optimizing.

    Use a release build of your application for the initial baseline instead of a debug build. A release build contains all the optimizations in your final application. This is important because you need to understand the loops or “hotspots” in your application are spending significant time. 

    A release baseline provides symbol information, and has all optimizations turned on except simd (explicit vectorization) and vec (auto-vectorization). To explicitly turn off simd and auto-vectorization use the following compiler switches  -no-simd–no-vec. (See Intel® C++ Compiler User Reference Guide 14.0)

    Compare the baseline’s performance against the vectorized version to get a sense of how well your vectorization tuning approaches theoretical maximum speedup.

    It is best to compare the performance of specific loops in the baseline and vectorized version using tools such as the Intel® VTune™ Amplifier XE or embedded print statements.

    Step 2. Run hotspots and general exploration report analysis with Intel® VTune™ Amplifier XE

    You can use the Intel® VTune™ Amplifier XE to find the most time-consuming functions in your application. The “Hotspots” analysis type is recommended, although “Lightweight Hotspots” (which profiles the whole system, as opposed to just your application) works as well

    Identifying which areas of your application are taking the most time allows you to focus your optimization efforts in those areas where performance improvements will have the most effect. Generally, you want to focus only on the top few hotspots or functions taking at least 10% of your application’s total runtime. Make note of the hotspots you want to focus on for the next step. (Tutorial: Finding Hotspots.)

    The general exploration report can provide information about:

    • TLB misses (consider compiler profile guided optimization),

    • L1 Data cache misses (consider cache locality and using streaming stores),

    • Split loads and split stores (consider data alignment for targeted architecture),

    • Memory bandwidth,

    • Memory latency (consider streaming stores and prefetching) demanded by the application.

    This higher level analysis can help you determine whether it is profitable to pursue vectorization tuning.

    Step 3: Determine Hot Loop/Functions Candidates Are Qualified for SIMD Parallelism

    One key suitability ingredient for choosing loops to vectorize is whether the memory references in the loop are independent of each other. (See Memory Disambiguation inside vector-loops and Requirements for Vectorizable Loops.)

    The Intel® Compiler vectorization report (or -vec-report) can tell you whether each loop in your code was vectorized.  Ensure that you are using the compiler optimization level 2 or 3 (-O2 or –O3) to enable the auto-vectorizer. Run the vectorization report and look at the output for the hotspots determined from Step 2. If there are loops in these hotspots that did not vectorize, check whether they have math, data processing, or string calculations on data in parallel (for instance in an array). If they do, they might benefit from vectorization. Move to Step 4 if any vectorization candidates are found.

    Data alignment

    Data alignment is another key ingredient for getting the most out of your vectorization efforts.  If the Intel® VTune™ Amplifier reports split loads and stores, then the application is using unaligned data. Data alignment forces the compiler to create data objects in memory on specific byte boundaries.  There are two aspects of data alignment that you must be aware of:

    1. Create arrays with certain byte alignment properties.

    2. Insert alignment pragmas/directives and clauses in performance critical regions.

    Alignment increases the efficiency of data loads and stores to and from the processor. When targeting the Intel® Supplemental Streaming Extensions 2 (Intel® SSE 2) platforms, use 16-byte alignment that facilitates the use of SSE-aligned load instructions. When targeting the Intel® Advanced Vector Extensions (Intel® AVX) instruction set, try to align data on a 32-byte boundary. (See Improving Performance by Aligning Data.) For Intel® Xeon Phi™ coprocessors, memory movement is optimal on 64-byte boundaries. (See Data Alignment to Assist Vectorization.)

    Unit stride

    Consider using unit stride memory (also known as address sequential memory) access and structure of arrays (SoA) rather than arrays of structures (AoS) or other algorithmic optimizations to assist vectorization. (See Memory Layout Transformations.)

    As a general rule, it is best to try to access data in a unit stride fashion when referencing memory.  Because this is often good for vectorization and other parallel programming techniques. (See Improving Discrete Cosine Transform performance using Intel(R) Cilk(TM) Plus.)

    Successful vectorization may hinge on the application of other loop optimizations, such as loop interchange (see information on cache locality), and loop unroll.

    It may be worth experimenting to see if inlining a function using –ip or –ipo allows vectorization to proceed for loops with embedded, user-defined functions. This is one alternative approach to using simd-enabled functions; there may be tradeoffs between using one or the other.

    Note:

    If the algorithm is computationally bound when performing hotspot analysis, continue pursuing the strategy described in this paper.  If the algorithm is memory-latency bound or memory-bandwidth bound, then vectorization will not help. In such cases, consider strategies like cache optimizations or other memory-related optimizations, or even rethink the algorithm entirely. High level loop optimizations, such as –O3, can look for loop interchange optimizations that might help cache locality issues.  Cache blocking, can also help improve cache locality when applicable. (See Cache Blocking Techniques which is specific to the Intel® Many Integrated Core Architecture (Intel® MIC Architecture), but the technique applies to the Intel® Xeon® processor as well.)

    Step 4: Implement SIMD Parallelism Using Explicit Vector Programming Techniques

    Explicit vector programming includes features such as the Intel® Cilk™ Plus or OpenMP* 4.0 vectorization directives. These optimizations provide a very powerful and portable way to express vectorization potential in C/C++ applications. OpenMP* 4.0 vectorization directives are also applicable to Fortran applications. These explicit vector programming techniques give you the means to specify which targeted loops to vectorize. Candidate loops for vectorization directives include loops that have too many memory references for the compiler to put in place dependency checks, loops with reductions, loops with user-defined functions, outer loops, among others.

    (See Best practices for using Intel® Cilk™ Plus for recommendations for using the Intel® Cilk™ Plus methodology and Enabling SIMD in program using OpenMP4.0” for how to enable SIMD features in an application using the OpenMP* 4.0 methodology.)

    See also the webinar Introducing Intel® Cilk™ Plus and two video training series detailing vectorization essentials with explicit vector programming using Intel® Cilk™ Plus and OpenMP* 4.0 vectorization techniques.

    Here are some common components of explicit vector programming.

    SIMD-enabled Functions (Intel® Cilk™ Plus and OpenMP* 4.0 Methodologies)

    User creation of SIMD-enabled functions is a capability provided in both the Intel® Cilk™ Plus and OpenMP* 4.0 methodologies. SIMD-enabled functions explicitly describe the SIMD behavior of user-defined functions, including how SIMD behavior is altered due to call site dependence. (See Call site dependence for SIMD-enabled functions in C++, which explains why the compiler sometimes uses a vector version of a function in some call sites, but not others. It also describes what you can do to extend the types of call sites for which the compiler can provide vector versions.  Learn more about SIMD-enabled functions in  Usage of linear and uniform clause in Elemental function (SIMD-enabled function).)

    SIMD Loops (Intel® Cilk™ Plus and OpenMP* 4.0 Methodologies)

    Both the Intel® Cilk™ Plus and OpenMP* 4.0 methodologies provide SIMD loops. The principle with SIMD loops is to explicitly describe the SIMD behavior of a loop, including descriptions of variable usage and any idioms such as reductions. (See Requirements for Vectorizing Loops with #pragma SIMD.) For a quick introduction to #pragma simd, see the corresponding topic for Intel® Cilk™ Plus and OpenMP* 4.0.)

    Traditionally, only inner loops have been targeted for vectorization. One unique application of the Cilk Plus #pragma simd or OpenMP* 4.0 #pragma omp simd is that it can be applied to an outer loops.

    (See loops Outer Loop Vectorization , and Outer Loop Vectorization via Intel® Cilk™ Plus Array Notations, which describe using #pragma simd in outer loops).

    Intel® Cilk™ Plus Array Notation (Intel® Cilk™ Plus Methodology)

    Array Notation is an Intel-specific language extension that is a part of the Intel® Cilk™ Plus methodology supported by the Intel® C++ Compiler. Array Notation provides a way to express a data parallel operation on ordinary declared C/C++ arrays. Array Notation is also compatible with OpenMP* 4.0 and Intel® Cilk™ Plus SIMD-enabled functions. It provides a concise way of replacing loops operating on arrays with a clean array notation syntax that the Intel® Compiler identifies as being vectorizable.

    Step 5: Measure SIMD performance

    Measure your application’s build configuration runtime performance. If you are satisfied, you are done! Otherwise, inspect -vec-report6 to get a SIMD vectorization summary report (to check alignment, unit-stride and using (SoA) versus (AoS), interaction with other loop optimizations, etc.).

    (For a deeper exploration on measuring performance, see How to Benchmark Code Execution Times on Intel®  IA-32 and IA-64 Instruction Set Architectures.)

    Another approach is to use a family of compiler switches with the template –profile-xxxx. (These switches are described ‘Profile Function or Loop Execution Time.) Using the instrumentation method to profile function or loop execution time makes it easy to view where cycles are being spent in your application. The Intel® Compiler inserts instrumentation code into your application to collect the time spent in various locations. (data for identifying hotspots that may be candidates for optimization tuning or targeting parallelization.).

    Another method to measure performance is to re-run the Intel® VTune™ Amplifier XE hotspot analysis after the optimizations are made and compare results.

    Optional Step 6: For Advanced Developers  -Generate assembly code and do inspection

    For those who want to see the assembly code that the compiler generates, and inspect that code to gain insight into how well applications were vectorized, use the compiler switch –S to compile to assembly (.s) without invoking a link step.

    Step 7:       Repeat!
    Repeat as needed until you achieve the desired performance or no good candidates remain.

     

    Other considerations are applicable for applications that are memory latency-bound or memory bandwidth-bound:

    Other considerations: Prefetching and Streaming Stores

    Prefetching

    Data prefetching is a method for a compiler or a developer to request that data be pulled into a cache line from main memory prior to it being used. Prefetching is more applicable for Intel® MIC Architecture.   Explicit control of prefetching can be an important performance factor to investigate. (See  Prefetching on Intel® MIC Architecture.)

    Streaming stores

    Streaming stores are a method of writing data explicitly to main memory bypassing all intermediate caches in instances where you are sure that data being written will not be needed from cache any time soon. Strictly speaking, bypassing all caches is only applicable on Intel® Xeon® processors. For Intel® Xeon Phi™ coprocessors, streaming stores evict instructions are provided to evict data only from a specific cache.  (See Intel® Xeon Phi™ coprocessor specific support of streaming stores or Compiler-based Data Prefetching and Streaming Non-temporal Store Generation for Intel Xeon Phi Coprocessor (May 2013)Vectorization support describes the use of the VECTOR NONTEMPORAL compiler directive for addressing streaming stores.)

    Other considerations: Scatter, gather, and compress structures:

    Many applications benefit from explicit vector programming efforts. In many cases performance increases over scalar performance can be commensurate with the number of available vector lanes on a given platform. However, some types of coding patterns or idioms limit vectorization performance to a large degree.

    Gather and Scatter codes

    A[I] = B[Index[i]];    //Gather
    A[Index[i]] = b[i];  //Scatter

    While gather/scatter vectorization is available on Intel® MIC Architecture and recent Intel® Xeon® platforms, the performance gains from vectorization relying on gather/scatter is often much inferior to use of unit-strided loads and stores inside vector-loops. If there are not enough other profitably vectorized operations (such as multiple, divide, or math calls, …) inside such vector loops, performance may even be lower than serial performance in some cases.  The only possible workaround for such issues is to look at alternative algorithms all together to avoid using gathers and scatters.

    Compress and Expand structures

    Compress and expand structures are generally problematic. On Intel® Xeon™ Phi coprocessors, the Intel® Compiler can automatically vectorize loops that contain simple forms of compress/expand idioms. An example of a compress idiom is as follows:

    do I =1,N
       if (B(I)>0)
           x= x+1
           A(X) = B(I)
       endif
    enddo

    In this example, the variable x is updated under a condition. Note that it is incorrect to use the #pragma simd for such compress structures but using #pragma ivdep is okay.

    Improve performance of such vectorized loops on Intel® MIC Architecture using the -opt-assume-safe-padding compiler option. (See Common Vectorization Tips.)

    Currently vectorization of compress structures is only for future platforms that support compress structures.

    Reference Materials:

    Compiler diagnostic messages

    • Intel® Fortran Vectorization Diagnostics– Diagnostic messages from the vectorization report produced by the Intel® Fortran Compiler. To obtain a vectorization report in Intel® Fortran, use the option -vec-report[n] (Linux* and OS X* platforms) or /Qvec-report[:n] (Windows* platform).

    • Vectorization Diagnostics for Intel® C++ Compiler– Diagnostic messages from the vectorization report produced by the Intel® C++ Compiler. To obtain a vectorization report with the Intel® C++ Compiler, use option -vec-report[n] (Linux* and OS X* platforms) or /Qvec-report[:n] (Windows* platform).

    Intel® C++ Compiler Videos

    Webinars

    Introduction to Vectorization using Intel® Cilk™ Plus Extensions

    Articles

     

    Cilk, Intel, the Intel logo, Intel Xeon Phi, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

    * Other names and brands may be claimed as the property of others.

  • Explicit Vector Programming
  • OpenMP* 4.0 Vectorization
  • Intel(R) Cilk(TM) Plus
  • Entwickler
  • Partner
  • C/C++
  • Experten
  • Fortgeschrittene
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Fortran Composer XE
  • Optimierung
  • Vektorisierung
  • URL
  • Compiler-Themen
  • Leistungsverbesserung
  • The Chronicles of Phi - part 4 - Hyper-Thread Phalanx – tiled_HT2

    $
    0
    0

    The prior part (3) of this blog showed the effects of the first-level implementation of the Hyper-Thread Phalanx. The change in programming yielded 9.7% improvement in performance for the small model, and little to no improvement in the large model. This left part 3 of this blog with the questions:

    What is non-optimal about this strategy?
    And: What can be improved?

    There are two things, one is obvious, and the other is not so obvious.

    Data alignment

    The obvious thing, which I will now show you, is that vectorization improves with aligned data. Most compiler optimizations will examine the code of the loop, and when necessary, insert preamble code that tests for alignment and executes up until an alignment is reached (this is called peeling), then insert code that executes more efficiently with the now aligned data. Finally post-amble code is inserted to complete any remainder that may be present.

    This sounds rather straightforward until you look at the inner loop:

    #pragma simd   
              for (x = 1; x < nx-1; x++) { 
                ++c; 
                ++n; 
                ++s; 
                ++b; 
                ++t; 
                f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c+1] 
                    + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 
              } 

    In the preceding code, the two terms: f1_t[c-1] , and , f1_t[c+1] will muck up the vector alignment tests since [c-1], [c] and [c+1] can never all be aligned at the same time.

    Are the compiler writers smart enough to offer some measure of optimization for such a loop?

    As it turns out, they are able to offer some measure of optimization for such a loop.

    Due to initial unknowns, the code has to do more work in the preamble and post-amble sections, as well as a reduced number of iterations of work in the fastest interior body loop of code.

    Take particular note that the input array f1_t is indexed in seven different ways. This means that the preamble code that determines alignment may have to work on minor permutations of the seven references in an attempt to narrow in on the time when the largest number of references are vector aligned. This is non-trivial for the compiler code generation, as well as a potential area for additional overhead.

    What can be improved?

    Use aligned data when possible

    This is addressed in an additional improvement to the coding of the tiled_HT2 program.

    First, we require that the dimension NX be a multiple of REALs that fill a cache line. This is not an unreasonable requirement. The value of 256 was used in the original example code. It is not too much of a restriction to require that NX must be a multiple of 16 for floats, or 8 for doubles.

    To assure alignment, I changed the malloc calls to allocate the arrays to use  _mm_malloc with an alignment of cache line size (64). This is a relatively simple change. (This will be shown later after the next optimization tip that also affects allocation.)

    Next, now that I know that NX is an even multiple of cache lines, and the arrays are cache line aligned, I can construct a function to process the innermost loop with the foreknowledge that six of the array references are cache aligned and two are not (the extra reference is the output array). The two that are not aligned are the references to [c-1] and [c+1]. The compiler, knowing beforehand what is aligned and what is not aligned does not have to insert code to make this determination. i.e. the compiler can reduce, or completely remove the preamble and post-amble code.

    The second improvement (non-obvious improvement):

    Redundancy can be good for you

    Additional optimization can be achieved by redundantly process x=0 and x=nx-1 as if these cells were at the interior of the parallel pipette being processed. This means that the preamble and post-amble code for unaligned loops can be bypassed, and that the elements x=1:15 could be directly processed as an aligned vector (as opposed to one-by-one computation or unaligned vector computation). The same is done for the 16 elements where the last element (x=nx-1) computes differently than the other elements of the vector. This does mean that after calculating the incorrect values (for free) for x=0 and x=nx-1, we have to then perform a scalar calculation to insert the correct values into the x column. Essentially you exchanging two scalar loops of 16 iterations for two of (one 16-wide vector operation + one scalar operation) where the scalar operations are in L1 cache.

    Adding the redundancy change necessitated allocating the arrays two vectors worth of elements larger than the actual array requirement, and returning the address of the 2nd vector for the array pointer. Additionally, this requires zeroing one element preceding and following the working array size. The allocation then provides for one vector of addressable memory (and not used as valid data). Not doing so, could result in a page fault depending on location and extent of allocation.

    Change to allocations:

      // align the allocations to cache line
      // increase allocation size by 2 cache lines
      REAL *f1_padded = (REAL *)_mm_malloc(
        sizeof(REAL)*(nx*ny*nz + N_REALS_PER_CACHE_LINE*2),
        CACHE_LINE_SIZE);
    
      // assure allocation succeeded
      assert(f1_padded != NULL);
      
      // advance one cache line into buffer
      REAL *f1 = f1_padded + N_REALS_PER_CACHE_LINE;
      
      f1[-1] = 0.0;       // assure cell prior to array not Signaling NaN
      f1[nx*ny*nz] = 0.0; // assure cell following array not Signaling NaN
    
      // align the allocations to cache line
      // increase allocation size by 2 cache lines
      REAL *f2_padded = (REAL *)_mm_malloc(
        sizeof(REAL)*(nx*ny*nz + N_REALS_PER_CACHE_LINE*2),
        CACHE_LINE_SIZE);
    
      // assure allocation succeeded
      assert(f2_padded != NULL);
      
      // advance one cache line into buffer
      REAL *f2 = f2_padded + N_REALS_PER_CACHE_LINE;
      
      f2[-1] = 0.0;       // assure cell prior to array not Signaling NaN
      f2[nx*ny*nz] = 0.0; // assure cell following array not Signaling NaN
    

    As an additional benefit the compiler can now generate more code using Fused Multiply and Add (FMA) instructions.

    The tiled_HT2 code follows:

    void diffusion_tiled_aligned(
                    REAL*restrict f2_t_c, // aligned
                    REAL*restrict f1_t_c, // aligned
                    REAL*restrict f1_t_w, // not aligned
                    REAL*restrict f1_t_e, // not aligned
                    REAL*restrict f1_t_s, // aligned
                    REAL*restrict f1_t_n, // aligned
                    REAL*restrict f1_t_b, // aligned
                    REAL*restrict f1_t_t, // aligned
                    REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
                    REAL cb, REAL cc, int countX, int countY) {
    
      __assume_aligned(f2_t_c, CACHE_LINE_SIZE);
      __assume_aligned(f1_t_c, CACHE_LINE_SIZE);
      __assume_aligned(f1_t_s, CACHE_LINE_SIZE);
      __assume_aligned(f1_t_n, CACHE_LINE_SIZE);
      __assume_aligned(f1_t_b, CACHE_LINE_SIZE);
      __assume_aligned(f1_t_t, CACHE_LINE_SIZE);
      // countY is number of squads along Y axis
      for(int iY = 0; iY < countY; ++iY) {
        // perform the x=0:N_REALS_PER_CACHE_LINE-1 as one cache line operation
        // On Phi, the following reduces to vector with one iteration
        // On AVX two iterations
        // On SSE four iterations
        #pragma noprefetch
        #pragma simd  
        for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++) {
          f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i] + ce * f1_t_e[i]
                       + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];
        } // for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++)
        
        // now overstrike x=0 with correct value
        // x=0 special (no f1_t[c-1])
        f2_t_c[0] = cc * f1_t_c[0] + cw * f1_t_w[1] + ce * f1_t_e[0]
                    + cs * f1_t_s[0] + cn * f1_t_n[0] + cb * f1_t_b[0] + ct * f1_t_t[0];
        // Note, while we could overstrike x=[0] and [nx-1] after processing the entire depth of nx
        // doing so will result in the x=0th cell being evicted from L1 cache.
    
        // do remainder of countX run including incorrect value for i=nx-1 (countX-1)
        #pragma vector nontemporal
        #pragma noprefetch
        #pragma simd  
        for (int i = N_REALS_PER_CACHE_LINE; i < countX; i++) {
            f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i] + ce * f1_t_e[i]
                     + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];
        } // for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++)
    
        // now overstrike x=nx-1 with correct value
        // x=nx-1 special (no f1_t[c+1])
        int i = countX-1;
        f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i-1] + ce * f1_t_e[i]
                       + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];
    
        // advance one step along Y
        f2_t_c += countX;
        f1_t_c += countX;
        f1_t_w += countX;
        f1_t_e += countX;
        f1_t_s += countX;
        f1_t_n += countX;
        f1_t_b += countX;
        f1_t_t += countX;
      } // for(int iY = 0; iY < countY; ++iY)
    } // void diffusion_tiled_aligned(
    
    diffusion_tiled(REAL *restrict f1, REAL *restrict f2, int nx, int ny, int nz,
                  REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
                  REAL cb, REAL cc, REAL dt, int count) {
    
    #pragma omp parallel
      {
    
        REAL *f1_t = f1;
        REAL *f2_t = f2;
    
        int nSquadsZ = (nz + nHTs - 1) / nHTs; // place squads across z dimension
        int nSquadsZY = nSquadsZ * ny;  // number of full (and partial) squads on z-y face
        int nSquadsZYPerCore = (nSquadsZY + nCores - 1) / nCores;
    
        // Determine this thread's squads
        int SquadBegin = nSquadsZYPerCore * myCore;
        int SquadEnd = SquadBegin + nSquadsZYPerCore; // 1 after last squad for core
        if(SquadEnd > nSquadsZY) SquadEnd = nSquadsZY;
        for (int i = 0; i < count; ++i) {
          int nSquads;
          // restrict current thread to its subset of squads on the Z/Y face.
          for(int iSquad = SquadBegin; iSquad < SquadEnd; iSquad += nSquads) {
            // determine nSquads for this pass
            if(iSquad % ny == 0)
              nSquads = 1; // at y==0 boundary
            else
            if(iSquad % ny == ny - 1)
              nSquads = 1;  // at y==ny-1 boundary
            else
            if(iSquad / ny == (SquadEnd - 1) / ny)
              nSquads = SquadEnd - iSquad;  // within (inclusive) 1:ny-1
            else
              nSquads = ny - (iSquad % ny) - 1; // restrict from iSquad%ny to ny-1
            int z0 = (iSquad / ny) * nHTs; // home z for 0'th team member of Squad
            int z = z0 + myHT;  // z for this team member
            int y = iSquad % ny;
            // last squad along z may be partially filled
            // assure we are within z
            if(z < nz)
            {
              int x = 0;
              int c, n, s, b, t;
              c =  x + y * nx + z * nx * ny;
              n = (y == 0)    ? c : c - nx;
              s = (y == ny-1) ? c : c + nx;
              b = (z == 0)    ? c : c - nx * ny;
              t = (z == nz-1) ? c : c + nx * ny;
              diffusion_tiled_aligned(
       &f2_t[c], // aligned
       &f1_t[c], // aligned
       &f1_t[c-1], // unaligned
       &f1_t[c+1], // unaligned
       &f1_t[s], // aligned
       &f1_t[n], // aligned
       &f1_t[b], // aligned
       &f1_t[t], // aligned
                            ce, cw, cn, cs, ct, cb, cc, nx, nSquads);
            } // if(z < nz)
          } // for(int iSquad = SquadBegin; iSquad < SquadEnd; iSquad += nSquads)
    // barrier required because we removed implicit barrier of #pragma omp for collapse(2)
          #pragma omp barrier
          // swap buffer pointers
          REAL *t = f1_t;
          f1_t = f2_t;
          f2_t = t;
        } // count
      } // parallel
      return;
    }
    

    The performance chart below incorporates the two new programs tiled_HT and tiled_HT2.

    The above chart clearly illustrates that the tiledHT2 is starting to make some real progress, at least for the small model with another 9.5% improvement. Be mindful that code alignment may still be an issue. And the above chart does not take this into consideration.

    What else can be improved?

    Think about it while you await part 5.

    Jim Dempsey
    Consultant
    QuickThread Programming, LLC

     

     

    Symbol-Bild: 

  • Technical Article
  • Intel® Many Integrated Core Architektur
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Parallel Studio XE
  • Intel® Advanced Vector Extensions
  • OpenMP*
  • C/C++
  • Server
  • UX
  • Windows*
  • Server
  • Desktop
  • Entwickler
  • Professoren
  • Studenten
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • How to use the cilkview?

    $
    0
    0

    I have a C search application  on a centos 6.x 64 bit linux server that I just installed the cilkplus compiler on to take advantage of more cpu/cores. I've added the cilk_spawn function to some recursive scanning functions in my program.  After re-compiling the search application with the cilkplus gcc compiler, the search program is working as intended without any seg faults or any other errors.

    My question is how do I use the cilkview analyzer? I want to if cilkplus/spawning is helping my search application and if so by how much?

    Thanks!

    Lawrence

     

     

     

     

    Intel MPI Library and Composer XE Compatibility

    $
    0
    0

    The following table lists all supported versions of the Intel® MPI Library and the Intel® Composer XE.  Use this as a reference on the cross-compatibility between the library and associated compiler.

    Compatibility Matrix
    Intel® MPI Library VersionIntel® Compiler XE 8.1Intel® Compiler XE 9.0Intel® Compiler XE 9.1Intel® Compiler XE 10.0Intel® Compiler XE 10.1Intel® Compiler XE 11.0Intel® Compiler XE 11.1Intel® Composer XE 2011Intel® Composer XE 2013Intel® Composer XE 2013 SP1
    3.1XXXXX     
    3.1 U1XXXXX     
    3.2  XXXX    
    3.2 U1  XXXX    
    3.2 U2  XXXX    
    4.0    XXX   
    4.0 U1    XXX   
    4.0 U2    XXX   
    4.0 U3      XX  
    4.1      XXX 
    4.1 U1      XXX 
    4.1 U2        Up to Update 2X
    4.1 U3        Up to Update 2X

    NOTE: Any older versions of the Intel® MPI Library may work with newer versions of the Intel® Compiler XE but compatibility is not guaranteed. If you have concerns or see any issues, please let us know by submitting a ticket at the Intel® Premier Support site.

  • MPI
  • compiler
  • compatibility
  • version
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • C/C++
  • Fortran
  • Intel® C++ Composer XE
  • Intel® Composer XE
  • Intel® Fortran Compiler
  • Intel® Fortran Composer XE
  • Intel® Visual Fortran Composer XE
  • Intel® MPI Library
  • Intel® Cluster Studio XE
  • Message Passing Interface
  • Cluster-Computing
  • URL
  • Compiler-Themen
  • Bibliotheken
  • MPI-Learn
  • MPI-Support
  • Question on reducers

    $
    0
    0

    In my search application there are globally variables defined outside any function that I would like to use the cilk reducers on.

    Specifically I have code like this:

    #include "search.h"
    
    static int total_users = 0;
    static int total_matches = 0;

    These total_x variables are incremented throughout the application on different functions.

    I tried adding the following for total_users and received the following error:

    cilk::reducer_opadd<static int> total_users;

    cilk plus error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘:’ token

    What am I doing wrong here?

     

     

    Issue with gather & scatter operations

    $
    0
    0

    Hi,

    I read on the doc that array notation can be used for array indicies in both cases :

    C[:] = A[B[:]] and A[B[:]] = C[:]

    I try to use this notation for left & right operands at the same time but it gives me wrong results.

    Here is my problem:

    double tmp[VEC_SIZE]; // Already initialized
    int index[VEC_SIZE];  // Already initialized
    
    tab[index[:]] = tab[index[:]] + tmp[:];       // This line gives wrong result
    
    for (int i = 0; i < VEC_SIZE; i++) {
        tab[index[i]] = tab[index[i]] + tmp[i];   // While this loop gives the correct result
    }

    To me, these two versions of the code are supposed to be equivalent, am I wrong ?

    Can we use array notation for array indicies in left & right operands at the same time ?

    Thanks


    Question on cilk_sort

    $
    0
    0

    Is cilk_sort functions parallel drop in replacements for the C qsort function?  

    Intel® Software Development Tools 2015 Beta

    $
    0
    0
    Intel® Software Development Tools 2015 Beta

    What's New in the 2015 Beta

    This suite of products brings together exciting new technologies along with improvements to Intel’s existing software development tools:

    • Get guidance on how to boost performance safely without creating threading bugs using the Intel® Advisor XE 2015 Beta.  These improvements include scaling to a larger number of processors and improved viewing and advanced modeling of suitability information on both Intel® Xeon® and Intel® Xeon Phi™ processors.
      • Suitability modeling for Intel® Xeon Phi™ processors is available as an experimental feature by setting the environment variable ADVIXE_EXPERIMENTAL=suitability_xeon_phi_modeling
    • Ever wonder what calling sequence led to the code that caused all those cache misses? Find out with the new Intel® VTune™ Amplifier XE 2015 Beta! Added support for remote collection from the GUI, new GPU and Intel® Transactional Synchronization Extensions analysis, and the ability to view your results on computers running OS X*.
    • Now you can debug memory and threading errors with Intel® Inspector XE 2015 Beta! For thread checking, take advantage of 3X performance improvement and reduction in memory overhead. For memory checking take advantage of advancements in the on-demand leak detection and memory growth controls as well as the brand new memory usage graph.
    • Now utilize new Parallel direct sparse Solvers for clusters (CPARDISO) and optimizations for the latest Intel® Architectures with the Intel® Math Kernel Library 11.2 Beta! Get insight into Intel® MKL’s settings via new verbose mode and take advantage of the Intel® MKL Cookbook to help assemble the correct routines for solving complex problems.
    • Leverage the latest language features in Intel® Composer XE 2015 Beta, including full language support for C++11 (/Qstd=c++11 or -std=c++11) and Fortran 2003, Fortran 2008 Blocks, and OpenMP* 4.0 (except user-defined reductions). Gain new insights into optimization opportunities such as vectorization or inlining with redesigned optimization reports (/Qopt-report or -opt-report). Exploit Intel® Graphics Technology for additional performance with new offload computing APIs. Use the new "icl" and "icl++" compilers on OS X* for improved compatibility with the clang/LLVM* toolchain.  The Intel® Integrated Performance Primitives has added support for the Intel® Xeon® Phi™ co-processor.
      • Existing users of optimization reports (opt-report, vec-report, openmp-report, or par-report) or the Intel® C++ Compiler for Linux (-ansi-alias is now default) should refer to the product release notes for more information
    • Get highly-optimized out-of-the-box performance for your MPI applications with the Intel® MPI Library 5.0 Beta! Now’s the time to take advantage of the new MPI-3 functionality, such as non-blocking collectives, and fast one-sided communication, as well as test binary compatibility with existing codes.
    • Extract complete insight into your distributed memory application and quickly find performance bottlenecks and MPI issues using the Intel® Trace Analyzer and Collector 9.0 Beta! In addition to support for the latest MPI-3 features, the new Performance Assistant automatically detects common MPI performance issues and quickly provides resolution tips.

    A detailed description of the new features in the 2015 Beta products is available in the Intel® Software Development Tools 2015 Beta Program: What's New document.

    Details

    This beta program is available for IA-32 architecture-based processors and Intel® 64 architecture-based processors for Linux* and Windows*. The Intel beta compilers and libraries for OS* X are also included in this beta program.

    During this Beta period, you will be provided access to the Intel® Cluster Studio XE 2015 Beta package – a superset containing all Intel® Software Development Tools. At the time of download, you can select to install individual products or the full suite.

    Early access to some components of the 2015 Beta will be available the first week of April.  These components will include the Intel® C++ Composer XE 2015 Beta and the Intel® Fortran Composer XE 2015 Beta.

    The full Intel® Software Development Products 2015 Beta packages (including the Intel® Cluster Studio XE 2015 Beta files) will be available in mid-April, once the 2015 Beta program commences.

    Frequently Asked Questions

    A complete list of FAQs regarding the 2015 Beta can be found in the Intel® Software Development Tools 2015 Beta Program: Frequently Asked Questions document.

    Beta duration

    The beta program officially ends July 11th, 2014. The beta license provided will expire September 25th, 2014. At the conclusion of the beta program, you will be asked to complete a survey regarding your experience with the beta software.

    Support

    Technical support will be provided via Intel® Premier Support. The Intel® Registration Center will be used to provide updates to the component products during this beta period.

    To enroll in this beta program

    Complete the pre-beta survey at the registration link

    • Information collected from the pre-beta survey will be used to evaluate beta testing coverage. Here is a link to the Intel Privacy Policy.
    • Keep the beta product serial number provided for future reference
    • After registration, you will be taken to the Intel Registration Center to download the product
    • After registration, you will be able to download all available beta products at any time by returning to the Intel Registration Center

    Note: At the end of the beta program you should uninstall the beta product software.

    Your next steps:

    • Review the Intel® Software Development Tools 2015 Beta What's New document and FAQ
    • Install the Intel® Software Development Tools 2015 Beta product(s)
    • Try it out and share your experience with us!

    Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.
    * Other names and brands may be claimed as the property of others.
    Copyright © 2014, Intel Corporation. All rights reserved.

  • Apple iOS*
  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • C/C++
  • Fortran
  • Intel® Trace-Analyzer und -Collector
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Cilk™ Plus
  • Intel® Composer XE
  • Intel® Fortran Compiler
  • Intel® Fortran Composer XE
  • Intel® Visual Fortran Composer XE
  • Intel® Debugger
  • Intel® Inspector
  • Intel® VTune™ Amplifier
  • Intel® Integrated-Performance-Primitives
  • Intel® Math Kernel Library
  • Intel® MPI Library
  • Intel® Threading Building Blocks
  • Intel® C++ Studio XE
  • Intel® Cluster Studio
  • Intel® Cluster Studio XE
  • Intel® Fortran Studio XE
  • Intel® Parallel Studio XE
  • Intel® Advisor XE
  • Cluster-Computing
  • Entwicklungstools
  • Intel® Core™ Prozessoren
  • Intel® Many Integrated Core Architektur
  • Parallel Computing
  • Vektorisierung
  • URL
  • Open Source Downloads

    $
    0
    0

    This article makes available third-party libraries, executables and sources that were used in the creation of Intel® Software Development Products or are required for operation of those. Intel provides this software pursuant to their applicable licenses.

     

    Required for Operation of Intel® Software Development Products

    The following products require additional third-party software for operation.

    Intel® Composer XE 2015 for Windows*:
    The following binutils package is required for operation with Intel® Graphics Technology:
    Descargarbinutils_setup.zip
    Please see Release Notes of the product for detailed instructions on using the binutils package.

    The above binutils package is subject to various licenses. Please see the corresponding sources for more information:
    Descargarbinutils_src.zip
     

    Used within Intel® Software Development Products

    The following products contain Intel® Application Debugger, Intel® Many Integrated Core Architecture Debugger and/or Intel® JTAG Debugger tools which are using third party libraries as listed below.

    Products and Versions:

    Intel® Composer XE 2013 SP1 for Linux*

    • Intel® C++ Composer XE 2013 SP1 for Linux*/Intel® Fortran Composer XE 2013 SP1 for Linux*
      (Initial Release and higher; 13.0 Intel® Application Debugger)

    Intel® Composer XE 2013 for Linux*

    • Intel® C++ Composer XE 2013 for Linux*/Intel® Fortran Composer XE 2013 for Linux*
      (Initial Release and higher; 13.0 Intel® Application Debugger)

    Intel® Composer XE 2011 for Linux*

    • Intel® C++ Composer XE 2011 for Linux*/Intel® Fortran Composer XE 2011 for Linux*
      (Update 6 and higher; 12.1 Intel® Application Debugger)
    • Intel® C++ Composer XE 2011 for Linux*/Intel® Fortran Composer XE 2011 for Linux*
      (Initial Release and up to Update 5; 12.0 Intel® Application Debugger)

    Intel® Compiler Suite Professional Edition for Linux*

    • Intel® C++ Compiler for Linux* 11.1/Intel® Fortran Compiler for Linux* 11.1
    • Intel® C++ Compiler for Linux* 11.0/Intel® Fortran Compiler for Linux* 11.0
    • Intel® C++ Compiler for Linux* 10.1/Intel® Fortran Compiler for Linux* 10.1

    Intel® Embedded Software Development Tool Suite for Intel® Atom™ Processor:

    • Version 2.3 (Initial Release and up to Update 2)
    • Version 2.2 (Initial Release and up to Update 2)
    • Version 2.1
    • Version 2.0

    Intel® Application Software Development Tool Suite for Intel® Atom™ Processor:

    • Version 2.2 (Initial Release and up to Update 2)
    • Version 2.1
    • Version 2.0

    Intel® C++ Software Development Tool Suite for Linux* OS supporting Mobile Internet Devices (Intel® MID Tools):

    • Version 1.1
    • Version 1.0

    Intel AppUp™ SDK Suite for MeeGo*

    • Initial Release (Version 1.0)

    Used third-party libraries:
    Please see the attachments for a complete list of third-party libraries.

    Note: The packages posted here are unmodified copies from the respective distributor/owner and are made available for ease of access. Download or installation of those is not required to operate any of the Intel® Software Development Products. The packages are provided as is, without warranty or support.

  • Eclipse
  • EPL
  • third-party
  • Intel(R) Software Development Products
  • Intel® Graphics Technology
  • Intel® Composer XE
  • Intel® C++ Composer XE
  • Intel® Application Debugger
  • Intel® Many Integrated Core Architecture Debugger & Intel® JTAG Debugger
  • Intel AppUp® Developer
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • C/C++
  • Fortran
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Composer XE
  • Intel® Fortran Compiler
  • Debugger
  • Intel® Debugger
  • Intel® JTAG Debugger
  • Grafik
  • Intel® Atom™ Prozessoren
  • Open Source
  • The Chronicles of Phi - part 2 Hyper-Thread Phalanx – tiled_HT1

    $
    0
    0

    In the first part I discussed the diffusion problem and the proposed strategy to address the performance issue through use of a Hyper-Thread Phalanx. I left you with the dangling question:

    How to determine thread binding to core and HT within core?

    Let's begin with an illustration of 2-wide, 3-wide, and 4-wide Hyper-Thread Phalanxes:

    View of the y/z plane of a x/y/z space, x going into page. Computation order into page along x, then stepping down on page in the +y direction.

    The left image of each pair, illustrates the cold cache penetration of x. Yellow indicates one of the threads incurs a cache miss, while all the adjacent threads accessing the same cell experiences a cache hit. More importantly, moving on to the next drill down of x (right illustration of paired illustrations), we can now estimate the cache hit ratios: 10/14 (71.43%), 16/21(76.19%), and 22/18 (78.57%). These are significant improvements over the single thread layout value of 3/7 (42.86%).

    The question then becomes:

    How to determine thread binding to core and HT within core

    One could specify affinity and core placement through use of environment variables external to the program, but this may not be suitable or reliable. It is better to place the least constrictions and requirements on the environment variables. While one set of affinity bindings may be best for this function, your overall application may benefit from a different arrangement of thread bindings. Therefore, this necessitates having the program determine the affinity bindings applied by the environment.

    The following header HyperThreadPhalanx.h  and utility code HyperThreadPhalanx.c  were used for the improved performance test programs added to the sample program folder. The original test programs were written in C. Therefore, this version of the utility code is also written in C. As an exercise for the reader, you may modify the code for use with C++.

    The primary goal of the HyperThreadPhalanx.c  utility function is to:

    o Determine the number of OpenMP threads in the outer most region of the application
    o Compute a logical core number (zero based and contiguous) for each thread
    o Compute a logical HT number within the core (zero based and contiguous) for each thread
    o Compute the number of logical cores
    o Compute number of HTs per core as used in the working set

    Notes:

    The programmer (operator of program) must specify some form of realistic affinity binding. They are free to choose almost any strategy that is reasonable for the remainder (non- HyperThreadPhalanx’ed part) of the application. KMP_AFFINITY=compact, KMP_AFFINITY=scatter, as well as combining with KMP_PLACE_THREADS=nnC, mmT, oO. The only “reasonable” requirement is for each core used to have the same number of working threads. If they do not, the current code will choose the smallest number (though testing of adverse configurations has not been strenuously performed).

    The header file:

    // HyperThreadPhalanx.h
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <math.h>
    #include <omp.h>
    #include <assert.h>
    
    // types:
    struct HyperThreadPhalanxThreadInfo_t
    {
      int APICid;
      int PhysicalCore;
      int PhysicalHT;
      int LogicalCore;
      int LogicalHT;
    };
    
    struct HyperThreadPhalanx_t
    {
      int isIntel;
      union {
      char ProcessorBrand[48];
      unsigned int ProcessorBrand_uint32[12];
      };
      int nHTsPerCore;// hardware
      int nThreads;   // omp_get_num_threads() {in parallel region, no nesting}
      int nCores;     // number of core derived therefrom
      int nHTs;       // smallest number of HT's in mapped cores (logical HTs/core)
      struct HyperThreadPhalanxThreadInfo_t* ThreadInfo; // allocated to nThreads
    };
    
    // global variables:
    extern struct HyperThreadPhalanx_t HyperThreadPhalanx;
    
    
    // global thread private variables:
    #if defined(__linux)
    // logical core (may be subset of physical cores and not necessarily core(0))
    extern __thread int myCore; 
    // logical Hyper-Thread within core
    // (may be subset of hw threads in core and not necessarily hwThread(0) in core)
    extern __thread int myHT;
    #else
    // logical core (may be subset of physical cores and not necessarily core(0))
    extern __declspec(thread) int myCore;
    // logical Hyper-Thread within core
    // (may be subset of hw threads in core and not necessarily hwThread(0) in core)
    extern __declspec(thread) int myHT;
    #endif
    
    // functions:
    int HyperThreadPhalanxInit();
    

    The header introduces into your namespace the HyperThreadPhalanx object and two Thread Local Storage variables myCore and myHT. Other than for the two TLS variables, the user is free to use the post-HyperThreadPhalanxInit() values if they wish to do so.

    The current code was kept brief, and only is functional for Intel processors (P4 and later). The code uses the CPUID intrinsic and instruction. Information on the CPUID instruction can be found in Intel® Processor Identification and the CPUID instruction. Application Note 485.

    The code now follows:

    // HyperThreadPhalanx.c
    
    #include "HyperThreadPhalanx.h"
    
    struct HyperThreadPhalanx_t HyperThreadPhalanx;
    
    #if defined(__linux)
    // logical core (may be subset of physical cores and not necessarily core(0))
    __thread int myCore = -1;
    // logical Hyper-Thread within core
    // (may be subset of hw threads in core and not necessarily hwThread(0) in core)
    __thread int myHT = -1;
    #else
    // logical core (may be subset of physical cores and not necessarily core(0))
    __declspec(thread) int myCore = -1;
    // logical Hyper-Thread within core
    // (may be subset of hw threads in core and not necessarily hwThread(0) in core)
    __declspec(thread) int myHT = -1;
    #endif
    
    void __cpuidEX(int cpuinfo[4], int func_a, int func_c)
    {
     int eax, ebx, ecx, edx;
     __asm__ __volatile__ ("cpuid":\
     "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (func_a), "c" (func_c));
     cpuinfo[0] = eax;
     cpuinfo[1] = ebx;
     cpuinfo[2] = ecx;
     cpuinfo[3] = edx;
    } // void __cpuidEX(int cpuinfo[4], int func_a, int func_c)
    
    void InitProcessor()
    {
      unsigned int CPUinfo[4];
      __cpuid(CPUinfo, 0); // This code requires at least support of CPUID
      HyperThreadPhalanx.ProcessorBrand_uint32[0] = CPUinfo[1];
      HyperThreadPhalanx.ProcessorBrand_uint32[1] = CPUinfo[3]; // note order different
      HyperThreadPhalanx.ProcessorBrand_uint32[2] = CPUinfo[2];
      HyperThreadPhalanx.ProcessorBrand_uint32[3] = 0;
      HyperThreadPhalanx.isIntel =
        (strcmp(HyperThreadPhalanx.ProcessorBrand, "GenuineIntel") == 0);
    }
    
    int HyperThreadPhalanxInit()
    {
      InitProcessor();
      if(!HyperThreadPhalanx.isIntel)
      {
        printf("Not Intel processor. Add code to handle this processor.\n");
        return 1;
      }
      if(omp_in_parallel())
      {
        printf("HyperThreadPhalanxInit() must be called from outside parallel .\n");
        return 2;
      }
    
    #pragma omp parallel
      {
        // use omp_get_num_threads() NOT omp_get_max_threads()
        int nThreads = omp_get_num_threads();
        int iThread = omp_get_thread_num();
        unsigned int CPUinfo[4];
    
    #pragma omp master
        {
          HyperThreadPhalanx.nThreads = nThreads;
          HyperThreadPhalanx.ThreadInfo =
           malloc(nThreads * sizeof(struct HyperThreadPhalanxThreadInfo_t));
          __cpuidEX(CPUinfo, 4, 0);
          HyperThreadPhalanx.nHTsPerCore = ((CPUinfo[0] >> 14) & 0x3F) + 1;
          // default logical HT's per core to physical (may change later)
          HyperThreadPhalanx.nHTs = HyperThreadPhalanx.nHTsPerCore;
    
        }
    #pragma omp barrier
        // master region finished, see if allocation succeeded
        if(HyperThreadPhalanx.ThreadInfo)
        {
          __cpuidEX(CPUinfo, 1, 0); // get features
          if(CPUinfo[2] & (1 << 21))
          {
            // processor has x2APIC
            __cpuidEX(CPUinfo, 0x0B, 0);
     // get thread's APICid
            HyperThreadPhalanx.ThreadInfo[iThread].APICid = CPUinfo[3];
          }
          else
          {
            // older processor without x2APIC
            __cpuidEX(CPUinfo, 1, 0);
     // get thread's APICid
            HyperThreadPhalanx.ThreadInfo[iThread].APICid = (CPUinfo[1] >> 24) & 0xFF;
          }
          // Use thread's APICid to determine physical core and physical HT number within core
          HyperThreadPhalanx.ThreadInfo[iThread].PhysicalCore =
            HyperThreadPhalanx.ThreadInfo[iThread].APICid
             / HyperThreadPhalanx.nHTsPerCore;
          HyperThreadPhalanx.ThreadInfo[iThread].PhysicalHT =
            HyperThreadPhalanx.ThreadInfo[iThread].APICid
              % HyperThreadPhalanx.nHTsPerCore;
          // for now indicate LogicalCore and LogicalHT not assigned
          HyperThreadPhalanx.ThreadInfo[iThread].LogicalCore = -1;
          HyperThreadPhalanx.ThreadInfo[iThread].LogicalHT = -1;
        }
    #pragma omp barrier
        // At this point, all the HyperThreadPhalanx.ThreadInfo[iThread].APICid,
        // PhysicalCore and PhysicalHT have been filled-in.
        // However, the logical core number may differ from physical core number
        // no different than OpenMP thread number differing from logical processor number
        // The logical core numbers are 0-based, without gaps
    #pragma omp master
        {
          int NextLogicalCore = 0;
          for(;;)
          {
            int iLowest = -1; // none found
            for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
            {
              // see if unassigned core
              if(HyperThreadPhalanx.ThreadInfo[i].LogicalCore == -1)
              {
                if(iLowest < 0)
                {
                  // first unassigned is lowest
                  iLowest = i;
                }
                else
                {
                  if(HyperThreadPhalanx.ThreadInfo[i].APICid < HyperThreadPhalanx.ThreadInfo[iLowest].APICid)
                   iLowest = i; // new lowest
                }
              } // if(HyperThreadPhalanx.ThreadInfo[i].LogicalCore < 0)
            } // for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
            if(iLowest < 0)
              break;
            if(HyperThreadPhalanx.ThreadInfo[iLowest].PhysicalHT != 0)
            {
              // unable to use core
              for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
              {
                if(HyperThreadPhalanx.ThreadInfo[i].PhysicalCore == HyperThreadPhalanx.ThreadInfo[iLowest].PhysicalCore)
                  HyperThreadPhalanx.ThreadInfo[i].LogicalCore = -2; // mark as unavailable
              } // for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
            }
            else
            {
              // able to use core
              int NextLogicalHT = 0;
              for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
              {
                if(HyperThreadPhalanx.ThreadInfo[i].PhysicalCore == HyperThreadPhalanx.ThreadInfo[iLowest].PhysicalCore)
                {
                  HyperThreadPhalanx.ThreadInfo[i].LogicalCore = NextLogicalCore;
                  HyperThreadPhalanx.ThreadInfo[i].LogicalHT = NextLogicalHT++;
                }
              } // for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
              ++NextLogicalCore;
              if(NextLogicalHT < HyperThreadPhalanx.nHTs)
                HyperThreadPhalanx.nHTs = NextLogicalHT; // reduce
            }
          } // for(;;)
          HyperThreadPhalanx.nCores = NextLogicalCore;
        } // omp master
    #pragma omp barrier
        // master is finished
        myCore = HyperThreadPhalanx.ThreadInfo[iThread].LogicalCore;
        myHT = HyperThreadPhalanx.ThreadInfo[iThread].LogicalHT;
      } // omp parallel
     
      for(int i = 1; i < HyperThreadPhalanx.nThreads; ++i)
      {
        for(int j = 0; j < i; ++ j)
        {
          if(HyperThreadPhalanx.ThreadInfo[j].APICid == HyperThreadPhalanx.ThreadInfo[i].APICid)
          {
            printf("Oversubscription of threads\n");
            printf("Multiple SW threads assigned to same HW thread\n");
            return 4;
          }
        } // for(int j = 0; j < i; ++ j)
      }
      return 0;
    } // void HyperThreadPhalanxInit()
    

    Next we can integrate the above function into the sample code. Which I will cover in the next part of this blog.

    Jim Dempsey
    Consultant
    QuickThread Programming, LLC

     

    Symbol-Bild: 

  • Intel® Many Integrated Core Architektur
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Parallel Studio XE
  • Intel® Advanced Vector Extensions
  • OpenMP*
  • C/C++
  • Server
  • Windows*
  • Server
  • Desktop
  • Entwickler
  • Professoren
  • Studenten
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Question about steal-continuation semantics in Cilk Plus, Global counter slowing down computation, return value of functions

    $
    0
    0

    1)
    What I understood about steal-continuation is, that every idle thread does not actually steal work, but the continuation which generates a new working item.
    Does that mean, that inter-spawn execution time is crucial? If 2 threads are idle at the same time, from what I understand only one can steal the continuation and create its working unit, the other thread stays idle during that time?!

    2)
    As a debugging artefact, I had a global counter incremented on every function call of a function used within every working item.

    I expect this value to be wrong (e.g. lost update), as it is not protected by a lock. what I didn't expect was execution time being 50% longer. Can somone tell me, why this is the case?

    3)
    Du I assume correctly, that a cilk-spwaned function can never (directly) return a result, as the continuation might continue in the mean time and one would never know when the return value is actually written?

    Viewing all 245 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>