developer documents for Cilk Plus

Intel® Parallel Studio XE

Intel® Advanced Vector Extensions

Introduction to Vectorization using Intel® Cilk™ Plus Extensions

Server

↧

Explicit Vector Programming – Best Known Methods

March 18, 2014, 9:33 am

Latest and popular articles on Intel Technologies

≫ Next: The Chronicles of Phi - part 4 - Hyper-Thread Phalanx – tiled_HT2

≪ Previous: The Chronicles of Phi - part 3 Hyper-Thread Phalanx – tiled_HT1 continued

Explicit Vector Programming – Best Known Methods

Why do we care about vectorizing applications? The simple answer: Vectorizing improves performance, and achieving high performance can save power. The faster an application can compute CPU-intensive regions, the faster the CPU can be set to a lower power state.

How does vectorizing compare to scalar operations with regard to performance and power? Vectorizing consumes less power than equivalent scalar operations because it performs better: Scalar operations process less several times data per cycle and require more instructions and more cycles to complete.

The introduction of wider vector registers in x86 platforms and the increasing number of cores that support single instruction multiple data (SIMD) and threading parallelism now make vectorization an optimization consideration for developers. This is because vector performance gains are applied per core, so multiplicative application performance gains become possible for more applications. In the past, many developers relied heavily on the compiler to auto-vectorize some loops, but serial constraints of programming languages have hindered the compiler’s ability to vectorize many different kinds of loops. The need arose for explicit vector programming methods to extend vectorization capability for supporting reductions, vectorizing:

Outer loops
Loops with user defined functions
Loops that the compiler assumes to have data dependencies, but on developer were understood to benign.

In summary: achieving high performance can also save power.

(An excellent web reference is the “Programming and Compiling for Intel® Many Integrated Core Architecture”. While the focus is on Intel® Xeon™ Phi coprocessor optimization, much of the content is also applicable to Intel Xeon® and Intel® Core™ processors.)

This document describes high-level best known methods (BKMs) for using explicit vector programming to improve the performance of CPU-bound applications on modern processors with vector processing units. In many cases, it is advisable to consider structural changes that accommodate both thread-level parallelism and as SIMD-level parallelism as you pursue your optimization strategy.

Note: To determine whether your application is CPU-bound or memory-bound, see About Performance Analysis with VTune™ Amplifier and Detecting Memory Bandwidth Saturation in Threaded Applications. Using hotspot analysis, find the parts of your application that are CPU-bound.

The following six steps are applicable for CPU-bound applications:

Measure baseline application performance.
Run hotspots and general exploration report analysis with the Intel® VTune™ Amplifier XE.
Determine hot loop/functions candidates to see if they are qualified for SIMD parallelism.
Implement SIMD parallelism using explicit vector programming techniques.
Measure SIMD performance.
[Optional for advanced developers] Generate assembly code and inspect.
Repeat!

Step 1. Measure Baseline Application Performance

You first need to have a baseline for your application’s existing performance level to if your vectorization changes are effective. In addition, you need have a baseline to measure your progress and final application performance relative to your starting point. Understanding this provides some guidance about when to stop optimizing.

Use a release build of your application for the initial baseline instead of a debug build. A release build contains all the optimizations in your final application. This is important because you need to understand the loops or “hotspots” in your application are spending significant time.

A release baseline provides symbol information, and has all optimizations turned on except simd (explicit vectorization) and vec (auto-vectorization). To explicitly turn off simd and auto-vectorization use the following compiler switches -no-simd–no-vec. (See Intel® C++ Compiler User Reference Guide 14.0)

Compare the baseline’s performance against the vectorized version to get a sense of how well your vectorization tuning approaches theoretical maximum speedup.

It is best to compare the performance of specific loops in the baseline and vectorized version using tools such as the Intel® VTune™ Amplifier XE or embedded print statements.

Step 2. Run hotspots and general exploration report analysis with Intel® VTune™ Amplifier XE

You can use the Intel® VTune™ Amplifier XE to find the most time-consuming functions in your application. The “Hotspots” analysis type is recommended, although “Lightweight Hotspots” (which profiles the whole system, as opposed to just your application) works as well

Identifying which areas of your application are taking the most time allows you to focus your optimization efforts in those areas where performance improvements will have the most effect. Generally, you want to focus only on the top few hotspots or functions taking at least 10% of your application’s total runtime. Make note of the hotspots you want to focus on for the next step. (Tutorial: Finding Hotspots.)

The general exploration report can provide information about:

TLB misses (consider compiler profile guided optimization),
L1 Data cache misses (consider cache locality and using streaming stores),
Split loads and split stores (consider data alignment for targeted architecture),
Memory bandwidth,
Memory latency (consider streaming stores and prefetching) demanded by the application.

This higher level analysis can help you determine whether it is profitable to pursue vectorization tuning.

Step 3: Determine Hot Loop/Functions Candidates Are Qualified for SIMD Parallelism

One key suitability ingredient for choosing loops to vectorize is whether the memory references in the loop are independent of each other. (See Memory Disambiguation inside vector-loops and Requirements for Vectorizable Loops.)

The Intel® Compiler vectorization report (or -vec-report) can tell you whether each loop in your code was vectorized. Ensure that you are using the compiler optimization level 2 or 3 (-O2 or –O3) to enable the auto-vectorizer. Run the vectorization report and look at the output for the hotspots determined from Step 2. If there are loops in these hotspots that did not vectorize, check whether they have math, data processing, or string calculations on data in parallel (for instance in an array). If they do, they might benefit from vectorization. Move to Step 4 if any vectorization candidates are found.

Data alignment

Data alignment is another key ingredient for getting the most out of your vectorization efforts. If the Intel® VTune™ Amplifier reports split loads and stores, then the application is using unaligned data. Data alignment forces the compiler to create data objects in memory on specific byte boundaries. There are two aspects of data alignment that you must be aware of:

Create arrays with certain byte alignment properties.
Insert alignment pragmas/directives and clauses in performance critical regions.

Alignment increases the efficiency of data loads and stores to and from the processor. When targeting the Intel® Supplemental Streaming Extensions 2 (Intel® SSE 2) platforms, use 16-byte alignment that facilitates the use of SSE-aligned load instructions. When targeting the Intel® Advanced Vector Extensions (Intel® AVX) instruction set, try to align data on a 32-byte boundary. (See Improving Performance by Aligning Data.) For Intel® Xeon Phi™ coprocessors, memory movement is optimal on 64-byte boundaries. (See Data Alignment to Assist Vectorization.)

Unit stride

Consider using unit stride memory (also known as address sequential memory) access and structure of arrays (SoA) rather than arrays of structures (AoS) or other algorithmic optimizations to assist vectorization. (See Memory Layout Transformations.)

As a general rule, it is best to try to access data in a unit stride fashion when referencing memory. Because this is often good for vectorization and other parallel programming techniques. (See Improving Discrete Cosine Transform performance using Intel(R) Cilk(TM) Plus.)

Successful vectorization may hinge on the application of other loop optimizations, such as loop interchange (see information on cache locality), and loop unroll.

It may be worth experimenting to see if inlining a function using –ip or –ipo allows vectorization to proceed for loops with embedded, user-defined functions. This is one alternative approach to using simd-enabled functions; there may be tradeoffs between using one or the other.

Note:

If the algorithm is computationally bound when performing hotspot analysis, continue pursuing the strategy described in this paper. If the algorithm is memory-latency bound or memory-bandwidth bound, then vectorization will not help. In such cases, consider strategies like cache optimizations or other memory-related optimizations, or even rethink the algorithm entirely. High level loop optimizations, such as –O3, can look for loop interchange optimizations that might help cache locality issues. Cache blocking, can also help improve cache locality when applicable. (See Cache Blocking Techniques which is specific to the Intel® Many Integrated Core Architecture (Intel® MIC Architecture), but the technique applies to the Intel® Xeon® processor as well.)

Step 4: Implement SIMD Parallelism Using Explicit Vector Programming Techniques

Explicit vector programming includes features such as the Intel® Cilk™ Plus or OpenMP* 4.0 vectorization directives. These optimizations provide a very powerful and portable way to express vectorization potential in C/C++ applications. OpenMP* 4.0 vectorization directives are also applicable to Fortran applications. These explicit vector programming techniques give you the means to specify which targeted loops to vectorize. Candidate loops for vectorization directives include loops that have too many memory references for the compiler to put in place dependency checks, loops with reductions, loops with user-defined functions, outer loops, among others.

(See Best practices for using Intel® Cilk™ Plus for recommendations for using the Intel® Cilk™ Plus methodology and Enabling SIMD in program using OpenMP4.0” for how to enable SIMD features in an application using the OpenMP* 4.0 methodology.)

See also the webinar Introducing Intel® Cilk™ Plus and two video training series detailing vectorization essentials with explicit vector programming using Intel® Cilk™ Plus and OpenMP* 4.0 vectorization techniques.

Here are some common components of explicit vector programming.

SIMD-enabled Functions (Intel® Cilk™ Plus and OpenMP* 4.0 Methodologies)

User creation of SIMD-enabled functions is a capability provided in both the Intel® Cilk™ Plus and OpenMP* 4.0 methodologies. SIMD-enabled functions explicitly describe the SIMD behavior of user-defined functions, including how SIMD behavior is altered due to call site dependence. (See Call site dependence for SIMD-enabled functions in C++, which explains why the compiler sometimes uses a vector version of a function in some call sites, but not others. It also describes what you can do to extend the types of call sites for which the compiler can provide vector versions. Learn more about SIMD-enabled functions in Usage of linear and uniform clause in Elemental function (SIMD-enabled function).)

SIMD Loops (Intel® Cilk™ Plus and OpenMP* 4.0 Methodologies)

Both the Intel® Cilk™ Plus and OpenMP* 4.0 methodologies provide SIMD loops. The principle with SIMD loops is to explicitly describe the SIMD behavior of a loop, including descriptions of variable usage and any idioms such as reductions. (See Requirements for Vectorizing Loops with #pragma SIMD.) For a quick introduction to #pragma simd, see the corresponding topic for Intel® Cilk™ Plus and OpenMP* 4.0.)

Traditionally, only inner loops have been targeted for vectorization. One unique application of the Cilk Plus #pragma simd or OpenMP* 4.0 #pragma omp simd is that it can be applied to an outer loops.

(See loops Outer Loop Vectorization , and Outer Loop Vectorization via Intel® Cilk™ Plus Array Notations, which describe using #pragma simd in outer loops).

Intel® Cilk™ Plus Array Notation (Intel® Cilk™ Plus Methodology)

Array Notation is an Intel-specific language extension that is a part of the Intel® Cilk™ Plus methodology supported by the Intel® C++ Compiler. Array Notation provides a way to express a data parallel operation on ordinary declared C/C++ arrays. Array Notation is also compatible with OpenMP* 4.0 and Intel® Cilk™ Plus SIMD-enabled functions. It provides a concise way of replacing loops operating on arrays with a clean array notation syntax that the Intel® Compiler identifies as being vectorizable.

Step 5: Measure SIMD performance

Measure your application’s build configuration runtime performance. If you are satisfied, you are done! Otherwise, inspect -vec-report6 to get a SIMD vectorization summary report (to check alignment, unit-stride and using (SoA) versus (AoS), interaction with other loop optimizations, etc.).

(For a deeper exploration on measuring performance, see How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures.)

Another approach is to use a family of compiler switches with the template –profile-xxxx. (These switches are described ‘Profile Function or Loop Execution Time.) Using the instrumentation method to profile function or loop execution time makes it easy to view where cycles are being spent in your application. The Intel® Compiler inserts instrumentation code into your application to collect the time spent in various locations. (data for identifying hotspots that may be candidates for optimization tuning or targeting parallelization.).

Another method to measure performance is to re-run the Intel® VTune™ Amplifier XE hotspot analysis after the optimizations are made and compare results.

Optional Step 6: For Advanced Developers -Generate assembly code and do inspection

For those who want to see the assembly code that the compiler generates, and inspect that code to gain insight into how well applications were vectorized, use the compiler switch –S to compile to assembly (.s) without invoking a link step.

Step 7: Repeat!
Repeat as needed until you achieve the desired performance or no good candidates remain.

Other considerations are applicable for applications that are memory latency-bound or memory bandwidth-bound:

Other considerations: Prefetching and Streaming Stores

Prefetching

Data prefetching is a method for a compiler or a developer to request that data be pulled into a cache line from main memory prior to it being used. Prefetching is more applicable for Intel® MIC Architecture. Explicit control of prefetching can be an important performance factor to investigate. (See Prefetching on Intel® MIC Architecture.)

Streaming stores

Streaming stores are a method of writing data explicitly to main memory bypassing all intermediate caches in instances where you are sure that data being written will not be needed from cache any time soon. Strictly speaking, bypassing all caches is only applicable on Intel® Xeon® processors. For Intel® Xeon Phi™ coprocessors, streaming stores evict instructions are provided to evict data only from a specific cache. (See Intel® Xeon Phi™ coprocessor specific support of streaming stores or Compiler-based Data Prefetching and Streaming Non-temporal Store Generation for Intel Xeon Phi Coprocessor (May 2013). Vectorization support describes the use of the VECTOR NONTEMPORAL compiler directive for addressing streaming stores.)

Other considerations: Scatter, gather, and compress structures:

Many applications benefit from explicit vector programming efforts. In many cases performance increases over scalar performance can be commensurate with the number of available vector lanes on a given platform. However, some types of coding patterns or idioms limit vectorization performance to a large degree.

Gather and Scatter codes

A[I] = B[Index[i]]; //Gather
A[Index[i]] = b[i]; //Scatter

While gather/scatter vectorization is available on Intel® MIC Architecture and recent Intel® Xeon® platforms, the performance gains from vectorization relying on gather/scatter is often much inferior to use of unit-strided loads and stores inside vector-loops. If there are not enough other profitably vectorized operations (such as multiple, divide, or math calls, …) inside such vector loops, performance may even be lower than serial performance in some cases. The only possible workaround for such issues is to look at alternative algorithms all together to avoid using gathers and scatters.

Compress and Expand structures

Compress and expand structures are generally problematic. On Intel® Xeon™ Phi coprocessors, the Intel® Compiler can automatically vectorize loops that contain simple forms of compress/expand idioms. An example of a compress idiom is as follows:

do I =1,N
   if (B(I)>0)
       x= x+1
       A(X) = B(I)
   endif
enddo

In this example, the variable x is updated under a condition. Note that it is incorrect to use the #pragma simd for such compress structures but using #pragma ivdep is okay.

Improve performance of such vectorized loops on Intel® MIC Architecture using the -opt-assume-safe-padding compiler option. (See Common Vectorization Tips.)

Currently vectorization of compress structures is only for future platforms that support compress structures.

Reference Materials:

Compiler diagnostic messages

Intel® Fortran Vectorization Diagnostics– Diagnostic messages from the vectorization report produced by the Intel® Fortran Compiler. To obtain a vectorization report in Intel® Fortran, use the option -vec-report[n] (Linux* and OS X* platforms) or /Qvec-report[:n] (Windows* platform).
Vectorization Diagnostics for Intel® C++ Compiler– Diagnostic messages from the vectorization report produced by the Intel® C++ Compiler. To obtain a vectorization report with the Intel® C++ Compiler, use option -vec-report[n] (Linux* and OS X* platforms) or /Qvec-report[:n] (Windows* platform).

Intel® C++ Compiler Videos

Vectorization Essentials– Ten videos covering the motivation for explicit vector programming and SIMD concepts with the Intel® Cilk™ Plus methodology.
Performance Essentials with OpenMP 4.0 Vectorization– Seven videos covering the motivation for explicit vector programming and SIMD concepts with the OpenMP* 4.0 methodology.

Webinars

Articles

Compiler Switches for Intel® Parallel Amplifier– The Intel® Parallel Amplifier can analyze many native binaries.
Vectorization and Optimization Reports– Using compiler option -vec-report to determine what is (or is not) vectorizing and why (or why not) in your application.
Data Alignment to Assist Vectorization– New features in the Intel® Compiler 14.0 that support the data alignment critical for vectorization.
Pointer Aliasing and Vectorization– How to tell the Intel® Compiler that pointers are not aliasing the same data.
Requirements for Vectorizable Loops– Requirements for loop vectorization, code snippets, examples, and an advice section.
Vectorization for C or C++ Users with Intel® Cilk™ Plus Array Notations and Elemental Functions– Elemental vectorization functions with the Intel® Cilk™ Plus array notation.
Memory Layout Transformations – Moving from data organized in an AoS to an organization of SoA.
Usage of linear and uniform clause in Elemental function (SIMD-enabled function)– Linear and uniform clauses specifically for the Intel® Cilk™ Plus methodology but with almost direct application for OpenMP* 4.0 #pragma omp declare simd linear and uniform clauses.
Call site dependence for SIMD-enabled functions in C++– Relationship of function call site to multiple SIMD-enabled functions in a header file.
Best practices for using Intel® Cilk™ Plus– Step-by-step approach to enable an application with task and data parallelism using the Intel® Cilk™ Plus methodology.
Data Alignment to Assist Vectorization– Data alignment is a method to force the compiler to create data objects in memory on specific byte boundaries. In addition to creating the data on aligned boundaries, the compiler is able to make optimizations when the data is known to be aligned by 64-bytes.
Array Notation Tradeoffs– Rewriting array notation syntax (one way to express parallelism that helps the compiler with vectorization) with shorter vectors to avoid cache overflow.
Improving Discrete Cosine Transform performance using Intel® Cilk™ Plus– Improving the performance of Discrete Cosine Transforms using the Intel® Cilk™ Plus methodology and Array Notation.
Improving Averaging Filter performance using Intel® Cilk™ Plus – Improving the performance of an Averaging Filter in image processing using the Intel® Cilk™ Plus methodology, and using task parallelism, Array Notation, and SIMD-enabled functions to express data parallelism.
Outer Loop Vectorization– Moving the vectorization from an inner level to an outer level using a combination of elemental functions and pragma/directive SIMD.
Outer Loop Vectorization via Intel® Cilk™ Plus Array Notations – Using C++ Array Notation with the Intel® Cilk™ Plus methodology.
Tradeoffs between array-notation long-vector and short-vector coding – Using C++ Array Notation with the Intel® Cilk™ Plus methodology.
The Importance of Vectorization for Intel® Many Integrated Core Architecture (Intel® MIC Architecture) (Fortran Example)– Using the Intel® Fortran Compiler vectorizer to get good performance through effective use of the SIMD hardware and the benefits of threading over many cores.
Vectorizing Intel® Threading Building Blocks (Intel® TBB) parallel_for block– Writing vector-friendly code inside an Intel® Threading Building Blocks (Intel® TBB) parallel_for bloc.
Intel® Xeon Phi™ coprocessor specific support of streaming stores– Speeding up performance of vector-aligned unmasked stores in streaming kernels
Molecular Dynamics Optimization on Intel® Many Integrated Core Architecture (Intel® MIC)– Optimizing a molecular dynamics program using Intel® MIC Architecture.
Cache Blocking Techniques– Cache Blocking is a technique to rearrange data access to pull subsets (blocks) of data into cache and to operate on block to avoid repeatedly fetching data from memory.
Memory Layout Transformations– Reorganizing data into an SoA organization for real world applications.
Large Page Considerations– Using mmap to allocate data directly in 2MB pages, using libraries such as libhugetlbfs to allocate all malloc-ed data and static data.
Common Vectorization Tips– User-defined function-calls inside vector-loops, unit-stride accesses inside elemental functions, and memory disambiguation inside vector loops.
Program Optimization through Loop Vectorization– The Intel® Compilers provide many ways to generate well-optimized vector instructions. Good vectorization is of fundamental importance to take full advantage of SIMD-based data parallelism. High-level coding can take advantage of auto-vectorization.
Intel Guide for Developing Multithreaded Applications– More specific tuning-related information applicable to thread synchronization and memory management.
Large Page Considerations– Compiler Methodology for Intel® MIC Architecture.
Element-wise Alignment Requirements for Data Accesses– Compiler Methodology for Intel® MIC Architecture.

Cilk, Intel, the Intel logo, Intel Xeon Phi, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

* Other names and brands may be claimed as the property of others.

Explicit Vector Programming

OpenMP* 4.0 Vectorization

Intel(R) Cilk(TM) Plus

Intel® Many Integrated Core Architektur

Leistungsverbesserung

↧

The Chronicles of Phi - part 4 - Hyper-Thread Phalanx – tiled_HT2

March 24, 2014, 2:09 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Software Development Tools 2015 Beta

≪ Previous: Explicit Vector Programming – Best Known Methods

The prior part (3) of this blog showed the effects of the first-level implementation of the Hyper-Thread Phalanx. The change in programming yielded 9.7% improvement in performance for the small model, and little to no improvement in the large model. This left part 3 of this blog with the questions:

What is non-optimal about this strategy?
And: What can be improved?

There are two things, one is obvious, and the other is not so obvious.

Data alignment

The obvious thing, which I will now show you, is that vectorization improves with aligned data. Most compiler optimizations will examine the code of the loop, and when necessary, insert preamble code that tests for alignment and executes up until an alignment is reached (this is called peeling), then insert code that executes more efficiently with the now aligned data. Finally post-amble code is inserted to complete any remainder that may be present.

This sounds rather straightforward until you look at the inner loop:

#pragma simd  
          for (x = 1; x < nx-1; x++) {
            ++c;
            ++n;
            ++s;
            ++b;
            ++t;
            f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c+1]
                + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];
          }

In the preceding code, the two terms: f1_t[c-1] , and , f1_t[c+1] will muck up the vector alignment tests since [c-1], [c] and [c+1] can never all be aligned at the same time.

Are the compiler writers smart enough to offer some measure of optimization for such a loop?

As it turns out, they are able to offer some measure of optimization for such a loop.

Due to initial unknowns, the code has to do more work in the preamble and post-amble sections, as well as a reduced number of iterations of work in the fastest interior body loop of code.

Take particular note that the input array f1_t is indexed in seven different ways. This means that the preamble code that determines alignment may have to work on minor permutations of the seven references in an attempt to narrow in on the time when the largest number of references are vector aligned. This is non-trivial for the compiler code generation, as well as a potential area for additional overhead.

What can be improved?

Use aligned data when possible

This is addressed in an additional improvement to the coding of the tiled_HT2 program.

First, we require that the dimension NX be a multiple of REALs that fill a cache line. This is not an unreasonable requirement. The value of 256 was used in the original example code. It is not too much of a restriction to require that NX must be a multiple of 16 for floats, or 8 for doubles.

To assure alignment, I changed the malloc calls to allocate the arrays to use _mm_malloc with an alignment of cache line size (64). This is a relatively simple change. (This will be shown later after the next optimization tip that also affects allocation.)

Next, now that I know that NX is an even multiple of cache lines, and the arrays are cache line aligned, I can construct a function to process the innermost loop with the foreknowledge that six of the array references are cache aligned and two are not (the extra reference is the output array). The two that are not aligned are the references to [c-1] and [c+1]. The compiler, knowing beforehand what is aligned and what is not aligned does not have to insert code to make this determination. i.e. the compiler can reduce, or completely remove the preamble and post-amble code.

The second improvement (non-obvious improvement):

Redundancy can be good for you

Additional optimization can be achieved by redundantly process x=0 and x=nx-1 as if these cells were at the interior of the parallel pipette being processed. This means that the preamble and post-amble code for unaligned loops can be bypassed, and that the elements x=1:15 could be directly processed as an aligned vector (as opposed to one-by-one computation or unaligned vector computation). The same is done for the 16 elements where the last element (x=nx-1) computes differently than the other elements of the vector. This does mean that after calculating the incorrect values (for free) for x=0 and x=nx-1, we have to then perform a scalar calculation to insert the correct values into the x column. Essentially you exchanging two scalar loops of 16 iterations for two of (one 16-wide vector operation + one scalar operation) where the scalar operations are in L1 cache.

Adding the redundancy change necessitated allocating the arrays two vectors worth of elements larger than the actual array requirement, and returning the address of the 2^nd vector for the array pointer. Additionally, this requires zeroing one element preceding and following the working array size. The allocation then provides for one vector of addressable memory (and not used as valid data). Not doing so, could result in a page fault depending on location and extent of allocation.

Change to allocations:

  // align the allocations to cache line
  // increase allocation size by 2 cache lines
  REAL *f1_padded = (REAL *)_mm_malloc(
    sizeof(REAL)*(nx*ny*nz + N_REALS_PER_CACHE_LINE*2),
    CACHE_LINE_SIZE);

  // assure allocation succeeded
  assert(f1_padded != NULL);
 
  // advance one cache line into buffer
  REAL *f1 = f1_padded + N_REALS_PER_CACHE_LINE;
 
  f1[-1] = 0.0;       // assure cell prior to array not Signaling NaN
  f1[nx*ny*nz] = 0.0; // assure cell following array not Signaling NaN

  // align the allocations to cache line
  // increase allocation size by 2 cache lines
  REAL *f2_padded = (REAL *)_mm_malloc(
    sizeof(REAL)*(nx*ny*nz + N_REALS_PER_CACHE_LINE*2),
    CACHE_LINE_SIZE);

  // assure allocation succeeded
  assert(f2_padded != NULL);
 
  // advance one cache line into buffer
  REAL *f2 = f2_padded + N_REALS_PER_CACHE_LINE;
 
  f2[-1] = 0.0;       // assure cell prior to array not Signaling NaN
  f2[nx*ny*nz] = 0.0; // assure cell following array not Signaling NaN

As an additional benefit the compiler can now generate more code using Fused Multiply and Add (FMA) instructions.

The tiled_HT2 code follows:

void diffusion_tiled_aligned(
                REAL*restrict f2_t_c, // aligned
                REAL*restrict f1_t_c, // aligned
                REAL*restrict f1_t_w, // not aligned
                REAL*restrict f1_t_e, // not aligned
                REAL*restrict f1_t_s, // aligned
                REAL*restrict f1_t_n, // aligned
                REAL*restrict f1_t_b, // aligned
                REAL*restrict f1_t_t, // aligned
                REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
                REAL cb, REAL cc, int countX, int countY) {

  __assume_aligned(f2_t_c, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_c, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_s, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_n, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_b, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_t, CACHE_LINE_SIZE);
  // countY is number of squads along Y axis
  for(int iY = 0; iY < countY; ++iY) {
    // perform the x=0:N_REALS_PER_CACHE_LINE-1 as one cache line operation
    // On Phi, the following reduces to vector with one iteration
    // On AVX two iterations
    // On SSE four iterations
    #pragma noprefetch
    #pragma simd 
    for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++) {
      f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i] + ce * f1_t_e[i]
                   + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];
    } // for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++)
   
    // now overstrike x=0 with correct value
    // x=0 special (no f1_t[c-1])
    f2_t_c[0] = cc * f1_t_c[0] + cw * f1_t_w[1] + ce * f1_t_e[0]
                + cs * f1_t_s[0] + cn * f1_t_n[0] + cb * f1_t_b[0] + ct * f1_t_t[0];
    // Note, while we could overstrike x=[0] and [nx-1] after processing the entire depth of nx
    // doing so will result in the x=0th cell being evicted from L1 cache.

    // do remainder of countX run including incorrect value for i=nx-1 (countX-1)
    #pragma vector nontemporal
    #pragma noprefetch
    #pragma simd 
    for (int i = N_REALS_PER_CACHE_LINE; i < countX; i++) {
        f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i] + ce * f1_t_e[i]
                 + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];
    } // for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++)

    // now overstrike x=nx-1 with correct value
    // x=nx-1 special (no f1_t[c+1])
    int i = countX-1;
    f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i-1] + ce * f1_t_e[i]
                   + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];

    // advance one step along Y
    f2_t_c += countX;
    f1_t_c += countX;
    f1_t_w += countX;
    f1_t_e += countX;
    f1_t_s += countX;
    f1_t_n += countX;
    f1_t_b += countX;
    f1_t_t += countX;
  } // for(int iY = 0; iY < countY; ++iY)
} // void diffusion_tiled_aligned(

diffusion_tiled(REAL *restrict f1, REAL *restrict f2, int nx, int ny, int nz,
              REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
              REAL cb, REAL cc, REAL dt, int count) {

#pragma omp parallel
  {

    REAL *f1_t = f1;
    REAL *f2_t = f2;

    int nSquadsZ = (nz + nHTs - 1) / nHTs; // place squads across z dimension
    int nSquadsZY = nSquadsZ * ny;  // number of full (and partial) squads on z-y face
    int nSquadsZYPerCore = (nSquadsZY + nCores - 1) / nCores;

    // Determine this thread's squads
    int SquadBegin = nSquadsZYPerCore * myCore;
    int SquadEnd = SquadBegin + nSquadsZYPerCore; // 1 after last squad for core
    if(SquadEnd > nSquadsZY) SquadEnd = nSquadsZY;
    for (int i = 0; i < count; ++i) {
      int nSquads;
      // restrict current thread to its subset of squads on the Z/Y face.
      for(int iSquad = SquadBegin; iSquad < SquadEnd; iSquad += nSquads) {
        // determine nSquads for this pass
        if(iSquad % ny == 0)
          nSquads = 1; // at y==0 boundary
        else
        if(iSquad % ny == ny - 1)
          nSquads = 1;  // at y==ny-1 boundary
        else
        if(iSquad / ny == (SquadEnd - 1) / ny)
          nSquads = SquadEnd - iSquad;  // within (inclusive) 1:ny-1
        else
          nSquads = ny - (iSquad % ny) - 1; // restrict from iSquad%ny to ny-1
        int z0 = (iSquad / ny) * nHTs; // home z for 0'th team member of Squad
        int z = z0 + myHT;  // z for this team member
        int y = iSquad % ny;
        // last squad along z may be partially filled
        // assure we are within z
        if(z < nz)
        {
          int x = 0;
          int c, n, s, b, t;
          c =  x + y * nx + z * nx * ny;
          n = (y == 0)    ? c : c - nx;
          s = (y == ny-1) ? c : c + nx;
          b = (z == 0)    ? c : c - nx * ny;
          t = (z == nz-1) ? c : c + nx * ny;
          diffusion_tiled_aligned(
   &f2_t[c], // aligned
   &f1_t[c], // aligned
   &f1_t[c-1], // unaligned
   &f1_t[c+1], // unaligned
   &f1_t[s], // aligned
   &f1_t[n], // aligned
   &f1_t[b], // aligned
   &f1_t[t], // aligned
                        ce, cw, cn, cs, ct, cb, cc, nx, nSquads);
        } // if(z < nz)
      } // for(int iSquad = SquadBegin; iSquad < SquadEnd; iSquad += nSquads)
// barrier required because we removed implicit barrier of #pragma omp for collapse(2)
      #pragma omp barrier
      // swap buffer pointers
      REAL *t = f1_t;
      f1_t = f2_t;
      f2_t = t;
    } // count
  } // parallel
  return;
}

The performance chart below incorporates the two new programs tiled_HT and tiled_HT2.

The above chart clearly illustrates that the tiledHT2 is starting to make some real progress, at least for the small model with another 9.5% improvement. Be mindful that code alignment may still be an issue. And the above chart does not take this into consideration.

What else can be improved?

Think about it while you await part 5.

Jim Dempsey
Consultant
QuickThread Programming, LLC

Symbol-Bild:

Technical Article

Intel® C++-Compiler

Intel® Parallel Studio XE

Intel® Advanced Vector Extensions

Environment not set correctly for non-Intel® Composer XE products in compiler command prompts

↧

Intel® Software Development Tools 2015 Beta

March 27, 2014, 3:13 pm

Latest and popular articles on Intel Technologies

≫ Next: How to use the cilkview?

≪ Previous: The Chronicles of Phi - part 4 - Hyper-Thread Phalanx – tiled_HT2

What's New in the 2015 Beta

This suite of products brings together exciting new technologies along with improvements to Intel’s existing software development tools:

Get guidance on how to boost performance safely without creating threading bugs using the Intel® Advisor XE 2015 Beta. These improvements include scaling to a larger number of processors and improved viewing and advanced modeling of suitability information on both Intel® Xeon® and Intel® Xeon Phi™ processors.
- Suitability modeling for Intel® Xeon Phi™ processors is available as an experimental feature by setting the environment variable ADVIXE_EXPERIMENTAL=suitability_xeon_phi_modeling
Profiling Advances - Improved hardware support for the Intel® Graphics Technology and Intel® Transactional Synchronization Extensions (Intel® TSX) analysis, OS X* view capability, and remote collection for Linux* systems with the new Intel® VTune™ Amplifier XE 2015 Beta!
Now you can debug memory and threading errors with Intel® Inspector XE 2015 Beta! For thread checking, take advantage of 3X performance improvement and reduction in memory overhead. For memory checking take advantage of advancements in the on-demand leak detection and memory growth controls as well as the brand new memory usage graph.
Now utilize new Parallel direct sparse Solvers for clusters (CPARDISO) and optimizations for the latest Intel® Architectures with the Intel® Math Kernel Library (Intel® MKL) 11.2 Beta! Get insight into Intel® MKL’s settings via new verbose mode and take advantage of the Intel® MKL Cookbook to help assemble the correct routines for solving complex problems.
Leverage the latest language features in Intel® Composer XE 2015 Beta, including full language support for C++11 (/Qstd=c++11 or -std=c++11) and Fortran 2003, Fortran 2008 Blocks, and OpenMP* 4.0 (except user-defined reductions). Gain new insights into optimization opportunities such as vectorization or inlining with redesigned optimization reports (/Qopt-report or -opt-report). Exploit Intel® Graphics Technology for additional performance with new offload computing APIs. Use the new "icl" and "icl++" compilers on OS X* for improved compatibility with the clang/LLVM* toolchain. The Intel® Integrated Performance Primitives has added support for the Intel® Xeon Phi™ co-processor.
- Existing users of optimization reports (opt-report, vec-report, openmp-report, or par-report) or the Intel® C++ Compiler for Linux (-ansi-alias is now default) should refer to the product release notes for more information
Get highly-optimized out-of-the-box performance for your MPI applications with the Intel® MPI Library 5.0 Beta Update 1! Now’s the time to take advantage of the new MPI-3 functionality, such as non-blocking collectives and fast one-sided communication. This release ensures binary compatibility with existing codes.
Extract complete insight into your distributed memory application and quickly find performance bottlenecks and MPI issues using the Intel® Trace Analyzer and Collector 9.0 Beta Update 1! In addition to support for the latest MPI-3 features, the new Performance Assistant automatically detects common MPI performance issues and quickly provides resolution tips.

A detailed description of the new features in the 2015 Beta products is available in the Intel® Software Development Tools 2015 Beta Program: What's New document.

Details

This beta program is available for IA-32 architecture-based processors and Intel® 64 architecture-based processors for Linux* and Windows*. The Intel beta compilers and libraries for OS* X are also included in this beta program.

During this Beta period, you will be provided access to the Intel® Cluster Studio XE 2015 Beta package – a superset containing all Intel® Software Development Tools. At the time of download, you can select to install individual products or the full suite.

Early access to some components of the 2015 Beta will be available the first week of April. These components will include the Intel® C++ Composer XE 2015 Beta and the Intel® Fortran Composer XE 2015 Beta.

The full Intel® Software Development Products 2015 Beta packages (including the Intel® Cluster Studio XE 2015 Beta files) will be available in mid-April, once the 2015 Beta program commences.

Frequently Asked Questions

A complete list of FAQs regarding the 2015 Beta can be found in the Intel® Software Development Tools 2015 Beta Program: Frequently Asked Questions document.

Beta duration

The beta program officially ends July 11th, 2014. The beta license provided will expire September 25th, 2014. At the conclusion of the beta program, you will be asked to complete a survey regarding your experience with the beta software.

Support

Technical support will be provided via Intel® Premier Support. The Intel® Registration Center will be used to provide updates to the component products during this beta period.

How to enroll in the Beta program

Complete the pre-beta survey at the registration link

Information collected from the pre-beta survey will be used to evaluate beta testing coverage. Here is a link to the Intel Privacy Policy.
Keep the beta product serial number provided for future reference
After registration, you will be taken to the Intel Registration Center to download the product
After registration, you will be able to download all available beta products at any time by returning to the Intel Registration Center

Note: At the end of the beta program you should uninstall the beta product software.

Beta Webinars

Want to know more about the 2015 Beta features in the Intel® Software Development Tools? Attend one of these webinars to learn more.

Times indicated are Pacific time. PST: Standard (UTC/GMT -8 hours), PDT: Daylight Savings (UTC/GMT -7 hours)

Date	Title	Description	Presenter	REGISTER
Apr 8 9:00 A.M. Pacific	Quickly discover performance issues with the Intel® Trace Analyzer and Collector 9.0 Beta	The Intel® Trace Analyzer and Collector has a long-standing reputation as a profiler that helps you understand MPI application behavior, and effectively visualize bottlenecks in your code. The new 9.0 Beta release introduces an even easier way to identify performance issues in your code via a brand new tool called the Performance Assistant. Join us as we discuss how the Performance Assistant is able to analyze your code, determine potential performance problems, and suggests solutions. We will discuss the technology and ideas behind how MPI performance bottlenecks are detected, and how you can easily implement solutions based on the information provided.	Gergana Slavova	REGISTER
Apr 22 9:00 A.M. Pacific	What's New in the Intel® Software Tools 2015 Beta releases	Join the technical experts at Intel as they provide you with details on the new features in the Intel® Software Tools 2015 Beta releases that access the newest Intel multicore processors, manycore coprocessors, and Intel® Graphics Technology. This technical presentation will cover new Beta features for the most recent standards: OpenMP4.0, MPI-3, Fortran 2003 and 2008, and C++ 11 running on the newest Linux, Windows, and OS X* operating systems. Covered will be a new C/C++ compiler driver "icl" for OS X*. Intel® Math Kernel Library adds Cluster PARDISO, Airmont, and Goldmont Atom optimizations, and certain tunings for the Haswell and Broadwell architectures. Intel® VTune™ Amplifier XE additionally boast an improved ease-of-use via changes in the Summary Pane and General Exploration. Intel® Trace Analyzer and Collector offers a brand-new performance assistant to aid in removing the bottlenecks in your MPI code. Intel Advisor XE presents a dramatically improved suitability view. Intel Inspector XE has generated a 3X performance improvement for threading analysis and improvements to memory growth and on-demand leak detection.	Gergana Slavova	REGISTER
May 1 9:00 A.M. Pacific	Getting the most out of your compiler with the new Optimization Reports	Intel® Composer XE 2015 has dramatically overhauled the reporting features for such crucial optimizations as inlining, vectorization, parallelization, and memory access and cache usage optimizations, replacing the current opt-report, vec-report, par-report, and openmp-report reporting functionality. A new consolidated optimization report provides improved presentation, content, and precision of the information provided so that users better understand what optimizations were performed by the compiler, and how they may be tuned to yield the best performance. In this webinar, we’ll show you how to use compiler options to target the exact optimization information you’re looking for and how to use this information to speed up your application.	Brandon Hewitt	REGISTER
May 7 9:00 A.M. Pacific	Intel® MKL 11.2 Beta Webinar - Introducing New features	Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance. Intel® MKL 11.2 Beta is Offered as component of Intel® Composer XE 2015 Beta. In this Webinar, we are introducing latest features of MKL 11.2 Beta and the topics include Cluster Pardiso, Verbose Mode, GEMM improvements on small matrix sizes and MKL Cookbook recipes.	Sridevi Allam	REGISTER
May 13 9:00 A.M. Pacific	Find Bugs Quickly and Easily in Your Fortran Application Using Intel® Inspector XE	This webinar will present the debugging and analysis capabilities of Intel® Inspector XE with a focus on Fortran development. Quickly detect and locate threading and memory issues in your application, and correlate those issues to the exact line of source code causing the problem. The presentation will include specific examples of common errors and how Intel® Inspector XE can greatly aid in the debugging process.	Jackson Marusarz	REGISTER
May 14 9:00 A.M. Pacific	What’s New in the Intel® VTune™ Amplifier XE 2015 Beta release	Join us for a look at all the new features arriving in the 2015 Beta release of VTune Amplifier XE. View the efficiency of your code that utilizes the new Intel® Transactional Synchronization Extensions (TSX). Observe memory transfers and compute queues on Intel® graphics. We’ll look at new capabilities such as remote collection via the graphical user interface and a results viewer for MAC OS X* systems. Those topics and many improvements to the user interface will be covered.	Dave Anderson	REGISTER
Jun 10 9:00 A.M. Pacific	Intel MPI library implementation of a new MPI3.0 standard - new features and performance benchmarks.	Introduction into implementation of a new MPI-3 standard by the latest Intel MPI library 5.0. MPI 3.0 standard introduced many new features such as new one-sided (Remote Memory Access (RMA)) communication semantics, non-blocking and neighborhood collectives, improvements in Fortran bindings and fault tolerance. New MPI 3.0 standard targets to improve performance, reliability and ease of use of HPC cluster applications. In this webinar we will cover MPI 3.0 features implemented in Intel MPI 5.0 library (beta) illustrated by small examples codes. Complementing release of Intel MPI 5.0, we also release a new version of Intel micro-benchmarks library IMB 4.0 containing the benchmarks for non-blocking collectives and new RMA interface. To observe performance benefits with these benchmarks, the asynchronous progress support in multi-threaded version of Intel MPI 5.0 library was implemented . The preliminary performance results based on IMB 4.0 show 2x performance advantage of non-blocking collectives for medium and large message sizes. We will also demonstrate performance advantages of truly passive RMA put function invocation in IMB 4.0 test-suite. Finally, a small stencil kernel will be used to demo a new shared memory MPI API functions that can compete with hybrid MPI and OpenMP applications.	Mark Lubin	REGISTER
Jun 24 9:00 A.M. Pacific	Remodel your Code with Intel® Advisor XE	Thanks to the multi-core era, it has become imperative for software developers to exploit parallelism inherent in their applications. Intel® Advisor XE helps make incorporating threading into applications easier by allowing developers to model parallelism. It inculcates in software developers a disciplined approach to exploiting parallelism. Inte® Advisor XE obviates guesswork and trial-and-error based approaches, and instead guides developers to confidently model and transform serial portions of code to parallelized versions in a step by step methodical fashion. The presenter will introduce the Intel® Advisor XE tool and demonstrate a structured approach to exploiting parallelism that Intel® Advisor XE facilitates. Attendees of this webinar will gain: Understanding of the importance of approaching parallelization problems based on measured data rather than guesswork The importance of performing parallelism modeling using Intel® Advisor XE annotations and analyses so that correct portions of your software can be judiciously selected and parallelized Knowledge of resources, including a clear step-by-step process, for implementing threading in software	Holly Wilper	REGISTER

This is a small subset of our Technical Webinar series. For more webinars and to view the archives, visit the main Intel Software Tools Technical Webinar Series page.

Known Issues and Special Features

This section contains information on known issues (plus associated fixes) and special features of the 2015 Beta versions of the Intel® Software Development Tools. Check back often for updates.

Next Steps

Review the Intel® Software Development Tools 2015 Beta What's New document and FAQ
Register for the Beta program and install the Intel® Software Development Tools 2015 Beta product(s)
Try it out and share your experience with us!

Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2014, Intel Corporation. All rights reserved.

Apple iOS*

Apple OS X*

Linux*

Intel® Trace-Analyzer und -Collector

C/C++

Fortran

Intel® C++-Compiler

Intel® Cilk™ Plus

Intel® Visual Fortran Composer XE

Intel® Debugger

Intel® Inspector

Intel® VTune™ Amplifier

Intel® Integrated-Performance-Primitives

Intel® Math Kernel Library

Intel® MPI Library

Intel® Threading Building Blocks

Intel® C++ Studio XE

Intel® Cluster Studio

Intel® Cluster Studio XE

Intel® Fortran Studio XE

Intel® Parallel Studio XE

Intel® Advisor XE

Cluster-Computing

Entwicklungstools

Intel® Core™ Prozessoren

Intel® Many Integrated Core Architektur

Parallel Computing

Vektorisierung

↧

How to use the cilkview?

March 31, 2014, 11:44 am

Latest and popular articles on Intel Technologies

≫ Next: Intel MPI Library and Composer XE Compatibility

≪ Previous: Intel® Software Development Tools 2015 Beta

I have a C search application on a centos 6.x 64 bit linux server that I just installed the cilkplus compiler on to take advantage of more cpu/cores. I've added the cilk_spawn function to some recursive scanning functions in my program. After re-compiling the search application with the cilkplus gcc compiler, the search program is working as intended without any seg faults or any other errors.

My question is how do I use the cilkview analyzer? I want to if cilkplus/spawning is helping my search application and if so by how much?

Thanks!

Lawrence

↧

Intel MPI Library and Composer XE Compatibility

March 31, 2014, 1:28 pm

Latest and popular articles on Intel Technologies

≫ Next: Question on reducers

≪ Previous: How to use the cilkview?

The following table lists all supported versions of the Intel® MPI Library and the Intel® Composer XE. Use this as a reference on the cross-compatibility between the library and associated compiler.

Compatibility Matrix
Intel® MPI Library Version	Intel® Compiler XE 8.1	Intel® Compiler XE 9.0	Intel® Compiler XE 9.1	Intel® Compiler XE 10.0	Intel® Compiler XE 10.1	Intel® Compiler XE 11.0	Intel® Compiler XE 11.1	Intel® Composer XE 2011	Intel® Composer XE 2013	Intel® Composer XE 2013 SP1
*3.1*	X	X	X	X	X
*3.1 U1*	X	X	X	X	X
*3.2*			X	X	X	X
*3.2 U1*			X	X	X	X
*3.2 U2*			X	X	X	X
*4.0*					X	X	X
*4.0 U1*					X	X	X
*4.0 U2*					X	X	X
*4.0 U3*							X	X
*4.1*							X	X	X
*4.1 U1*							X	X	X
*4.1 U2*									Up to Update 2	X
*4.1 U3*									Up to Update 2	X

NOTE: Any older versions of the Intel® MPI Library may work with newer versions of the Intel® Compiler XE but compatibility is not guaranteed. If you have concerns or see any issues, please let us know by submitting a ticket at the Intel® Premier Support site.

C/C++

Fortran

Intel® Visual Fortran Composer XE

Intel® MPI Library

Intel® Cluster Studio XE

Message Passing Interface

↧

Question on reducers

March 31, 2014, 3:52 pm

Latest and popular articles on Intel Technologies

≫ Next: Issue with gather & scatter operations

≪ Previous: Intel MPI Library and Composer XE Compatibility

In my search application there are globally variables defined outside any function that I would like to use the cilk reducers on.

Specifically I have code like this:

#include "search.h"

static int total_users = 0;
static int total_matches = 0;

These total_x variables are incremented throughout the application on different functions.

I tried adding the following for total_users and received the following error:

cilk::reducer_opadd<static int> total_users;

cilk plus error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘:’ token

What am I doing wrong here?

↧

Issue with gather & scatter operations

April 2, 2014, 2:05 am

Latest and popular articles on Intel Technologies

≫ Next: Question on cilk_sort

≪ Previous: Question on reducers

Hi,

I read on the doc that array notation can be used for array indicies in both cases :

C[:] = A[B[:]] and A[B[:]] = C[:]

I try to use this notation for left & right operands at the same time but it gives me wrong results.

Here is my problem:

double tmp[VEC_SIZE]; // Already initialized
int index[VEC_SIZE];  // Already initialized

tab[index[:]] = tab[index[:]] + tmp[:];       // This line gives wrong result

for (int i = 0; i < VEC_SIZE; i++) {
    tab[index[i]] = tab[index[i]] + tmp[i];   // While this loop gives the correct result
}

To me, these two versions of the code are supposed to be equivalent, am I wrong ?

Can we use array notation for array indicies in left & right operands at the same time ?

Thanks

↧

Question on cilk_sort

April 2, 2014, 12:36 pm

Latest and popular articles on Intel Technologies

≫ Next: Open Source Downloads

≪ Previous: Issue with gather & scatter operations

Is cilk_sort functions parallel drop in replacements for the C qsort function?

↧

Open Source Downloads

April 4, 2014, 2:49 am

Latest and popular articles on Intel Technologies

≫ Next: Memory Movement and Initialization: Optimization and Control

≪ Previous: Question on cilk_sort

This article makes available third-party libraries, executables and sources that were used in the creation of Intel® Software Development Products or are required for operation of those. Intel provides this software pursuant to their applicable licenses.

Required for Operation of Intel® Software Development Products

The following products require additional third-party software for operation.

Intel® Composer XE 2015 for Windows*:
The following binutils package is required for operation with Intel® Graphics Technology:
Téléchargement binutils_setup.zip
Please see Release Notes of the product for detailed instructions on using the binutils package.

The above binutils package is subject to various licenses. Please see the corresponding sources for more information:
Téléchargement binutils_src.zip

Used within Intel® Software Development Products

The following products contain Intel® Application Debugger, Intel® Many Integrated Core Architecture Debugger and/or Intel® JTAG Debugger tools which are using third party libraries as listed below.

Products and Versions:

Intel® Composer XE 2013 SP1 for Linux*

Intel® C++ Composer XE 2013 SP1 for Linux*/Intel® Fortran Composer XE 2013 SP1 for Linux*
(Initial Release and higher; 13.0 Intel® Application Debugger)

Intel® Composer XE 2013 for Linux*

Intel® C++ Composer XE 2013 for Linux*/Intel® Fortran Composer XE 2013 for Linux*
(Initial Release and higher; 13.0 Intel® Application Debugger)

Intel® Composer XE 2011 for Linux*

Intel® C++ Composer XE 2011 for Linux*/Intel® Fortran Composer XE 2011 for Linux*
(Update 6 and higher; 12.1 Intel® Application Debugger)
Intel® C++ Composer XE 2011 for Linux*/Intel® Fortran Composer XE 2011 for Linux*
(Initial Release and up to Update 5; 12.0 Intel® Application Debugger)

Intel® Compiler Suite Professional Edition for Linux*

Intel® C++ Compiler for Linux* 11.1/Intel® Fortran Compiler for Linux* 11.1
Intel® C++ Compiler for Linux* 11.0/Intel® Fortran Compiler for Linux* 11.0
Intel® C++ Compiler for Linux* 10.1/Intel® Fortran Compiler for Linux* 10.1

Intel® Embedded Software Development Tool Suite for Intel® Atom™ Processor:

Version 2.3 (Initial Release and up to Update 2)
Version 2.2 (Initial Release and up to Update 2)
Version 2.1
Version 2.0

Intel® Application Software Development Tool Suite for Intel® Atom™ Processor:

Version 2.2 (Initial Release and up to Update 2)
Version 2.1
Version 2.0

Intel® C++ Software Development Tool Suite for Linux* OS supporting Mobile Internet Devices (Intel® MID Tools):

Version 1.1
Version 1.0

Intel AppUp™ SDK Suite for MeeGo*

Initial Release (Version 1.0)

Used third-party libraries:
Please see the attachments for a complete list of third-party libraries.

Note: The packages posted here are unmodified copies from the respective distributor/owner and are made available for ease of access. Download or installation of those is not required to operate any of the Intel® Software Development Products. The packages are provided as is, without warranty or support.

Eclipse

EPL

third-party

Intel(R) Software Development Products

Intel® Graphics Technology

Intel® Application Debugger

Intel® Many Integrated Core Architecture Debugger & Intel® JTAG Debugger

Intel AppUp® Developer

Linux*

Compiler Methodology for Intel^® Many Integrated Core (Intel^®MIC) Architecture

Intel® Atom™ Prozessoren

Open Source

↧

Memory Movement and Initialization: Optimization and Control

April 16, 2014, 2:01 pm

Latest and popular articles on Intel Technologies

≫ Next: Оптимизировали, оптимизировали, да не выоптимизировали!

≪ Previous: Open Source Downloads

Overview

Are you initializing data or copying blocks of data from one variable to another in your application? Probably so. Moving or setting blocks of data is very common. So how to best optimize these operations for Intel^® Xeon Phi™ Coprocessors?

Job #1 - For Phi, Parallelize the initialization!

A single Phi core cannot saturate the bandwidth available on Phi. So if only 1 core is initializing your large arrays you will notice a significant slowness compared to Xeon (due to the relatively slow clock speed of the Phi cores). Therefore, on Phi it is necessary to get many cores involved in the memory initialization to insure that the memory subsystem is driven at or near maximum bandwidth.

For example, if you have something like this:

do i=1,N
  arr1(i) = 1.1_dp
end do

you can parallelize the do loop:

!DIR$ vector nontemporal
!$OMP PARALLEL DO
do i=1,N
  arr1(i) = 1.1_dp
end do

mem*() calls in libc

The mem* family of functions in libc can take significant amounts of time in many applications. These include memcpy(), memset() and memmove() functions. C programmers may call these directly in their code. In addition to directly calling these functions, Fortran and C applications with data initalizations or data copy statements may IMPLICITLY call these functions when a compiler translates the data set/move/copy statements into calls to these libc mem*() functions. In addition, Fortran may hide direct calls to libc mem*() functions in the Fortran Runtime Libraries which often "wrap" calls to libc mem*() functions.

Applications compiled with the Intel Compilers: because these libc mem*() functions are so common, the Intel Compilers provide optimized versions of memset and memcpy in the Intel Compiler provided library 'libirc'. This library and specifically these functions are intended to replace the calls to mem*() functions with a more optimized version of the mem*() functions. The Intel replacement libraries have symbol names "_intel_fast_memset" and "_intel_fast_memcpy".

Some examples showing how the compiler will translate some calls into :

Fortran:

more memset.f90
program memsetter
  integer, parameter :: N=10000
  real :: a(N)
  integer :: i

  do i=1,N
    a(i) = 0.0
  end do
!...or with array syntax
  a = 0.0

 print*, a(1:10) !...if you don't use array a, the loop above
                  !...is completely optimized away
end program memsetter

# now compile the code at O2 or greater, use 'nm' to dump the symbols
$ ifort -c -g -O2 memset.f90
$ nm memset.o
0000000000000000 N .debug_info_seg
0000000000000000 T MAIN__
...U _intel_fast_memset
...
0000000000000000 b memsetter_$A.0.1

#include <stdio.h>
float a[1000000];
float b[1000000];
int main() {
int i, n;
n=1000000;
for (i=0; i<n; i++) {
a[i] = b[i];
}
printf("%f", a[1]);
}

$ icc -g -c -O2 memset.c
rwgreen@dpdknf01:~/projects$ nm memset.o
0000000000000000 N .debug_info_seg
...
U _intel_fast_memcpy

_intel_fast_mem* function calls and how to control their use?

“memcpy” calls in user-code (explicit and implicit) get translated to intel_fast_memcpy UNLESS user uses non-default options such as:

C++: -ffreestanding (option means user provides possibly their own version of library entry-points, so compiler is NOT free to translate mem* calls to other versions).
Fortran: -nolib-inline ( option disables inline expansion of standard library or intrinsic functions, and prevents the compiler from translating mem* functions to their intel_fast_mem* equivalents )

So depending on the options used for compilation, you may be getting glibc memcpy (or user’s own version) OR intel_fast_memcpy.

Streaming Stores - Nontemporal writes for data:

Many High-Performance Computing applications need to move data in huge blocks. Normally during write operations the application will move data through the data cache(s) with the assumption that data may be reused again soon ( known as a 'write through cache'). However, in many cases an HPC application will completely overwrite cache contents (first level, second level - the whole cache hierarchy) in the process of moving data that are much larger than the cache size. This wipes out any 'useful' data that may be cached, effectively flushing their contents. To avoid this, the programmer may specify to use 'streaming stores.' Streaming store instructions on the Intel microarchitecture code name Knights Corner do not perform a read for ownership (RFO) for the target cache line before the actual store, thus saving memory bandwidth. The data remain cached in L2 (This is in contrast to the streaming stores on Intel^® Xeon^® processors where the on-chip cache hierarchy is bypassed and the data get combined in a separate write-combining buffer). See the article here for more details: Intel^® MIC Architecture Streaming Stores.

To control use of non-temporal streaming store instructions, the Intel compilers provide the -opt-streaming-stores (Linux*, OS* X) , /Qopt-streaming-stores (WIndows*) option. The syntax is:

Linux and OS X

-opt-streaming-stores keyword

Windows:

/Qopt-streaming-stores:keyword

Arguments

keyword

Specifies whether streaming stores are generated. Possible values are:

always	Enables generation of streaming stores for optimization. The compiler optimizes under the assumption that the application is memory bound.
never	Disables generation of streaming stores for optimization. Normal stores are performed.
auto	Lets the compiler decide which instructions to use.

Default

-opt-streaming-stores auto
or/Qopt-streaming-stores:auto

The compiler decides whether to use streaming stores or normal stores.

Description

This option enables generation of streaming stores for optimization. This method stores data with instructions that use a non-temporal buffer, which minimizes memory hierarchy pollution.

This option may be useful for applications that can benefit from streaming stores.

Control Streaming Store with Pragma/Directive:

C and Fortran: Add “simd” pragma to suppress conversion-to-mem*. Add another “vector nontemporal” pragma/directive to generate non-temporal stores. Examples:

!DIR$ vector nontemporal
!DIR$ simd
do i=1,n
   a(i) = 0
enddo

#pragma vector nontemporal
#pragma simd
  for (i=0; i<n; i++) {
    a[i] = b[i];
  }
}

Advanced Notes:

Inside intel_fast_memcpy() (library function that resides in libirc.a library that gets shipped with the Intel compiler), uses non-temporal stores for memcpy IF the copy-size is > 256K. For smaller sizes you will still get vector-code, but it will not use non-temporal stores.

The Intel compilers and libraries do NOT automatically parallelize the mem* calls (The execution will happen in a single thread unless the memcpy/loop resides inside a user-parallelized code-region).

In some specialized uses of memcpy, the application has extra knowledge of the cache-behavior of the src/dest arrays and their cache-locality at a bigger scope than what the library-code sees from just one invocation of the memcpy. In such cases, you may be able to do smarter optimizations (such as different prefetching techniques that are not just based on the input-size) in a loop-version (or a smarter user-version of specialized memcpy) that may lead to better behavior for your application.

For stream-copy, the src-code does not use memcpy directly, but it has a copy-loop. Under default options, compiler translates the loop into a call to intel_fast_memcpy that then takes the path executing the stores using non-temporal stores. In the best performing stream-copy version though, you can get slightly better performance (~14% better) using the C++ options "-opt-streaming-stores always -opt-prefetch-distance=64,8" OR “-ffreestanding -opt-prefetch-distance=64,8” due to the better prefetching behavior in the loop-version of the code vectorized by the compiler (driven by the compiler prefetching options and no translation to memcpy library call).

In general, small-size memcpy performance is expected to be slower on Intel MIC Archiecture compared to a host processor (when it is NOT bound by bandwidth - meaning small sizes plus cache-resident data) due to the slower single-threaded clock speed on the coprocessor.

Take Aways

Memory movement operations can either explicitly or implicitly call memcpy() or memset() functions to move or set blocks of data. These functions can be linked to routines provided by the resident libc provided by your OS. The Intel compilers in certain conditions will replace the slower libc calls with faster versions in the Intel compiler runtime libraries such as _intel_fast_memcpy and _intel_fast_memset, which are optimized for Intel architecture.

Moving large data sets through the cache hierarchy can flush useful data out of cache. Streaming stores can be used to improve memory bandwidth on Intel^® Xeon Phi™ Coprocessors. The opt-streaming-store compiler option can be used or the pragma/directive nontemporal can be used for finer grain control.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel^® Xeon Phi™ Coprocessors. The paths provided in this guide reflect the steps necessary to discover best possible application performance.

Back to Advanced Optimizations chapter

intel_fast_memcpy

intel_fast_memset

VECTOR TEMPORAL

-opt-streaming-stores

Entwicklungstools

Compiler-Themen

↧

Оптимизировали, оптимизировали, да не выоптимизировали!

October 22, 2012, 2:09 am

Latest and popular articles on Intel Technologies

≫ Next: Cilk view got error with program compiled by GCC 4.9.0

≪ Previous: Memory Movement and Initialization: Optimization and Control

Оптимизация? Конечно, каждый сталкивался с данной задачей при разработке своих, сколь-нибудь значительных, требующих определённых вычислений, приложений. При этом способов оптимизировать код существует огромное множество, и, как следствие, различных путей сделать это в автоматическом режиме с помощью опций компилятора. Вот здесь и возникает проблема – как выбрать то, что нужно нам и не запутаться?

Начнём с того, что оптимизации в современном компиляторе компании Intel® делятся на feature-specific (функциональные) и processor-specific (процессорные). К первым относят, например, оптимизации, подключаемые с помощью наиболее часто используемых ключиков O1, O2, O3.

Другая история – оптимизация с помощью опций компилятора под конкретное «железо» и набор инструкций, который оно поддерживает. Хотелось бы раскрыть именно этот вопрос более детально.

Итак, ключей для генерации «специфичного» под какое-либо семейство процессоров несколько. Что же «специфичного» мы получаем? В первую очередь, использование SIMD инструкций поддерживаемой архитектуры. Рассмотрим возможные ключи:

1)-x<feature> (Linux), /Qx<feature> (Windows)

Если мы хотим, чтобы компилятор оптимизировал наш код под конкретный, заранее известный тип интеловского процессора, и, соответственно, поддерживаемый им набор инструкций и функциональностей, то можно использовать данный ключ. При этом мы не собираемся запускать приложение на системах, которые не поддерживают данный набор инструкций.

Скажем, есть у нас система с процессором архитектуры SandyBridge, на которой мы собираемся запускать приложение. Данный процессор поддерживает набор инструкций AVX, и именно это значение мы можем указать с помощью ключика –xAVX (Linux) или –QxAVX (Windows). Таким образом, мы получим оптимизированный бинарник, который при запуске будет проверять, поддерживает ли наш процессор AVX инструкции. Если да, то наше приложение успешно запуститься и раскроет все преимущества использования процессорной оптимизации. Если же нет, то вылетит ошибка о том, что наш процессор не поддерживает данную функциональность. Приложение выполняться не будет!

Список значений, которые можно прописать с помощью данного ключика следующий:

SSE2, SSE3, SSSE3, SSSE3_ATOM, SSE3_ATOM, SSE4.1, SSE4.2, AVX, CORE-AVX-I, CORE-AVX2

Важно, что наиболее «свежие» версии поддерживают предыдущие, поэтому, возвращаясь к нашему примеру, очевидно, что код, скомпилированный с опцией –xSSE4.1, будет успешно выполнен и на системах, поддерживающих SSE4.2, AVX и выше. Но не наоборот. Специфичные AVX инструкции не смогут выполняться на более старых системах их не поддерживающих, что логично.

2)-m<feature> (Linux), /arch: <feature> (Windows)

Данный ключ позволит получить приложение, которое будет содержать оптимизацию для всех типов процессоров (не только интеловских), поддерживающих заданный набор инструкций. При этом никакой проверки в main функцию на тип процессора добавляться не будет, а это значит, что если мы станем запускать приложение на системе с процессором, не поддерживающим заданный набор инструкций, произойдёт неприятность… в отличие от проверки в случае с флагом –x, где просто выдаётся сообщение о несовместимости, в данном случае программа «упадёт», пытаясь выполнить конкретную инструкцию.

3)-ax<feature> (Linux), /Qax<feature> (Windows)

Ещё одна опция компилятора, позволяющая создать несколько версий (code paths), оптимизированных под разные архитектуры. Мы можем указать сразу несколько архитектур через запятую, например –axSSE4.2,AVX. Понятно, что в данном случае увеличится размер выходного файла (он так и будет один). При этом создаются как минимум два варианта – базовый (baseline) и оптимизированный под заданную архитектуру код. Есть одно но… компилятор сам проверяет, будет ли выигрыш в производительности от создания дополнительной, оптимизированной ветки. Если выигрыш есть, сгенерирует несколько версий, и во время выполнения программы, в зависимости от используемого процессора, будет выбрана нужная. С использованием этого ключика мы получим возможность создавать приложения, извлекающие максимум возможностей на наиболее свежих процессорах, при этом так же вполне валидно работающих на более старых и не интеловских процессорах (будет выполняться базовая ветка).

Интересны дополнительные комбинации перечисленных ключиков. Скажем, использование опций –ax и –x одновременно приводит к тому, что последней будет задана базовая версия, которая по умолчанию выставляется как SSE2. Поэтому, если есть необходимость её изменить с SSE2 на что-то другое, используем сочетание этих опций. Равно как и сочетание –ax и –m приведёт к тому, что в базовой версии будет оптимизированный код, который выполняется не только на интеловских процессорах.

Есть ещё один интересный флажок -mia32. Только он сможет создать приложение, способное выполняться на процессорах, более древних, чем Pentium 4 (не поддерживающих SSE2). Скажем, комбинация опций -mia32 -axSSE2 (Linux) или /arch:IA32 /QaxSSE2 (Windows) создаст приложение, которое будет выполняться на любых процессорах с архитектурой IA-32, плюс ещё и оптимизированная под SSE2 ветка.

И напоследок… если мы не знаем, какой набор инструкций поддерживает наша текущая система, но хотим оптимизировать именно под неё, нам поможет флаг –xHost(Linux) и /QxHost (Windows). Она скажет компилятору генерировать код, оптимизированный под поддерживаемую нашим процессором архитектуру. На другой машине наше приложение может и не заработает, но есть вероятность, что этого нам и не нужно.

А теперь немного практики. Будем компилировать программку перемножения матриц с разными ключиками на Windows. Тесты я провожу на системе с процессором архитектуры SandyBridge, поэтому собрав программку с флагом -QxCORE-AVX2, при запуске получаю следующую ошибку:

Fatal Error: This program was not built to run in your system.

Please verify that both the operating system and the processor support Intel(R) AVX2, BMI, LZCNT, HLE, RTM and FMA instructions.

Меняю ключик на –QaxAVX,CORE-AVX2, и программа запускается успешно, при этом выполняются специфичныe AVX инструкции. Можно было бы написать и просто –QaxCORE-AVX2 – как мы помним, в дефолтной baseline версии будет код, оптимизированный с использованием SSE2 инструкций.

Кстати, интересно глянуть и на ASM листинг. Добавив опцию –S (к опциям –QaxAVX,CORE-AVX2), получаем соответствующий файл с расширением *.asm на выходе. В нём видно, что сгенерировалось три версии инструкций, причём эти версии находятся в разных местах листинга:

lea       r8, QWORD PTR [rax+rcx*8]                     ;76.2
vcvtdq2pd ymm15, xmm13                                  ;76.2
add       rcx, 16                                       ;76.2
vextracti128 xmm14, ymm13, 1                            ;76.2
vcvtdq2pd ymm14, xmm14                                  ;76.2
vpaddd    ymm13, ymm13, ymm4                            ;76.2
vfmadd213pd ymm15, ymm5, ymm0                           ;76.2
...
lea       r8, QWORD PTR [rax+rcx*8]                     ;76.2
vpaddd    xmm5, xmm5, xmm1                              ;76.2
vmulpd    ymm15, ymm3, ymm15                            ;76.2
add       rcx, 16                                       ;76.2
vaddpd    ymm15, ymm2, ymm15                            ;76.2
vmovupd   YMMWORD PTR [imagerel(a)+rbp+r8], ymm15       ;76.2
vcvtdq2pd ymm15, xmm5                                   ;76.2
...
lea       r8, QWORD PTR [rax+rcx*8]                     ;76.2
movaps    XMMWORD PTR [imagerel(a)+rbp+r8], xmm15       ;76.2
add       rcx, 8                                        ;76.2
cvtdq2pd  xmm15, xmm14                                  ;76.2
mulpd     xmm15, xmm1                                   ;76.2
addpd     xmm15, xmm5                                   ;76.2
paddd     xmm14, xmm3                                   ;76.2

Вот как-то так работают processor-specific оптимизации в интеловском компиляторе. Надеюсь было интересно и познавательно. До новых встреч!

Symbol-Bild:

Intel® Visual Fortran Composer XE

Intel® Parallel Composer

Intel® Streaming SIMD Extensions

Apple OS X*

↧

Cilk view got error with program compiled by GCC 4.9.0

April 26, 2014, 1:25 pm

Latest and popular articles on Intel Technologies

≫ Next: Cilk Plus and Graphics Application

≪ Previous: Оптимизировали, оптимизировали, да не выоптимизировали!

Hi,

Recently, I have been compiling my Cilk plus program with GCC 4.9.0. Then I ran Cilk view to measure its parallelism and got this error:

Cilkview: Generating scalability data Cilkview Scalability Analyzer V2.0.0, Build 3229 C:Tool (or Pin) caused signal 11 at PC 0x7f9124cd1dd3

Interestingly, when I compiled the program with GCC 4.8.2 and ran it with Cilk view again, it ran fine. So does it only happen with GCC 4.9.0? My guess is that GCC 4.9.0 isn't stable enough or not compatible with Cilk view at this time.

Thanks,

↧

Cilk Plus and Graphics Application

April 27, 2014, 9:57 am

Latest and popular articles on Intel Technologies

≫ Next: Question on the Status of Cilk Plus Support in GCC Mainline

≪ Previous: Cilk view got error with program compiled by GCC 4.9.0

Hi,

I am trying to parallelize a c++ raytracing engine which renders pixels to screen using cilk plus. How can I convert a windows application to a console application so that I can use the cilk keywords in cilk_main() or is there perhaps another alternative way to using cilk keywords with the WinMain() method. I tried using OpenGL calls but am having trouble with that, you can see my post on the link http://stackoverflow.com/questions/23289466/how-do-i-display-the-result-of-a-raytracing-engine-in-using-opengl

I will appreciate any help I can get

Thanks

↧

Question on the Status of Cilk Plus Support in GCC Mainline

April 28, 2014, 10:30 am

Latest and popular articles on Intel Technologies

≫ Next: Parallel Search With Cilk Plus

≪ Previous: Cilk Plus and Graphics Application

Hi,

I read from the GCC 4.9.0 release notes that http://gcc.gnu.org/gcc-4.9/changes.html

Support for Cilk Plus has been added and can be enabled with the -fcilkplus option. Cilk Plus is an extension to the C and C++ languages to support data and task parallelism. The present implementation follows ABI version 1.2; all features but _Cilk_for have been implemented.

I wonder if _Cilk_for is the only feature that is missing from the GCC 4.9.0 release as well as the mainline branch.

Another question is about the libcilkplus in the GCC. For patches to libcilkplus in GCC, is http://www.cilkplus.org/submit-cilk-contribution still the right place to submit a patch? If yes, how often is the GCC libcilkplus synced with the upstream version?

Thanks,

Yufeng

↧

Parallel Search With Cilk Plus

May 1, 2014, 8:08 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® C++ Composer XE 2013 SP1 for OS X*, Update 3

≪ Previous: Question on the Status of Cilk Plus Support in GCC Mainline

Hi everyone , I have a program that generating random number if it does not exist in ( allocated custom size) array then add array. But if custom size is very big ( 1 million ) after a period search is very slowing down. I did learn cilk_for and reducers.I want to paralleize but I could not decide what reducer is suitable for array. Is there someone who can help me ?

(Sorry for my english if you do not understand my problem you can write my e-mail "03011241@st.meliksah.edu.tr" )

↧

Intel® C++ Composer XE 2013 SP1 for OS X*, Update 3

April 24, 2014, 4:07 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® C++ Composer XE 2013 SP1 for Linux*, Update 3

≪ Previous: Parallel Search With Cilk Plus

Intel® C++ Composer XE 2013 SP1 Update 3 includes the latest Intel C/C++ compilers and performance libraries for IA-32 and Intel® 64 architecture systems. This new product release now includes: Intel® C++ Compiler XE Version 14.0.3, GNU* Project Debugger (GDB*) 7.5, Intel® Math Kernel Library (Intel® MKL) Version 11.1 Update 3, Intel® Integrated Performance Primitives (Intel® IPP) Version 8.1 Update 1, Intel® Threading Building Blocks (Intel® TBB) Version 4.2 Update 4.

New in this release:

Intel® C++ Compiler 14.0.3
Intel® Math Kernel Library 11.1 update 3
Intel® Integrated Performance Primitives 8.1 Update 1
Intel® Threading Building Blocks 4.2 update 4
Xcode* 5.1 supported
-stdlib default now –stdlib=libc++
Corrections to reported problems
- Compiler fix list
- Intel® MKL fix list

Note: For more information on the changes listed above, please read the individual component release notes. See the previous releases's ReadMe to see what was new in that release.

Resources

Intel® Composer XE (Click on desired product)
Intel® Composer XE 2013 SP1 Checksums

Contents
File: m_ccompxe_2013_sp1.3.166.dmg
Product for developing 32-bit and 64-bit applications

File: m_ccompxe_redist_2013_sp1.3.166.dmg
Redistributable Libraries

Intel® Integrated-Performance-Primitives

Intel® Debugger

Intel® Math Kernel Library

Intel® Threading Building Blocks

↧

Intel® C++ Composer XE 2013 SP1 for Linux*, Update 3

April 30, 2014, 1:29 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® C++ Composer XE 2013 SP1 for Windows*, Update 3

≪ Previous: Intel® C++ Composer XE 2013 SP1 for OS X*, Update 3

Intel® C++ Composer XE 2013 SP1 Update 3 includes the latest Intel C/C++ compilers and performance libraries for IA-32, Intel® 64, and Intel® Many Integrated Core (Intel® MIC) architecture systems. This new product release now includes: Intel® C++ Compiler XE Version 14.0 Update 3, GNU* Project Debugger (GDB*) 7.5, Intel® Debugger 13.0, Intel® Math Kernel Library (Intel® MKL) Version 11.1 Update 3, Intel® Integrated Performance Primitives (Intel® IPP) Version 8.1 Update 1, Intel® Threading Building Blocks (Intel® TBB) Version 4.2 Update 4.

New in this release:

Intel® C++ Compiler XE 14.0.3
Intel® Math Kernel Library 11.1 update 3
Intel® Integrated Performance Primitives 8.1 Update 1
Intel® Threading Building Blocks 4.2 update 4
Corrections to reported problems
- Compiler fix list
- Intel® MKL fix list

Note:

For more information on the changes listed above, please read the individual component release notes. See the previous releases's ReadMe to see what was new in that release.

Resources

Intel® Composer XE (Click on desired product)
Intel® Composer XE 2013 SP1 Checksums

Contents
File: l_ccompxe_online_2013_sp1.3.174.sh
Product for developing 32-bit and 64-bit applications

File: l_ccompxe_2013_sp1.3.174.tgz
Product for developing 32-bit and 64-bit applications

File: l_ccompxe_2013_sp1.3.174_redist.tgz
Redistributable Libraries

File: get-crypto-library.htm
Cryptography Library

Intel® Integrated-Performance-Primitives

Intel® Debugger

Intel® Math Kernel Library

Intel® Threading Building Blocks