Intel Compiler and Composer Update version numbers to compiler version number mapping

May 5, 2014, 11:36 am

Latest and popular articles on Intel Technologies

≪ Previous: Intel® C++ Composer XE 2013 SP1 for Windows*, Update 3

Introduction : Mapping Intel Compiler or Composer Update numbers to specific compiler versions and packages

Intel® Composer XE 2013 SP1 (released September 2013)

Composer XE 2013 SP1	Intel Registration Center Activation Date yr.mo.day	Windows Version / build	Linux Version / build	Mac OS Version / build
Composer XE 2013 SP1 Update 3	2014.05.05	14.0.3.202 / 20140422	14.0.3.174 / 20140422	14.0.3.166 / 20140421
Composer XE 2013 SP1 Update 2	2014.02.12	14.0.2.176 / 20140130	14.0.2.144 / 20140120	14.0.2.139 / 20140121
Composer XE 2013 SP1 Update 1	2013.10.18	14.0.1.139 / 20131008	14.0.1.106 / 20131008	14.0.1.103 / 20131010
Composer XE 2013 SP1 initial reiease	2013.09.04	14.0.0.103 / 20130728	14.0.0.080 / 20130728	14.0.0.074 / 20130716

Intel® Composer XE 2013 (released September 2012)

Composer XE	Intel Registration Center activation date	Windows Version / build	Linux Version / build	Mac OS Version / build
Composer XE 2013 Update 5	2013/06/20	13.1.3.198 / 20130607	13.1.3.192 / 20130607	13.0.3.198 / 20130606
Composer XE 2013 Update 4	2013/05/24	13.1.2.190 / 20130514	13.1.2.183 / 20130514	no Update 4 for OS X
Composer XE 2013 Update 3	2013/03/26	13.1.1.171 / 20130314 removed from downloads due to bug - upgrade U4 or greater	13.1.1.163 / 20130313 removed from downloads due to bug - upgrade U4 or greater	13.0.2.171 / 20130314
Composer XE 2013 Update 2* This is a minor update, version 13.1 first release (except on Mac)	2013/01/31	13.1.0.149 / 20130118	13.1.0.146 / 20130121	No package for OS X* in this update (OS X skips this Update)
Composer XE 2013 Update 1	2012/10/23	13.0.1.119 20121008	13.0.1.117 20121010	13.0.1.119 20121010
Composer XE 2013 initial release	2012/09/05	13.0.0.089 20120731	13.0.0.079 20120731	13.0.0.088 20120731

Intel Composer XE 2011 (aka 12.0) and Intel Composer XE 2011 SP1 (aka 12.1) editions mappings.
Notes:

Intel Composer XE 2011 is also known as version 12.0. It is comprised of the initial release and Update 1 through Update 5.

Intel Composer XE 2011 SP1 is also known as version 12.1. This version first appeared as Update 6 and subsequent Updates (7, 8, etc) are all version 12.1 compilers.

Composer XE	Reg Center activation date	Windows Version / build	Linux Version / build	Mac OS Version / build
Composer XE 2011 12.0 initial release	11/09/2010	12.0.0.104 20101006	12.0.0.084 20101006	12.0.0.085 20101006
Composer XE 2011 Reg. Center download Update 1	12/02/2010	12.0.1.128 20101116	12.0.1.108 20101116	12.0.1.122 20101110
Composer XE 2011 Reg. Center download Update 2	01/29/2011	12.0.2.154 20110112	12.0.2.137 20110117	12.0.2.142 20110112
Composer XE 2011 Reg. Center download Update 3	03/24/2011	12.0.3.175 20110309	12.0.3.174 20110309 (Japanese 12.0.3.175)	12.0.3.167 20110309
Composer XE 2011 Reg. Center download Update 4	05/09/2011	12.0.4.196 20110427	12.0.4.191 20110427	12.0.4.184 20110503
Composer XE 2011 Reg. Center download Update 5	07/29/2011	12.0.5.221 20110719	12.0.5.220 20110719	12.0.5.209 20110719
Composer XE 2011 SP1 (aka 12.1) Reg. Center download Update 6	08/24/2011	12.1.0.233 20110811	12.1.0.233 20110811	12.1.0.038 20110817
Composer XE 2011 SP1 Update 1 Reg. Center download Update 7	10/21/2011	12.1.1.258 20111011	12.1.1.256 20111011	12.1.1.246 20111011
Composer XE 2011 SP1 Update 2 Reg. Center download Update 8	12/16/2011	12.1.2.278 20111128	12.1.2.273 20111128	12.1.2.269 20111207
Composer XE 2011 SP1 Update 3 Reg. Center download Update 9	02/09/2012	12.1.3.300 20120130	12.1.3.293 20120212	12.1.3.289 20120130
Composer XE 2011 SP1 Update 4 Reg. Center download Update 10	04/30/2012	12.1.4.325 20120410	12.1.4.319 20120410	12.1.4.328 20120423
Composer XE 2011 SP1 Update 5 Reg. Center download Update 11	06/26/2012	12.1.5.344 20120612	12.1.5.339 20120612	12.1.5.344 20120612
Composer XE 2011 SP1 Update 6 Reg. Center download Update 12	09/10/2012	12.1.6.369 20120821	12.1.6.361 20120821	12.1.6.371 20120821
Composer XE 2011 SP1 Update 7 Reg. Center download Update 13	10/08/2012	12.1.7.371 201209278	12.1.7.367 20120928	12.1.7.380 20120928

Intel Compiler Professional Editions / Intel Visual Fortran editions mappings

Compiler Pro	Reg Center Post date	Windows Version / build	Linux Version / build	Mac OS Version / build
11.1 initial release	06/11/2009	11.1.035 20090511 - removed	11.1.038 200900511	11.1.046 20090511
11.1 Update 1	07/14/2009	11.1.038 20090624 - removed	11.1.046 20090630	11.1.058 20090624
11.1 Update 2	09/14/2009	11.1.046 20090903 - removed	11.1.056 20090827	11.1.067 20090910
**11.1 Windows Update 2 revised	10/07/2009	11.1.048 20090930	na	na
11.1 Update 3	10/21/2009	11.1.051 20091012	11.1.059 20091012	11.1.076 20091029
11.1 Update 4	12/15/2009	11.1.054 20091130	11.1.064 20091130	11.1.080 20091130
11.1 Update 5	02/18/2010	11.1.060 20100203	11.1.069 20100203	11.1.084 2010203
11.1 Update 6	04/22/2010	11.1.065 20100414	11.1.072 20100414	11.1.088 20100401
11.1 Update 7	08/20/2010	11.1.067 20100806	11.1.073 20100806	11.1.089 20100806
11.1 Update 8	12/09/2010	11.1.070 20101201	11.1.075 20101201	11.1.091 20101201
11.1 Update 9	7/13/2011	11.1.072 20110708	11.1.080 20110708	No Mac release No Mac release

** NOTE, For Compiler Professional Edition for Windows 11.1: The initial release, and Updates 1 and 2 were pulled from the Intel Registration Center. 11.1.048 was posted out of cycle to replace these early versions and was referred to as "11.1 Update 2 revised". These versions were pulled because of a bug, details are here: /en-us/articles/program-crashes-or-hangs-on-some-systems PLEASE UPDATE to 11.1.048 or newer if you have any of the affected compiler versions listed above. Contact support at premier.intel.com for assistance.

Intel Parallel Composer 2011 Releases

Intel Parallel Composer 2011	Reg Center Post date	Windows Version / build
initial release Composer 2011	08/13/2010	2.0.0.063 / 20100721
Composer 2011 Update 1	12/03/2010	2.0.1.096 / 20101203
Composer 2011 Update 2	01/20/2011	2.0.2.114 / 20110113
Composer 2011 Update 3	03/25/2011	2.0.3.132 / 20110309
Composer 2011 Update 4	05/11/2011	2.0.4.147 / 20110427
Composer 2011 Update 5	08/03/2011	2.0.5.172 / 20110722
Composer 2011 SP1 (Update 6)	08/11/2011	2.1.6.043 / 20110811
Composer 2011 SP1 (Update 7)	10/19/2011	2.1.7.062 / 20111011
Composer 2011 SP1 (Update 8)	10/19/2011	2.1.8.079 / 20111128
Composer 2011 SP1 (Update 9)	02/14/2012	2.1.9.102 / 20120130
Composer 2011 SP1 (Update 10)	04/23/2012	2.1.10.122 / 20120410
Composer 2011 SP1 (Update 11)	06/26/2012	2.1.11.142 / 20120612
Composer 2011 SP1 (Update 12)	09/07/2012	2.1.12.172 / 20120731

Intel Parallel Composer Releases

Intel Parallel Composer	Reg Center Post date	Windows Version / build
Initial release ***	05/01/2009	*composer.061 20090421 - removed**
Update 1 ***	06/26/2009	*composer_update1.063 20090624 - removed**
Update 2 ***	09/15/2009	*composer_update2.066 20090903 - removed**
Update 3 ***	10/08/2009	*composer_update3.068 20090930 - removed**
Update 3 revised	11/02/2009	composer_update3.071 20091012
Update 4	12/14/2009	composer_update4.072 20091130
Update 5	02/16/2010	composer_update5.078 20100203
Update 6	04/22/2010	Composer_update6.082 20100419

*** NOTE: The Intel Parallel Composer initial release and updates 1, 2, and 3 were also pulled from Intel Registration Center due to the same bug affecting early Intel Compiler Professional Editions 11.1.
These versions were pulled because of a bug, details are here: /en-us/articles/program-crashes-or-hangs-on-some-systems PLEASE UPDATE to the current Update 3 (.071) or newer if you have any of the affected compiler versions listed above. Contact support at premier.intel.com for assistance.

Entwickler

Apple OS X*

Linux*

Microsoft Windows* (XP, Vista, 7)

Intel® C++ Composer XE

Intel® Fortran Compiler

Intel® Fortran Composer XE

Intel® Parallel Composer

Intel® Visual Fortran Composer XE

Entwicklungstools

↧

Cilk worker scheduling

May 12, 2014, 5:25 am

Latest and popular articles on Intel Technologies

≫ Next: Efficient prefix scan library in Cilk Plus and accessible from C?

≪ Previous: Intel Compiler and Composer Update version numbers to compiler version number mapping

Hello,

I would like to understand better how Cilk scheduling works.

I am not sure how to phrase this question so I give it my best.

I have downloaded the latest Intel Cilk runtime release (cilkplus-rtl-003365 - released 3-May-2013).

I use the classical Fibonacci example in Cilk.

I wanted to know on what CPU core each worker executes.

To Fibonacci example, I added a function that checks CPU affinity for every worker as in here:

http://linux.die.net/man/3/pthread_getaffinity_np

“printf” is located in “int fib(int n)” of the Fibonacci sample code.

I get WORKER ID using “__cilkrts_get_worker_number()”

While the program runs, I print each WORKER's ID and the CPU core affinity of each worker.

However, the result surprises me.

I expected that some of the workers would run on different CPU cores but it seems that all workers are running on the same exact CPU core.

For example, I get this for every “printf” when running “./fib 30”:

***** WORKER ID: 0 on CPU core: 7 *****

***** WORKER ID: 3 on CPU core: 7 *****

...

...(repeats sa long as some binary is executing) ...

...

***** WORKER ID: 0 on CPU core: 7 *****

I have repeated these tests using PBBS (http://www.cs.cmu.edu/~pbbs/) and get the same results.

I did similar experiments using OpenMP (Windows and Linux) and I saw threads executing on different CPU’s.

My system is:

Debian GNU/Linux 7.0 running kernel 3.2.49

Available CPU schedulers are: noop, deadline, cfq - the system is running cfg

AMD FX-8150 (8-core)

I think, I might be missing or misunderstanding something here.

Is there a “preferred” Linux CPU scheduler for Intel Cilk runtime?

Do you have any advice or suggestion?

Thank you for your time,

Haris

↧

Efficient prefix scan library in Cilk Plus and accessible from C?

May 14, 2014, 3:58 pm

Latest and popular articles on Intel Technologies

≫ Next: Internal compiler error 010101_239

≪ Previous: Cilk worker scheduling

Is there any efficient prefix scan library for Cilk Plus accessible from C?

I was not able to find any and my implementation can hardly compete with the sequential version :-)

An interface similar to the reducers will work nicely.

Thank you.

↧

Internal compiler error 010101_239

May 15, 2014, 4:59 am

Latest and popular articles on Intel Technologies

≫ Next: GCC* 4.9 OpenMP code cannot be linked with Intel® OpenMP runtime

≪ Previous: Efficient prefix scan library in Cilk Plus and accessible from C?

Hi guys,

I condensed our project down to a piece of code that lets you reproduce the following issue.
When I compile this in Release configuration (Debug works), I get this compiler error:

1>------ Build started: Project: ng-gtest, Configuration: Release x64 ------
1> CilkTest.cpp
1>" : error : 010101_239
1>
1> compilation aborted for General\CilkTest.cpp (code 4)
========== Build: 0 succeeded, 1 failed, 3 up-to-date, 0 skipped ==========

This is our compiler: Intel(R) C++ Intel(R) 64 Compiler XE for Intel(R) 64, version 14.0.3 Package ID: w_ccompxe_2013_sp1.3.202
OS: Windows 7, x64.

This is the code:

#include <math.h>

const int VecSize = 8;

const short* acdata;
const short* lowdata;

const unsigned short* meas_data;
const unsigned short* rdval;

short trident[2 * VecSize];
short speed[2 * VecSize];
float spdfact[VecSize];
float spdfact2[VecSize];
float tdat[VecSize];
float array1[VecSize];
float array2[VecSize];

const float *input_01;
const float *input_02;
float val0;
float vvv9;
float agn;

void get_g(float ag[VecSize], const float ae[VecSize], const float* pp)
{
float a01[VecSize];
a01[:] = ae[:];
if (a01[:] >= 360.0f)
a01[:] -= 360.0f;
if (a01[:] >= 360.0f)
a01[:] -= 360.0f;
if (a01[:] < 0.0f)
a01[:] += 360.0f;
if (a01[:] < 0.0f)
a01[:] += 360.0f;

int i0[VecSize], i1[VecSize];
i1[:] = static_cast<int>(a01[:]);
i0[:] = i1[:] + 1;

float g0[VecSize], g1[VecSize];
g1[:] = pp[i1[:]];
g0[:] = pp[i0[:]];

ag[:] = g0[:] - (g1[:] - g0[:]) * (a01[:] - static_cast<float>(i0[:]));
}

void f(float prlo[VecSize], const int cntr)
{
float cop[VecSize];
short cod[2 * VecSize];
short maxm[2 * VecSize];

cod[0:VecSize:2] = acdata[cntr:VecSize];
cod[1:VecSize:2] = lowdata[cntr:VecSize];
maxm[0:VecSize:2] = lowdata[cntr:VecSize];
maxm[1:VecSize:2] = acdata[cntr:VecSize];
cop[:] = (1.0f / float(255 * 255)) * static_cast<float>(
cod[0:VecSize:2] * trident[0:VecSize:2] + cod[1:VecSize:2] * trident[1:VecSize:2]);
float music[VecSize];
music[:] = (1.0f / float(16383 * 16383)) * static_cast<float>(
maxm[0:VecSize:2] * speed[0:VecSize:2] + maxm[1:VecSize:2] * speed[1:VecSize:2]);

float velo2[VecSize];
float brigh[VecSize];
velo2[:] = static_cast<float>(rdval[cntr:VecSize]) * spdfact[:];
brigh[:] = asinf(velo2[:]);

float denom[VecSize];
denom[:] = cop[:] * array1[:] + array2[:] / brigh[:];

float accel[VecSize];
accel[:] = atanf(music[:] / denom[:]);
accel[:] = atan2f(accel[:], velo2[:]);

bool haMask[VecSize];
haMask[:] = accel[:] < 0.0f;
if (haMask[:] & (music[:] > 0.0f))
accel[:] += float(9.81 / 2);
if (!haMask[:] & (music[:] < 0.0f))
accel[:] -= float(9.81 / 2);

float diff[VecSize];
diff[:] = array1[:] * brigh[:] - array2[:] * cop[:] * velo2[:];
float prod[VecSize];
prod[:] = sinf(accel[:]) * diff[:];

float accel2[VecSize];
accel2[:] = atanf(prod[:] / music[:]);

float halter[VecSize] = { 0.0f };
if (cop[:] <= 0.0f)
halter[:] = 2.8182963f;
float valter[VecSize];
valter[:] = cop[:] > 0 ? velo2[:] - tdat[:] : velo2[:] + tdat[:];

if (music[:] == 0)
{
accel[:] = halter[:];
accel2[:] = valter[:];
}
accel[:] *= float(9.81);
accel2[:] *= float(9.81);

float v8[VecSize];
v8[:] = 25.7385f - accel2[:];
float hxx[VecSize];
hxx[:] = fabsf(accel[:]) * float(1.38e-23);
float hnn[VecSize];
hnn[:] = -accel[:];

float vgg[VecSize], gx2[VecSize], hms[VecSize], sv[VecSize];
get_g(vgg, accel2, input_02);
get_g(gx2, v8, input_02);
get_g(hms, hnn, input_01);
sv[:] = ((1.0f - hxx[:]) * (val0 - vgg[:])) + (hxx[:] * (vvv9 - gx2[:]));

float p0[VecSize];
p0[:] = hms[:] - sv[:];

prlo[:] = static_cast<float>(meas_data[cntr:VecSize]) * spdfact2[:] - p0[:] - agn;
}

↧

GCC* 4.9 OpenMP code cannot be linked with Intel® OpenMP runtime

May 16, 2014, 7:00 am

Latest and popular articles on Intel Technologies

≫ Next: Submissions open: High Performance Parallelism Gems

≪ Previous: Internal compiler error 010101_239

GCC* 4.9 was released on April 22, 2014. This release now supports Version 4.0 of the OpenMP* specification for the C and C++ compilers. The interface between the compilers and the GCC OpenMP runtime library (libgomp) was changed as part of this development. As a result, code compiled by GCC 4.9 using the –fopenmp compiler option cannot be successfully linked with the Intel® OpenMP runtime library (libiomp5), even if it uses no new OpenMP 4.0 features. Linking may fail with a message such as “undefined reference to `GOMP_parallel'”, or (if both libiomp5.so and libgomp.so are linked in) linking may appear to succeed, but the executable code may crash at runtime since having two different OpenMP runtimes linked into the same process is fatal.

Intel is working to restore the ability to link OpenMP code compiled by GCC 4.9 with OpenMP code compiled by the Intel compilers. In the meantime, we recommend against using GCC 4.9 –fopenmp to compile code that you plan to link with anything containing Intel-compiled OpenMP code, including the Intel® Math Kernel Library (MKL) and Intel® Integrated Performance Primitives (IPP) performance libraries.

OpenMP GCC libiomp5 libgomp

Intel® C++ Composer XE

↧

Submissions open: High Performance Parallelism Gems

May 19, 2014, 10:47 am

Latest and popular articles on Intel Technologies

≫ Next: intel cilk plus cilkscreen and tbb/scalable_allocator

≪ Previous: GCC* 4.9 OpenMP code cannot be linked with Intel® OpenMP runtime

Hi everyone,

We have all had our little discoveries and triumphs in identifying new and innovative approaches that increased the performance of our applications. Usually, they are small, though important, but occasionally we find something more, something that could also help others, an innovative gem. Perhaps it is a method of analysis, or an unconventional use of the memory hierarchy, or simply the dogged application of techniques that achieves remarkable speedups. Yet, we rarely have a means of making these innovations available outside of our immediate colleagues.

You now have an opportunity to broadcast your successes more widely to the benefit of our community.

And we’re not referring only to triumphs specific to pure processor performance. Perhaps your innovation solves an I/O bottleneck issue, answers a particularly important multi-body problem, or succeeds in reducing the energy footprint of a suite of applications. These are all important to the community at large.

Of course, I and the editors are from Intel, so we are focusing on the use of Intel® Xeon® and Intel® Xeon Phi™ processors. But this focus isn’t too limiting as Intel® architectures are everywhere.

So here is your chance to share your triumphs. Do you know a unique way of exploiting multicore caches? An innovative algorithm that allows scaling to greater than 200 cores? Or a unique application of OpenMP* in conjunction with MPI in an Intel Xeon cluster? Consider letting the broader community know by submitting a proposal to the editors.

=============

PLEASE PASS AROUND TO ANYONE WHO MAY BE INTERESTED

You are invited to submit a proposal to a contribution-based book, working title, “High Performance Parallelism Gems – Successful Approaches for Multicore and Many-core Programming” that will focus on practical techniques for Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor parallel computing.

Submissions are due by May 29, 2014 in order to be guaranteed for consideration for publication in the first (2014) volume.

Please submit your proposal now. We'll work with you to refine it as needed.

If you would like to contribute, please fill out the form completely and click SUBMIT.

Visit http://lotsofcores.com/gems to send us your ideas now.

You may email us at hppg2014@easychair.org with questions (please read http://lotsofcores.com/gems first). Please submit by May 29.

Thank you,

James Reinders and Jim Jeffers

P.S. Many of you will think “Intel Xeon Phi gems,” but we actually expect “the gems” will show great ways to scale on both Intel Xeon Phi coprocessors and Intel Xeon processors, hence the working title for the book.

Intel Xeon Phi Coprocessor

Symbol-Bild:

Finanzdienstleistungsbranche

Spieleentwicklung

Intel® Many Integrated Core Architektur

Intel® Cluster Toolkit

Intel® MPI Benchmarks

Intel® C++-Compiler

Intel® C++ Composer XE

Intel® Cilk™ Plus

Intel® Composer XE

Intel® Fortran Compiler

Intel® Fortran Composer XE

Intel® Parallel Composer

Intel® Visual Fortran Composer XE

Intel® VTune™ Amplifier

Intel® Integrated-Performance-Primitives

Intel® Math Kernel Library

Intel® MPI Library

Intel® SDK für OpenCL™ Anwendungen

Intel® Threading Building Blocks

Intel® C++ Studio XE

Intel® Cluster Studio

Intel® Cluster Studio XE

Intel® Fortran Studio XE

Intel® Parallel Studio

Intel® Parallel Studio XE

Intel® Parallel Amplifier

Intel® VTune™ Amplifier XE

Intel® VTune™ Performance Analyzer

Intel® Advanced Vector Extensions

Intel® Streaming SIMD Extensions

Message Passing Interface

↧

intel cilk plus cilkscreen and tbb/scalable_allocator

May 22, 2014, 2:11 pm

Latest and popular articles on Intel Technologies

≫ Next: Cilk_for returns wrong data in array.

≪ Previous: Submissions open: High Performance Parallelism Gems

Dear friends,

the following simple code seems to run just fine, however, cilkscreen is shouting "Race condition"!

Shall I trust it? Or it is just false sharing?

So, what scalable memory allocator is fast and thread safe to use with intel cilk plus?

#include <cilk/cilk.h>
#include "tbb/scalable_allocator.h"

char * array[10000000];

int main(int argc, char **argv) {

  cilk_for (int i = 0; i < 10000000; i++) {
    array[i] = (char *) scalable_malloc(1);
  }

  cilk_for (int i = 0; i < 10000000; i++) {
    scalable_free(array[i]);
  }

  return 0;
}

I compile it with

icc -lcilkrts -ltbbmalloc -o example -O3 -std=c99 example.c

but

$ /usr/pkg/intel/bin/cilkscreen ./example
Cilkscreen Race Detector V2.0.0, Build 3566

Race condition on location 0x7fc83fd4ae90
  write access at 0x7fc83fb0fd5c: (/tmp/tbb.MXm12595/1.0/build/fxtcarvm024icc13_0_64_gcc4_6_cpp11_release/../../src/tbbmalloc/tbbmalloc_internal.h:913, rml::internal::TLSKey::createTLS+0xec)
  read access at 0x7fc83fb0de78: (/tmp/tbb.MXm12595/1.0/build/fxtcarvm024icc13_0_64_gcc4_6_cpp11_release/../../src/tbbmalloc/tbbmalloc_internal.h:918, scalable_malloc+0x18)
    called by 0x400cb1: (/home/nikos/projects/reducer/example.c:18, __$U0+0x41)
    called by 0x400c3c: (/home/nikos/projects/reducer/example.c:17, main+0x4c)
...

The system and compiler are:

$ uname -a
Linux leibniz4 3.5.0-44-generic #67~precise1-Ubuntu SMP Wed Nov 13 16:16:57 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$ icc --version
icc (ICC) 14.0.1 20131008
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.

Thank you,

↧

Cilk_for returns wrong data in array.

May 28, 2014, 8:09 pm

Latest and popular articles on Intel Technologies

≫ Next: Applying Vectorization Techniques for B-Spline Surface Evaluation

≪ Previous: intel cilk plus cilkscreen and tbb/scalable_allocator

Hello everyone. I am new to multi threading programming. Recently, i have a project, which i apply cilk_for into it. Here is the code:

void myfunction(short *myarray)
{
	m128i *array = (m128i*) myarray
	cilk_for(int i=0; i<N_LOOP1; i++)
		{
			for(int z = 0; z<N_LOOP2; z+=8)
			{
				array[z]        =  _mm_and_si128(array[z],mym128i);
				array[z+1]        =  _mm_and_si128(array[z+1],mym128i);
				array[z+2]        =  _mm_and_si128(array[z+2],mym128i);
				array[z+3]        =  _mm_and_si128(array[z+3],mym128i);
				array[z+4]        =  _mm_and_si128(array[z+4],mym128i);
				array[z+5]        =  _mm_and_si128(array[z+5],mym128i);
				array[z+6]        =  _mm_and_si128(array[z+6],mym128i);
				array[z+7]        =  _mm_and_si128(array[z+7],mym128i);
				array+=8;
			}
		}
}

After the above code ran, ridiculous thing happens. The data in array isn't updated correctly. For example, if i have an array with 1000 elements, there is a chance that the array will be updated correctly (1000 elements are AND-ed). But there is also a chance that some parts of the array will be omited (first element to 300th element are AND-ed, 301st element to 505th element aren't AND-ed, 506th element to 707th element are AND-ed, etc,...). These omited parts are random in each individual run, so i think the problem here is about cache miss. Am I right? Please tell me, any help is appreciated. :)

↧

Applying Vectorization Techniques for B-Spline Surface Evaluation

May 29, 2014, 11:48 am

Latest and popular articles on Intel Technologies

≫ Next: Parallel Studio XE 2013 全新上市

≪ Previous: Cilk_for returns wrong data in array.

Abstract

In this paper we analyze relevance of vectorization for evaluation of Non-Uniform Rational B-Spline (NURBS) surfaces broadly used in Computer Aided Design (CAD) industry to describe free-form surfaces. NURBS evaluation (i.e. computation of surface 3D points and derivatives for the given u, v parameters) is a core component of numerous CAD algorithms and can have a significant performance impact. We achieved up to 5.8x speedup using Intel® Advanced Vector Extensions (Intel® AVX) instructions generated by Intel® C/C++ compiler, and up to 16x speedup including minor algorithmic refactoring, which demonstrates high potential offered by the vectorization technique to NURBS evaluation.

Introduction

Vectorization, or Single Instruction Multiple Data (SIMD), is a parallelization technique available on modern computer processors, which allows to apply the same computational operation (e.g. addition or multiplication) to several data elements at once. For example, on a processor with a 128 bit register a single addition operation can add 4 pairs of integers (each takes 32 bits) or 2 pairs of doubles (64 bits each). With the help of vectorization one can speed up computations due to reduced time required to process the same data sets. SIMD was introduced with Intel® Architecture Processors way back in 1990es, with MMX™ technology as its first generation.

In this paper we analyze relevance of vectorization for evaluation of NURBS surfaces [1]. NURBS is a standard method used in CAD industry to describe free-form surfaces, e.g. car bodies, ship hulls, aircraft wings, consumer products and so on. Examples of 3D models (from [3]) containing NURBS surfaces are given on Fig.1:

NURBS evaluation (i.e. a computation of surface 3D points and derivatives) is a core component of numerous CAD algorithms. For instance...

(For further reading please refer to the attached pdf document)

Microsoft Windows* (XP, Vista, 7)

Intel® C++ Composer XE

Intel® VTune™ Amplifier

Intel® Parallel Studio XE

Intel® Advanced Vector Extensions

Intel® Streaming SIMD Extensions

Leistungsverbesserung

Bibliotheken

Multithread-Entwicklung

Design-Zone:

IDZone

↧

Parallel Studio XE 2013 全新上市

June 4, 2014, 8:41 pm

Latest and popular articles on Intel Technologies

≫ Next: Cilk Tools error while loading shared libraries

≪ Previous: Applying Vectorization Techniques for B-Spline Surface Evaluation

今天我们宣布推出 Parallel Studio XE 2013（立即发布）和 Cluster Studio XE 2013（2012 年第四季度发布）。

如欲了解更多详细信息，请参阅《Parallel Universe Magazine》第 11 期。第 11 期中包括“十大新特性”、指针检查器功能以及有条件数值再现方面的内容。访问 Parallel Studio XE 2013和 Cluster Studio XE 2013，了解更多信息，包括如何评估以及如何购买。

这些产品中的新特性包括：

（1）针对新处理器和协处理器的支持包括针对 Ivy Bridge 微架构、Haswell 微架构（ 2013 年首次生产）英特尔至强融核协处理器（超过 50 个核心，2012 年生产，代号 Knights Corner，使用 MIC 架构）进行优化的支持。针对英特尔® 至强融核™ 协处理器中使用的英特尔集成众核（MIC）架构的工具将自动随编译器、数据库、调试和性能分析工具提供。 Haswell 微架构支持工具包括面向 TSX的内联函数。

（2）最新标准的扩展支持包括 MPI 2.2、C++ 11 和 Fortran 2008。我们致力于提供领先的行业标准支持。

（3）令人兴奋的全新客户驱动型功能，包括有条件数值再现性、指针检查功能、睡眠状态分析（适用于功耗）以及内存堆增长分析。

这些工具可支持使用英特尔库进行简单的重新编译和重新链接来轻松获得性能，也可为希望进行深度操作（如以并行方法追踪 TLB 缺失、数据竞跑或设备的起因）的编程人员提供大量的“深度”功能。英特尔提供了多种选项，可支持访问面向 C、C++ 和 Fortran 编程器的性能。

这些软件开发产品可为 C、C++ 和 Fortran 开发人员提供全面工具套件，从而帮助其实现更高的性能。它包括编译器、数据库、并行编程模式、设计辅助、调试和性能分析工具。英特尔 Parallel Studio XE 2013 主要用于为标准数据中心系统、工作站和电脑的共享内存设备提供支持。英特尔 Cluster Studio XE 2013 包括英特尔 Parallel Studio XE 2013 中的工具，以及针对分布式内存设备（如使用 MPI（消息传递接口）进行编程的集群和超级计算机中的设备）而提供的特殊功能。英特尔可提供出色的 MPI 支持。

如果您的支持为最新版本，则 Parallel Studio XE 2013更新为免费（在首次购买或购买相应支持包后一年内）。

访问 Parallel Studio XE 2013，了解更多信息，包括申请免费版本亲自体验，并可阅读《Parallel Universe Magazine》第 11 期进行了解。

Intel Xeon Phi Coprocessor

Symbol-Bild:

Intel® Many Integrated Core Architektur

Intel® Cluster Checker

Intel® MPI Benchmarks

Intel® Trace-Analyzer und -Collector

Intel® C++ Composer XE

Intel® Composer XE

Intel® Fortran Composer XE

Intel® Visual Fortran Composer XE

Intel® Integrated-Performance-Primitives

Intel® Math Kernel Library

Intel® MPI Library

Intel® Threading Building Blocks

Produkt-Suites

Intel® Cluster Studio XE

Intel® Parallel Studio XE

Microsoft Windows* (XP, Vista, 7)

↧

Cilk Tools error while loading shared libraries

June 5, 2014, 8:40 am

Latest and popular articles on Intel Technologies

≫ Next: Exception when run project at debug mode using cilk_for

≪ Previous: Parallel Studio XE 2013 全新上市

I have successfully compiled cilkplus for gcc (4.8 branch) on Ubuntu 14.04 LTS and compiled the example program fib on the cilkplus website. I would like to run cilkview and cilkscreen on it, and so I downloaded cilk tools from the website as well. However, when I try to run cilkview, I get the following error:

Cilkview: Generating scalability data
-t: error while loading shared libraries: -t: cannot open shared object file: No such file or directory

I've tried changing the environment variables $LIBRARY_PATH and $LD_LIBRARY_PATH to point to the libraries in the cilk tools directory, but I still come up with the same error. I also noticed that on the cilk tools downloads, for linux there is an extra set of libraries (libelf and libdwarf), which I have also installed on my system. I tried looking at the depenencies for cilkview, but I couldn't find anything unusual with those. Here is the output:

$ ldd cilkview
    linux-gate.so.1 =>  (0xf7735000)
    libm.so.6 => /lib32/libm.so.6 (0xf76d1000)
    libstdc++.so.6 => /usr/lib32/libstdc++.so.6 (0xf75e8000)
    libgcc_s.so.1 => /usr/lib32/libgcc_s.so.1 (0xf75cb000)
    libc.so.6 => /lib32/libc.so.6 (0xf741f000)
    libdl.so.2 => /lib32/libdl.so.2 (0xf741a000)

I would particularly like to run cilkview, but the error persists for both cilkview and cilkscreen. It looks like the error is loading libraries; however, I have no idea why it calls the library -t. Any help with this would be much appreciated.

↧

Exception when run project at debug mode using cilk_for

June 10, 2014, 5:10 pm

Latest and popular articles on Intel Technologies

≫ Next: Less performance on 16 core than on 4 ?!

≪ Previous: Cilk Tools error while loading shared libraries

Dear all,

I have used cilk_plus to make parallel processing into my source code with visual studio 2008 IDE.

But when I build it at debug mode, the project throw an exception below:

"Run-Time Check Failure #0 - The value of ESP was not properly saved across a function call. This is usually a result of calling a function pointer declared with a different calling convention"

How can I resolve it to make debug mode operated ?

Thanks of all,

Tam Nguyen

↧

Less performance on 16 core than on 4 ?!

June 11, 2014, 5:23 am

Latest and popular articles on Intel Technologies

≫ Next: Downloaded Composer XE Linux Online Installer Bootstrap may not have Execute Permission

≪ Previous: Exception when run project at debug mode using cilk_for

Hi there,

I evaluated my cilk application using "taskset -c 0-(x-1) MYPROGRAM) to analyze scaling behavior.

I was very suprised to see, that the performances increases up to a number of cores but decreases afterwards.

for 2 Cores, I gain a speedup of 1,85. for 4, I gain 3.15. for 8 4.34 - but with 12 cores the performance drops down

to a speedup close to the speedup gained by 2 cores (1.99).

16 cores performe slightly better (2.11)

How is such an behaviour possible? either an idle thread can steal work or it cant?! - or may the working packets be too coarse grained and the stealing overhead destroys the performance with too many cores in use?!

↧

Downloaded Composer XE Linux Online Installer Bootstrap may not have Execute Permission

June 13, 2014, 10:13 am

Latest and popular articles on Intel Technologies

≫ Next: Converting Cilkview data to seconds

≪ Previous: Less performance on 16 core than on 4 ?!

Default permissions of online installer bootstrap scripts depend on default permissions based on the umask procedure and security setup of the downloading tool. If the downloaded online installer bootstrap scripts do not have execution permission, users should do chmod +x "installer bootstrap scripts"

Intel® C++ Composer XE

Intel® Fortran Composer XE

URL

↧

Converting Cilkview data to seconds

June 17, 2014, 2:02 pm

Latest and popular articles on Intel Technologies

≫ Next: Set Worker on Windows with Intel Core i3

≪ Previous: Downloaded Composer XE Linux Online Installer Bootstrap may not have Execute Permission

Hey folks,

I'm working with a system that needs the work and span of the programs I'm running in seconds and nanoseconds to run properly, and running Cilkview on them gives me work and span in processor instructions. Does anyone know of a way to convert that data from instructions to a unit of time?

Thanks!

Matt

↧

Set Worker on Windows with Intel Core i3

June 23, 2014, 6:07 am

Latest and popular articles on Intel Technologies

≫ Next: sec_implicit_index

≪ Previous: Converting Cilkview data to seconds

Hi all,

I have used Cilk Plus to make my code computing parallel. But PC is installed Windows XP3, Intel Core i3, how many workers should I set to make the best performance for my code ?

Thanks of all,

Tam Nguyen

↧

sec_implicit_index

June 23, 2014, 6:01 pm

Latest and popular articles on Intel Technologies

≫ Next: Cilk™ Plus并行程序的串行等价程序的执行过程

≪ Previous: Set Worker on Windows with Intel Core i3

I've been trying to understand what the implicit_index intrinsic may be intended for. It's tricky to get adequate performance from it, and apparently not possible in some of the more obvious contexts (unless the goal is only to get a positive vectorization report).

It seems to be competitive for the usage of setting up an identity matrix.

In the context of dividing its result by 2, different treatments are required on MIC and host:

#ifdef __MIC__

a[2:i__2-1] = b[2:i__2-1] + c[((unsigned)__sec_implicit_index(0)>>1)+1] * d__[2:i__2-1];

#else

a[2:i__2-1] = b[2:i__2-1] + c[__sec_implicit_index(0)/2+1] * d__[2:i__2-1];

#endif

That is, the unsigned right shift is several times as fast as the divide on MIC (and not much slower than plain C code), while the signed divide by 2 is up to 60% faster on host (but not as fast as C code).

The only advantage in it seems to be the elimination of a for(), if in fact that is considered to be an advantage.

I didn't see documented anywhere that it is int data type, although the opt-report shows it. I can't see how it could be anything other than positive integers, so the (unsigned) cast seems valid. I guess >>1U would have the same effect with less space taken up compared with (unsigned). The notation is already cryptic from my point of view.

↧

Cilk™ Plus并行程序的串行等价程序的执行过程

June 30, 2014, 12:04 am

Latest and popular articles on Intel Technologies

≫ Next: 利用Cilk™ Plus的Reducer解决并行程序中的竞态及按序计算问题

≪ Previous: sec_implicit_index

C++社区的趋势近年来主要是通过以添加更多的库而不是语言关键字来实现增加程序的功能性，比如Threading Building Blocks以及Parallel Patterns库，但与主流发展趋势不同的是，Intel的Cilk™ Plus的实现方式则是以后者的形式——语言关键字来增加程序功能的，本文将就此给出分析。

其中的一个主要原因是语言及其扩展主要是由编译器来实现转换，并且编译器能够提供一定程度的保证，比如串行等价语义等。

每一个使用关键字来定义并行Cilk™ Plus的程序都有一个已在编译器实现中定义好的串行语义。通过将每一个cilk_sync及cilk_spawn替换为空，且将每一个cilk_for以for关键字来替代，编译器由此将并行的Cilk™ Plus程序处理为一个有效的串行C/C++程序。当两个逻辑并行的线程同时访问同一内存位置且至少一个为写内存操作时，程序行为此时出现竞态，如果一个Cilk™ Plus并行程序没有竞态发生的话，此时它将产生与其串行等价程序相同的结果。编译器是如何保证其串行等价的结果一致的？考虑以下的代码：

int foo()
{
int x1 = func1();
int d1 = 0;
int x2 = cilk_spawn child1(bar1(), bar2()); //产生一个输入参数为两个函数调用的子调用
int x3 = cilk_spawn child2(&d1);            //传递一个栈变量到子调用中
int x4 = func3();                     // func3能够与child1和child2同时运行
cilk_sync;                            //等待两个子任务的完成以使用其计算的结果
return x1 + x2 + x3 + x4 + d1;
}

程序的运行时可解释为：

1. foo函数中spawn出一个child1子任务以便它能与func1同时执行；

2. cilk_sync的声明会使程序的执行流在此等待child1和child2的返回，以便后续程序流能够使用其返回值。

在此代码的生成过程中，编译器主要做了3处转换：

1. 作为child1传入参数的两个函数bar1和bar2首先将被parent线程顺序执行，即函数foo所在的线程。

2. 在foo的return语句返回前，编译器会插入cilk_sync，这也是编译器对每一个包含cilk_spawn的函数所做的底层默认工作，这与fork-join的并行模型相一致，由此可使程序的行为更容易理解；

而且默认插入的cilk_sync也会保证foo的栈帧在整个foo函数执行期间一直为其spawn出的子任务所见。在此例中，foo将d1的地址传给child2，这由此保证了当child2写d1内存位置处的内容时，此内存地址仍在foo的栈中。

3. 执行foo的线程执行到child1时，cilk_spawn之后的代码将被加入到工作队列中以便被之后的当前线程继续执行以及被其它spawn出的子工作线程执行，这种工作方式称为‘Parent Stealing’，且在库的实现中，‘Work Stealing’机制是用‘Child Stealing’的方式来完成，即在此例子中，child stealing意思是child1将被入队列且会被之后spawn出的worker线程们以steal的方式来完成执行。

以下是一段并行递归代码，在此例子中的三元树中，每一个节点指向其左中右子节点且自身有颜色属性。通过对此树的一次递归遍历即可以建立一个含有所有红色节点的链表，当并行化此递归程序时，需要注意的是在将多个节点同时push入全局链表时可能会发生竞态。所以推荐的方式是用Cilk™ Plus中的hyper object机制来存取此链表，hyper object内部实现为对此全局变量/数据结构进行局部化处理，从而保证及时多线程同时存取全局地址时仍不会有竞态发生：

cilk::reducer_list_append<terntreenode *> root;
void
find_reds_par(terntreenode *p)
 {
       if (p->color == red) {
         root.push_back(p);
       }
       if (p->left) cilk_spawn find_reds_par(p->left);
       if (p->middle) cilk_spawn find_reds_par(p->middle);
       if (p->right) find_reds_par(p->right);
}

Cilk™ Plus的并行遍历实现可保证会跟串行遍历程序一样产生内部节点排序相同的结果一致的链表。

关于更多Cilk™ Plus的规范或内部实现规定，可以访问https://software.intel.com/en-us/intel-cilk-plus.

Entwickler

C/C++

Intel® C++ Composer XE

Intel® Cilk™ Plus

Server

Lizenzvertrag:

Intel Sample Source Code License Agreement

URL

Codebeispiel

↧

利用Cilk™ Plus的Reducer解决并行程序中的竞态及按序计算问题

June 30, 2014, 1:41 am

Latest and popular articles on Intel Technologies

≫ Next: Optimized Pseudo Random Number Generators with AVX2

≪ Previous: Cilk™ Plus并行程序的串行等价程序的执行过程

使用Cilk™ Plus来对程序进行并行化比使用传统的Pthread方式来建立管理线程库容易得多，在一般情况下，利用关键字cilk_sync以及cilk_for可以使串行的程序更容易改写为并行的代码，尽管在一些复杂的并行情况下，使用cilk_sync以及cilk_for并不能解决程序中本身存在的数据竞态及多线程并行的协调管理问题。

值得注意的是Cilk™ Plus并不是只有关键字的方式，Cilk™ Plus库也包含一些用于解决并行程序中的竞态、锁、多线程协调等问题的功能features。本文将提到的是Cilk Reducer，它能有助于解决常见的累计型算法中存在的数据竞态及多线程间按序计算等问题。

1. 累计型算法常见于对一个变量进行多次叠加地更新值，比如以下代码：

#include <iostream>
int main()
{
    unsigned long accum = 0;
    for (int i = 0; i != 1000; i++) {         //cilk_for
        accum += i*i;
    }
    std::cout << accum << "\n";
}

利用Cilk™ Plus对上文例程代码并行化的过程中常见的错误就是仅仅用cilk_for来代替for而无视对程序的正确性:例如本例中的accum += i*i；会出现多个线程同时更新此accum全局变量。
然而，如果按照传统的多线程编程的方式用加锁来对全局变量进行保护的话，又会引起很大的性能损失，Cilk reducer的引入就正好可以解决此全局变量的同时更新问题：

#include <iostream>
#include <cilk/cilk.h>
#include <cilk/reducer_opadd.h>

int main()
{
    cilk::reducer_opadd<unsigned long> accum(0);
    cilk_for (int i = 0; i != 1000; i++) {
        *accum += i*i;
    }
    std::cout << accum.get_value() << "\n";
}

对于原串行代码的改动只需：

将accum全局变量替换为“reducer” (cilk::reducer_opadd<unsigned long>)；
将reducer看做为指向真实的变量的指针(*accum += a[i])；
当计算结束时，从reducer中返回最终的累加值(accum.get_value()).

2. 利用reducer来协调并行计算的顺序，在很多情况下此并行问题比上文的数据冲突的并行问题出现得更为典型：

#include <string>
#include <iostream>
#include <cilk/cilk.h>

int main()
{
    std::string alphabet;
    cilk_for(char letter = 'A'; letter <= 'Z'; ++letter) {
        alphabet += letter;
    }
    std::cout << alphabet << "\n";
}

上述代码中也出现了同时更新/Append同一字符串时的竞态问题，而且如果只是通过添加锁代码来期待保证按序append到string时，程序的结果同样是错误的，比如会输出KPFGLABCUHQRSVWXDMTYZMEIJO等乱序的结果。

尽管加锁可以保证每一次更新正确地发生且与其他的更新保持无关，但是锁并不能够保证全局地所有更新按正确次序发生。Cilk reducer的实现则保证了没有上述问题的出现，在并行计算时，使用Cilk reducer可以保证所有的输入值以同串行程序一致的顺序按序计算，如以下代码将全局变量改为Cilk reducer形式后：

#include <string>
#include <iostream>
#include <cilk/cilk.h>
#include <cilk/reducer_string.h>

int main()
{
    cilk::reducer_string alphabet;
    cilk_for(char letter = 'A'; letter <= 'Z'; ++letter) {
        *alphabet += letter;
    }
    std::cout << alphabet.get_value() << "\n";
}

编译后运行可发现此并行程序的结果为ABCDEFGHIJKLMNOPQRSTUVWXYZ，与串行程序的结果保持一致。
3. Cilk reducer不仅仅适用于循环及常用的计算等例子，在下文的例子中，函数filter_tree()会被传参入一个二叉树和对应键值来被调用，此函数将在所有树的节点中找到键值相同的Key，并且返回含有对应节点的值域的链表。链表中的值将按照左子树、根节点、右子树的位置按序排列，filter_tree()函数中的filter_and_collect()用于递归遍历此二叉树从而构建此链表：

#include<cilk/cilk.h>
#include <cilk/reducer_list.h>

// 树的节点结构体定义
//
template <typename Key, typename Value>
struct TreeNode {
    TreeNode* left_subtree;
    TreeNode* right_subtreee;
    Key key;
    Value value;
};

// worker函数，遍历子树且将匹配对应键的所有节点的值累加到一个reducer
//
template <typename Key, typename Value>
void filter_and_collect(const TreeNode<Key, Value>* subtree,
                   const Key& key,
                   cilk::reducer_list_append<Value>& list)
{
    if (!subtree) return;
    cilk_spawn filter_and_collect(subtree->left, key, list);
    if (subtree->key == key) {
        list->push_back(subtree->value);
    }
    filter_and_collect(subtree->right, key, list);
}

//主函数，匹配对应的输入键值Key
//
template <typename Key, typename Value>
std::list<Value> filter_tree(const TreeNode<Key, Value>* tree,
                             const Key& key)
{
    cilk::reducer_list_append<Value> list;
    filter_and_collect(tree, key, list);
    return list.get_value();
}

4. 并行化程序时的要点总结：

将用于累计值的全局变量替换为reducer可以更高效地实现并行；
用cilk_for 或 cilk_spawn将循环或递归等并行化；
计算结束时，用reducer的get_value()方式取得累计后的共享变量的最终值，reducer的实现保证了不用考虑多线程数据竞态及多线程按序计算等问题，从而保证了与串行程序的结果一致。

关于更多Cilk™ Plus的规范或内部实现规定，可以访问https://software.intel.com/en-us/intel-cilk-plus.

Entwickler

C/C++

Intel® C++ Composer XE

Intel® Cilk™ Plus

Server

Lizenzvertrag:

Intel Sample Source Code License Agreement

URL

Codebeispiel

↧

Optimized Pseudo Random Number Generators with AVX2

June 7, 2014, 10:29 pm

Latest and popular articles on Intel Technologies

≫ Next: Cilk™ Plus Trademark License for product distribution

≪ Previous: 利用Cilk™ Plus的Reducer解决并行程序中的竞态及按序计算问题

Intel® Math Kernel Library includes powerful and versatile random number generators that have been optimized to take full advantage of Intel® Advanced Vector Extensions 2 (aka Intel® AVX2) introduced with the Haswell CPUs.

In this post, I’ll explain how to use a random number generator that benefits from Intel® AVX2 and how easy it is for developers to use it in C++11 without having to learn specialized instructions but still taking full advantage of the new instructions introduced in Haswell. I’ll provide an example with Intel® Parallel Studio XE 2013 SP1 and Visual Studio 2013.

Both Big Data and Internet of Things (aka IoT) are increasing the total amount of data we need to process. The additional instructions are extremely powerful to process multiple data with a single instruction or to perform operations that required dozens of instructions with a single one. Thus, these instructions are very useful to optimize code that has to run as fast as possible in projects related to Big Data and IoT. You can boost your code performance when you take advantage of the latest additions to the Intel® CPUs.

Since 1996, I’ve been explaining how useful the different improvements in the instruction sets introduced in the different Intel® CPUs were useful to improve code performance in different application domains and according how IT trends have been evolving. So, as you might guess, I’m a big fan of the usage of new instruction sets to boost performance. Intel® AVX2 instructions follow the same programming model introduced by their predecessor: Intel® AVX instructions.

The generation of pseudorandom numbers is a very common requirement in number crunching applications. The good news is that you don’t need to learn the details about the new instruction set to write code that generates random numbers taking full advantage of Intel® AVX2 instructions in C++. In fact, you don’t need to write your own optimized algorithm. You can take advantage of the MRG32k3a pseudorandom number generator included in Intel® Math Kernel Library, a component of Intel® Parallel Studio XE 2013. MRG32k3a is a combined multiple recursive pseudorandom generator with two components of order 3 that is highly optimized for Haswell CPUs and uses Intel® AVX2 instructions.

With a few lines of code, you can take advantage of the most modern SIMD instructions introduced in Intel® CPUs. Because Intel® Parallel Studio XE and Intel® Math Kernel Library have very frequent updates, you can rest assured the algorithms are going to be improved to take advantage of future micro-architecture features and instruction sets. Thus, you can focus on using the generated pseudo random numbers in your application domain. You can think of the highly optimized pseudo random generator as your silver bullet.

Before moving to the code, let me dive a bit deeper on Intel® Math Kernel Library (aka Intel® MKL). Intel® Vector Statistical Library (aka VSL) is a component within MKL that provides optimized routines that implement pseudo-random and quasi-random number generators with continuous and discrete distributions. Thus, the code will use the MRG32k3a pseudo random generator included in VSL. You can read more information about the MRG32k3a pseudo random generator here.

Notice that Intel® Vector Statistical Library provides a wide range of Basic Random Number Generators (aka BRNG). You can use them to obtain random numbers of various statistical distributions and you should choose the appropriate Basic Random Number Generator based on your application requirements. In this case, I’m using the MRG32k3a pseudo random generator because it includes specific optimizations that take advantage of Intel® AVX2. However, depending on your application requirements, other Basic Random Number Generators might be more suitable.

The following steps allow you to create a project that uses MKL and compiles with Intel® C++ Compiler in Visual Studio 2013. The great integration that Intel® Parallel Studio XE 2013 has with Visual Studio 2013 makes it really easy to start working with Intel® MKL with just a few clicks.

1. Use the Launch Intel® Parallel Studio XE 2013 with VS 2013 shortcut to launch the IDE.

2. Create a Windows console application.

3. Select Project | Intel® Composer XE 2013 SP1 | Use Intel® C++ Compiler.

4. Now, right click on the project name in Solution Explorer and select Properties.

5. Select Configuration Properties | Intel® Performance Libraries. Click on the dropdown at the right-hand side of Use Intel® MKL, under Intel® Math Kernel Library. Select the desired working mode based on your needs. In my case, I’ve selected Parallel to use parallel Intel® MKL libraries. See the following figure.

I will use a very useful include file, errcheck.inc, that is part of the Intel® Math Kernel Library samples. This file defines the CheckVslError function that receives the int status code returned by any Intel® MKL function call and displays a message explaining the problem with that call when something went wrong. In order to access this file, you have to decompress the examples_core.zip file located in the mkl\examples folder within the Intel® Composer XE 2013 installation folder. So, for example, if you are working with a 64-bit Windows version, the default installation folder for Intel® Composer XE 2013 will be C:\Program Files (x86)\Intel\Composer XE 2013 SP1, and the full path for examples_core.zip will be C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\examples. It is usually a good idea to copy this zip file to another folder and decompress it. Once you decompress the file, you will find the errcheck.inc file within the vslc\source folder. For example, if you decompressed examples_core.zip in C:\mkl_samples, you will find errcheck.inc in C:\mkl_samples\vslc\source. I know you require a few steps, but believe me, errcheck.inc is very useful when you work with Intel® MKL.

The following lines show C++11 code that generates 1,000 pseudo random numbers by using the MRG32k3a pseudo random generator with the BOXMULLER2 method. This method generates normally distributed random numbers. You can read more information about the different methods and their related formulas here. The Intel® MKL functions are C-style calls, but as I cannot stop using C++11 features, I’ve made the C-style calls in a C++ Windows console application that uses some C++11 features to display all the generated pseud random numbers.

#include <iostream>
#include <stdio.h>

#include "mkl.h"
#include "mkl_vsl.h"
// Replace with your own path to errcheck.inc
#include "C:\mkl_samples\vslc\source\errcheck.inc"

#define SEED  7777777
#define RANDOM_NUMBERS     1000

using namespace std;

int main()

{
       // Buffer for RANDOM_NUMBERS pseudo random numbers
       float pseudorandom[RANDOM_NUMBERS];

       VSLStreamStatePtr stream;

       // Initialize the stream
       // Generate the stream and initialize it specifying the 32-bit input integer parameter seed
       auto status = vslNewStream(&stream, VSL_BRNG_MRG32K3A, SEED);
       CheckVslError(status);

       // Mean value
       float mean = 0.0f;

       // Standard deviation
       float sigma = 1.0f;

       // Generate normally distributed random numbers
       status = vsRngGaussian(VSL_METHOD_SGAUSSIAN_BOXMULLER2,
              stream, RANDOM_NUMBERS, pseudorandom, mean, sigma);

       CheckVslError(status);

       // Delete the stream
       status = vslDeleteStream(&stream);
       CheckVslError(status);

       cout << "Pseudo random numbers:\n";
       for (auto n : pseudorandom) {
              cout << n << '\n';
       }

       return 0;
}

If your CPU doesn’t support the instructions required for the used generator, the status will be equal to VSL_ERROR_CPU_NOT_SUPPORTED and CheckVslError with display an appropriate message. The code is very easy to understand and the generator is taking advantage of Intel® AVX2.

First, the code declares a buffer to hold the number of float pseudo random numbers defined in RANDOM_NUMBERS: 1,000. Then, a call to the vslNewStream function generates the stream and initializes it specifying the generator, VSL_BRNG_MRG32K3A, and the seed defined in SEED. Notice that each Intel® MKL function call is followed by a call to CheckVslError with the status returned by the Intel® MKL function call as an argument.

Then, the call to vsRngGaussian generates normally distributed random numbers with the BOXMULLER2 method (VSL_METHOD_SGAUSSIAN_BOXMULLER2). The mean value is 0 and the standard deviation (sigma) is 1. The random numbers will be stored in the previously pseudorandom buffer. Finally, the code deletes the stream and displays all the generated pseudo random numbers.

As you can learn from this small example, it is extremely easy to work with Intel® Math Kernel Library in Visual Studio 2013 thanks to the great integration that Intel® Parallel Studio XE 2013 provides with this IDE. With just a few lines of code, you can start taking full advantage of the Intel® AVX2 instructions introduced in Haswell CPUs in your C++ applications.

Intel® Math Kernel Library is a commercial product, but you can download a free 30-day evaluation version here.

Intel MKL support for Intel(R) AVX2

AVX2

optimization

Haswell support in intel MKL

intel math kernel library