Quantcast
Channel: Intel® C++ Composer XE
Viewing all 245 articles
Browse latest View live

Intel Compiler and Composer Update version numbers to compiler version number mapping

$
0
0

 

Introduction : Mapping Intel Compiler or Composer Update numbers to specific compiler versions and packages
 

Intel® Composer XE 2013 SP1 (released September 2013)

Composer XE 2013 SP1Intel Registration
Center
Activation Date
yr.mo.day
Windows Version / buildLinux Version / buildMac OS Version / build
Composer XE 2013 SP1 Update 32014.05.0514.0.3.202 / 2014042214.0.3.174 / 2014042214.0.3.166 / 20140421
Composer XE 2013 SP1 Update 22014.02.1214.0.2.176 / 2014013014.0.2.144 / 2014012014.0.2.139 / 20140121
Composer XE 2013 SP1 Update 12013.10.1814.0.1.139 / 2013100814.0.1.106 / 2013100814.0.1.103 / 20131010
Composer XE 2013 SP1 initial reiease2013.09.0414.0.0.103 / 2013072814.0.0.080 / 2013072814.0.0.074 / 20130716

 

Intel® Composer XE 2013 (released September 2012)

 

Composer XEIntel Registration Center 
activation date
 

Windows Version / build
 

Linux Version / build
 
Mac OS Version / build
 

Composer XE 2013 Update 5

2013/06/2013.1.3.198 / 2013060713.1.3.192 / 20130607

13.0.3.198 / 20130606

Composer XE 2013 Update 4

2013/05/2413.1.2.190 / 20130514 13.1.2.183 / 20130514

no Update 4 for OS X

Composer XE 2013 Update 3

2013/03/2613.1.1.171 / 20130314
removed from downloads due to bug - upgrade U4 or greater

13.1.1.163 / 20130313
removed from downloads due to bug - upgrade U4 or greater

13.0.2.171 / 20130314

Composer XE 2013 Update 2*
This is a minor update, version
13.1 first release (except on Mac)
 

2013/01/3113.1.0.149 / 2013011813.1.0.146 / 20130121No package for OS X* in this
update (OS X skips this Update)
Composer XE 2013 Update 12012/10/2313.0.1.119
20121008
13.0.1.117
20121010
13.0.1.119
20121010
Composer XE 2013
initial release
 
2012/09/05
 
13.0.0.089
20120731
13.0.0.079
20120731
13.0.0.088
20120731
 

Intel Composer XE 2011 (aka 12.0) and Intel Composer XE 2011 SP1 (aka 12.1) editions mappings.
Notes: 


Intel Composer XE 2011 is also known as version 12.0.  It is comprised of the initial release and Update 1 through Update 5.

Intel Composer XE 2011 SP1 is also known as version 12.1.  This version first appeared as Update 6 and subsequent Updates (7, 8, etc) are all version 12.1 compilers.


 
Composer XEReg Center 
activation date
 
Windows Version / build
 
Linux Version / build
 
Mac OS Version / build
 
Composer XE 2011
12.0 
initial release
 
11/09/2010
 
12.0.0.104
20101006
12.0.0.084
20101006
12.0.0.085
20101006
Composer XE 2011
Reg. Center download Update 1
 
12/02/201012.0.1.128
20101116
12.0.1.108
20101116
12.0.1.122
20101110
Composer XE 2011
Reg. Center download Update 2
 
01/29/201112.0.2.154
20110112
12.0.2.137
20110117
12.0.2.142
20110112
Composer XE 2011
Reg. Center download Update 3
 
03/24/201112.0.3.175
20110309
12.0.3.174
20110309
(Japanese
12.0.3.175)
12.0.3.167
20110309
Composer XE 2011
Reg. Center download Update 4
 
05/09/201112.0.4.196
20110427
12.0.4.191
20110427
12.0.4.184
20110503
Composer XE 2011
Reg. Center download Update 5
 
07/29/201112.0.5.221
20110719
12.0.5.220
20110719
12.0.5.209
20110719
Composer XE 2011 SP1 (aka 12.1)
Reg. Center download Update 6
 
08/24/201112.1.0.233
20110811
12.1.0.233
20110811
12.1.0.038
20110817
Composer XE 2011 SP1 Update 1
Reg. Center download Update 7
 
10/21/201112.1.1.258
20111011
12.1.1.256
20111011
12.1.1.246
20111011
Composer XE 2011 SP1 Update 2
Reg. Center download Update 8
 
12/16/201112.1.2.278
20111128
12.1.2.273
20111128
12.1.2.269
20111207
Composer XE 2011 SP1 Update 3
Reg. Center download Update 9
 
02/09/201212.1.3.300
20120130
12.1.3.293
20120212
12.1.3.289
20120130
Composer XE 2011 SP1 Update 4
Reg. Center download Update 10
 
04/30/201212.1.4.325
20120410
12.1.4.319
20120410
12.1.4.328
20120423
Composer XE 2011 SP1 Update 5
Reg. Center download Update 11
 
06/26/201212.1.5.344
20120612
12.1.5.339
20120612
12.1.5.344
20120612

Composer XE 2011 SP1 Update 6
Reg. Center download Update 12

09/10/201212.1.6.369
20120821

12.1.6.361
20120821

12.1.6.371
20120821

Composer XE 2011 SP1 Update 7
Reg. Center download Update 13

10/08/201212.1.7.371
201209278

12.1.7.367
20120928

12.1.7.380
20120928





Intel Compiler Professional Editions / Intel Visual Fortran editions mappings

 

 

 

 

 

 

 

 

 

 

 

 


 

Compiler Pro
 
Reg Center Post date
 
Windows Version / build
 
Linux Version / build
 
Mac OS Version / build
 
11.1  initial release
 
06/11/2009
 
**11.1.035
**20090511
- removed
11.1.038
200900511
11.1.046
20090511
11.1 Update 1
 
07/14/2009
 
**11.1.038
**20090624
- removed
11.1.046
20090630
11.1.058
20090624
11.1 Update 2
 
09/14/2009
 
**11.1.046
**20090903
- removed
11.1.056
20090827
11.1.067
20090910
**11.1 Windows Update 2 revised
 
10/07/2009
 
11.1.048
20090930
nana
11.1 Update 3
 
10/21/2009
 
11.1.051
20091012
11.1.059
20091012
11.1.076
20091029
11.1 Update 4
 
12/15/2009
 
11.1.054
20091130
11.1.064
20091130
11.1.080
20091130
11.1 Update 5
 
02/18/2010
 
11.1.060
20100203
11.1.069
20100203
11.1.084
2010203
11.1 Update 6
 
04/22/2010
 
11.1.065
20100414
11.1.072
20100414
11.1.088
20100401
11.1 Update 7
 
08/20/2010
 
11.1.067
20100806
11.1.073
20100806
11.1.089
20100806
11.1 Update 8
 
12/09/2010
 
11.1.070
20101201
11.1.075
20101201
11.1.091
20101201
11.1 Update 9
 
7/13/2011
 
11.1.072
20110708
11.1.080
20110708
No Mac release
No Mac release


** NOTE, For Compiler Professional Edition for Windows 11.1:  The initial release, and Updates 1 and 2 were pulled from the Intel Registration Center.  11.1.048 was posted out of cycle to replace these early versions and was referred to as "11.1 Update 2 revised".  These versions were pulled because of a bug, details are here:  /en-us/articles/program-crashes-or-hangs-on-some-systems PLEASE UPDATE to 11.1.048 or newer if you have any of the affected compiler versions listed above.  Contact support at premier.intel.com for assistance.
 

 


Intel Parallel Composer 2011 Releases
 

Intel Parallel Composer 2011
 
Reg Center Post date
 
Windows Version / build
 
initial release Composer 2011
 
08/13/20102.0.0.063 / 20100721
Composer 2011 Update 1
 
12/03/20102.0.1.096 / 20101203
Composer 2011 Update 2
 
01/20/20112.0.2.114 / 20110113
Composer 2011 Update 3
 
03/25/20112.0.3.132 / 20110309
Composer 2011 Update 4
 
05/11/20112.0.4.147 / 20110427
Composer 2011 Update 5
 
08/03/20112.0.5.172 / 20110722
Composer 2011 SP1 (Update 6)
 
08/11/20112.1.6.043 / 20110811
Composer 2011 SP1 (Update 7)
 
10/19/20112.1.7.062 / 20111011
Composer 2011 SP1 (Update 8)
 
10/19/20112.1.8.079 / 20111128
Composer 2011 SP1 (Update 9)
 
02/14/20122.1.9.102 / 20120130
Composer 2011 SP1 (Update 10)
 
04/23/20122.1.10.122 / 20120410
Composer 2011 SP1 (Update 11)
 
06/26/20122.1.11.142 / 20120612
Composer 2011 SP1 (Update 12)09/07/20122.1.12.172 / 20120731


Intel Parallel Composer Releases
 

Intel Parallel Composer
 
Reg Center Post date
 
Windows Version / build
 
Initial release ***
 
05/01/2009***composer.061
20090421
- removed
Update 1 ***
 
06/26/2009
 
***composer_update1.063
20090624
- removed
Update 2 ***
 
09/15/2009
 
***composer_update2.066
20090903
- removed
Update 3 ***
 
10/08/2009
 
***composer_update3.068
20090930
- removed
Update 3 revised
 
11/02/2009
 
composer_update3.071
20091012
Update 4
 
12/14/2009
 
composer_update4.072
20091130
Update 5
 
02/16/2010
 
composer_update5.078
20100203
Update 6
 
04/22/2010
 
Composer_update6.082
20100419



*** NOTE:  The Intel Parallel Composer initial release and updates 1, 2, and 3 were also pulled from Intel Registration Center due to the same bug affecting early Intel Compiler Professional Editions 11.1.
These versions were pulled because of a bug, details are here:  /en-us/articles/program-crashes-or-hangs-on-some-systems PLEASE UPDATE to the current Update 3 (.071) or newer if you have any of the affected compiler versions listed above.  Contact support at premier.intel.com for assistance.


 

  • Entwickler
  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • C/C++
  • Fortran
  • Anfänger
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Fortran Compiler
  • Intel® Fortran Composer XE
  • Intel® Parallel Composer
  • Intel® Visual Fortran Composer XE
  • Entwicklungstools

  • Cilk worker scheduling

    $
    0
    0

    Hello,

    I would like to understand better how Cilk scheduling works. 

    I am not sure how to phrase this question so I give it my best.

    I have downloaded the latest Intel Cilk runtime release (cilkplus-rtl-003365 -  released 3-May-2013).

    I use the classical Fibonacci example in Cilk.

    I wanted to know on what CPU core each worker executes.

    To Fibonacci example, I added a function that checks CPU affinity for every worker as in here:

    http://linux.die.net/man/3/pthread_getaffinity_np

    “printf” is located in “int fib(int n)” of the Fibonacci sample code.

    I get WORKER ID using “__cilkrts_get_worker_number()”

    While the program runs, I print each WORKER's ID and the CPU core affinity of each worker. 

    However, the result surprises me.

    I expected that some of the workers would run on different CPU cores but it seems that all workers are running on the same exact CPU core. 

    For example, I get this for every “printf” when running “./fib 30”:

    ***** WORKER ID: 0 on CPU core: 7 *****

    ***** WORKER ID: 3 on CPU core: 7 *****

    ...

    ...(repeats sa long as some binary is executing) ...

    ...

    ***** WORKER ID: 0 on CPU core: 7 *****

    I have repeated these tests using PBBS (http://www.cs.cmu.edu/~pbbs/) and get the same results.

    I did similar experiments using OpenMP (Windows and Linux) and I saw threads executing on different CPU’s.

    My system is:

    Debian GNU/Linux 7.0 running kernel 3.2.49

    Available CPU schedulers are: noop, deadline, cfq - the system is running cfg

    AMD FX-8150 (8-core)

    I think, I might be missing or misunderstanding something here. 

    Is there a “preferred” Linux CPU scheduler for Intel Cilk runtime?

    Do you have any advice or suggestion? 

    Thank you for your time,

    Haris

    Efficient prefix scan library in Cilk Plus and accessible from C?

    $
    0
    0

    Is there any efficient prefix scan library for Cilk Plus accessible from C?

    I was not able to find any and my implementation can hardly compete with the sequential version :-)

    An interface similar to the reducers will work nicely.

    Thank you.

    Internal compiler error 010101_239

    $
    0
    0

    Hi guys,

    I condensed our project down to a piece of code that lets you reproduce the following issue.
    When I compile this in Release configuration (Debug works), I get this compiler error:

    1>------ Build started: Project: ng-gtest, Configuration: Release x64 ------
    1> CilkTest.cpp
    1>" : error : 010101_239
    1>
    1> compilation aborted for General\CilkTest.cpp (code 4)
    ========== Build: 0 succeeded, 1 failed, 3 up-to-date, 0 skipped ==========

    This is our compiler: Intel(R) C++ Intel(R) 64 Compiler XE for Intel(R) 64, version 14.0.3 Package ID: w_ccompxe_2013_sp1.3.202
    OS: Windows 7, x64.

    This is the code:

    #include <math.h>

    const int VecSize = 8;

    const short* acdata;
    const short* lowdata;

    const unsigned short* meas_data;
    const unsigned short* rdval;

    short trident[2 * VecSize];
    short speed[2 * VecSize];
    float spdfact[VecSize];
    float spdfact2[VecSize];
    float tdat[VecSize];
    float array1[VecSize];
    float array2[VecSize];

    const float *input_01;
    const float *input_02;
    float val0;
    float vvv9;
    float agn;

    void get_g(float ag[VecSize], const float ae[VecSize], const float* pp)
    {
    float a01[VecSize];
    a01[:] = ae[:];
    if (a01[:] >= 360.0f)
    a01[:] -= 360.0f;
    if (a01[:] >= 360.0f)
    a01[:] -= 360.0f;
    if (a01[:] < 0.0f)
    a01[:] += 360.0f;
    if (a01[:] < 0.0f)
    a01[:] += 360.0f;

    int i0[VecSize], i1[VecSize];
    i1[:] = static_cast<int>(a01[:]);
    i0[:] = i1[:] + 1;

    float g0[VecSize], g1[VecSize];
    g1[:] = pp[i1[:]];
    g0[:] = pp[i0[:]];

    ag[:] = g0[:] - (g1[:] - g0[:]) * (a01[:] - static_cast<float>(i0[:]));
    }

    void f(float prlo[VecSize], const int cntr)
    {
    float cop[VecSize];
    short cod[2 * VecSize];
    short maxm[2 * VecSize];

    cod[0:VecSize:2] = acdata[cntr:VecSize];
    cod[1:VecSize:2] = lowdata[cntr:VecSize];
    maxm[0:VecSize:2] = lowdata[cntr:VecSize];
    maxm[1:VecSize:2] = acdata[cntr:VecSize];
    cop[:] = (1.0f / float(255 * 255)) * static_cast<float>(
    cod[0:VecSize:2] * trident[0:VecSize:2] + cod[1:VecSize:2] * trident[1:VecSize:2]);
    float music[VecSize];
    music[:] = (1.0f / float(16383 * 16383)) * static_cast<float>(
    maxm[0:VecSize:2] * speed[0:VecSize:2] + maxm[1:VecSize:2] * speed[1:VecSize:2]);

    float velo2[VecSize];
    float brigh[VecSize];
    velo2[:] = static_cast<float>(rdval[cntr:VecSize]) * spdfact[:];
    brigh[:] = asinf(velo2[:]);

    float denom[VecSize];
    denom[:] = cop[:] * array1[:] + array2[:] / brigh[:];

    float accel[VecSize];
    accel[:] = atanf(music[:] / denom[:]);
    accel[:] = atan2f(accel[:], velo2[:]);

    bool haMask[VecSize];
    haMask[:] = accel[:] < 0.0f;
    if (haMask[:] & (music[:] > 0.0f))
    accel[:] += float(9.81 / 2);
    if (!haMask[:] & (music[:] < 0.0f))
    accel[:] -= float(9.81 / 2);

    float diff[VecSize];
    diff[:] = array1[:] * brigh[:] - array2[:] * cop[:] * velo2[:];
    float prod[VecSize];
    prod[:] = sinf(accel[:]) * diff[:];

    float accel2[VecSize];
    accel2[:] = atanf(prod[:] / music[:]);

    float halter[VecSize] = { 0.0f };
    if (cop[:] <= 0.0f)
    halter[:] = 2.8182963f;
    float valter[VecSize];
    valter[:] = cop[:] > 0 ? velo2[:] - tdat[:] : velo2[:] + tdat[:];

    if (music[:] == 0)
    {
    accel[:] = halter[:];
    accel2[:] = valter[:];
    }
    accel[:] *= float(9.81);
    accel2[:] *= float(9.81);

    float v8[VecSize];
    v8[:] = 25.7385f - accel2[:];
    float hxx[VecSize];
    hxx[:] = fabsf(accel[:]) * float(1.38e-23);
    float hnn[VecSize];
    hnn[:] = -accel[:];

    float vgg[VecSize], gx2[VecSize], hms[VecSize], sv[VecSize];
    get_g(vgg, accel2, input_02);
    get_g(gx2, v8, input_02);
    get_g(hms, hnn, input_01);
    sv[:] = ((1.0f - hxx[:]) * (val0 - vgg[:])) + (hxx[:] * (vvv9 - gx2[:]));

    float p0[VecSize];
    p0[:] = hms[:] - sv[:];

    prlo[:] = static_cast<float>(meas_data[cntr:VecSize]) * spdfact2[:] - p0[:] - agn;
    }

    GCC* 4.9 OpenMP code cannot be linked with Intel® OpenMP runtime

    $
    0
    0

    GCC* 4.9 was released on April 22, 2014.  This release now supports Version 4.0 of the OpenMP* specification for the C and C++ compilers.  The interface between the compilers and the GCC OpenMP runtime library (libgomp) was changed as part of this development.  As a result, code compiled by GCC 4.9 using the –fopenmp compiler option cannot be successfully linked with the Intel® OpenMP runtime library (libiomp5), even if it uses no new OpenMP 4.0 features.   Linking may fail with a message such as “undefined reference to `GOMP_parallel'”, or (if both libiomp5.so and libgomp.so are linked in) linking may appear to succeed, but the executable code may crash at runtime since having two different OpenMP runtimes linked into the same process is fatal. 

    Intel is working to restore the ability to link OpenMP code compiled by GCC 4.9 with OpenMP code compiled by the Intel compilers.  In the meantime, we recommend against using GCC 4.9 –fopenmp to compile code that you plan to link with anything containing Intel-compiled OpenMP code, including the Intel® Math Kernel Library (MKL) and Intel® Integrated Performance Primitives (IPP) performance libraries.

  • OpenMP GCC libiomp5 libgomp
  • Entwickler
  • Linux*
  • Server
  • C/C++
  • Fortgeschrittene
  • Intel® C++ Composer XE
  • OpenMP*
  • Parallel Computing
  • Server
  • Desktop
  • URL
  • Submissions open: High Performance Parallelism Gems

    $
    0
    0

    Hi everyone,

    We have all had our little discoveries and triumphs in identifying new and innovative approaches that increased the performance of our applications. Usually, they are small, though important, but occasionally we find something more, something that could also help others, an innovative gem. Perhaps it is a method of analysis, or an unconventional use of the memory hierarchy, or simply the dogged application of techniques that achieves remarkable speedups. Yet, we rarely have a means of making these innovations available outside of our immediate colleagues.

    You now have an opportunity to broadcast your successes more widely to the benefit of our community.

    And we’re not referring only to triumphs specific to pure processor performance. Perhaps your innovation solves an I/O bottleneck issue, answers a particularly important multi-body problem, or succeeds in reducing the energy footprint of a suite of applications. These are all important to the community at large.

    Of course, I and the editors are from Intel, so we are focusing on the use of Intel® Xeon® and Intel® Xeon Phi™ processors. But this focus isn’t too limiting as Intel® architectures are everywhere.

    So here is your chance to share your triumphs. Do you know a unique way of exploiting multicore caches? An innovative algorithm that allows scaling to greater than 200 cores? Or a unique application of OpenMP* in conjunction with MPI in an Intel Xeon cluster? Consider letting the broader community know by submitting a proposal to the editors.

    =============

    PLEASE PASS AROUND TO ANYONE WHO MAY BE INTERESTED

    You are invited to submit a proposal to a contribution-based book, working title, “High Performance Parallelism Gems – Successful Approaches for Multicore and Many-core Programming” that will focus on practical techniques for Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor parallel computing.

    Submissions are due by May 29, 2014 in order to be guaranteed for consideration for publication in the first (2014) volume.

    Please submit your proposal now. We'll work with you to refine it as needed.

    If you would like to contribute, please fill out the form completely and click SUBMIT.

    Visit http://lotsofcores.com/gems to send us your ideas now.

    You may email us at hppg2014@easychair.org with questions (please read http://lotsofcores.com/gems first). Please submit by May 29.

    Thank you,

    James Reinders and Jim Jeffers

    P.S. Many of you will think “Intel Xeon Phi gems,” but we actually expect “the gems” will show great ways to scale on both Intel Xeon Phi coprocessors and Intel Xeon processors, hence the working title for the book.

  • server
  • Parallel Programming
  • Jim Jeffers
  • James Reinders
  • KNC
  • KNL
  • Knights
  • Knights Landing
  • Xeon
  • Intel Xeon Phi Coprocessor
  • MIC
  • Knights Corner
  • manycore
  • Many Core
  • Symbol-Bild: 

  • Cloud-Computing
  • Cluster-Computing
  • Entwicklungstools
  • Bildungswesen
  • Finanzdienstleistungsbranche
  • Spieleentwicklung
  • Intel® Many Integrated Core Architektur
  • Optimierung
  • Parallel Computing
  • Portierung
  • Energieeffizienz
  • Threading
  • Vektorisierung
  • Cluster Tools
  • Intel® Cluster Toolkit
  • Intel® MPI Benchmarks
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Cilk™ Plus
  • Intel® Composer XE
  • Intel® Fortran Compiler
  • Intel® Fortran Composer XE
  • Intel® Parallel Composer
  • Intel® Visual Fortran Composer XE
  • Intel® VTune™ Amplifier
  • Intel® Integrated-Performance-Primitives
  • Intel® Math Kernel Library
  • Intel® MPI Library
  • Intel® SDK für OpenCL™ Anwendungen
  • Intel® Threading Building Blocks
  • Intel® C++ Studio XE
  • Intel® Cluster Studio
  • Intel® Cluster Studio XE
  • Intel® Fortran Studio XE
  • Intel® Parallel Studio
  • Intel® Parallel Studio XE
  • Intel® Parallel Amplifier
  • Intel® VTune™ Amplifier XE
  • Intel® VTune™ Performance Analyzer
  • Intel® Advanced Vector Extensions
  • Intel® Streaming SIMD Extensions
  • Message Passing Interface
  • OpenCL*
  • OpenMP*
  • C#
  • C/C++
  • Fortran
  • Java*
  • Cloud-Dienste
  • Server
  • Server
  • Desktop
  • Entwickler
  • Professoren
  • Studenten
  • intel cilk plus cilkscreen and tbb/scalable_allocator

    $
    0
    0

    Dear friends,

    the following simple code seems to run just fine, however, cilkscreen is shouting "Race condition"!

    Shall I trust it? Or it is just false sharing?

    So, what scalable memory allocator is fast and thread safe to use with intel cilk plus?  

    #include <cilk/cilk.h>
    #include "tbb/scalable_allocator.h"
    
    char * array[10000000];
    
    int main(int argc, char **argv) {
    
      cilk_for (int i = 0; i < 10000000; i++) {
        array[i] = (char *) scalable_malloc(1);
      }
    
      cilk_for (int i = 0; i < 10000000; i++) {
        scalable_free(array[i]);
      }
    
      return 0;
    }
    

    I compile it with

    icc -lcilkrts -ltbbmalloc -o example -O3 -std=c99 example.c

    but 

    $ /usr/pkg/intel/bin/cilkscreen ./example
    Cilkscreen Race Detector V2.0.0, Build 3566
    
    Race condition on location 0x7fc83fd4ae90
      write access at 0x7fc83fb0fd5c: (/tmp/tbb.MXm12595/1.0/build/fxtcarvm024icc13_0_64_gcc4_6_cpp11_release/../../src/tbbmalloc/tbbmalloc_internal.h:913, rml::internal::TLSKey::createTLS+0xec)
      read access at 0x7fc83fb0de78: (/tmp/tbb.MXm12595/1.0/build/fxtcarvm024icc13_0_64_gcc4_6_cpp11_release/../../src/tbbmalloc/tbbmalloc_internal.h:918, scalable_malloc+0x18)
        called by 0x400cb1: (/home/nikos/projects/reducer/example.c:18, __$U0+0x41)
        called by 0x400c3c: (/home/nikos/projects/reducer/example.c:17, main+0x4c)
    ...

    The system and compiler are:

    $ uname -a
    Linux leibniz4 3.5.0-44-generic #67~precise1-Ubuntu SMP Wed Nov 13 16:16:57 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
    $ icc --version
    icc (ICC) 14.0.1 20131008
    Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.

    Thank you,

    N

    Cilk_for returns wrong data in array.

    $
    0
    0

    Hello everyone. I am new to multi threading programming. Recently, i have a project, which i apply cilk_for into it. Here is the code:

    void myfunction(short *myarray)
    {
    	m128i *array = (m128i*) myarray
    	cilk_for(int i=0; i<N_LOOP1; i++)
    		{
    			for(int z = 0; z<N_LOOP2; z+=8)
    			{
    				array[z]        =  _mm_and_si128(array[z],mym128i);
    				array[z+1]        =  _mm_and_si128(array[z+1],mym128i);
    				array[z+2]        =  _mm_and_si128(array[z+2],mym128i);
    				array[z+3]        =  _mm_and_si128(array[z+3],mym128i);
    				array[z+4]        =  _mm_and_si128(array[z+4],mym128i);
    				array[z+5]        =  _mm_and_si128(array[z+5],mym128i);
    				array[z+6]        =  _mm_and_si128(array[z+6],mym128i);
    				array[z+7]        =  _mm_and_si128(array[z+7],mym128i);
    				array+=8;
    			}
    		}
    }

    After the above code ran, ridiculous thing happens. The data in array isn't updated correctly. For example, if i have an array with 1000 elements, there is a chance that the array will be updated correctly (1000 elements are AND-ed). But there is also a chance that some parts of the array will be omited (first element to 300th element are AND-ed, 301st element to 505th element aren't AND-ed, 506th element to 707th element are AND-ed, etc,...). These omited parts are random in each individual run, so i think the problem here is about cache miss. Am I right? Please tell me, any help is appreciated. :)

     


    Applying Vectorization Techniques for B-Spline Surface Evaluation

    $
    0
    0

    Abstract

    In this paper we analyze relevance of vectorization for evaluation of Non-Uniform Rational B-Spline (NURBS) surfaces broadly used in Computer Aided Design (CAD) industry to describe free-form surfaces. NURBS evaluation (i.e. computation of surface 3D points and derivatives for the given u, v parameters) is a core component of numerous CAD algorithms and can have a significant performance impact. We achieved up to 5.8x speedup using Intel® Advanced Vector Extensions (Intel® AVX) instructions generated by Intel® C/C++ compiler, and up to 16x speedup including minor algorithmic refactoring, which demonstrates high potential offered by the vectorization technique to NURBS evaluation.

    Introduction

    Vectorization, or Single Instruction Multiple Data (SIMD), is a parallelization technique available on modern computer processors, which allows to apply the same computational operation (e.g. addition or multiplication) to several data elements at once. For example, on a processor with a 128 bit register a single addition operation can add 4 pairs of integers (each takes 32 bits) or 2 pairs of doubles (64 bits each). With the help of vectorization one can speed up computations due to reduced time required to process the same data sets. SIMD was introduced with Intel® Architecture Processors way back in 1990es, with MMX™ technology as its first generation.

    In this paper we analyze relevance of vectorization for evaluation of NURBS surfaces [1]. NURBS is a standard method used in CAD industry to describe free-form surfaces, e.g. car bodies, ship hulls, aircraft wings, consumer products and so on. Examples of 3D models (from [3]) containing NURBS surfaces are given on Fig.1:

    NURBS evaluation (i.e. a computation of surface 3D points and derivatives) is a core component of numerous CAD algorithms. For instance...

    (For further reading please refer to the attached pdf document)

  • vectorization
  • NURBS
  • B-Spline
  • CAD
  • Entwickler
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Server
  • Windows*
  • C/C++
  • Experten
  • Fortgeschrittene
  • Compiler
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® VTune™ Amplifier
  • Intel® Parallel Studio XE
  • Intel® Advanced Vector Extensions
  • Intel® Streaming SIMD Extensions
  • Notebook
  • Server
  • Desktop
  • URL
  • Compiler-Themen
  • Leistungsverbesserung
  • Bibliotheken
  • Multithread-Entwicklung
  • Design-Zone: 

    IDZone

    Parallel Studio XE 2013 全新上市

    $
    0
    0

    今天我们宣布推出 Parallel Studio XE 2013(立即发布) 和 Cluster Studio XE 2013(2012 年第四季度发布)。

    如欲了解更多详细信息,请参阅 《Parallel Universe Magazine》第 11 期。 第 11 期中包括“十大新特性”、指针检查器功能以及有条件数值再现方面的内容。 访问 Parallel Studio XE 2013Cluster Studio XE 2013,了解更多信息,包括如何评估以及如何购买。

    这些产品中的新特性包括:

    (1)针对新处理器和协处理器的支持包括针对 Ivy Bridge 微架构、Haswell 微架构( 2013 年首次生产)英特尔至强融核协处理器(超过 50 个核心,2012 年生产,代号 Knights Corner,使用 MIC 架构)进行优化的支持。 针对英特尔® 至强融核™ 协处理器中使用的英特尔集成众核(MIC)架构的工具将自动随编译器、数据库、调试和性能分析工具提供。 Haswell 微架构支持工具包括面向 TSX的内联函数。

    (2)最新标准的扩展支持包括 MPI 2.2、C++ 11 和 Fortran 2008。 我们致力于提供领先的行业标准支持。

    (3)令人兴奋的全新客户驱动型功能,包括有条件数值再现性指针检查功能、睡眠状态分析(适用于功耗)以及内存堆增长分析。

    这些工具可支持使用英特尔库进行简单的重新编译和重新链接来轻松获得性能,也可为希望进行深度操作(如以并行方法追踪 TLB 缺失、数据竞跑或设备的起因) 的编程人员提供大量的“深度”功能。 英特尔提供了多种选项,可支持访问面向 C、C++ 和 Fortran 编程器的性能。

    这些软件开发产品可为 C、C++ 和 Fortran 开发人员提供全面工具套件,从而帮助其实现更高的性能。 它包括编译器、数据库、并行编程模式、设计辅助、调试和性能分析工具。 英特尔 Parallel Studio XE 2013 主要用于为标准数据中心系统、工作站和电脑的共享内存设备提供支持。 英特尔 Cluster Studio XE 2013 包括英特尔 Parallel Studio XE 2013 中的工具,以及针对分布式内存设备(如使用 MPI(消息传递接口)进行编程的集群和超级计算机中的设备)而提供的特殊功能。 英特尔可提供出色的 MPI 支持。

    如果您的支持为最新版本,则 Parallel Studio XE 2013更新为免费(在首次购买或购买相应支持包后一年内)。

    访问 Parallel Studio XE 2013,了解更多信息,包括申请免费版本亲自体验,并可阅读《Parallel Universe Magazine》第 11 期进行了解。

  • Intel Xeon Phi Coprocessor
  • Symbol-Bild: 

  • News
  • Cluster-Computing
  • Debugging
  • Entwicklungstools
  • Intel® Many Integrated Core Architektur
  • Optimierung
  • Parallel Computing
  • Threading
  • Vektorisierung
  • Cluster Tools
  • Intel® Cluster Checker
  • Intel® MPI Benchmarks
  • Intel® Trace-Analyzer und -Collector
  • Intel® C++ Composer XE
  • Intel® Composer XE
  • Intel® Fortran Composer XE
  • Intel® Visual Fortran Composer XE
  • Intel® Integrated-Performance-Primitives
  • Intel® Math Kernel Library
  • Intel® MPI Library
  • Intel® Threading Building Blocks
  • Produkt-Suites
  • Intel® Cluster Studio XE
  • Intel® Parallel Studio XE
  • Intel® Inspector XE
  • C/C++
  • Fortran
  • Java*
  • Business Client
  • Server
  • Server
  • Entwickler
  • Professoren
  • Studenten
  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Cilk Tools error while loading shared libraries

    $
    0
    0

    I have successfully compiled cilkplus for gcc (4.8 branch) on Ubuntu 14.04 LTS and compiled the example program fib on the cilkplus website.  I would like to run cilkview and cilkscreen on it, and so I downloaded cilk tools from the website as well.  However, when I try to run cilkview, I get the following error:

    Cilkview: Generating scalability data
    -t: error while loading shared libraries: -t: cannot open shared object file: No such file or directory

    I've tried changing the environment variables $LIBRARY_PATH and $LD_LIBRARY_PATH to point to the libraries in the cilk tools directory, but I still come up with the same error.  I also noticed that on the cilk tools downloads, for linux there is an extra set of libraries (libelf and libdwarf), which I have also installed on my system.  I tried looking at the depenencies for cilkview, but I couldn't find anything unusual with those.  Here is the output:

    $ ldd cilkview
        linux-gate.so.1 =>  (0xf7735000)
        libm.so.6 => /lib32/libm.so.6 (0xf76d1000)
        libstdc++.so.6 => /usr/lib32/libstdc++.so.6 (0xf75e8000)
        libgcc_s.so.1 => /usr/lib32/libgcc_s.so.1 (0xf75cb000)
        libc.so.6 => /lib32/libc.so.6 (0xf741f000)
        libdl.so.2 => /lib32/libdl.so.2 (0xf741a000)

    I would particularly like to run cilkview, but the error persists for both cilkview and cilkscreen.  It looks like the error is loading libraries; however,  I have no idea why it calls the library -t.  Any help with this would be much appreciated.

    Exception when run project at debug mode using cilk_for

    $
    0
    0

    Dear all,

    I have used cilk_plus to make parallel processing into my source code with visual studio 2008 IDE.

    But when I build it at debug mode, the project throw an exception below:

    "Run-Time Check Failure #0 - The value of ESP was not properly saved across a function call. This is usually a result of calling a function pointer declared with a different calling convention"

    How can  I resolve it to make debug mode operated ?

    Thanks of all,

    Tam Nguyen

     

    Less performance on 16 core than on 4 ?!

    $
    0
    0

    Hi there,

    I evaluated my cilk application using "taskset -c 0-(x-1) MYPROGRAM) to analyze scaling behavior.

     

    I was very suprised to see, that the performances increases up to a number of cores but decreases afterwards.



    for 2 Cores, I gain a speedup of 1,85. for 4, I gain 3.15. for 8 4.34 - but with 12 cores the performance drops down

    to a speedup close to the speedup gained by 2 cores (1.99).

    16 cores performe slightly better (2.11)

    How is such an behaviour possible? either an idle thread can steal work or it cant?! - or may the working packets be too coarse grained and the stealing overhead destroys the performance with too many cores in use?!

    Downloaded Composer XE Linux Online Installer Bootstrap may not have Execute Permission

    Converting Cilkview data to seconds

    $
    0
    0

    Hey folks,

    I'm working with a system that needs the work and span of the programs I'm running in seconds and nanoseconds to run properly, and running Cilkview on them gives me work and span in processor instructions. Does anyone know of a way to convert that data from instructions to a unit of time?

    Thanks!

    Matt


    Set Worker on Windows with Intel Core i3

    $
    0
    0

    Hi all,

    I have used Cilk Plus to make my code computing parallel. But PC is installed Windows XP3, Intel Core i3, how many workers should I set to make the best performance for my code ?

    Thanks of all,

    Tam Nguyen

    sec_implicit_index

    $
    0
    0

    I've been trying to understand what the implicit_index intrinsic may be intended for.  It's tricky to get adequate performance from it, and apparently not possible in some of the more obvious contexts (unless the goal is only to get a positive vectorization report).

    It seems to be competitive for the usage of setting up an identity matrix.

    In the context of dividing its result by 2, different treatments are required on MIC and host:

    #ifdef __MIC__

          a[2:i__2-1] = b[2:i__2-1] + c[((unsigned)__sec_implicit_index(0)>>1)+1] * d__[2:i__2-1];

    #else

          a[2:i__2-1] = b[2:i__2-1] + c[__sec_implicit_index(0)/2+1] * d__[2:i__2-1];

    #endif

    That is, the unsigned right shift is several times as fast as the divide on MIC (and not much slower than plain C code), while the signed divide by 2 is up to 60% faster on host (but not as fast as C code).

    The only advantage in it seems to be the elimination of a for(), if in fact that is considered to be an advantage.

    I didn't see documented anywhere that it is int data type, although the opt-report shows it.  I can't see how it could be anything other than positive integers, so the (unsigned) cast seems valid.  I guess >>1U would have the same effect with less space taken up compared with (unsigned).  The notation is already cryptic from my point of view.

    Cilk™ Plus并行程序的串行等价程序的执行过程

    $
    0
    0

        C++社区的趋势近年来主要是通过以添加更多的库而不是语言关键字来实现增加程序的功能性,比如Threading Building Blocks以及Parallel Patterns库,但与主流发展趋势不同的是,Intel的Cilk™ Plus的实现方式则是以后者的形式——语言关键字来增加程序功能的,本文将就此给出分析。

        其中的一个主要原因是语言及其扩展主要是由编译器来实现转换,并且编译器能够提供一定程度的保证,比如串行等价语义等。

        每一个使用关键字来定义并行Cilk™ Plus的程序都有一个已在编译器实现中定义好的串行语义。 通过将每一个cilk_sync及cilk_spawn替换为空,且将每一个cilk_for以for关键字来替代,编译器由此将并行的Cilk™ Plus程序处理为一个有效的串行C/C++程序。 当两个逻辑并行的线程同时访问同一内存位置且至少一个为写内存操作时,程序行为此时出现竞态,如果一个Cilk™ Plus并行程序没有竞态发生的话,此时它将产生与其串行等价程序相同的结果。编译器是如何保证其串行等价的结果一致的?考虑以下的代码:

    int foo()
    {
    int x1 = func1();
    int d1 = 0;
    int x2 = cilk_spawn child1(bar1(), bar2()); //产生一个输入参数为两个函数调用的子调用
    int x3 = cilk_spawn child2(&d1);            //传递一个栈变量到子调用中
    int x4 = func3();                     // func3能够与child1和child2同时运行
    cilk_sync;                            //等待两个子任务的完成以使用其计算的结果
    return x1 + x2 + x3 + x4 + d1;
    } 

    程序的运行时可解释为:

    1. foo函数中spawn出一个child1子任务以便它能与func1同时执行;

    2. cilk_sync的声明会使程序的执行流在此等待child1和child2的返回,以便后续程序流能够使用其返回值。

    在此代码的生成过程中,编译器主要做了3处转换:

    1. 作为child1传入参数的两个函数bar1和bar2首先将被parent线程顺序执行,即函数foo所在的线程。

    2. 在foo的return语句返回前,编译器会插入cilk_sync,这也是编译器对每一个包含cilk_spawn的函数所做的底层默认工作,这与fork-join的并行模型相一致,由此可使程序的行为更容易理解;

    而且默认插入的cilk_sync也会保证foo的栈帧在整个foo函数执行期间一直为其spawn出的子任务所见。 在此例中,foo将d1的地址传给child2,这由此保证了当child2写d1内存位置处的内容时,此内存地址仍在foo的栈中。

    3. 执行foo的线程执行到child1时,cilk_spawn之后的代码将被加入到工作队列中以便被之后的当前线程继续执行以及被其它spawn出的子工作线程执行,这种工作方式称为‘Parent Stealing’,且在库的实现中,‘Work Stealing’机制是用‘Child Stealing’的方式来完成,即在此例子中,child stealing意思是child1将被入队列且会被之后spawn出的worker线程们以steal的方式来完成执行。

        以下是一段并行递归代码,在此例子中的三元树中,每一个节点指向其左中右子节点且自身有颜色属性。通过对此树的一次递归遍历即可以建立一个含有所有红色节点的链表,当并行化此递归程序时,需要注意的是在将多个节点同时push入全局链表时可能会发生竞态。所以推荐的方式是用Cilk™ Plus中的hyper object机制来存取此链表,hyper object内部实现为对此全局变量/数据结构进行局部化处理,从而保证及时多线程同时存取全局地址时仍不会有竞态发生:

    cilk::reducer_list_append<terntreenode *> root;
    void
    find_reds_par(terntreenode *p)
     {
           if (p->color == red) {
             root.push_back(p);
           }
           if (p->left) cilk_spawn find_reds_par(p->left);
           if (p->middle) cilk_spawn find_reds_par(p->middle);
           if (p->right) find_reds_par(p->right);
    }  

        Cilk™ Plus的并行遍历实现可保证会跟串行遍历程序一样产生内部节点排序相同的结果一致的链表。

     

        关于更多Cilk™ Plus的规范或内部实现规定,可以访问https://software.intel.com/en-us/intel-cilk-plus.

  • Entwickler
  • C/C++
  • Intel® C++ Composer XE
  • Intel® Cilk™ Plus
  • Server
  • Lizenzvertrag: 

  • URL
  • Codebeispiel
  • 利用Cilk™ Plus的Reducer解决并行程序中的竞态及按序计算问题

    $
    0
    0

        使用Cilk™ Plus来对程序进行并行化比使用传统的Pthread方式来建立管理线程库容易得多,在一般情况下,利用关键字cilk_sync以及cilk_for可以使串行的程序更容易改写为并行的代码,尽管在一些复杂的并行情况下,使用cilk_sync以及cilk_for并不能解决程序中本身存在的数据竞态及多线程并行的协调管理问题。

        值得注意的是Cilk™ Plus并不是只有关键字的方式,Cilk™ Plus库也包含一些用于解决并行程序中的竞态、锁、多线程协调等问题的功能features。本文将提到的是Cilk Reducer它能有助于解决常见的累计型算法中存在的数据竞态及多线程间按序计算等问题。

    1.    累计型算法常见于对一个变量进行多次叠加地更新值,比如以下代码:

    #include <iostream>
    int main()
    {
        unsigned long accum = 0;
        for (int i = 0; i != 1000; i++) {         //cilk_for
            accum += i*i;
        }
        std::cout << accum << "\n";
    }

        利用Cilk™ Plus对上文例程代码并行化的过程中常见的错误就是仅仅用cilk_for来代替for而无视对程序的正确性:例如本例中的accum += i*i;会出现多个线程同时更新此accum全局变量。
        然而,如果按照传统的多线程编程的方式用加锁来对全局变量进行保护的话,又会引起很大的性能损失,Cilk reducer的引入就正好可以解决此全局变量的同时更新问题:

    #include <iostream>
    #include <cilk/cilk.h>
    #include <cilk/reducer_opadd.h>
    
    int main()
    {
        cilk::reducer_opadd<unsigned long> accum(0);
        cilk_for (int i = 0; i != 1000; i++) {
            *accum += i*i;
        }
        std::cout << accum.get_value() << "\n";
    }

        对于原串行代码的改动只需:

    • 将accum全局变量替换为“reducer” (cilk::reducer_opadd<unsigned long>);
    • 将reducer看做为指向真实的变量的指针(*accum += a[i]);
    • 当计算结束时,从reducer中返回最终的累加值(accum.get_value()).

    2.   利用reducer来协调并行计算的顺序,在很多情况下此并行问题比上文的数据冲突的并行问题出现得更为典型:

    #include <string>
    #include <iostream>
    #include <cilk/cilk.h>
    
    int main()
    {
        std::string alphabet;
        cilk_for(char letter = 'A'; letter <= 'Z'; ++letter) {
            alphabet += letter;
        }
        std::cout << alphabet << "\n";
    }

        上述代码中也出现了同时更新/Append同一字符串时的竞态问题,而且如果只是通过添加锁代码来期待保证按序append到string时,程序的结果同样是错误的,比如会输出KPFGLABCUHQRSVWXDMTYZMEIJO等乱序的结果。

        尽管加锁可以保证每一次更新正确地发生且与其他的更新保持无关,但是锁并不能够保证全局地所有更新按正确次序发生。Cilk reducer的实现则保证了没有上述问题的出现,在并行计算时,使用Cilk reducer可以保证所有的输入值以同串行程序一致的顺序按序计算,如以下代码将全局变量改为Cilk reducer形式后:

    #include <string>
    #include <iostream>
    #include <cilk/cilk.h>
    #include <cilk/reducer_string.h>
    
    int main()
    {
        cilk::reducer_string alphabet;
        cilk_for(char letter = 'A'; letter <= 'Z'; ++letter) {
            *alphabet += letter;
        }
        std::cout << alphabet.get_value() << "\n";
    }

        编译后运行可发现此并行程序的结果为ABCDEFGHIJKLMNOPQRSTUVWXYZ,与串行程序的结果保持一致。
    3.    Cilk reducer不仅仅适用于循环及常用的计算等例子,在下文的例子中,函数filter_tree()会被传参入一个二叉树和对应键值来被调用,此函数将在所有树的节点中找到键值相同的Key,并且返回含有对应节点的值域的链表。链表中的值将按照左子树、根节点、右子树的位置按序排列,filter_tree()函数中的filter_and_collect()用于递归遍历此二叉树从而构建此链表:

    #include<cilk/cilk.h>
    #include <cilk/reducer_list.h>
    
    // 树的节点结构体定义
    //
    template <typename Key, typename Value>
    struct TreeNode {
        TreeNode* left_subtree;
        TreeNode* right_subtreee;
        Key key;
        Value value;
    };
    
    // worker函数,遍历子树且将匹配对应键的所有节点的值累加到一个reducer
    //
    template <typename Key, typename Value>
    void filter_and_collect(const TreeNode<Key, Value>* subtree,
                       const Key& key,
                       cilk::reducer_list_append<Value>& list)
    {
        if (!subtree) return;
        cilk_spawn filter_and_collect(subtree->left, key, list);
        if (subtree->key == key) {
            list->push_back(subtree->value);
        }
        filter_and_collect(subtree->right, key, list);
    }
    
    //主函数,匹配对应的输入键值Key
    //
    template <typename Key, typename Value>
    std::list<Value> filter_tree(const TreeNode<Key, Value>* tree,
                                 const Key& key)
    {
        cilk::reducer_list_append<Value> list;
        filter_and_collect(tree, key, list);
        return list.get_value();
    }

    4.    并行化程序时的要点总结:

    • 将用于累计值的全局变量替换为reducer可以更高效地实现并行;
    • 用cilk_for 或 cilk_spawn将循环或递归等并行化;
    • 计算结束时,用reducer的get_value()方式取得累计后的共享变量的最终值,reducer的实现保证了不用考虑多线程数据竞态及多线程按序计算等问题,从而保证了与串行程序的结果一致。

     

        关于更多Cilk™ Plus的规范或内部实现规定,可以访问https://software.intel.com/en-us/intel-cilk-plus.

  • Entwickler
  • C/C++
  • Intel® C++ Composer XE
  • Intel® Cilk™ Plus
  • Server
  • Lizenzvertrag: 

  • URL
  • Codebeispiel
  • Optimized Pseudo Random Number Generators with AVX2

    $
    0
    0

    Intel® Math Kernel Library includes powerful and versatile random number generators that have been optimized to take full advantage of Intel® Advanced Vector Extensions 2 (aka Intel® AVX2) introduced with the Haswell CPUs.

    In this post, I’ll explain how to use a random number generator that benefits from Intel® AVX2 and how easy it is for developers to use it in C++11 without having to learn specialized instructions but still taking full advantage of the new instructions introduced in Haswell. I’ll provide an example with Intel® Parallel Studio XE 2013 SP1 and Visual Studio 2013.

    Both Big Data and Internet of Things (aka IoT) are increasing the total amount of data we need to process. The additional instructions are extremely powerful to process multiple data with a single instruction or to perform operations that required dozens of instructions with a single one. Thus, these instructions are very useful to optimize code that has to run as fast as possible in projects related to Big Data and IoT. You can boost your code performance when you take advantage of the latest additions to the Intel® CPUs.

    Since 1996, I’ve been explaining how useful the different improvements in the instruction sets introduced in the different Intel® CPUs were useful to improve code performance in different application domains and according how IT trends have been evolving. So, as you might guess, I’m a big fan of the usage of new instruction sets to boost performance. Intel® AVX2 instructions follow the same programming model introduced by their predecessor: Intel® AVX instructions.

    The generation of pseudorandom numbers is a very common requirement in number crunching applications. The good news is that you don’t need to learn the details about the new instruction set to write code that generates random numbers taking full advantage of Intel® AVX2 instructions in C++. In fact, you don’t need to write your own optimized algorithm. You can take advantage of the MRG32k3a pseudorandom number generator included in Intel® Math Kernel Library, a component of Intel® Parallel Studio XE 2013. MRG32k3a is a combined multiple recursive pseudorandom generator with two components of order 3 that is highly optimized for Haswell CPUs and uses Intel® AVX2 instructions.

    With a few lines of code, you can take advantage of the most modern SIMD instructions introduced in Intel® CPUs. Because Intel® Parallel Studio XE and Intel® Math Kernel Library have very frequent updates, you can rest assured the algorithms are going to be improved to take advantage of future micro-architecture features and instruction sets. Thus, you can focus on using the generated pseudo random numbers in your application domain. You can think of the highly optimized pseudo random generator as your silver bullet.

    Before moving to the code, let me dive a bit deeper on Intel® Math Kernel Library (aka Intel® MKL). Intel® Vector Statistical Library (aka VSL) is a component within MKL that provides optimized routines that implement pseudo-random and quasi-random number generators with continuous and discrete distributions. Thus, the code will use the MRG32k3a pseudo random generator included in VSL. You can read more information about the MRG32k3a pseudo random generator here.

    Notice that Intel® Vector Statistical Library provides a wide range of Basic Random Number Generators (aka BRNG). You can use them to obtain random numbers of various statistical distributions and you should choose the appropriate Basic Random Number Generator based on your application requirements. In this case, I’m using the MRG32k3a pseudo random generator because it includes specific optimizations that take advantage of Intel® AVX2. However, depending on your application requirements, other Basic Random Number Generators might be more suitable.

    The following steps allow you to create a project that uses MKL and compiles with Intel® C++ Compiler in Visual Studio 2013. The great integration that Intel® Parallel Studio XE 2013 has with Visual Studio 2013 makes it really easy to start working with Intel® MKL with just a few clicks.

    1. Use the Launch Intel® Parallel Studio XE 2013 with VS 2013 shortcut to launch the IDE.

    2. Create a Windows console application.

    3. Select Project | Intel® Composer XE 2013 SP1 | Use Intel® C++ Compiler.

    4. Now, right click on the project name in Solution Explorer and select Properties.

    5. Select Configuration Properties | Intel® Performance Libraries. Click on the dropdown at the right-hand side of Use Intel® MKL, under Intel® Math Kernel Library. Select the desired working mode based on your needs. In my case, I’ve selected Parallel to use parallel Intel® MKL libraries. See the following figure.

    Selecting the desired working mode for Intel® Math Kernel Library in Visual Studio 2013.

    I will use a very useful include file, errcheck.inc, that is part of the Intel® Math Kernel Library samples. This file defines the CheckVslError function that receives the int status code returned by any Intel® MKL function call and displays a message explaining the problem with that call when something went wrong. In order to access this file, you have to decompress the examples_core.zip file located in the mkl\examples folder within the Intel® Composer XE 2013 installation folder. So, for example, if you are working with a 64-bit Windows version, the default installation folder for Intel® Composer XE 2013 will be C:\Program Files (x86)\Intel\Composer XE 2013 SP1, and the full path for examples_core.zip will be C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\examples. It is usually a good idea to copy this zip file to another folder and decompress it. Once you decompress the file, you will find the errcheck.inc file within the vslc\source folder. For example, if you decompressed examples_core.zip in C:\mkl_samples, you will find errcheck.inc in C:\mkl_samples\vslc\source. I know you require a few steps, but believe me, errcheck.inc is very useful when you work with Intel® MKL.

    The following lines show C++11 code that generates 1,000 pseudo random numbers by using the MRG32k3a pseudo random generator with the BOXMULLER2 method. This method generates normally distributed random numbers. You can read more information about the different methods and their related formulas here. The Intel® MKL functions are C-style calls, but as I cannot stop using C++11 features, I’ve made the C-style calls in a C++ Windows console application that uses some C++11 features to display all the generated pseud random numbers.

    #include <iostream>
    #include <stdio.h>
    
    #include "mkl.h"
    #include "mkl_vsl.h"
    // Replace with your own path to errcheck.inc
    #include "C:\mkl_samples\vslc\source\errcheck.inc"
    
    #define SEED  7777777
    #define RANDOM_NUMBERS     1000
    
    using namespace std;
    
    int main()
    
    {
           // Buffer for RANDOM_NUMBERS pseudo random numbers
           float pseudorandom[RANDOM_NUMBERS];
    
           VSLStreamStatePtr stream;
    
           // Initialize the stream
           // Generate the stream and initialize it specifying the 32-bit input integer parameter seed
           auto status = vslNewStream(&stream, VSL_BRNG_MRG32K3A, SEED);
           CheckVslError(status);
    
           // Mean value
           float mean = 0.0f;
    
           // Standard deviation
           float sigma = 1.0f;
    
           // Generate normally distributed random numbers
           status = vsRngGaussian(VSL_METHOD_SGAUSSIAN_BOXMULLER2,
                  stream, RANDOM_NUMBERS, pseudorandom, mean, sigma);
    
           CheckVslError(status);
    
           // Delete the stream
           status = vslDeleteStream(&stream);
           CheckVslError(status);
    
           cout << "Pseudo random numbers:\n";
           for (auto n : pseudorandom) {
                  cout << n << '\n';
           }
    
           return 0;
    }
    

     

    If your CPU doesn’t support the instructions required for the used generator, the status will be equal to VSL_ERROR_CPU_NOT_SUPPORTED and CheckVslError with display an appropriate message. The code is very easy to understand and the generator is taking advantage of Intel® AVX2.

    First, the code declares a buffer to hold the number of float pseudo random numbers defined in RANDOM_NUMBERS: 1,000. Then, a call to the vslNewStream function generates the stream and initializes it specifying the generator, VSL_BRNG_MRG32K3A, and the seed defined in SEED. Notice that each Intel® MKL function call is followed by a call to CheckVslError with the status returned by the Intel® MKL function call as an argument.

    Then, the call to vsRngGaussian generates normally distributed random numbers with the BOXMULLER2 method (VSL_METHOD_SGAUSSIAN_BOXMULLER2). The mean value is 0 and the standard deviation (sigma) is 1. The random numbers will be stored in the previously pseudorandom buffer. Finally, the code deletes the stream and displays all the generated pseudo random numbers.

    As you can learn from this small example, it is extremely easy to work with Intel® Math Kernel Library in Visual Studio 2013 thanks to the great integration that Intel® Parallel Studio XE 2013 provides with this IDE. With just a few lines of code, you can start taking full advantage of the Intel® AVX2 instructions introduced in Haswell CPUs in your C++ applications.

    Intel® Math Kernel Library is a commercial product, but you can download a free 30-day evaluation version here.

     

  • Intel MKL support for Intel(R) AVX2
  • AVX2
  • optimization
  • Haswell support in intel MKL
  • intel math kernel library
  • Symbol-Bild: 

  • Vektorisierung
  • Intel® C++-Compiler
  • Intel® C++ Composer XE
  • Intel® Composer XE
  • Intel® Math Kernel Library
  • Intel® Advanced Vector Extensions
  • Intel® Streaming SIMD Extensions
  • C/C++
  • Server
  • Windows*
  • Notebook
  • Server
  • Tablet-PC
  • Desktop
  • Entwickler
  • Studenten
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Design-Zone: 

    IDZone
  • Server
  • Include in RSS: 

    0
    Viewing all 245 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>