Intel SIMD Intrinsics

SIMD Intrinsics

  • SIMD: Single Instruction, Multiple Data

  • Intrinsics: low-level, compiler-specific instructions that allow programmers to leverage SIMD capabilities without writing assembly code. They provide a higher-level interface for SIMD programming.

  • Motivation:

    • Critical for performance optimization in certain applications

    • Harness the power of modern CPUs

  • Characteristics:

    • Hardware-specific

    • Compiler-specific

Intel Intrinsics

Overview

  • Only available on Intel processors

  • Available in C/C++ and Fortran

  • Evolution

    • Multimedia Extensions (MMX)

    • Streaming SIMD Extensions (SSE)

    • Advanced Vector Extensions (AVX)

  • Operations

    • Arithmetic

    • Logical

    • Data manipulation

    • Conversion

    • Shuffle

SSE4 Intrinsics in C language

  • Environment setup

    • Compiler: gcc

      • -march=native flag on a computer with intel CPU

    • Hardware: Intel CPUs

  • Include header file

    • #include <immintrin.h>

SSE4 Example

Matrix multiplication of a 4x4 matrix by a size-4 vector using SSE4 intrinsics is a fundamental operation. For larger matrices, this example serves as a building block for block-wise matrix multiplication, where the multiplication is efficiently computed in smaller chunks. This approach enhances computational efficiency and scalability when dealing with matrices of larger dimensions.

Link to code example

  • __m128

    The __m128 type is used to represent a 128-bit register that can hold 4 float numbers. The number of variables you can declare of this type is limited by the number of registers available on the CPU.

  • The _mm_loadu_ps intrinsic

    This intrinsic loads 4 float numbers from an unaligned memory address into a __m128 register. The _mm_load_ps intrinsic can be used to load 4 float numbers from an aligned memory address.

  • The _mm_storeu_ps intrinsic

    This intrinsic stores 4 float numbers from a __m128 register into an unaligned memory address. The _mm_store_ps intrinsic can be used to store 4 float numbers from a __m128 register into an aligned memory address.

  • The _mm_add_ps intrinsic

    This intrinsic adds two __m128 registers element-wise and returns the result in a __m128 register.

  • The _mm_hadd_ps intrinsic

    This intrinsic adds two __m128 registers horizontally and returns the result in a __m128 register. This horizontal addition adds the neighboring two elements from the two registers to create a new register. The first two elements of the result register are the sums from the first register, and the last two elements of the result register are the sums from the second register.

    Demonstration on horizontal addition
  • The _mm_mul_ps intrinsic

    This intrinsic multiplies two __m128 registers element-wise and returns the result in a __m128 register.

  • All intrinsics used in the example are the unaligned versions. The aligned versions are faster but require the memory addresses to be aligned to 16 bytes.