In This Article
What is SSE (Streaming SIMD Extensions)?
SSE, an acronym for Streaming Single Instruction Multiple Data Extensions, refers to specific types of instructions that are used by the multimedia programs. It was used in the Intel Pentium III processors for the first time.
Streaming SIMD Extensions is actually a particular type of processor technology. This facilitates, just as the name suggests, handling multiple data sets with one single instruction.
- Streaming Single Instruction Multiple Data Extensions is a particular type of technology used extensively by modern processors to enhance multimedia performance.
- Over time, Streaming SIMD Extensions was extended to create newer versions of it. This includes SSE2, SSE3 and SSSE3, and SSE4 and comes with better features and enhanced functionalities.
- SSE was originally introduced in Intel Pentium III processors in 1999 but the later processors were equipped with its higher versions.
- Due to the SIMD feature, SSE offers a lot of other additional performance benefits as well, such as reduced operational time, higher speed, and reduced complexity in design.
- SSE can handle both packed floating and scalar instructions typically involving arithmetic operations, data shuffling, comparison, logical operation, data movement, cache and memory management and lots more.
Understanding Streaming SIMD Extensions (SSE)
Streaming SIMD Extensions is a processor technology that handles several data sets with only one instruction.
It was initially known as Internet Streaming SIMD Extensions or ISSE. It was first included in the Intel Pentium III processor in 1999.
On the processors of the earlier days, only one single data element could be processed in each instruction.
However, with the use of SSE, now the processors can handle a large number of data sets and not have to use a wide range of instructions for that purpose.
This offers significant benefits such as:
- It reduces operational time
- It enhances the performance of the CPU by increasing its speed
- It reduces design complexity
- It enables processing more data within a short amount of time
- It enhances multimedia performance of the system.
Ideally, this specific extension technology was designed with the prime intention to replace MultiMedia eXtensions or MMX technology.
However, both MMX and SSE instructions can be mixed because the latter is actually an extension of the former.
And, such a combination of instructions will not have any adverse effect on the performance of the system.
In the Streaming SIMD Extensions there are a set of instructions and registers added to Intel processor chips.
These registers are of a special kind and allow calculating various sets of floating point data and integers at the same time.
Perhaps the best thing about SSE is that it can handle all regular types of data which even includes the following:
- Double precision floating point
- Any integer from 8 bits to 128 bits
Initially, the Intel Pentium III processors in which SSE was included for the first time had 128-bit integer registers in them along with 70 fresh instructions.
However, over time, the design of Streaming SIMD Extensions evolved and in the subsequent versions, eight more registers were included for the 64-bit processors along with further instructions.
Here is the complete breakup of the evolutions and different versions of SSE along with the number of instructions included in each and their respective features.
Each of these iterations brought in newer instructions which resulted in enhanced performance.
- SSE – This was launched in 1999 and had as many as 70 instruction sets and featured single precision vectors in it.
- SSE2 – This was launched in 2000 and had as many as 144 instruction sets and featured double precision vectors and 128-bit vector integers in it.
- SSE3 – This was launched in 2004 and had 13 instruction sets in it and was capable of doing complex arithmetic MPEG encoding.
- SSSE3 – This was launched in 2006 and had as many as 32 instruction sets in it and featured MPEG decoding.
- SSE4 – This was launched in 2007 and had as many as 54 instruction sets in it and was known to be capable of handling graphics building blocks, video and coprocessor acceleration.
Now, if you are wondering which version of the Streaming SIMD Extension your system supports, if at all, you can use the following programs:
- Intel Processor Identification Utility
- Lscpu for most Linux distributions.
Usually, a processor with support for SSE typically helps the computer system to perform Moving Picture Experts Group or MPEG2 decoding, a scheme that is typically used for playing DVD video discs.
This feature eliminates the need for using a decoder card.
What are SSE Instructions Used for?
The Streaming SIMD Extension technology is used in a wide range of intensive applications such as 3D graphics and animation. It is mainly used for video encoding and decoding purposes.
In addition to that, the SSE instructions also have some specific use cases such as:
- The SSE conversion instructions can be used for converting individual and packed double word integers into scalar and packed single precision floating point values.
- SSE instructions are also used to handle various data elements, which is necessary in the fields of 3D graphics and other intensive applications in order to expedite processing.
- The later versions of SSE also support double-precision operations which significantly help in functions like content creation as well as engineering, financial, and scientific applications.
- The SSE2 instructions specifically help the software developers by offering them optimal flexibility so that they can use different algorithms which enhance the performance of the software created and used for different applications such as MP3, MPEG2, and 3D graphics.
Typically, with their enhanced support for higher dynamic range and flexible computational power, the original SSE instructions allow doing all types of arithmetic operations on different data types such as double words, quad words and more.
SSE Instruction Examples
Ideally, the SSE instructions are an expansion of the Single Instruction Multiple Data model launched with the MMX technology and can be divided into the three major groups namely, floating point instructions, integer instructions and miscellaneous instructions.
These major groups can be further divided into several other subgroups based on different characteristic attributes such as:
- SIMD single-precision floating-point instructions, which typically work on the XMM registers
- 64-bit SIMD integer instructions, which typically work on the MMX registers
- MXCSR state management instructions
- Instructions that help in prefetch and cache control along with instruction ordering functionality.
Based on all the above parameters, some of the common examples of SSE instructions are as follows.
These are denoted with their Intel/AMD mnemonics. Each of these instructions, needless to say, performs different functions.
- Data Transfer Instructions – This category includes MOVAPS, MOVHLPS, MOVHPS, MOVLHPS, MOVLPS, MOVMSKPS, MOVSS, and MOVUPS.
- Packed Arithmetic Instructions – This category includes ADDPS, ADDSS, DIVPS, DIVSS, MAXPS, MAXSS, MINPS, MINSS, MULPS, MULSS, RCPPS, RCPSS, RSQRTPS, RSQRTSS, SQRTPS, SQRTSS, SUBPS, and SUBSS.
- Compare Instructions – This category includes CMPPS, CMPSS, COMISS, and UCOMISS.
- Logical Instructions – This category includes ANDNPS, ANDPS, ORPS, and XORPS.
- Shuffle and Unpack Instructions – This category includes SHUFPS, UNPCKHPS, and UNPCKLPS.
- Conversion Instructions – This category includes CVTPI2PS, CVTPS2PI, CVTSI2SS, CVTSS2SI, CVTTPS2PI, and CVTTSS2SI.
- MXCSR Status/Control Instructions – This category includes LDMXCSR, and STMXCSR.
- 64-bit SIMD Integer Instructions – This category includes PAVGB, PAVGW, PEXTRW, PINSRW, PMAXSW, PMAXUB, PMINSW, PMINUB, PMOVMSKB, PMULHUW, PSADBW, and PSHUFW.
- Miscellaneous Instructions – This category includes MASKMOVQ, MOVNTPS, MOVNTQ, PREFETCHNTA, PREFETCHT0, PREFETCHT1, PREFETCHT2, and SFENCE.
What is SSE Optimization?
Streaming SIMD Extensions have both scalar and vector instructions, which are used to optimize single mathematical or logical operations on multiple values at the same time.
These instructions especially help in performing and maximizing matrix or vector math functions.
Ideally, with experimentation and programming efforts, the rate of pure assembly can be expedited without actually mentioning the particular vector instructions.
However, there are some tradeoffs in terms of portability involved in it.
For example, if a code is created for GCC or any other advanced compiler, it will work well with the non-Intel architectures such as ARM and PowerPC but not with other compilers.
On the other hand, if Intel intrinsics are used to create C codes just like assembly, they can be used with other compilers but will not be compatible with other architectures.
Typically, a compiler is able to target an instruction set as a part of its optimization effort but will typically have to restructure the code.
This means that it is needed to either create the SSE code manually or use Intel Performance Primitives or any other similar library to take full advantage of it.
However, the main idea behind the optimization of SSE is to simply implement the same operation on four 32-bit words or two 64-bit values, in some cases.
This means that you will be better off if you use vector add instructions instead of the conventional `add´ instructions that will add the values from two separate 32-bit wide registers together.
This will use the special, 128-bit wide registers that have four 32-bit values and add them up collectively as a single operation.
How Many SSE Registers are There?
There are usually 16 registers in SSE that are referred to as XMM0, XMM1, XMM2 and so on through XMM15.
These registers are typically 128 bits wide and can be used for a variety of operations that are performed on different types of data of different sizes.
Moreover, the registers of SSE do not overlie with the floating point stack, as it is in the case of MMX.
Initially, SSE came with only eight new registers that were 128 bits wide and were referred to as XMM0, XMM1, and so on through XMM7.
On the other hand, the AMD64 extensions from AMD, which were initially called x86-64, had a further eight registers included in the design, which were named XMM8, XMM9 and so on through XMM15.
It is actually this specific extension design that is reproduced in the Intel 64 architecture. However, the registers XMM8 and higher are only accessible while using 64-bit operating mode.
In addition to that, there is another new control and status register of 32 bits available, called the MXCSR.
The use of registers and their types varied in different versions of SSE. For example, SSE only used a single data type for the XMM registers. In fact, there were only four 32-bit single-precision floating-point numbers.
In comparison, the newer SSE2 version expanded its use of XMM registers and included the following:
- Two 64-bit integers or
- Two 64-bit double-precision floating-point numbers or
- Four 32-bit integers or
- Eight 16-bit short integers or
- Sixteen 8-bit bytes or characters.
However, these registers are disabled by default. This is because the 128-bit registers are typically supplementary machine states that must be preserved by the operating system while switching tasks.
Therefore, in order to use them, the operating system has to enable them explicitly.
This means that the operating system must first be aware of how exactly it should use specific types of instructions such as the FXSAVE and FXRSTOR.
These are the typical extended pair of instructions and are typically used to save all SSE and x86 register states straight away.
This specific feature or support was promptly added to all key IA-32 operating systems.
SSE vs AVX
- Depending on the specific version, memory and the type of processor used, SSE can perform faster than an AVX, processing a code in an image processing program in about 180 ms as opposed to 200 ms of AVX.
- The compiled SSE version may have as many as 111 instructions in comparison to 67 instructions in the AVX.
- The SSE version can process four floating points in every instruction, but in comparison, the AVX version can process eight floating points in a single instruction.
- The SSE differs from the AVX in terms of generations, where the latter is relatively newer and comes with wider vectors that provide support to more parallelism.
- While SSE can process only 128 bits in each instruction, AVX can process double that amount in every instruction.
- SSE can be used only for those processes using XMM registers, but, in comparison, AVX instructions can be used with both XMM registers as well as YMM registers.
- In SSE, generally, one of the operands of the input is overwritten, but in the case of AVX instructions these are non-destructive because there is an additional destination operand programmed with a new VEX or Vector Extension prefix.
- The SSE instruction sets, being older than AVX, are available on comparatively more hardware.
Through this article, you now surely have gained a fair bit of knowledge regarding the Streaming SIMD Extension which is a specific type of technology used by the modern CPUs.
With newer features and functionalities, it helps in handling a wide range of data sets with a single instruction and enhances performance speed.