Processors (CPU)

What is Superscalar? Works, Example, Pros, Cons & More

December 21, 2022

In This Article

What is Superscalar?

Superscalar refers to a microprocessor architecture that was introduced with the Intel Pentium processors. This specific type of CPU architecture has the ability to handle more than one instruction per clock cycle.

Technically, the design of these processors consists of a set of mechanisms, which allow the Central Processing Unit, or the CPU, of a computer to organize and manage the output of the multiple instructions executed in a single cycle in a sequential manner.

KEY TAKEAWAYS

A superscalar processor is typically implemented to offer a faster and better performance by handling several instructions in a sequential manner in a clock cycle.
As opposed to the working process of a scalar processor, which executes only one instruction in a clock cycle, a superscalar processor executes more than one instruction in a single clock cycle.
These specific processors are considered to be a combination of scalar and vector processors because it allows handling more than one instruction in one clock cycle.
The execution units of a superscalar processor are not disconnected processors, which is why it is called a second-generation RISC that can operate faster with a reduced instruction set.
At any given point in time, the output of a superscalar processor is significantly higher than that of a scalar processor, which is mainly due to their higher number of ALUs, FPUs and SIMD units.

Understanding Superscalar

A superscalar processor implements a kind of parallelism on processing data which is called Instruction Level Parallelism or ILP in a single processor.

However, this needs analyzing the instructions to be carried out and the use of different execution units to carry out these instructions.

By handling more than one instruction in a clock cycle simultaneously, a superscalar processor can dispatch several instructions to different execution units available on the processor.

The design of the superscalar processor typically emphasizes on enhancing the accuracy of the instruction dispatcher by keeping several execution units busy at all times.

The early designs of such processors had two ALUs or Arithmetic Logic Units and one FPU or Floating Point Unit.

However, the later designs included two more ALUs, one more FPU, and two additional SIMD or Single Instruction Multiple Data units. Typically, these CPUs maintain a steady execution rate.

However, merely processing more than one instruction per machine cycle does not make a superscalar architecture.

This can be achieved even by other architectures such as multiprocessor, or multi-core, or pipelined architectures, of course with different methods.

Ideally, in a superscalar design, the dispatcher needs to read the instructions from the memory and at the same time has to decide which of them are to be executed in parallel.

It also has to dispatch the instructions to several different execution units in the CPU.

This, in simple words, means that a superscalar processor should have several parallel pipelines and each of them should be able to process simultaneous instructions from one instruction thread.

Traditionally, a superscalar design is identified with a few specific characteristics within it such as:

Issuance of instructions happens from an instruction stream in perfect order
Data dependencies are checked dynamically at runtime between instructions rather than checking the software at compile time
Ability to carry out multiple instructions in each clock cycle

The superscalar processor architecture typically came out in three successive phases such as:

At first, the concept was envisaged
Then a few architecture proposals along with a few prototype machines came out
Finally the commercial products were released on the market.

With multi-core in the current microprocessor designs, the superscalar RISC or Reduced Instruction Set Computer processors surfaced according to two special approaches as follows:

Some of these processors emerged as the result of transferring a scalar or current RISC line into a superscalar line such as the Intel 960, and MC 88000 and
Some of them perceived an entirely new architecture so that it can be executed right from the very start of the superscalar line such as in the IBM RS/6000.

The performance improvements gained from the superscalar processors are however dependent on three specific areas such as:

The level of intrinsic parallelism within the instruction stream so that the instructions get the same resources for computing from the processor
The complexity and the time required by the register renaming circuit and for the dependency checking logic
The branch instruction processing ability

Ideally, a superscalar processor is a combination of a scalar and a vector processor where every instruction may process one data at a time but the multiple executions within it allows handling separate data at the same time.

Superscalar Processor Working Process

The working process of the superscalar processors involves sequential execution of instructions though there is not much of a universal agreement on its implementation of parallel instruction handling.

Typically, it uses specific superscalar design techniques to function, which include:

Parallel instruction decoding
Speculative execution
Parallel register renaming
Out-of-Order Execution or OoOE

Along with these techniques, there are also a few other specific types of procedures typically employed to complement the superscalar design. These techniques include:

Caching
Pipelining
Branch prediction

Ideally, the modes of execution used in a superscalar processor are Out-of-Order Execution and register renaming.

This helps the processor to identify and improve the level of parallelism at the instruction level while handling instructions, which, in turn, allows it to handle a number of them in a clock cycle.

During the operation, the superscalar processor makes the best use of the pipelining technique by using pipeline fetch, pipeline branch prediction logic, and pipeline decode.

Ideally, there are different stages in which a pipelined superscalar processor works. These are:

Fetch – This stage is divided into a couple or more sub-stages because the main memory is slow. This helps it to fetch more instructions in parallel.
Decode – In this stage, pre-decoding and the division of instructions into micro-ops or further micro-instructions are implemented. This allows faster completion of the stage.
Dispatch – This is the stage when the instructions are dispatched to different execution units.
Execution – In this stage, the execution unit of the processor typically has more parallelism to perform out-of-order-execution.
Complete – In this stage, the set of instructions is considered to have finished execution and the results have been received.
Retite or Writeback – At this stage the results of the execution are written back on the registers.

Superscalar Processor Examples

The first superscalar and commercial microprocessors were the Intel i960CA, introduced in 1989 and the AMD 29050 processor belonging to the 29000-series, introduced in 1990.

However, the IBM mainframe System/360 Model 91 used in 1967 is also an example of a superscalar computer. Apart from that, the P5 Pentium was the first superscalar x86 processor.

Some of the most prominent examples of the approach where a scalar or current RISC line is transferred into a superscalar line include:

The Intel 960
The HP Precision Architecture or PA
The MC 88000 of Motorola
The MC 88100
The SunSparc and
The AMD Am29000 RISC lines.

And, a few examples of the other approach of microprocessor design where there is a completely new architecture include the RS/6000 processor announced by IBM in 1990, which was later renamed a Power1.

IBM also has some PowerPC superscalar processors such as:

The PowerPC 600
The PowerPC 601
The PowerPC 620
The PowerPC 604
The POWER 1
The POWER 2
The POWER 3
The PowerPC 7xx
The PowerPC G4

The list of superscalar processors also includes a few CPUs belonging to the MIPS R series, and it includes:

R4000
R5000
R8000
R10000

Some other examples of superscalar processors, in chronological order, are:

Alpha 21064
Alpha 21164
Alpha 21264
Alpha 21364
Alpha 21464
AMD K6
AMD K6-2
AMD K6-III
Athlon
Cyrix 6×86
Motorola 68060
MP6
P6 microarchitecture
PA-7100
PA-7100LC
PA-7200
PA-8000
Pentium II
Pentium III
Pentium Pro
HAL SPARC 64
SPARC 64 V
UltraSPARC
UltraSPARC II
UltraSPARC III
UltraSPARC IV
VEGA microprocessors

Most of the out-of-order CPUs are considered to be superscalar processors by nature.

Also, most of the RISC line processors are also considered to be superscalar CPUs, depending on the two specific approaches to the design of the microprocessors.

Also, the modern x86 processors are superscalar CPUs. They perform out-of-order executions.

The ARM Cortex-R52 CPUs also belong to the in-order, mid-performance, superscalar processor family. These are basically used in industrial and automotive applications.

What are the Superscalar Processor Design Characteristics?

The main characteristic of a superscalar processor design is the ability to execute multiple instructions in a clock cycle by establishing parallelism in the processor at the instruction level.

It is super-pipelined so that there is no waiting state when independent instructions are executed in sequence with the help of multiple processing units.

It is the Instruction Set Architecture and its implementation techniques that make these processors different from others, both in terms of hardware and software.

It acts as an abstraction between the programs and hardware implementations.

Some of the notable characteristics of it are:

It ensures portability
The set of instructions are defined in an assembly language by the ISA
Dynamic-static interface

The ISA itself acts like a specification for the hardware developers.

Different types of operation instructions are handled by the superscalar processors such as:

Arithmetic operations
Load/store or data movement operations between the cache, memory and registers
Branch Instructions

Since the performance of a processor is typically measured in CPI or Cycles Per Instruction, the superscalar processors also have the ability to reduce the instruction count by following two specific techniques as follows:

Using a deeper pipeline to carry out several instructions per cycle
Shifting the complexity on hardware that may increase the cycle time.

Branch prediction is an important aspect of the performance of the superscalar processors.

It is quite easily predictable by these processors due to its characteristics, features and different techniques applied such as:

Branch Target Speculation – Here, the Branch Target Buffer, or the BTB, which is a fully associative cache, is used to store the target address of the preceding branch taken, and the cache is used for the next lookup.
Branch Conditional Speculation – Here, the branch predictors make a prediction on the basis of the hint received from the compiler. It is usually done by the Finite State Machine or FSM-based predictors.

And the renaming of the register is controlled by the scheduler and the reorder buffer. This eliminates the chances of any false data dependencies or data redundancy.

This may be a result of reusing the architectural registers by the following instructions that may not have any actual data dependencies between them.

Is Superscalar Multicore?

The superscalar processors can be either single core or multicore but the fact that they have only one instruction counter differentiates them from a multicore processor.

Therefore, you can keep track of various instructions in the process, but all of them are from a single program.

Ideally, in a multicore processor, several instruction streams may be executed simultaneously but the important thing is that each of the cores of the CPU has its own separate instruction counter to execute, which can also be superscalar.

This means that every single process can be executed more quickly, a trait which is customary to the superscalar processors.

How Do Superscalar Processors Exploit Parallelism?

Parallelism is exploited by the superscalar processors by fetching and executing several instructions simultaneously which reduces the clock cycles for each instruction.

Instruction Level Parallelism in superscalar processor is exploited in two specific ways as follows:

It can be done statically by the compiler or
It can be done dynamically by the hardware.

It is the improvements made in the architecture of the superscalar processors that help them to exploit more parallelism and in a much better way as well.

Ideally, this specific architecture allows pipelined execution of instructions, which is a necessity for parallel processing, so that there are no delays in the process due to waiting for the previous process to be completed in order to start working on the next one.

The architecture also allows Out-of-Order Execution or OoOE and extracting ILP more dynamically from the scalar instruction stream by a superscalar machine, which also augments their effort to achieve more parallelism.

Is Superscalar SIMD?

The answer to this question is not simple. In fact, it is quite confusing. According to Flynn’s taxonomy, if there is only one core in the superscalar processor and if it can execute short vector operations, it can be considered a SIMD processor.

However, Flynn’s taxonomy is based on the number of data and instruction streams.

Now the confusion lies in the fact that a superscalar processor can handle a number of instructions at a time. Therefore, it can very well be a MIMD or Multiple Instruction Multiple Data processor as well.

Moreover, the superscalar processors use only one instruction for multiple data, and therefore they are quite similar to SISD or Single Instruction Single Data CPUs as well.

However, based on the fact that it supports short vector instructions and uses one instruction and data stream, the general concept of a multiple data vector processor does not apply to it.

The pipelined architecture does not add to the number of instruction streams that are processed simultaneously, and the single stream merely flows through a channel which is just longer, as it were.

So, it is quite similar to SIMD architecture.

Advantages

Thoughtful selection and ordering of instructions by the compiler
Operational hazards and delays are avoided
Interleaving of floating point and integer instructions
Floating point and integer units kept busy most of the time by the dispatch unit
Faster execution of instruction in one cycle due to multiple data handled in parallel
Better analysis of instructions
High performance achieved
Backward compatibility
Better arrangement of program instructions
Optimal usage of the hardware units available
Out-of-Order Execution, branch prediction and speculative execution allow more parallelism

Disadvantages

Probability of issues with scheduling
Analysis of executable instructions is not free
The analysis process is time consuming and need transistors to be employed
Higher power usage
Unsuitable for use in the smaller embedded systems due to higher energy consumption
The possibility of new side channels opening and attacking is more due to speculative execution

Scalar vs Superscalar

A superscalar processor can issue multiple instructions at the same time by operating on a single piece of data, but in comparison, a scalar processor can act only on a single piece of data issuing a single instruction.
A superscalar processor can produce a much larger throughput at any given point in time as compared to a scalar processor.
The architecture of the superscalar processors allows them to use ILP and redundant functional units to handle multiple instructions, but in comparison, a scalar processor uses integer instructions and fixed point operands for operation even in its simplest state.
The superscalar processors are much more powerful and faster than the scalar processors.
A superscalar processor can perform operations without waiting for the previous instructions to be completed as it is in the case of the scalar processors.
Each thread in a superscalar processor gets its own execution unit which enhances its eventual performance as opposed to the scalar processors.

What is Superscalar Implementation?

The most common implementation of the superscalar processor architecture is in common instructions such as loads and stores, integer and floating-point arithmetic, and conditional branches.

All these instructions can be implemented simultaneously and initiated and executed independently.

What Happens to a Superscalar Processor without Multithreading Support?

If there is no multithreading possibility in a superscalar processor, there can be several issues of instructions that can be rendered useless.

This is because the absence of multithreading will reduce parallelism in each thread.

And, in the case of a Level 3 cache miss or any other similar type of long-lasting stall, it will limit the exploitation of CPU resources and may even freeze the processor completely.

Conclusion

A superscalar processor uses parallelism and a better architecture than a scalar processor to perform faster, being able to handle multiple instructions at a time.

The features and functionalities of superscalar make these processors more powerful and efficient in managing the instructions to produce higher output.