How the Nehalem Microprocessor Microarchitecture Works

By: Jonathan Strickland

Nehalem Branches and Loops

The Core i7 chip with the heatspreader removed.
The Core i7 chip with the heatspreader removed.
Courtesy Intel

­In a microprocessor, everything runs on clock cycles. Clock cycles are a way to measure how long a microprocessor takes to execute an instruction. Think of it as the number of instructions a microprocessor can execute in a second. The faster the clock speed, the more instructions the microprocessor will be able to handle per second.

One way microprocessors like the Core i7 try to increase efficiency is to predict future instructions based on old instructions. It's called branch prediction. When branch prediction works, the microprocessor completes instructions more efficiently. But if a prediction turns out to be inaccurate, the microprocessor has to compensate. This can mean wasted clock cycles, which translates into slower performance.


Nehalem has two branch target buffers (BTB). These buffers load instructions for the processors in anticipation of what the processors will need next. Assuming the prediction is correct, the processor doesn't need to call up information from the computer's memory. Nehalem's two buffers allow it to load more instructions, decreasing the lag time in the event one set turns out to be incorrect.

Another efficiency improvement involves software loops. A loop is a string of instructions that the software repeats as it executes. It may come in regular intervals or intermittently. With loops, branch prediction becomes unnecessary -- one instance of a particular loop should execute the same way as every other. Intel designed Nehalem chips to recognize loops and handle them differently than other instructions.

Microprocessors without loop stream detection tend to have a hardware pipeline that begins with branch predictors, then moves to hardware designed to retrieve -- or fetch -- instructions, decode the instructions and execute them. Loop stream detection can identify repeated instructions, bypassing some of this process.

Intel used loop stream detection in its Penryn microprocessors. Penryn's loop stream detection hardware sits between the fetch and decode components of older microprocessors. When the Penryn chip's detector discovers a loop, the microprocessor can shut down the branch prediction and fetch components. This makes the pipeline shorter. But Nehalem goes a step farther. Nehalem's loop stream detector is at the end of the pipeline. When it sees a loop, the microprocessor can shut down everything except the loop stream detector, which sends out the appropriate instructions to a buffer.

The improvements to branch prediction and loop stream detection are all part of Intel's "tock" strategy. The transistors in Nehalem chips are the same size as Penryn's, but Nehalem's design makes more efficient use of the hardware.

Next, we'll take a look at how Nehalem microprocessors handle data streams.