Introduction
Wide Dynamic Execution, Advanced Digital Media Boost, Smart Memory Access and Advanced Smart Cache; those are the technologies that according to the marketing people at Intel enable Intel to build the high performance, low energy CPUs using the new Core architecture.Of course, as an AnandTech Reader, you couldn't care less about which Hyper Super Advanced Label the marketing folks glue on their CPUs. "Extend the digital lifestyle by combining robust performance with low power consumption" could have been another marketing claim for the new Core architecture, but VIA already cornered that sentence for its C7 CPUs. The marketing slogans for Intel's Core and VIA's C7 are almost the same; the architectures are however vastly different.
No, let us find out what is really behind all this marketing hyper-talk, and preferably compare it with the AMD "K8" (Athlon 64, Opteron) architecture of Intel's NetBurst and Pentium M processors. That is what this article is all about. We talked to Jack Doweck, the engineer who designed the completely new Memory Reorder Buffer and Memory disambiguation system. Jack Doweck is one of the Intel Israel Development Center (IDC) architects.
The Intel "P8"
Intel marketing states that Core is a blend of P-M techniques and NetBurst architecture. However, Core is clearly a descendant of the Pentium Pro, or the P6 architecture. It is very hard to find anything "Pentium 4" or "NetBurst" in the Core architecture. While talking to Jack Doweck, it became clear that only the prefetching was inspired by experiences with the Pentium 4. Everything else is an evolution of "Yonah" (Core Duo), which was itself an improvement of Dothan and Banias. Those CPUs inherited the bus of the Pentium 4, but are still clearly children of the hugely successful P6 architecture. In a sense, you could call Core the "P8" architecture, with Banias/Dothan being based on the "P7" architecture. (Note that the architecture of Banias/Dothan was never given an official name, so we will refer to it as "P-M" for simplicity's sake.)Of course this doesn't mean that Intel's engineers just bolted a few functional units and a few decoders on Yonah and called it a day. Jack told us that Woodcrest/Conroe/Merom are indeed based on Yonah, but that almost 80% of both the architecture and circuit design had to be redone.
CPU architecture in a nutshell
For those of you who are not so familiar with CPUs, we'll start with a crash course in CPU architectures. To understand CPU design, you must first look at the instructions that are sent to the CPU, and thus we start with the software.Typical x86 software code consists of about 50% stores and loads, and there are about twice as many loads as there are stores. Of the remainder, about 15 to 20% of the instructions are branches (If, Then, Else), and the rest are mostly "ADD" (addition) and "MUL" (multiply) instructions. Only a very small percentage of code consists of more exotic instructions such as DIV (divisions), SQRT (square root), or other higher order math (e.g. trigonometric functions).
All these instructions are processed in a typical "Von Neuman" pipeline: Fetch, Decode, Operand Fetch, Execute, Retire.
Instructions are fetched based on the instruction pointer register, and initially they are nothing but long bit patterns to the CPU. It's only after the CPU starts decoding the bits that the instructions "start to make sense" to the CPU. Addresses and opcodes are decoded out of the instructions, and the addresses are used for the next step: the operand fetch. As you don't want the CPU to perform calculations with the addresses but rather on the content of these addresses - the "operands" - the CPU has to fetch the right data out of the data cache. Once these operands are put in the registers, the ALU is steered by the "opcode" (which has been decoded) to perform the right calculation on the operands in the registers.
The results are written to the architecture register file, the registers which can be used by the compiler. The results must also be written to the caches and the main memory, so that these are also up to date. That is the final phase, the retire phase. That is the basically how processing works in all CPUs.
The main challenge for the CPU designer today is the average memory latency the CPU sees. A Pentium 4 3.6 GHz with DDR-400 runs no less than 18 times faster than the base clock of the RAM (200 MHz). Every cycle the memory is being accessed, a minimum of 18 cycles pass on the CPU. At the same time, it takes several cycles to even send a request, and it takes a few cycles to send a request back. (We discussed this in the past in our overview of memory technology article.) The result is that wait times of 200 to 300 cycles are not uncommon on the Pentium 4. The goal of CPU cache is to avoid accessing RAM, but even if the CPU only has to go to system memory 4% of the time, that 4% of the time can lower performance significantly.
87 Comments
View All Comments
thestain - Friday, May 5, 2006 - link
Larger Cache and an extra decoder were bound to help Conroe in the small and simple tesing done by most benchmarks.But, what about applications that are a bit larger than Conroe's cache size or those that are complex causing the simple decoders to not be able to be used that much while placing the single complex decoder on the Conroe into short supply?
Mike
IntelUser2000 - Friday, May 5, 2006 - link
LOL. You crack me up. Go see how much doubling of L2 caches help to increase performance. I guess the last 5 years of Netburst screwed people's mental abilities. Sure the caches will help Conroe, but if the CPU doesn't really need the extra cache, then it will be a waste. Kinda like how doubling L2 caches on Pentium D doesn't help a lot. Kinda like how doubling L2 caches on Athlon 64's don't help either. It's why Semprons excel.
About decoders. Guess you are still in the old ages where one of the reasons K7 was better than P6 was because it has the ability to decode complex instructions in all decoders. If you read about Conroe, more of the instructions that USED to go to the complex decoders can now go to the simple decoders.
And which benchmark would that be.
Guess there is gonna be a lot of AMD fanboys that are gonna cry when Conroe is shown.
stopkidding - Tuesday, May 2, 2006 - link
Did anyone notice that this comment thread is virtually free of the usual "Intel-this, AMD-that" comments that usually are seen on this site. The "fanbois" have nothing to bitch about as their little brains can't comprehend whats written in this article! :-)Reynod - Wednesday, May 3, 2006 - link
Which is a sigh of relief I must say. I can swallow hard facts and interpret code ... my 4400+ looks like going in as my new server box ... and my next gaming box looks like being an OC'd Conroe. I just won't buy an Intel Mobo ... heh heh.mino - Tuesday, May 2, 2006 - link
IMHO not, the article is extremely well written AND there are NO benchmarks => Intelman is happy from the text; AMD-man is hoping the real numbere won't be so bad...On topic, article is written in a very good style for general public.
On thing I'am afraid of is the moment code is optimized for Core, any other irchitecture would take a performance hit. K8 the smallest one, PM/K7 the small one, P4/P6 the big one and all older plus C7 e pretty huge hits.
That bothers me.
Except that, AMD will live for a long time (Opteron alone would survive them for 5+ yrs.) and X2's will be finally cheaper. What else to pray for :)
Best regards.
mino - Tuesday, May 2, 2006 - link
addennum:"the article is extremely well written FOR GENERAL AT AUDIENCE"
Otherwise job well done Johan.
nullpointerus - Tuesday, May 2, 2006 - link
Right, but then why do they respond to the other articles?JustAnAverageGuy - Tuesday, May 2, 2006 - link
Another top notch article, as always, Johan- JaAG
dguy6789 - Tuesday, May 2, 2006 - link
Thank you for writing this article. You have cleared up a large quantity of questions that I had in relation to the Core architecture.Betwon - Tuesday, May 2, 2006 - link
sub eax,[edi+ebx+79]There are 3 registers used: eax, edi, ebx
For Core duo, it decodes to one fusion-micro-op.
In the reservation station (RS), only one entry is needed to be allocated. There are three registers spaces in one RS entry at least. And the results of address(edi+ebx+79) can be w rited back into the same position of one register in this RS entry.(A replace method)
For K7/K8, it decodes to one macro-op?
In the reservation station (RS), only one entry is needed to be allocated? There are three registers spaces in one RS entry?
It can take one entry in ROB.
But I don't believe AMD. It may take two entrys in RS, because there are only two registers spaces in one RS entry of K8. K8 hasn't three registers spaces in one RS entry.
K8's RS is up to 8X3 macro-op, but not means that one macro-op can always take one entry in the RS.
I say that I don't believe AMD.
Of course, Other people have on need to believe me too.