AMD Core Counts and Bulldozer: Preparing for an APU World
by Anand Lal Shimpi on November 30, 2009 12:00 AM EST- Posted in
- CPUs
The New Way to Count Cores
Henceforth AMD is referring to the number of integer cores on a processor when it counts cores. So a quad-core Zambezi is made up of four integer cores, or two Bulldozer modules. An eight-core would be four Bulldozer modules.
A hypothetical quad-core Bulldozer. Presumably the L3 cache would be shared by both modules.
A hypothetical eight-core Bulldozer. Presumably the L3 cache would be shared by all four modules.
It's a distinct shift from AMD's (and Intel's) current method of counting cores. A quad-core Phenom II X4 is literally four Phenom II cores on a single die, if you disabled three you would be left with a single core Phenom II. The same can't be said about a quad-core Bulldozer. The smallest functional block there is a module, which is two cores according to AMD.
Better than Hyper Threading?
Intel doesn't take, at least today, quite aggressive of a step towards multithreading. Nehalem uses SMT to send two threads to a single core, resulting in as much as a 30% increase in performance:
The added die area to enable HT on Nehalem is very small, far less than 5%.
AMD claims that the performance benefit from the second integer core on a single Bulldozer module is up to 80% on threaded code. That's more than what AMD could get through something like Hyper Threading, but as we've recently found out the impact to die size is not negligible. It really boils down to the sorts of workloads AMD will be running on Bulldozer. If they are indeed mostly integer, then the performance per die area will be quite good and the tradeoff worth it. Part of the integer/FP balance does depend on how quickly the world embraces computing on the GPU however...
According to AMD's roadmaps, Zambezi will use either 4 or 8 Bulldozer cores (that's 2 or 4 modules). The quad-core Zambezi should have roughly 10 - 35% better integer performance than a similarly clocked quad-core Phenom II. An eight-core Zambezi will be a threaded monster.
No GPU, for Now
The first APU from AMD will be Llano, but based on existing Phenom II cores. The move to a new manufacturing process combined with the first monolithic CPU/GPU is enough to do at once, there's no need to toss in a brand new microarchitecture at the same time.
AMD did add that eventually, in a matter of 3 - 5 years, most floating point workloads would be moved off of the CPU and onto the GPU. At that point you could even argue against including any sort of FP logic on the "CPU" at all. It's clear that AMD's design direction with Bulldozer is to prepare for that future.
In recent history AMD's architectural decisions have predicted, earlier than Intel, where the the microprocessor industry was headed. The K8 embraced 64-bit computing, a move that Intel eventually echoed some years later. Phenom was first to migrate to the 3 level cache hierarchy that we have today, with private L2 caches. Nehalem mimicked and improved on that philosophy. Bulldozer appears to be similarly ahead of its time, ready for world where heterogenous CPU/GPU computing is commonplace. I wonder if we'll see a similar architecture from Intel in a few years.
94 Comments
View All Comments
Calin - Monday, November 30, 2009 - link
Only on instructions per clock - the K6-3 was available at 400 and 450 MHz, while the Pentium !!! was available at (much later) up to 1300 MHz.However, the K6-3 was in the competition against Pentium !!! at up to 550-600 MHz, as the original K7 appeared around those times.
mino - Sunday, January 17, 2010 - link
K6-2+ and K6-3 were PII competitors that were able to outperform PII and even early PIII purely thanks to ON-DIE L2 and 3Dnow.L3 was on motherboard back then, was slow, and had little to do with K6-2+/3 performance gains.
Also K6-2+/3 was the top AMD CPU for a very short time as K7 came righ afterward.
K6-2+/3 was the notebook & low cost bussiness desktop solution of the times while K6-2 (without cache) was the budget solution.
medi01 - Monday, November 30, 2009 - link
Was it shared?Zool - Monday, November 30, 2009 - link
Any info on the amount of transistors for each module ?At least they can make a decent notebook cpu from the modules. Mobile nehalem and phenons are everything with 4 cores just not low power notebook cpu-s.
Zool - Monday, November 30, 2009 - link
Something like 1 module no L3 cache for netbooks,notebooks.2 modules with L3 cache for average notebooks and 4 or more modules for desktop replacement notebooks. They could play with the cache sizes too.
The 1 module no L3 cache for example would kill atom in performance. And the power usage could be still quite good. There is no meaning for 2-4 W slow cpu in netbook when the mainboard,hdd,display eats several times more electric power together than the cpu itself.
JVLebbink - Monday, November 30, 2009 - link
I am wondering about the shared FPU. Does one thread really have to be purely integer for the other thread to use the 2 FMACs at the same time, or can one thread use the 2 FMACs if the other one is currently not sending FPU instructions. "if one thread is purely integer, the other can use all of the FP execution resources to itself." sounds like the first, but that would (1) waste FPU resources, (2) give problems with threads switching cores. (how does the not witched thread know it can now use all of the FPU resources?)My presumption is that AMD chose the former and that if 2 (FMAC) instructions of different 'cores' reach the FPU (at that point in time) they will be executed using an FMAC each and if 2 instructions of one core reach the FPU without the other core sending any (at that point in time) they will be executed in parallel using both FMACs.
If my presumption is correct AMD decided not to HyperThread their ALUs but did HyperThread their FPU.
Kiijibari - Monday, November 30, 2009 - link
You do not have to switch anything.The shared FPU has ONE common scheduler. Both threads can issue Ops into the scheduler queue. If there are no FPU Ops from the first thread then - of course - the second thread has the power of the whole FPU.
Very simple ... it's like a chat room. Several people type in messages and you can see it serialized in the chat window (that would be equivalent to the queue).
You will read the messages one after another, according to their posted time / when they were issued. It is the same for the Bulldozer FPU.
You do not need to switch from one chat member to another to read their messages. Neither does the FPU has to switch ;-)
kobblestown - Monday, November 30, 2009 - link
I hope AMD regain their common sense and use the term "core" in amore conventional sense. According to their definition a quad core will only have 2 FP pipelines. What I see in a quad core Bulldozer is dual core with 2 Int and 1 FP pipeline each. I wish them the very best in their effort to regain the performance crown but abusing existing terminology will not help in that.Zool - Monday, November 30, 2009 - link
OS , drivers, API layers trashing the cpu constantly are usualy integer loads. For average math in code the curent fpus are fast enough. Things that realy need paralel FP performance (multimedia things,graphic) are using SIMD SSE units and those things should run much faster on gpus anyway. For 5% extra die area the extra integer pipeline rocks.fitten - Tuesday, December 1, 2009 - link
Except it isn't 5% additional die area.