Investigations into Socket 939 Athlon 64 Overclocking
by Jarred Walton on October 3, 2005 4:35 PM EST- Posted in
- CPUs
RAM BIOS Settings
The final area that will need attention with overclocking is the RAM. The type of RAM used will dictate the approach to overclocking that RAM.
Winbond BH5/CH5 can usually be set to 2-2-2-7 1T timings and then you can simply adjust the voltage levels to try and make it stable. (Technically, the RAM should be able to run 2-2-2-5 1T, but Wesley's tests have shown 7 for tRAS to be the optimal setting, so that's what we used.) You can also increase the timings above 2-2-2-7 1T, though most Winbond users are more interested in keeping "tight" timings. Samsung TCCx will usually run at 2-2-2-7 1T at DDR400 speeds, but increasing the memory speed will definitely require looser timings and possibly more voltage as well. The above screenshot shows the settings that we used for 9x300. You can refer to our memory articles for more detail on how individual brands of RAM perform. If you want to know a bit more about what the timings actually mean, we covered that in the past as well. The simple fact of life is that memory is variable in quality, so while most RAM can run at the specified timings and speed, how far above that you can go is not going to be the same from DIMM to DIMM. It will require time and effort to determine how fast your RAM can run and what timings and voltages will be required - and the motherboard will affect this as much as anything.
Even if you select the slowest possible timings, some DIMMs won't be able to run much above DDR400 speeds. Value RAM in particular is often what is left after all the high speed blanks are binned out, which is why it can be as cheap as half the cost of quality RAM. The above BIOS screens show our settings for 9x300 with the value and performance RAM - you can see that we had to drop to DDR266 vs. DDR333, since not even the OCZ ran optimally at DDR600. (At least, that's what we needed on this particular board with this particular set of DIMMs. 3-4-4-8 with 2T command rate worked, but was slower than DDR333 2.5-3-3-8 with 1T.) If you're going for top performance, value RAM is a poor choice. If, on the other hand, you want to save money, $85 RAM is a lot more attractive than $150+ RAM. The trick to using value RAM is that you basically have to keep it at or below DDR400 speeds, and this is accomplished by the use of memory ratios. There's a stigma against using higher memory ratios, because it "negates the performance gain." That can be true on some platforms, but it doesn't seem to affect Athlon 64 quite as much due to the integrated memory controller. Let's talk about why.
Above is a rough diagram of a typical Pentium 4 motherboard. The CPU talks over the FSB (Front Side Bus) to the NB (North Bridge of the chipset), and the NB then talks over the memory bus to the RAM. The memory controller is part of the NB, and it runs at the system bus speed. With an 800FSB, the base clock speed is actually only 200MHz, but data is sent four times per clock cycle. With modern CPUs running at four or more times the FSB speed (i.e. 3.2 GHz and above), making the CPU wait for data from memory can severely impact performance. It may help to think of the quad-pumping of the FSB as a four-car train leaving every cycle, with the caveat that the train must be either completely full or completely empty. If the CPU is waiting for data from RAM and only three parts are ready, it has to wait an extra FSB cycle. At 200MHz, an FSB cycle could be as many as 19 CPU cycles (for the 580/680)! Here's where memory ratios come into play.
On the P4 chipsets, the NB talks to the CPU at one speed, and it can talk to RAM at a slower or faster speed. If it talks to RAM faster than it talks to the CPU (i.e. DDR2-533 or DDR2-667 with FSB800), there will be fewer "empty trains" going to the CPU. If it talks slower to the RAM than the FSB, however, there will be more empty slots on the FSB. When we look at Pentium overclocking in the future, we'll cover this more, but the basic idea is that you want the RAM to run at FSB speed or faster if at all possible. Dropping even to DDR333 (which is actually DDR320 for most Intel chipsets) can cause a 5 to 10% performance loss, and DDR266 would be even worse. So, why is Athlon 64 different?
The key is in the integrated memory controller. All modern processors run at much faster clock speeds than memory, which is why we have such things as L1 and L2 cache on the CPUs. Even with cache, though, getting data from RAM faster will improve performance substantially. By eliminating the NB from the CPU to RAM path, latencies can be reduced significantly, and this is the key change that AMD made from the K7 to K8 cores (besides the 64-bit extensions that still go unused by the majority of PC users, of course). Clock for clock, the integrated memory controller of K8 makes the Athlon 64 roughly 25% faster than Athlon XP, which is a significant performance increase given the relatively small amount of die space that was required. The integrated memory controller also affects performance when running at other RAM ratios, though. Since there is no NB to CPU delay when going from RAM to CPU, and since the memory controller runs at CPU core speed, once the data arrives from RAM to the memory controller, it is available. On a P4, the use of lower asynchronous memory speeds only serves to increase the delays in accessing RAM.
An interesting corollary to the above discussion is in the Northbridge link of AMD vs. Intel. On AMD, the Northbridge is connected via the HyperTransport link. Running at 1000 MHz base (double-pumped makes it equivalent to 2000 MHz), the HT bus is 16-bits wide (2 bytes) with dedicated upstream and downstream links. All told, that gives:
The net result is that while a lower RAM clock speed can still impact performance on Athlon 64 systems, it will not do so as much as on a P4 style configuration. That's the theory, anyway, and we hope to support it with results from several planned overclocking articles.
To illustrate the above points, we've created a brief list of cycle rates and timings of RAM. We'll start with some common RAM speeds, but the actual speed of your RAM will depend on the CPU multiplier and CPU bus speed, among other things. With the clock speeds and timings, we can also calculate the latency and estimate the total memory latency. Remember that the RAM timings are relative to the base bus speed, not the doubled speed, so DDR400 has a 200MHz base clock speed. There is also latency involved internally in the CPU (typically at least a single RAM cycle each way is spent in the memory controller), as well as in sending the address request across the memory bus and receiving the data back across the bus to the CPU (we guessed at two RAM cycles each way). Command rate adds another cycle, and maybe we're wrong or missed some other potential delays.
Everything takes time, and the memory timings generally only reflect the delays caused internally by the RAM. Different clock speeds and timings should produce a different total latency, and we estimate the total memory subsystem latency below. We'll check the actual latency in our benchmarks later. In practice, going from DDR200 to DDR400 RAM will not cut the RAM latency in half, even at the same timings. For reference, our estimated latency is calculated as:
Even with all this information, we haven't even come close to touching on every BIOS option or memory timing that's available. That is, simply put, beyond the scope of this article. There are over a dozen memory timings listed in the DFI BIOS, and we left the majority of them set to Auto. The only items that we changed are CAS, tRCD, tRP, tRAS, and CMD. We may try to look at the other options in the future, but most of the extra settings are not available on the majority of the motherboards, and we want the information contained in this OC Guide to be applicable to more than just the "elite" motherboards.
The final area that will need attention with overclocking is the RAM. The type of RAM used will dictate the approach to overclocking that RAM.
Winbond BH5/CH5 can usually be set to 2-2-2-7 1T timings and then you can simply adjust the voltage levels to try and make it stable. (Technically, the RAM should be able to run 2-2-2-5 1T, but Wesley's tests have shown 7 for tRAS to be the optimal setting, so that's what we used.) You can also increase the timings above 2-2-2-7 1T, though most Winbond users are more interested in keeping "tight" timings. Samsung TCCx will usually run at 2-2-2-7 1T at DDR400 speeds, but increasing the memory speed will definitely require looser timings and possibly more voltage as well. The above screenshot shows the settings that we used for 9x300. You can refer to our memory articles for more detail on how individual brands of RAM perform. If you want to know a bit more about what the timings actually mean, we covered that in the past as well. The simple fact of life is that memory is variable in quality, so while most RAM can run at the specified timings and speed, how far above that you can go is not going to be the same from DIMM to DIMM. It will require time and effort to determine how fast your RAM can run and what timings and voltages will be required - and the motherboard will affect this as much as anything.
Click to enlarge. |
Even if you select the slowest possible timings, some DIMMs won't be able to run much above DDR400 speeds. Value RAM in particular is often what is left after all the high speed blanks are binned out, which is why it can be as cheap as half the cost of quality RAM. The above BIOS screens show our settings for 9x300 with the value and performance RAM - you can see that we had to drop to DDR266 vs. DDR333, since not even the OCZ ran optimally at DDR600. (At least, that's what we needed on this particular board with this particular set of DIMMs. 3-4-4-8 with 2T command rate worked, but was slower than DDR333 2.5-3-3-8 with 1T.) If you're going for top performance, value RAM is a poor choice. If, on the other hand, you want to save money, $85 RAM is a lot more attractive than $150+ RAM. The trick to using value RAM is that you basically have to keep it at or below DDR400 speeds, and this is accomplished by the use of memory ratios. There's a stigma against using higher memory ratios, because it "negates the performance gain." That can be true on some platforms, but it doesn't seem to affect Athlon 64 quite as much due to the integrated memory controller. Let's talk about why.
Above is a rough diagram of a typical Pentium 4 motherboard. The CPU talks over the FSB (Front Side Bus) to the NB (North Bridge of the chipset), and the NB then talks over the memory bus to the RAM. The memory controller is part of the NB, and it runs at the system bus speed. With an 800FSB, the base clock speed is actually only 200MHz, but data is sent four times per clock cycle. With modern CPUs running at four or more times the FSB speed (i.e. 3.2 GHz and above), making the CPU wait for data from memory can severely impact performance. It may help to think of the quad-pumping of the FSB as a four-car train leaving every cycle, with the caveat that the train must be either completely full or completely empty. If the CPU is waiting for data from RAM and only three parts are ready, it has to wait an extra FSB cycle. At 200MHz, an FSB cycle could be as many as 19 CPU cycles (for the 580/680)! Here's where memory ratios come into play.
On the P4 chipsets, the NB talks to the CPU at one speed, and it can talk to RAM at a slower or faster speed. If it talks to RAM faster than it talks to the CPU (i.e. DDR2-533 or DDR2-667 with FSB800), there will be fewer "empty trains" going to the CPU. If it talks slower to the RAM than the FSB, however, there will be more empty slots on the FSB. When we look at Pentium overclocking in the future, we'll cover this more, but the basic idea is that you want the RAM to run at FSB speed or faster if at all possible. Dropping even to DDR333 (which is actually DDR320 for most Intel chipsets) can cause a 5 to 10% performance loss, and DDR266 would be even worse. So, why is Athlon 64 different?
The key is in the integrated memory controller. All modern processors run at much faster clock speeds than memory, which is why we have such things as L1 and L2 cache on the CPUs. Even with cache, though, getting data from RAM faster will improve performance substantially. By eliminating the NB from the CPU to RAM path, latencies can be reduced significantly, and this is the key change that AMD made from the K7 to K8 cores (besides the 64-bit extensions that still go unused by the majority of PC users, of course). Clock for clock, the integrated memory controller of K8 makes the Athlon 64 roughly 25% faster than Athlon XP, which is a significant performance increase given the relatively small amount of die space that was required. The integrated memory controller also affects performance when running at other RAM ratios, though. Since there is no NB to CPU delay when going from RAM to CPU, and since the memory controller runs at CPU core speed, once the data arrives from RAM to the memory controller, it is available. On a P4, the use of lower asynchronous memory speeds only serves to increase the delays in accessing RAM.
An interesting corollary to the above discussion is in the Northbridge link of AMD vs. Intel. On AMD, the Northbridge is connected via the HyperTransport link. Running at 1000 MHz base (double-pumped makes it equivalent to 2000 MHz), the HT bus is 16-bits wide (2 bytes) with dedicated upstream and downstream links. All told, that gives:
2000 MHz * 2 bytes = 4000 MB/sThe total HT bandwidth is thus 8000 MB/s - 4000 MB/s in each direction. Here's the catch, though: this bandwidth is separate from the memory bandwidth. If you've ever tried overclocking the HT bus speed and found the results to be of little help, performance-wise, that should hopefully clear things up. There is rarely (if ever) 4 GB/s of data in either direction between the CPU and Northbridge. It's like putting an eight-lane highway through a rural farming community; since there's very little traffic to begin with, the extra lanes (bandwidth) won't help much. That's not to say this is a poor design decision - better to have too much available bandwidth than not enough! The P4 design doesn't suffer from a lack of bandwidth either; the problem is that the time in which it takes to get data from the RAM to the CPU is quite a bit longer - in other words, higher latencies.
The net result is that while a lower RAM clock speed can still impact performance on Athlon 64 systems, it will not do so as much as on a P4 style configuration. That's the theory, anyway, and we hope to support it with results from several planned overclocking articles.
To illustrate the above points, we've created a brief list of cycle rates and timings of RAM. We'll start with some common RAM speeds, but the actual speed of your RAM will depend on the CPU multiplier and CPU bus speed, among other things. With the clock speeds and timings, we can also calculate the latency and estimate the total memory latency. Remember that the RAM timings are relative to the base bus speed, not the doubled speed, so DDR400 has a 200MHz base clock speed. There is also latency involved internally in the CPU (typically at least a single RAM cycle each way is spent in the memory controller), as well as in sending the address request across the memory bus and receiving the data back across the bus to the CPU (we guessed at two RAM cycles each way). Command rate adds another cycle, and maybe we're wrong or missed some other potential delays.
Everything takes time, and the memory timings generally only reflect the delays caused internally by the RAM. Different clock speeds and timings should produce a different total latency, and we estimate the total memory subsystem latency below. We'll check the actual latency in our benchmarks later. In practice, going from DDR200 to DDR400 RAM will not cut the RAM latency in half, even at the same timings. For reference, our estimated latency is calculated as:
7 * (Cycle Latency) + CL + 0.1 * tRCD + 0.05 * tRPIt would be different for other platforms, and it's probably off by as much as 20%, we'd guess. The "7" represents the command rate, CPU to memory controller delays, and memory controller to RAM delays. We did use some tests to try to come up with a good estimate, but take the estimated latency in the following table with a serious dose of skepticism. For a P4 platform, the major change is that the inclusion of the NB will change the "7" factor to something much larger - perhaps 14 to 20. (We'll look at that in a future article.)
RAM Speeds; Timings and Theoretical Latencies | |||||||||||
RAM Rating | Base Speed | ns per cycle | CAS | tRCD | tRP | tRAS | CL ns | tRCD ns | tRP ns | tRAS ns | Estimated Latency |
DDR266 | 133.33 | 7.5 | 2 | 2 | 2 | 5 | 15 | 15 | 15 | 37.5 | 69.75 |
DDR266 | 133.33 | 7.5 | 2.5 | 3 | 3 | 7 | 18.75 | 22.5 | 22.5 | 52.5 | 74.63 |
DDR266 | 133.33 | 7.5 | 3 | 4 | 4 | 8 | 22.5 | 30 | 30 | 60 | 79.5 |
DDR333 | 166.67 | 6 | 2 | 2 | 2 | 5 | 12 | 12 | 12 | 30 | 55.8 |
DDR333 | 166.67 | 6 | 2.5 | 3 | 3 | 7 | 15 | 18 | 18 | 42 | 59.7 |
DDR333 | 166.67 | 6 | 3 | 4 | 4 | 8 | 18 | 24 | 24 | 48 | 63.6 |
DDR400 | 200 | 5 | 2 | 2 | 2 | 5 | 10 | 10 | 10 | 25 | 46.5 |
DDR400 | 200 | 5 | 2.5 | 3 | 3 | 7 | 12.5 | 15 | 15 | 35 | 49.75 |
DDR400 | 200 | 5 | 3 | 4 | 4 | 8 | 15 | 20 | 20 | 40 | 53 |
DDR450 | 225 | 4.44 | 2 | 2 | 2 | 5 | 8.89 | 8.89 | 8.89 | 22.22 | 41.33 |
DDR450 | 225 | 4.44 | 2.5 | 3 | 3 | 7 | 11.11 | 13.33 | 13.33 | 31.11 | 44.22 |
DDR450 | 225 | 4.44 | 3 | 4 | 4 | 8 | 13.33 | 17.78 | 17.78 | 35.56 | 47.11 |
DDR500 | 250 | 4 | 2 | 2 | 2 | 5 | 8 | 8 | 8 | 20 | 37.2 |
DDR500 | 250 | 4 | 2.5 | 3 | 3 | 7 | 10 | 12 | 12 | 28 | 39.8 |
DDR500 | 250 | 4 | 3 | 4 | 4 | 8 | 12 | 16 | 16 | 32 | 42.4 |
Even with all this information, we haven't even come close to touching on every BIOS option or memory timing that's available. That is, simply put, beyond the scope of this article. There are over a dozen memory timings listed in the DFI BIOS, and we left the majority of them set to Auto. The only items that we changed are CAS, tRCD, tRP, tRAS, and CMD. We may try to look at the other options in the future, but most of the extra settings are not available on the majority of the motherboards, and we want the information contained in this OC Guide to be applicable to more than just the "elite" motherboards.
101 Comments
View All Comments
Crassus - Tuesday, October 4, 2005 - link
First of all, thank you for such a long article. I appreciate the work you put into this. What I'd really like to see in one of the planned articles would be an in-depth coverage of the options an enthusiast-grade mainboard BIOS offers nowadays for the RAM timings (and maybe PCIe) - beyond the standard timings covered in this article.PrinceGaz - Tuesday, October 4, 2005 - link
The finer memory-timings offered by enthusiast mobos are generally vendor specific so your best bet is to check a forum or other site dedicated to your motherboard. For DFI mobos for instance, you can find a thread which gives detailed coverage of memory settings on DFI-Street forums http://www.dfi-street.com/forum/showthread.php?t=2...">hereCheesePoofs - Tuesday, October 4, 2005 - link
Why stability test with 3dmark (an app that tries to stress teh CPU as little as possible) and pcmark (an ok pc-stressing app) instead of the combo of memtest86+, superpi, and prime95? Seems to me that if you want to find out whether yoru CPU really is stable, you'd want to stress it as hard as possible (which those three will do).Also, from what I've read from Zebo's thread in the CPU forums, 2T really doesn't have a significant impact on performance. Could you clarify this?
JarredWalton - Tuesday, October 4, 2005 - link
I've seen systems that run Prime95 and SuperPi 100% stable crash under 3DMark looping, as well as under PCMark. I imagine 2.80 GHz will crash under those if I run them all concurrently. My personal experience is that SuperPi and Prime95 only stress a few paths of the CPU, hence the inclusion of benchmarks with 11 different applications that can all fail with an unstable overclock. 3DMark GPU tests are not as demanding of the CPU, but the CPU tests are very demanding IMO. (That's part of why the top scores on the 3DMark ORB never include the CPU tests.)2T command rate, as you can see in quite a few instances, really killed performance. Perhaps tweaking other special timings beyond CL, tRCD, tRP, and tRAS might make the impact less, but you could likely tweak the same things with 1T at a lower memory speed. Command rate comes into play on every single memory access, so doubling that delay will certainly have an impact on performance.
fitten - Tuesday, October 4, 2005 - link
Good answer. Most have no clue as to how a CPU actually works. Ideally, a synchronous circuit is rated at a clock speed that the longest path will function properly (give correct results). There may be 1000s of pathways that can run at higher frequencies but that one can hold it back. Running the clock rate up may cause that one pathway not to be able to meet something like a data setup and hold time on one line (of the 32 or 64) in the data path and now you have an unstable setup that you may not detect. As always with overclocking, a crash is the best result you can get because you know you've pushed too far. Unless you are testing pretty much every instruction with every possible data against a control to compare against (some pathways can take longer depending on the data that it is being operated on), there are many errors that you may not detect... and all it takes is one, out of the possible billions, to make your machine not stable. Sure, it may be a rarely seen case of instruction+data but it exists.Programs like the Pi calculators and such do make your CPU work a lot, but the calculations are fairly repetitive and hardly a broad sample of the ISA.
I'm all for doing whatever you want with your own machine. Heck, I used to overclock all the time, too. I just find all of the lack of knowledge in synchronous circuits... interesting... when people talk about overclocking.
Saist - Monday, October 3, 2005 - link
for those who read this portion here :****
Because of the GPU limitation, we're going to be testing at 640x480, 800x600, and 1024x768. We'll also test many of the titles with 4xAA enabled, which should serve as a reality check. Even with a super fast CPU, many games are going to be completely GPU limited with the X800 Pro when we run 4xAA, especially at resolutions 1024x768 and above. Frankly, we wouldn't bother enabling 4xAA unless you can at least reach 1024x768 anyway.
****
Did anyone else think... okay.. lets stick a Radeon 9600, GeforceFX, or XGI Volari in there so that we actually will be limited? I mean... please. X800 alone goes above what most users have in their systems today. If we are buying "new" components, then yeah, the X800 is on my short list, but how about doing some reviews over hardware people actually have in their hands.
OvErHeAtInG - Tuesday, October 4, 2005 - link
If you're overclocking a new A64 Venice... somehow I think you're not still running your XGI Volari for games. Remember bench numbers are really only useful if they reflect framerates you would actually want to play with.JarredWalton - Tuesday, October 4, 2005 - link
The reason I used an X800 Pro is because I feel it's a good match for the chip, RAM, and motherboard. I can toss in a 7800GTX to show what the CPU on its own is capable of, but you can get cards that pretty much equal the X800 Pro for under $200. X800 GTO and GTO2 can match and even beat the X800 Pro.I view overclocking (and computer building in general) from a bang-for-the-buck perspective. It doesn't make sense to me to spend $100 upgrading from the 3000+ to the 3500+ if I'm going to be completely GPU limited. $200 on a graphics card is not that much money, when you really get down to it. 180 million transistor chip with 256MB of 980MHz RAM, all mounted on a large PCB? At least I can feel I'm getting a lot of stuff for $200. A CPU is far cheaper to produce (though more expensive to design). Profit margins on CPUs are notoriously high.... Personally, the X800 Pro is a decent card, but I really want something faster these days. Same goes for the 6800GT. But then, not everyone feels that way.
---------
Thought #2 (for Saist): If X800 is above what most people have, other than those buying new computers... well, what about the motherboard and processor? Socket 939 with nForce4 is a more recent configuration than X800/6800 cards. Not to mention Venice has only been out for something like 8 months.
If you're looking to spend $120+ on a new Venice chip and you've only got a 9600 Pro (or even a 9800 Pro), you're wasting your money on the wrong part (at least from a gaming perspective). A socket 754 Sempron with an X800 Pro would be far better for gaming than a Venice core with anything less than an X800/6800. Outside of gaming... well, graphics don't matter outside of gaming much, which is why Winstones, PCMark, and AutoGK are included.
Honestly, I'm not entirely sure if you were complaining about the use of a GPU that was too fast, or that it wasn't fast enough. For frequent gaming, I wouldn't recommend anyone go lower than about the X800 GTO these days. 6600GT is (IMO) now relegated to the budget/moderate-gaming setup, as many games are simply unplayable above 1024x768. I really don't like to drop below 1280x1024/1280x960 if I can avoid it. If I've misunderstood your complaint, let me know; if we simply have a difference of opinion... well, there's not much to do about that. :)
yanman - Tuesday, October 4, 2005 - link
any chance you can add in benches for 7800GT/GTX? after all, in your discussion you correctly asset that money is much better spent on high spec'd GPU to match the cpu speed that you've managed to overclock to - having used bargain rate ram and venice.i have a venice 3000+ clocked at 2686mhz, 7800gt and 2x1gb sticks of average ram (legend/hynix). until i upgraded the ram a few weeks ago i had it running for prehaps a month and a half totally solid with 2x512mb sticks of same type, at 2696mhz (337x8, ram at 225mhz (2:3) 2.5-3-4-7-1T)
the reason i ask for 7800GT and GTX is 2 fold, so we can see it from an nvidia side too (different cpu scaling maybe?), and also to show the scaling for a top-end card even if only as a reference point. It just seems a bit one-dimensional only using 1 card.
One last thing, well done to Zebo who made the excellent "Quick and dirty A64 overclocking guide" (used to be sticky in the forums) which I and many people I know used to overclock their venices with.. i'd be stuck without it!
JarredWalton - Tuesday, October 4, 2005 - link
I'm planning on doing 7800GTX testing with an X2 3800+ OC article. For gaming, it will perform identically to the 3200+ Venice. Hopefully, I'll be done in the next ~week or so.