We have just acquired a new shiny FAS6280 in metrocluster configuration with 2x512GB PAM per controller and a total of 336 FC 15k disks hoping for awesome performance, but now that we have set it up we are quite disappointed about NFS performance.
We already have a metrocluster FAS3160 (our first netapp), and I have to say that we were surprised by its performance reaching the 35-40k NFS v3 IOPS per controller (our application profile is 80% metadata, 12% reads, 8% writes on hundred of millions of files) saturating CPU (its bottlenek in our profile), with very low latency (disks and bandwidth were OK).
We use just NFS protocol, nothing more, nothing less but we use it heavily and we need very high IOPS capacity (we use storage for real not as many customers that use it without touching its limit), so we decided to move to the new FAS6280 hoping for a huge improvement in CPU performance (powerful exacore X5670 versus a tiny old dual core 2218 AMD Opteron). If you look at CPU benchmarks websites (like http://www.cpubenchmark.net) you can see something like 6x performance in pure CPU power so we hoped for at least 3x performance increment knowing that the bottleneck in our configuration was just the CPU.
A bad surprise.. using same application profile, a single controller seems barely reach 75-80k IOPS before 100% CPU busy is touched. So from our point of view just 2x performance more than a FAS3160. The bad thing is that obviously a FAS6280 doesn't cost 2x a FAS3160...
So we tried to investigate..
The real surprise is how the cpu is badly badly (let me say badly) employed on FAS6280. For a comparison here is the sysstat -m from our FAS3160 running near 100% cpu busy:
sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3
100% 86% 70% 96% 90% 90%
How you can see the CPUs usage is quite well balanced and all core are employed. Good.
Now the sysstat -m on the FAS6280 with same application profile:
sysstat -m 1
ANY AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11
99% 21% 6% 0% 0% 0% 0% 0% 0% 0% 48% 48% 58% 93%
How you can see the FAS6280 barely uses 4-5 cpu cores out of 12 available, expecially CPU11 reaching 100% cpu busy very soon. We tried to optimize everything, mount in V2, V3, V4, create more aggregates, more volumes, flexvols, traditional vols and so on, but nothing let us to increase performances..
From my personal point of view, it seems that the FAS6280 hardware is far far more advanced than the Ontap version (8.0.1) that probably can't take advantage of bigger number of cores of this new family of filers, so it finishes to use an advanced cpu like X5670 just as an older dual core or a little more.. Simply the x5670 core is faster than 2218 core so it obtain a better performance.. but far far away from what it could do..
I read that new ontap upgrades should unlock NVRAM (now it can use just 2Gigs out of 8Gigs installed) and cache (now it can use 48Gigs out of 96Gigs) and should give better multithreading. Will these upgrades unlock some power out of our FAS6280?
Another strange thing is that the filer reach 100% busy CPU with extra low latencies. I read that is not recommended to take the filer upper than 90% cpu busy limit, because latencies could increase very fast and in an unpredictable way. This sounds reasonable, but for us is not useful at all to have application with 200us latencies.. we just need to stay under the 12ms limit.. For instance, if we touch the 100% CPU bound is it reasonable to continue to increase the filer usage until we reach, for example, 8ms medium latency for the slowest op? Or the latency could really explode in a unpredictable way causing problems?
What do you think? I sincerely don't know what other to refine, I think we have tried almost everything. Can we better balance CPU utilization? What do you suggest?
Of course I can post more diag commands output if you need it.
Thanks in advance,
Welcome to community!
Let me first tell you that, "you should never judge a netapp system's load by CPU usage." It's always a bad idea to think that a system is busy just by looking at CPU usage, to best use your netapp system always use multithreaded operation with recommended settings and look at latency values, which in your case is very good 200us.
Ontap ties every type of operation in a stack called domain, every domain is coded in a way that it will fist use the explicit assigned cpu and once it satuarates that CPU then it shifts its load to next CPU. Every domain does have it's own priority and they are hard coded so instance a houkeeping job domain will always have lower priority over NFS/CIFS domain and Ontap always makes sure that user requests always take priority over system internal work.
One more thing please don't look at CPU stat's 'ANY' counter always see 'AVG' as 'ANY' conter always give me a mild heart attack.
At last I would say that the issuse you are looking at is cosmetic, however If you think you have any performance problem run a benchmarck with SIO_ontap and then you can see the system's capaciry or open a support case but I am sure you will hear the same thing from support also.
thanks for your answer! I considered not the ANY cpu ( that I've already understood that is a meaningless parameter ) but the syssstat -x N CPU busy parameter (I haven't post it but it was at 99%). Has this parameter a better use to understand a storage system load? Or is it meaningless this too? How could I understand when the storage will reach its limit? It's just try until it dies? or there is a predictable way to understand it?
I mean, you are saying that all the cores will be employed one after the other (cool way to lower cpu content switch!) when they get saturated. So could I use the number of core utilization as a meter to understand when the storage is at the end of performance? The FAS has 12 cores, when I see 11 of them employed I can say it's time to buy another one? Now just 4-5 are really employed.. this means that we are far away from its saturation?
In a Netapp whitepaper i've read that going upper 90% CPU busy parameter can give unpredictable result on latency that could increase in an exploding way. Is this right? Of course we have to monitor storage and we have to know when we have to add another storage at least one or two month before this happens just because acquiring a new storage is not so fast.
For benchmarking purpose we prefer install some servers recreating our applications environment to really understand how the storage performs. In the past we had very different behaviour between benchmark results and real world result so we prefer to have a more traditional approach to avoid (already happened) future surprises
Currently we have installed 20 dual xeon quad servers to stress the storage and we reached the Cpu 100% busy on the filer. As I said latency continues to be exceptionally low but we need to know the real limits of the storage to better understand for how many months it will give us sufficient performance. We came from EMC Celerras and on that appliances all is quite more linear, meanings that when you reach the 100% datamover CPU you are dead and all stops working good so it is quite easy to calculate its limits and make prevision when it will saturate. Should I increase the test servers until I reach our latency limits? How much is reliable this approach? I mean we wont make explode latency with an 1% more load on storage..
PS: Bosko, you can reset your filer nfs statistics with nfsstat -z, launch your application for a while than launch again nfsstat to see the percent nfs profile and sizes.
We see the same problem as described from Lorenzo in our environment. And if your domains are really "shifting load to the next CPU", you for sure stop "shifting" at a maximum of 4 (used) CPUs.
To "open a case" is a waste of time. NetApp support asks for lots of different logs and things, just to tell you at the end "it works as designed" and "all will be good with ONTAPxyz".
I was calmed down by the answer of Lovik and now i'm returning to my depression..
So how do you manage the CPU limit? Do you increase pression on your storage until you reach the desired latency (I must admit this way give me the creeps) or do you keep CPU busy steadly near 90% and avoid to go upper?
If your application doesn't mind, I'd recommend to disable all the atime updates on all volumes. This should help a bit..
In theory and best case, you could kick the MetroCluster and maybe even the Cluster and use only the single heads. The write overhead/penalty with a MetroCluster is there when you are really pushing those Filers to the edge... well, theory, as I said.
Unfortunately, this is not an option with our (main) application. So what we did in the meantime, is, we had to introduce a (stats based) latency monitoring and spread the load of the worst application, which we know makes all the trouble (synchronous IOs) , over multiple NetApp clusters. The new FAS6280 will host other stuff for now. We hope NetApp will come up with a better ONTAP release very soon... because the FAS6080 has to replace some of the other Filers soon.
We started to use "jumbo frames" and do use 10GigE now... but this may also not free up enough CPU cycles in your case... and we disabled DeDup on all performance relevant NetApp Filers.
We also started to change our "well known" trouble making application. In the future we try to use as much local disks for the application log's and such things as possible and write less synchronous data to the Filer.
Some good answers about that topic could i.e. be found here:
http://nfs.sourceforge.net/nfs-howto/ar01s05.html (see "5.9. Synchronous vs. Asynchronous Behavior in NFS")
http://milek.blogspot.com/2010/05/zfs-synchronous-vs-asynchronous-io.html (ZFS - synchronous vs. asynchronous IO)
http://kerneltrap.org/mailarchive/linux-fsdevel/2009/8/27/6359123/thread (Linux kernel: adding proper O_SYNC/O_DSYNC)
I still don't have a 100% satisfying answer from NetApp why they don't spread their "domains" over the CPU resources they have... or just give up that kind of design. Maybe there are deep WAFL secrets involved... and maybe the new SSD shelfs could help then... if you can afford them.
I've already disabled atime from all vols, disabled dedup and unfortunately, as you, I can't disable metrocluster and cluster.. (netapp was chosen for HA and metrocluster feature).
Unofrtunately now we can't change application too.. we are planning for it but it's a long long process.. Application access near a billion of files (million of users) and this causes lot of metadata (lookup at 60%)
We tried to enable jumbo frames on a LACP with two 10Gigs ethernet ports too but it doesn't seems help at all.
Our problem seems to be just the CPU, disk are not used at all (5-10% from sysstat), so I think that SSD technology is not very useful for us.. we just need to use all CPU we bought..
I can't uderstand why Netapp decided to use exacore CPUs if they really use 4-5 cores out of 12..
We actually experience similar issue (high CPU load) for small sequential sync writes (database log writes). I'm being told that CPU load in sysstat is not the "real" CPU usage, but also no one can explain me what's shown in there... Enabling jumbo frames for smaller data blocks, disabling VLANs tagging and LACP didn't help. Disks usage 10%, CPU 100%. While our FAS is just the entry level one and event can't match performance for your unit, the problem remains the same so I'm looking forward for your conclusion.
Sounds like someone is a clock watcher! For starters, latency is generally the most important predicator to performance issues on a storage array. If latency rises, and IOPS drop, you certainly have a problem. As it stands now with your sub millisecond latency I would say you are performing exceptionally well! I understand your concern about not wanting to push your CPUs to the point where the system does become slow, but don't forget about FlexShare! Take a look at TR-3459. FlexShare lets you prioritize your workload once your system becomes heavily loaded. You can give priority to more critical operations using FlexShare, and the best part about it is you already have it on your system! If you don't contact your local SE and they will get you the license since it is free!
Metadata is mainly composed of Lookup (56%), getattr (9%), setattr (7%), then there are some creates some remove and so on but they are just a small part of them. Our application need to access something like a billion of files divided in douzens of mountpoints (our new FAS6280 is beginning to be part of this). This probably causes client lookup cache to invalidate before the same file is recalled, this could explain the high lookup requests. Fortunately lookup are one of the lighter operation on storage so this is not a problem for now. Unfortunately we can't change application at this time: of course this is planned but it is a long process and will take a lot of time.
Looking at how the load is unevenly distributed I would make a guess that something in the IRQ (interrupt handling) isn't as good as it should be.
Is all the IO-load going thru one or few network interfaces?
I've seen too many cases where especially on older Linux (pre MSI-interrupt) handling would load just one CPU.
Also, even with modern MSI-interrupt handling, I notice that some cards (QLogic 4GB/s) still just loads the first CPU in certain cases, although all the cores
In that case I would try to redistribute the load across different network cards and see if the CPU-load distribution moves around, and you could find the most optimal distribution.
You might want/be forced to do trunked connection to do that, depending on your setup and requirements to adress one or few NFS-IP adresses.
Second thing is to enable NFSv4, if possible, and look at enabling the NFSv4 delegation, again if possible/supported by your client OS.
Read and Write delegations are disabled by default.
Read about delegations in this paper : http://www.nfsconf.com/pres04/baker.pdf
You didn't mention if you have a high thruput on the NICs also.
One thing that realy P-O me is that Netapp decided to drop TCP (TOE) offloading.
So now I see a much higher CPU load on NFS with my 10G NICS when doing the exact same tests compared to 4GB/s FC.
And I doubt that the protocoll differences and block size makes up for the almost 30-50% higher CPU load I see on 10GB/s NFS, and still getting higher raw thruput on 2x4GB/s.
Netapp reintroduces TCP TSO (Segmentation offloading) in Ontap 8, but that is just a small part of the whole process of handling network traffic.
The reasoning I hear is that TOE doesn't give any extra performance compared to CPU nowdays and adds overhead.
But CPU should IMHO do other things than calculate checksums, that even the cheapest NICs does at almost wirespeed nowdays. And I suppose Netapp doesn't use the cheapest NICS out there.
What if the CPU is busy like in you case? Why not switch on TOE then if you care about performance.
Or at least leave it to us, the customers, to decide what is the optimal way.
Re the TOE thing, i heard there were driver dramas. Again its all just hearsay. We've been using TOE on our 6080's for quite a while, for us it was the difference between it bottoming out at about 400MB/s and sustained 700MB/s with line speed peaks. the biggest pain with TOE was the lack of trunk support, we used RR DNS to work around it which wasnt to bad.
We're still trying to see whats going on with 8.0.1 and stateless offload, it seems interesting but we cant really measure the improvement. To be honest we havent tried, but we probably should one day.
We are gone further with our tests and probably we began to better understand how this filer works.
First of all Lovik seems to be right.
We pushed forward to the CPU utilization and what we have seen is that more you press and more core begins to be employed. We reached 7 cores utilization:
ANY AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11
100% 33% 19% 0% 0% 0% 0% 0% 41% 38% 66% 68% 71% 90%
Pushing further we began to use CPU7, then CPU6. Latency increased but very low.. medium latency was 0.500 msec (Slowest operation in read op, 5ms, then mkdir with 2.4ms, then the others with under 1 ms latency).
The IOPS have doubled:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk
ops/s in out read write read write age hit time ty util
99% 160143 152620 300862 15218 127792 0 0 3 99% 100% : 12%
PAM (1TB) seems work good avoiding a lot of disk reads:
Usage Hit Meta Miss Hit Evict Inval Insert Chain Blocks Chain Blocks Replaced
% /s /s /s % /s /s /s /s /s /s /s /s
90 3947 3700 4254 48 105 0 6430 3887 3901 100 6425 3887
The strange thing is that CPU busy was 100% at 80000 iops, and continue to be 100% at 160000 iops, telling me that CPU Utilization is not a good storage usage indicator..
Now the 20 dual xeon test servers got saturated.. they have load at 300 and cannot spit out more cpu cycles.. anyway the IO Wait on client continue to be low, near 4%, and this seems to point out that we haven't reach the storage limit yet.
At this point the only reliable indication of storage usage seems to be the latency indication as Lovik said.. we'll try to increase test servers to understand where the limit is and to understand if pressing on storage will bring latency to explode or not (we want maximum 12ms latency)..
Thank you all,
One question from me (I also am having very serious performance problems which we don't know where they come from), you wrote that you disabled dedupe, but it didn't help.
Does it mean that you only used command sis off /vol ? or did you use sis undo command (real disable of dedupe)?
we only disabled dedupe (not udoing) and later on we couldn't undo (due to not enough space).
just do sis status - if it shows "disabled", then it still checks of changed blocks from your volume. Try to undo asis volumes.
Probably it's more correct to say that dedupe was never turned on and we mantain it disabled.
If I launch a sis status I receive a "No status entry found."
Is dedupe causing an excessive overhead to you? Netapp claims overhead is very low but I've never tried it and I'm interested on real customers application.
in my case i know dedupe is wrong in many ways. we made some mistakes earlier, like disabling dedupe (without undo) and later on we couldn't undo due to not enough space (overdeduplicated).
i also sometimes get a call that customer sees performance issue, and when i check there are let's say 3 concurrent drdupe tasks running (some in verification mode) and when i disable them problem seems to be solved afterwards.
that is why in new setup with clustered (no metro) fas3160 with pam2 i won't use deduplication because of performance implications.
in my case i could never get help from netapp as they always said that i have misaligned vm's.
may i ask btw if you are using your filers for vmware as well or just physical boxes for website? and did anything change after going from 1gbit to 10g?
This FAS6280 is planned to host only our mail application, we are planning a test with our web hosting service but not for now and no virtual machine services at all.
I've checked for difference between 1gbit link and 10gbit link but what i've found is that if you don't need bandwidth the CPU usage doesn't change very much.
I've checked FAS3160 (linked at 2 x 1Gbit/sec) and FAS6280 (linked at 2 x 10gbit/s). Using ps -c 1 in advanced mode I see an 8% CPU usage for each network thread in 1gbit and 10gbit configuration, so no difference.
A slightly better usage in 10gbit config is in the link thread e0X_qidX (10gbit) that seems to use lower CPU than Gb_Enet/eXX (1gbit) thread, but i'm speaking 1% against 3% so very very little difference and probably too low to be considered reliable.
I think that the needed networking processing power low and you should upgrade to 10gigs only if you need bandwidth.. but this is my opinion of course..
thanks for your input, but why then if I setup normal server with Windows 2008 (physical one), and connect from another server via \\servername\c$, and get a file (or write a file), then I get speed 100-120MB/s constantly? I often get that question from my customers, and I can't really explain that behavior.
Do you have valid explanation to that? I would appreciate it - maybe i would understand that more
This might be a matter of window size or packet size of your LAN connection or the NIC itself. Test your network connection with iperf using smaller and greater window size (parameter -w4K and -w128K). With 4k you get 300Mbps, while with 128k almost a full bandwidth. Moreover, non-server (ex. laptop) 1Gbps NICs can't usually do better then 400mbps. BTW, the interesting thing you achieve 100MB/s with Windows 2008. We can't get so much with 2003.
well, dd command is pretty common:
dd if=/dev/zero of=/tmp/testfile bs=1M count=2000
creates 2gb file
nfs mount is from ESX host, so basically default one, but we also did the same test on following nfs mount (physical linux server):
netapp:/vol/nfs1 /vhosts nfs defaults,noatime,hard,intr 0 0
Hi marek, i've launched 17 streams of dd from 17 different servers against the 6280, and i've reached 408MByte/sec writing, and 621MByte/sec reading. I have to say that our FAS is not optimized for throughtput but just for random IOs and doesn't have many disks loops (two controllers are installed in different places so we need to reduce fiber patches). This caused a B in sysstat CP field.
Probably with an higher number of loops performance could be increased.
I am glad you started understanding the netapp systems
Couple of things, we've been playing around with the 6280's for 3 or 4 months, and i have to say we've seen pretty spetacular NFS performance with them (peak 2.5-3GB/s off disk much higher off cache) and about 1GB/s straight onto disk. (144 spindles)
Couple of things, firstly sysstat -M 1 doesnt really work anymore, i'm assuming its because alot of the subsystems now reside in BSD, things like networking, raid and wafl (i believe, dont quote me on that)
Effectively with the way NVRAM is deployed you'll only get the benefit if you were running single systems i've been told that with clustering you're getting 2GB per head with the mirroring which will stay the same with the upgrade. you'll have 4Gb to play with single node.
with 1TB of PAM in there, the access to another 48Gb of memory probably isnt going to make a difference majorly to throughput.
A statit will give you much better/accurate information about how the cores are loaded up.
We personally havent been able to break it (other than a few software bugs we found in testing) we find the performance is pretty linear even with a couple of thousand workstations hammer a single node (deliberating trying to kill the node), obviously the total single node throughput decreases but the response times were very good.
A perfstat and/or a statit would help.
our application profile doesn't produce high troughput, instead it generates a lot of small IOPS, expecially metadata, causing high CPU utilization in filer.
At this time I can say that, with 40 dual xeon quadcore servers with our application installed, we reached the FAS limit causing high latency (that for us is: nfs read > 12ms) and high IOWAIT (>20%) on servers. To confirm that we reached the end of 6280 we tried to launch a simple ls -R on the same mount point from another not loaded server and we observed that the response was very very slow, but I have to say it, constant, about one dir list every five seconds).
With this particularly heavy load we reached 110-120k IOPS, with 500mbit/sec bandwidth usage and 40% disk utilization (160 spindles in two plexes).
Do you know a reliable way to know when the storage will be at the end? I mean, now I can make tests but when the storage will be in the production state I will need to know exactly how it can performs and anticipate when it will finish its performance capacity (Automated tests are gaving me some estimations that I hope to be as near as possible to production, but of course millions of real users are different!) .
As I said with EMC Celerras it is quite straightforward to understand storage utilization because 100% CPU is a real limit.. if you break it you are in trouble.. With netapp I was at 100% cpu busy when FAS was giving me 60k iops and I was at 100% cpu busy when I was at 100k iops too.. but, in both, with good performance anyway (as I said I don't need 0,2ms latency performances and unfortunately latency increase is not linear at all..)
Yes statit is cool but from what i've seen it is useful when you are in troubles and you want understand where the problem is. I'm not an expert on statit output and probably I cant completely understand its output very well. What are the main parameters to read out from statit output that can help me to anticipate problems and make previsions on them?
I've read that in 8.1 release netapp should fix some threading issues to better utilize modern processor with high cores number like x5670 in FAS6280.
Someone knows when it should be released?
but i have one more question. I often get questions from customers: i am copying (or dd) from/to netapp and i get like 15-20MB/s - it's slow etc. i know it's bullshit because his applications do not use dd/copy as main process, but still i can't find good answer for them. Can you suggest something?
and last question - how did you know that your filer is optimized for random io's not sequential?
dd is single threaded, slow performance, open up more consoles and do more dds or even dds from different machines so you have a degree of paralism. its always reading a block, writing a block, plain & simple. windows is faster since it reads a few blocks, caches them, and writes them after a certain ammount of cached data etc.. dd is, never was, and probably will never be a proper method of measuring performance.
you might play around with bs, 4mb is usualy a better value, and google for "why is dd so slow?", you will find plenty of discussions, some suggest to use cat over dd, but you will see, as i said, that its a rather poor performance measuring tool.
ok, I can understand this, but what to say to customers (who sometimes come from other place where they had "better copying performance" ? That it works like designed? There is huge difference between 20MB/s and 100MB/s. And that's my point
would you say that to your customers? Don't worry about copying speeds - that's normal...
You should not only play around with "
bs" but maybe also with the "
iflag=direct" and "
oflag=direct" for direct I/O's and with
"oflag=sync" and "
oflag=dsync" for synchronous tests.
I personally also don't really like "dd" as a performance testing tool. Nevertheless, we have UNIX guys here, which always use it for their quick performance tests. They use different options, yeap, and as their "dd test suite" is always the same since years, they claim the results do say something.
Btw., here is a maybe interesting "pro dd" read:
To understand what may happen with your "dd results" when Linux cached the files and how to prevent this with newer kernels, you may also find this one interesting: