Hi all,
I'm deploying a new filer and am having some troubles with SnapDrive 4.0 for Linux - specifically CentOS 5.1 x86_64 (fully patched).
snapdrived starts up ok and I can interact with it to the extent of setting the root password for the filer. When I try to perform a filer operation, however, things don't go so well. To start,
[root@db2 log]# snapdrive storage list -all
Status call to SDU daemon failed
[root@db2 log]# ps -ef | grep snapdri
root 7587 1 0 Jul24 ? 00:00:00 snapdrived start
root 11283 7587 0 13:40 ? 00:00:00 [snapdrived] <defunct>
Each re-iteration of a snapdrive storage command will spawn a new defunct process. Commands such as "snapdrive config show" will run fine.
And in sd-trace.log:
13:43:06 07/25/08 [f7f7cb90]?,2,2,Job tag: bEogRP90xw
13:43:06 07/25/08 [f7f7cb90]?,2,2,snapdrive storage list -all
13:43:06 07/25/08 [f7f7cb90]v,2,6,FileSpecOperation::FileSpecOperation: 12
13:43:06 07/25/08 [f7f7cb90]v,2,6,StorageOperation::StorageOperation: 12
13:43:06 07/25/08 [f7f7cb90]i,2,2,Job tag bEogRP90xw
13:43:06 07/25/08 [f7f7cb90]i,2,6,Operation::setUserCred user id from soap context: root
13:43:06 07/25/08 [f7f7cb90]i,2,6,Operation::setUserCred uid:0 gid:0 userName:root
13:43:06 07/25/08 [f7f7cb90]F,0,0,Fatal error: Assertion detected in production code: ../sbl/StorageOperation.cpp:182: Test 'osAssistants.size() == 1' failed
When I strace the snapdrive process I see things conclude with:
connect(3, {sa_family=AF_INET, sin_port=htons(4094), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
send(3, "POST / HTTP/1.1\r\nHost: localhost"..., 1555, 0) = 1555
recv(3, "HTTP/1.1 200 OK\r\nServer: gSOAP/2"..., 65536, 0) = 1722
shutdown(3, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected)
close(3) = 0
write(2, "Status call to SDU daemon failed"..., 33) = 33
munmap(0xf7f7d000, 135168) = 0
exit_group(104) = ?
Which follows what I see on the packet capture side of things where the snapdrived port sends RSTs (no doubt after the child process has gone defunct) after a very limited exchange:
POST / HTTP/1.1
Host: localhoHTTP/1.1 200 OK
Server: gSOAP
Any input appreciated.
Thanks in advance.
Hello Frans,
I have seen similar problems like this occur in the past, so I'm going to offer up a few suggestions in the hope we can get this taken care of.
1. Check the current length of the lun names - If they're excessively long, this could be part of your problem.
2. Make sure you have no stale snapdrive daemons or that the snapdrive ports are not in use
a.) ps -ae | grep snap
b.) ps -an | grep 4094
3. Attempt enabling low latency to disable delayed ACK from kicking in.
Check with:
sysctl -a| grep net.ipv4.tcp_low_latency
Should report:
net.ipv4.tcp_low_latency = 0
Enable with:
sysctl -w net.ipv4.tcp_low_latency=1
4. And above all, if your troubleshooting steps up until 3 do not return any significant results, contact support - 888-463-8277 (888-4NETAPP)
Let us know if this helps Frans!
Thanks,
Christopher
Hi Chris,
I still get defunct processes on SnapDrive after I set the tcp_low_latency on. I'll open a ticket with NetApp.
Thanks,
Frans
Thanks for the update Frans,
I look forward to a speedy resolution to your problem!
Christopher
Hello Frans!
Did you find a solution for this problem?
I have the same problem on my lab system....
Regards
Helge
Hi Helge,
No, unfortunately I have not found a soution. NetApp does not support CentOS but advised I try an older rev of SnapDrive. If, by chance, you are using RHEL
and have support with NetApp, could you open a ticket?
Cheers,
Frans
I'm getting the same issue here. Ialso opened a ticket but had no luck getting a better response.
It would seem there is a clear demand from the NetApp community for CentOS support. Frans did some good work digging into this as much as an end user can. What must we do to get NetApp Engineers to look into this? The error message states clearly what code this is puking on.
Help your loyal customers out NetApp, please!
To add more information here, most of which was included in my ticket to NetApp support:
OS: CentOS 5.2 (Also tried with Fedora 7 with same results)
Filer: FAS 3070
Connection: iSCSI
SnapDrive Version: 4.0, 3.0 and 2.2.1
sanlun version: 3.2.79.2486
Snapdrive v4.0:
All 'snapdrive config *' commands work, nothing else appears to work. Mainly:
#snapdrive storage list -all
Status call to SDU daemon failed
Snapdrive v.3.0:
Nothing here really appears to work. The common error I get is:
0001-877 Admin error: HBA assistant not found. Commands involving LUNs should fail.
The most success I have had was with Snapdrive v 2.2.1
Snapdrive v 2.2.1:
'snapdrive config' works
I have had success with 'snapdrive snap create -fs [path_to_mounted_LUN]'
Doing a 'snapdrive snap restore' from the snap does NOT work, however I successfully tested making a FlexClone from the Snap and mounting it.
Snapdrive v2.2.1 does NOT work with multipathing, as I found out just tonight which is a requirement for production use, IMHO.
Use " snapdrive storage show -all "
Also check /etc/hosts file for host and filer ip/alias
~Nikhil
Neither 'snapdrive storage show -all' or 'snapdrive storage list -all' work. They seem to be similar commands anywho.
A host entry exists for the filers and works, otherwise simply getting a login to the filer would fail (you should not be able to 'snapdrive config set [filer] root' without this existing).
The meaning of my addition to this post was to prove there is a need and want for snapdrive to work in CentOS and that others are trying to make it work with very little success.
1) " Status call to SDU daemon failed " , got this error when when filer DNS entry was removed.
Looks good in your case.
2) Admin Error: HBA assistant not found.
check (a) sanlun lun show => does it working on your host ? .
(b) sanlun fcp show adapter -v => does it showing HBA information on the host ?
HBA assistant not found means, Snapdrive is not able to recognize the host HBA driver.
(a) 'sanlun lun show'
This is working correctly. Results are showing filers, LUNs that are mapped to the igroup the initiator belongs to, lun-pathnames are correct as well as device filenames.
filer: lun-pathname device filename adapter protocol lun size lun state
fas-001: /vol/vol_test4/lun_test4 /dev/sde host1 iSCSI 400.0g (429523992576) GOOD
fas-001: /vol/vol_test4/lun_test4 /dev/sdc host2 iSCSI 400.0g (429523992576) GOOD
fas-001: /vol/vol_test3/lun_test3 /dev/sdd host1 iSCSI 400.0g (429523992576) GOOD
fas-001: /vol/vol_test3/lun_test3 /dev/sdb host2 iSCSI 400.0g (429523992576) GOOD
(b) 'sanlun fcp show adapter -v'
Unable to locate /usr/lib/libHBAAPI.so library
Make sure the package installing the library is installed & loaded
I am using iSCSI and I believe the original poster was as well. Should this still show some sort of result back
I found this KB article: https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb41496
which seems to indicate Linux Host Utils should install that library, but it does not exist on my systems. Is it installed with iSCSI Host Utils also or just FC Host Utils?
(a) and (b) looks good for iSCSI, since you are using iSCSI HU.
1) what is your below parameters in snapdrive.conf file.
"snapdrive config show"
E.g
default-transport="iscsi" # Transport type to use for storage provisioning, when a decision is needed
multipathing-type="none" # Multipathing software to use when more than one multipathing solution is available
fstype="ext3" # File system to use when more than one file system is available
vmtype="lvm" # Volume manager to use when more than one volume manager is available
use-https-to-filer=off # Communication with filer done via HTTPS instead of HTTP
below obtained from active config, using grep to single out lines:
default-transport="iscsi" # Transport type to use for storage provisioning, when a decision is need
multipathing-type="none" # Multipathing software to use when more than one multipathing solution is available
fstype did not exist, I've manually added this in by hand(this did not change anything)
vmtype did not exist also. We're not using LVM so I have not added this to the config
use-https-to-filer=on # Communication with filer done via HTTPS instead of HTTP
I have disabled all HTTP transactions on the filer end so I have this enabled. I can re-enable HTTP mode if you think this will make a difference and re-test.
It should be noted that I am using snapdrive 2.2.1 for these, as that is the version I have had the most sucess with. Thanks,
Jesse
Also can you please let us know the snapdrive restore command which you are using , is it a live or dead filespec
'snapdrive snap restore -fs [local fs path] -snapname [snap_created]'
I've tried this with both the LUN device mounted and unmounted with the same results. In my case, the restore command is less of an issue. I have successfully used a SNAP to create a FlexVol without issues. It's a slight annoying to have to get on the filer to do this, but not a problem really, especially with the frequency a SNAP is/will be restored.
Since you are not using VM type , does it mean you created a raw lun and make filesystem on it and used snapdrive to create the snapshot.
That is right
in a condensed, psuedo command type fashion:
filer> create lun, map to igroup, etc
server> iscsiadm resync
server> fdisk to create partition on device (following KB article http://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb8190 to ensure proper alignment)
server> mke2fs w/ journal
server> mount LUN device partition
server> snapdrive snap creat -fs [LUN device partition] -snapname test
Check Netapp support matrix for snapdrive with CentOS.
Since this is an Assistant error , it may require snapdrive code fix for CentOs.
The support matrix doesn't have CentOS which is why calling up NetApp support doesn't help at all.
CentOS should be a 99.9% match to RHEL so a code fix shouldn't be too hard. Heck, the sd-trace file tells you exactly where it is bombin out. See OP for message from sd-trace.
What would be the correct route to try and get NetApp to have an engineer look at this? It would seem a small fix could make plenty of customer happy.
It would be nice if Netapp at least reviewed the problem without offering official support but don't hold your breath.
What would be interesting would be to see a strace from a true RHEL system running the same kernel and RPMs. I'm afraid if we're going to solve this one we're going to solve it on our own.
I agree completely. I was strung on by support for a few weeks wth common setup question. Once it came down to the nitty gritty they just referred back to the support matrix. They didn't even have the curteousy of checking this forum posting, even when I put it in the ticket.
I happen to have a RHEL system available, my VAR obtained a copy for me to use to prove that SnapDrive does in fact work. Funny thing is, it still doesn't work. It gets hung up in a different area, but doesn't work none the less. I will get the exact same RPM versions installed tonight, and as close of a kernel as possible and get the strace for the app. I'll post the results here in a few hours.
Please provide the NetApp support Ticket Number.
Thanks
~Nikhil
Case 2000172643
Thanks
Here is the strace from a successful snap using RHEL 5.2:
connect(3, {sa_family=AF_INET, sin_port=htons(4094), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
send(3, "POST / HTTP/1.1\r\nHost: localhost"..., 1555, 0) = 1555
recv(3, "HTTP/1.1 200 OK\r\nServer: gSOAP/2"..., 65536, 0) = 1688
shutdown(3, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected)
close(3) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, , 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, {1, 0}) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0
setsockopt(3, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
open("/etc/hosts", O_RDONLY) = 4
I have also attached the full strace as a text file.
Hello...
I have similar problems.
I need to install Snap Drive for Unix (SDU) on Oracle Enterprise Linux 5.2 (RedHat 5.2 indeed) and iSCSI HBA (qlogic 4050c) and FCP HBA (HP FC2243 (Emulex LP11002 rebrand))
I used Netapp FCP (and iSCSI) Host Utilities for Linux 3.0
The most popular error on all systems is:
admin Error: HBA assistant not found
sanlun fcp show adapter says
sanlun fcp show
WARNING: libHBAAPI.so not found in /usr/lib
Unable to load HBA control library
Similar case is described here for Qlogic FC HBA:
https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb45308
But there are no libHBAAPI.so in Qlogic qlaiscsi package for 4050C. I installed qlogic drivers and sansurfer CLI but no that library.
Even for Emulex HBA:
https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb40902
I installed described tools and host utilities and sanlun fcp show adapter all do not want /usr/lib/libHBAAPI.so
but says "No supported adapters present"
In both cases SnapDrive does not want to work with HBA assistant error.
SnapDrive config checker says the following:
Detected Intel/AMD x64 Architecture
Detected Linux OS
Detected Host OS Oracle Enterprise Linux 5
Detected Host OS Oracle Enterprise Linux 5
Detected NFS FileSystem on Linux
Detected FCP on Linux
Detected Ext3 File System
Detected Linux Native LVM1
Detected Linux Native LVM2
Detected Linux Native MPIO
Did not find any supported cluster solutions.
Detected Software iSCSI Linux Initiator Support Kit 3.0
Supported Configurations on this host by SDU Version 4.1
-------------------------------------------------------------
Linux NFS Configuration
Interesting that the ckecker does not support Host Utilities 4.1 and 4.2.
Did anybody have similar problems with SDU and SAN HBAs? The only config where SDU works is SW iSCSI.
Thanks.
This is due to iSCSI Hardware initiator. SDU currently does not support iSCSI Hardware initiator in Linux (Please refer to NetApp InterOperability Matrix). iSCSI Hardware initiator is supported only on Solaris platform.
Regards,
Senthil
Thank you.
Probably I'm not very good with new Netapp Compatibility Matrix tool.
And what about FCP? I have the same problems with FCP even with HBAlib installed?
Nick
It should work with FCP. It could be that some settings are not configured properly. Please make sure you have followed SnapDrive for Unix Installation and Administration Guide properly. Please contact NetApp support, if you still have issues.
Poked around in the snapdrived binary. Apparently it make things differently if it is a Red Hat 4 or Red Hat 5 distribution. It determines what kind of host it is by using /etc/redhat-release.
Changing the contents in that file to:
Red Hat Enterprise Linux Server release 4 (Tikanga)
Makes everything magically work. This is with SDU 4.1 using NFS on a CentOS 5.2 with recent updates. SnapManager for Oracle with Oracle 10gR2 works like a charm too! ![]()
This is ofcourse an unsupported lab environment.
Thanks Michael,
your post made my day. Now I've my proof of concept running, using CentOs 5.1 as a Xen host with VMs (also CentOs 5.1) running a simulator and two Oracle 11g RAC nodes, all controlled by Snapmanager for Oracle.Tested all combinations (NFS, DNFS both directly using the simulators NFS shares as well as ASM on top of NFS shares).
Works great.
Mark
This post is directed primarily to the NetApp engineers that check these threads.
I was having the same problems as stated above. I have installed SnapDrive for Unix(Linux) 4.1 on a RHEL 5.2 server. The install works fine and the SDU daemon appears to be working properly. However, when running the majority of the snapdrive commands, I receive that following error:
Status call to SDU daemon failed
I tried troubleshooting this problem for a few days, but didn't have much luck. Just today though, I saw the post form Michael Mattsson stating that changing the /etc/redhat-release file to read "Red Hat Enterprise Linux Server release 4 (Tikanga)", makes everything magically work. I tried the fix myself, and sure enough, he is correct.
I have been messing around with SnapDrive in our lab, but we are getting ready to deploy it in our production environments. I don't want to have to make this change across thousands of servers, especially since it seems to be a rather "dirty" fix.
Is there any supported fix that is being developed or a patch in a later release? This seems to be a large bug in the SnapDrive code since it claims to be compatible up to RHEL 5.3.
Please let me know.
Thanks.
SnapDrive for UNIX v4.1 works with Redhat v5.2.
What was the original value in the /etc/redhat-release file? What were the commands that were failing?
I do not suspect this to be an an issue with the /etc/redhat-release file. We need to analyze the trace logs and system configuration to understand the problem.
Kindly file a ticket with NetApp and provide them with the output of "snapdrive.dc" and linux_info diagnostic scripts.
Good luck with any type of solid support from NetApp regarding SnapDrive for Unix. I've had a ticket open with them for over 2 weeks with absolutely no hint of resolution.
Funny thing is, the snapdrive utility actually creates the iSCSI LUN on the filer as well as the iGroup but then fails to 'discover' the new LUNs after they have been created:
mapping new lun(s) ... done
discovering new lun(s) ... *failed*
Not so fun times with NetApp.......
Please provide the ticket number.
It's great to see so many other users in the last 6 months that have also wanted this. I wonder if NetApp will choose to support it's users any time soon.
We have RHEL 5.3 with Emulex HBA, Host Utils 5.0, Snapdrive 4.1 and were running into the same issue as above. Netapp has released a fix to make this work, but does not work. This did fix the problem but we need a supportable fix and this is a hack. I have heard Netapp is working on officially releasing the "broken" fix soon. We are trying to figure out what is wrong with our environment that the fix does not work, as they have not released the docs yet. We are supposed to be on the line with the product managers tonight.
Keep you updated.
Mark,
Good luck with the fix. Luckily RHEL is a supported distribution so you're getting much better support with SnapDrive than those of us using CentOS. I look forward to hearing how the resolution turns out and more about the 'hack' they suggested you use and their final fix for you. Perhaps it can shed some light for CentOS users. Thanks for updating the community, it's interesting to hear this problem is now coming up in a supported distribution.
Jesse
OK I think I may have found the solution.
After digging through the snapdrive and snapdrived binaries I noticed that the check it does between RHEL4 and RHEL5 are different.
For RHEL4 it does a 'cat /etc/redhat-release' to get the OS version information.
For RHEL5 id does a 'cat /etc/issue' for the OS information.
As to why this changed I have no idea, just modify '/etc/issue' with:
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
Kernel \r on an \m
Don't forget to set '/etc/redhat-release' back to the original for your system. For RHEL5 it is:
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
Restart snapdrived.
I tested with 'snapdrive storage list -all' and it gave me the correct output.
Hope this helps.