25 Replies Latest reply: Jan 30, 2014 9:58 PM by pulkitkaul RSS

LSF Performance Plugins

pulkitkaul
Currently Being Moderated

Hi Team,

 

Last week I have heard about performance plugins which NetApp have developed for handling the LSF jobs related issues which users are generally faced in CAD environment.

I have heard there are two plugins (1) NetApp schedule Pulgins (2) Hot job blugin.

 

I am very much intresting in deploying this in my environment so can someone help me to guide how these plugins are working and what all we required to make these plugins workable.

 

 

 

 

Best Regrads

 

Pulkit Kaul.

  • Re: LSF Performance Plugins
    bikash
    Currently Being Moderated

    Pulkit,

     

    It is nice to know that ST Micro is interested to try out the NetApp LSF scheduler plugin. I am working with Bakshana to get the internal process in place before sending out the implementation guide and the plugin.

     

    Bikash

    • Re: LSF Performance Plugins
      pulkitkaul
      Currently Being Moderated

      Hi Bikash,

       

      I have implemented the schedule plugin on test LSF master host but while executing this script:

       

      python2.7 ontapmon.py config.ini

       

      I am getting below errors in "ontapmon_error.log" file:

       

      File "/usr/local/lib/python2.7/httplib.py", line 1045, in getresponse

          response.begin()

        File "/usr/local/lib/python2.7/httplib.py", line 409, in begin

          version, status, reason = self._read_status()

        File "/usr/local/lib/python2.7/httplib.py", line 365, in _read_status

          line = self.fp.readline(_MAXLINE + 1)

        File "/usr/local/lib/python2.7/socket.py", line 476, in readline

          data = self._sock.recv(self._rbufsize)

      KeyboardInterrupt

      2013-11-07 13:06:38,983 ERROR main(): ipaddr_get() unable to get ip for filer napfs001b

      2013-11-07 13:07:52,275 ERROR ipaddr_get(): Failed agrfs006a:SNMP walk failed

      2013-11-07 13:07:52,275 ERROR main(): ipaddr_get() unable to get ip for filer agrfs006a

       

       

      Could you plese help me to confirm why this script ia not getting the IPS for above mentioned filers.

       

       

      Best Regards

       

      Pulkit Kaul.

      • Re: LSF Performance Plugins
        bikash
        Currently Being Moderated

        Pulkit,

         

        Our engineer who wrote the code checked that python2.7 is working with the script even though we had recommnedation for python2.6. The error message ending with "KeyboardInterrupt" is just Python is spitting out a stack trace of where the script was interupted with Ctl+C. This does not indicate that there is problem. You should allow to run the script to normally to complete before interupting in between.

         

        It looks like the three errors hat you have listed towards the end -

        2013-11-07 13:06:38,983 ERROR main(): ipaddr_get() unable to get ip for filer napfs001b

        2013-11-07 13:07:52,275 ERROR ipaddr_get(): Failed agrfs006a:SNMP walk failed

        2013-11-07 13:07:52,275 ERROR main(): ipaddr_get() unable to get ip for filer agrfs006a

         

        Could be mainly due to ht DFM/Operations Manager server has controllers( napfs001b, agrfs006a) that are either offline or cDOT controllers or they are not reachable on the network. This should not be much of a concern unless you wnt those specfic controllers to be monitored.

         

        • Re: LSF Performance Plugins
          pulkitkaul
          Currently Being Moderated

          Hi Bikash,

           

          Thanks for the prompt reply.

           

          Just want to add:

           

          This time my script is able to fetch the information ftom NetApp filers.

          But this time this script fetch the information for all the NetApp filers configured in DFM =.

           

          [root@dlhadminsb DIRLOC]# pwd

          /data/logs/plugin/DIRLOC

          [root@dlhadminsb DIRLOC]# ls -l

          total 1380

          -rw-r--r--  1 root root    9394 Nov  7 18:52 agrfs008a.xml

          -rw-r--r--  1 root root   16018 Nov  7 18:52 agrfs008b.xml

          -rw-r--r--  1 root root    6629 Nov  7 18:53 agrfsb0.xml

          -rw-r--r--  1 root root    5770 Nov  7 18:54 bngfs02.xml

          -rw-r--r--  1 root root    2820 Nov  7 18:55 c2na01.xml

          -rw-r--r--  1 root root    2508 Nov  7 18:55 c2na02.xml

          -rw-r--r--  1 root root    6461 Nov  7 18:37 crx604.xml

          -rw-r--r--  1 root root    6815 Nov  7 18:56 crx605.xml

          -rw-r--r--  1 root root    3451 Nov  7 18:36 crx606.xml

          -rw-r--r--  1 root root    4438 Nov  7 18:38 crx607.xml

          -rw-r--r--  1 root root   11735 Nov  7 18:38 ctofs003a.xml

          -rw-r--r--  1 root root    9989 Nov  7 18:39 ctofs006a.xml

          -rw-r--r--  1 root root    4973 Nov  7 18:40 dlhfs01.xml

          -rw-r--r--  1 root root    5936 Nov  7 18:40 dlhfs02.xml

          -rw-r--r--  1 root root    3440 Nov  7 18:38 dlhfs03.xml

          -rw-r--r--  1 root root    4900 Nov  7 17:27 dlhfs04.xml

          -rw-r--r--  1 root root    1872 Nov  7 18:38 dlhfs05.xml

          -rw-r--r--  1 root root    3316 Nov  7 18:38 dlhfs06.xml

          -rw-r--r--  1 root root    4460 Nov  7 18:40 dlhfs07.xml

          -rw-r--r--  1 root root    2995 Nov  7 18:42 dlhfs08.xml

          -rw-r--r--  1 root root    3591 Nov  7 18:38 dlhfs11.xml

          -rw-r--r--  1 root root    3915 Nov  7 18:40 dlhfs12.xml

          -rw-r--r--  1 root root    3557 Nov  7 18:39 dlhfs13.xml

          -rw-r--r--  1 root root    3989 Nov  7 18:40 dlhfs14.xml

          -rw-r--r--  1 root root    4184 Nov  7 18:40 dlhfs15.xml

          -rw-r--r--  1 root root    4746 Nov  7 18:40 dlhfs16.xml

          -rw-r--r--  1 root root    3477 Nov  7 18:13 dlhfs17.xml

          -rw-r--r--  1 root root    5347 Nov  7 18:41 dlhfs18.xml

          -rw-r--r--  1 root root   10669 Nov  7 18:42 dlhr200.xml

          -rw-r--r--  1 root root   24190 Nov  7 18:43 gnx5948.xml

          -rw-r--r--  1 root root   16410 Nov  7 18:43 gnx5949.xml

          -rw-r--r--  1 root root    1749 Nov  7 18:41 napfs001a.xml

          -rw-r--r--  1 root root    2148 Nov  7 18:41 napfs001b.xml

          -rw-r--r--  1 root root 1116634 Nov  7 18:56 ontapmon_error.log

          -rw-r--r--  1 root root    4083 Nov  7 18:41 rbaux190.xml

          -rw-r--r--  1 root root    4276 Nov  7 18:34 rbaux191.xml

          -rw-r--r--  1 root root    1868 Nov  7 17:31 tunna03.xml

           

          Looks scheduler plugin is not configured properly because of which instead of specific filer which I have configured in "ntapplugin.conf" file, this "ontapmon.py" script is fetching the informatin for all the filers.

           

           

           

          # First section lists the available filers filesystems
          #
          Begin ExportNames
          /data/test      dlhfs06:/vol/vol10
          #test2  fas6280c-svl12:/vol/nfsDS
          End ExportNames

          #
          # The second section lists the GLOBAL filer utilization thresholds
          # and file system space thresholds.
          #
          Begin PluginPolicy
          Max_DiskBusy    =       10
          Max_NEDomain    =       75
          Max_AvgVolLatency =     10
          Min_AvailFiles  =       1000
          Min_AvailSize   =       1000
          End PluginPolicy

          #
          # Section where one can define Volume and Filer specific parameters
          # and policies.  IF Filer or volume are not specified then, global
          # thresholds will be used.
          #
          Begin FilerPolicy
          #fas6280c-svl11:/vol/volTest    Min_AvailFiles = 1000
          #fas6280c-svl11:/volnfsDS       Min_AvailSize = 1000
          #fas6280c-svl11                 Max_DiskBusy =  50
          #fas6280c-svl11                 Max_AvgVolLatency = 10
          dlhfs06:/vol/vol10              Min_AvailSize = 800
          End FilerPolicy

          #
          # Parameter section controlling plugin
          # behaviour.
          Begin Parameters
          Debug yes
          Work_Dir /data/logs/plugin/netapp-log
          Counter_Dir /data/logs/plugin/DIRLOC
          XMLReread 60
          DryRunMode no
          End Parameters

           

           

          Also I have observed the file "ntapplugin.log" have not created as mentioned in implementation guide on page number 24.

           



           

          13. The

          ntapplugin.log file will show all the threshold values that are set globally on the NetApp Storage and also values set on each volume. This will also list the schmod_netapp.so plugin is configured correctly. This information will be logged in the ntapplugin.log file if the debug mode is set in the ntapplugin.conf file. The debug mode is enabled by default.



          [root@ibmx3650-svl50 netapp-log]# tail -f ntapplugin.log


          Mar 20 20:10:20:804066 9881 parse_policies(): Max_NEDomain 1000.000 configured


          Mar 20 20:10:20:804096 9881 parse_filerpolicies(): Max_NEDomain 1000.000 configured


          Mar 20 20:10:20:804108 9881 parse_filerpolicies(): Max_NEDomain 1000.000 configured


          Mar 20 20:10:20:804117 9881 parse_filerpolicies(): Max_DiskBusy 50.000 configured


          Mar 20 20:10:20:804126 9881 parse_filerpolicies(): Max_AvgVolLat_Busy 15.000 configured


          Mar 20 20:10:20:804134 9881 filertab->size 269


          Mar 20 20:10:20:804140 9881 146: data volume fas6280c-svl05:/vol/USERVOL_16d_rg 50.000000 0.000000 15.000000 1000.000000 1000.000000


          Mar 20 20:10:20:804160 9881 read_conf():

           

          schmod_ntap.so plugin configured all right

           

           

          But when I have checked the LSF master server logs I found module schmod_netapp.so is successfully loaded.

           

           

          Nov  8 13:43:10 2013 1410 6 9.1.1 Customer scheduling
          module(/sw/platform/lsf_test/9.1/linux2.6-glibc2.3-x86_64/lib/schmod_netapp.so)loaded successfully.

           

           

          Also Hot job plugin file is not available in "netapp_lsf_plugin_v2.0" folder.

           

          Can you please check the above mentioned configuration and help us to resolve this issue.

           

          Thanks for your Cooperation.

           

           

          Best Regards

           

          Pulkit Kaul

           

           

           

          • Re: LSF Performance Plugins
            bikash
            Currently Being Moderated

            Pulkit,-

             

            In the ntapplugin.conf file you have not specified the proper tag that is mentioned in the document. It does not take "/".

            You have in your ntapplugin.conf file - /data/test      dlhfs06:/vol/vol10

             

            In the TR it is listed as -

            lsf_storage     fas6280c-svl05:/vol/USERVOL_16d_rg

            Bikash

            • Re: LSF Performance Plugins
              pulkitkaul
              Currently Being Moderated

              Hi Bikash,

               

              we dont have any file ystem lsf_storage created, Do I first need to create this or how linuk machine identify this as a file system for submitting the jobs.

               

              In our setup we are uisng NIS server for defining file systems through auto mounter in this case how I will define this tag as a file system or do we need to create some mapping in betwwen this tag and file system.

               

               

              Please confirm.

               

              Best Regards

               

              Pulkit Kaul.

              • Re: LSF Performance Plugins
                bikash
                Currently Being Moderated

                Pulkit,

                 

                "lsf_storage" is not a file system. It is just a tag. It does not have to "lsf_storage", the tag name could be any other name but should not have a "/" in it. You will using this tag along with "bsub" command as mentioned step 11 in section 8.3 of the TR. Anytime you make a change to the ntapplugin.conf file, make sure you repeat steps 9 and 10 of section 8.3. Please refer to the README file for more details.

                 

                Bikash

                • Re: LSF Performance Plugins
                  pulkitkaul
                  Currently Being Moderated

                  Hi Bikash,

                   

                  Hi Bikash,

                   

                  I have moade all the required chnages in "ntapplugin.conf" file as mentioned by you:

                   

                  [root@dlhadminsb conf]# more ntapplugin.conf
                  #
                  # $Id: filerplugin.conf,v 1.1 2010/12/30 22:46:23 david Exp $
                  #
                  # /etc/filesystemtags
                  #
                  # Interface defining the filers and file systems to the scheduling plugin
                  # this file has to be managed by the site system administrator.
                  # This file has two sections:
                  #
                  # First section lists the available filers filesystems
                  #
                  Begin ExportNames
                  test    dlhfs06:/vol/vol10
                  #test2  fas6280c-svl12:/vol/nfsDS
                  End ExportNames

                  #
                  # The second section lists the GLOBAL filer utilization thresholds
                  # and file system space thresholds.
                  #
                  Begin PluginPolicy
                  Max_DiskBusy    =       10
                  Max_NEDomain    =       75
                  Max_AvgVolLatency =     10
                  Min_AvailFiles  =       1000
                  Min_AvailSize   =       1000
                  End PluginPolicy

                  #
                  # Section where one can define Volume and Filer specific parameters
                  # and policies.  IF Filer or volume are not specified then, global
                  # thresholds will be used.
                  #
                  Begin FilerPolicy
                  #fas6280c-svl11:/vol/volTest    Min_AvailFiles = 1000
                  #fas6280c-svl11:/volnfsDS       Min_AvailSize = 1000
                  #fas6280c-svl11                 Max_DiskBusy =  50
                  #fas6280c-svl11                 Max_AvgVolLatency = 10
                  dlhfs06:/vol/vol10              Min_AvailSize = 800
                  End FilerPolicy

                  #
                  # Parameter section controlling plugin
                  # behaviour.
                  Begin Parameters
                  Debug yes
                  Work_Dir /data/logs/plugin/netapp-log
                  Counter_Dir /data/logs/plugin/DIRLOC
                  XMLReread 60
                  DryRunMode no
                  End Parameters

                   

                  After making these changes I have restarted the server and followed step 9 and 10 mentioned in section 8.3 but still "ntapplugin.log" have not created.

                  Can u pls check and help us to resolve this issue.

                   

                  Also the folder "netapp_lsf_plugin_v2.0" did not contain lsf hot job plugin.

                  If possible can you please provide these plugins so that we will test the same in our environment.

                   

                   

                  Best Regards

                   

                  Pulkit Kaul.

                  • Re: LSF Performance Plugins
                    pulkitkaul
                    Currently Being Moderated

                    Hi Bikash,

                     

                    I have made all the required changes in "ntapplugin.conf" file as mentioned by you:

                     

                    [root@dlhadminsb conf]# more ntapplugin.conf
                    #
                    # $Id: filerplugin.conf,v 1.1 2010/12/30 22:46:23 david Exp $
                    #
                    # /etc/filesystemtags
                    #
                    # Interface defining the filers and file systems to the scheduling plugin
                    # this file has to be managed by the site system administrator.
                    # This file has two sections:
                    #
                    # First section lists the available filers filesystems
                    #
                    Begin ExportNames
                    test    dlhfs06:/vol/vol10
                    #test2  fas6280c-svl12:/vol/nfsDS
                    End ExportNames

                    #
                    # The second section lists the GLOBAL filer utilization thresholds
                    # and file system space thresholds.
                    #
                    Begin PluginPolicy
                    Max_DiskBusy    =       10
                    Max_NEDomain    =       75
                    Max_AvgVolLatency =     10
                    Min_AvailFiles  =       1000
                    Min_AvailSize   =       1000
                    End PluginPolicy

                    #
                    # Section where one can define Volume and Filer specific parameters
                    # and policies.  IF Filer or volume are not specified then, global
                    # thresholds will be used.
                    #
                    Begin FilerPolicy
                    #fas6280c-svl11:/vol/volTest    Min_AvailFiles = 1000
                    #fas6280c-svl11:/volnfsDS       Min_AvailSize = 1000
                    #fas6280c-svl11                 Max_DiskBusy =  50
                    #fas6280c-svl11                 Max_AvgVolLatency = 10
                    dlhfs06:/vol/vol10              Min_AvailSize = 800
                    End FilerPolicy

                    #
                    # Parameter section controlling plugin
                    # behaviour.
                    Begin Parameters
                    Debug yes
                    Work_Dir /data/logs/plugin/netapp-log
                    Counter_Dir /data/logs/plugin/DIRLOC
                    XMLReread 60
                    DryRunMode no
                    End Parameters

                     

                    After making the changes, I have restarted the server and followed step 9 and 10 mentioned in section 8.3 but still "ntapplugin.log" have not created.

                    Can u pls check and help us to resolve this issue.

                     

                    Also the folder "netapp_lsf_plugin_v2.0" did not contain lsf hot job plugins.

                    If possible can you please provide these plugins so that we will test the same in our environment.

                     

                     

                    Best Regards

                     

                    Pulkit Kaul.

                    • Re: LSF Performance Plugins
                      pulkitkaul
                      Currently Being Moderated

                      Hi Vikash,

                       

                      Happy New year!

                       

                      Just want to inform you, we have tested the scheduler plugins and its working as  per our understanding.

                      But for hot job plugins we are facing some problems:

                       

                      We have configured this hot job plugin but its not able to generate the proper reporting. Infact I have tested this for volumes TEST created on filer dlhfs07 for which we have configured disk busy thrushold  value to 10%.

                      After enabling the "elim.netapp_compute" and "systemtap" deamons  with which this plugin have successfully monitered the disk busy threshold and generated the report for the same.

                       

                      Below are the plugin processes which are currently running:

                      ps -ef | grep netapp

                      root      9588  9584  0 Jan08 ?        00:05:47 /usr/bin/python /sw/platform/lsf_test/9.1/linux2.6-glibc2.3-x86_64/etc/elim.netapp_compute

                      root     13769  7616  0 15:13 pts/0    00:00:00 /usr/bin/python26 ./netapp_lsf_hot_job_detector.py

                      root     32557  9588  0 14:03 ?        00:00:00 /usr/libexec/systemtap/stapio /sw/platform/lsf_test/netapp_nfsmon_2_6_32_358_el6_x86_64.ko

                      root      2069  7616  0 14:13 pts/0    00:00:31 python26 ontapmon.py config.ini

                       

                      But in that report its not showing any LSF information (like job id 2578 which is currently running and for which job report file have created).

                       

                      Below is the output of "/var/log/netapp_lsf_hot_job_detector.log" file :

                       

                       

                      2014-01-10 14:30:56,392 DEBUG Checking XML file /data/logs/plugin/DIRLOC/dlhfs07.xml.
                      2014-01-10 14:30:56,403 INFO Found performance problems on the following controllers: dlhfs07
                      2014-01-10 14:30:56,407 DEBUG Discovered text files in job report directory /data/logs/plugin/job_reports/:  ['/data/logs/plugin/job_reports/2575-lsftest.txt',

                      '/data/logs/plugin/job_reports/2471-lsftest.txt', '/data/logs/plugin/job_reports/2576-lsftest.txt', '/data/logs/plugin/job_reports/2577-lsftest.txt',

                      '/data/logs/plugin/job_reports/2573-lsftest.txt', '/data/logs/plugin/job_reports/2574-lsftest.txt', '/data/logs/plugin/job_reports/2572-lsftest.txt',

                      '/data/logs/plugin/job_reports/2578-lsftest.txt'].
                      2014-01-10 14:30:56,408 DEBUG Read job reports: []
                      2014-01-10 14:30:56,408 INFO Created performance report:
                      ['Controller: dlhfs07\n\n\t-Aggregate fs07ag1 has a disk that has exceeded the threshold for acceptable maximum disk busy. Threshold: 10.00, Value: 18.43.\n\n']
                      2014-01-10 14:30:56,409 INFO Calling script: python netapp_lsf_hot_job_email.py ./NetApp_LSF_Hot_Job_Detector_Report_-_2014-01-10_14.30.56.txt

                       

                       

                      Also after some time lsf hot job detector script automatically gets killed as mentioned below:

                       

                      [root@dlhl0571 lsf_test]# Traceback (most recent call last):
                        File "./netapp_lsf_hot_job_detector.py", line 1224, in <module>
                          hotJobDetector.run()
                        File "./netapp_lsf_hot_job_detector.py", line 344, in run
                          self.monitorPerformanceData()
                        File "./netapp_lsf_hot_job_detector.py", line 406, in monitorPerformanceData
                          self.findAffectingJobs(errorDocuments)
                        File "./netapp_lsf_hot_job_detector.py", line 615, in findAffectingJobs
                          consolidatedJobReports = self.readAndConsolidateAllJobReports()
                        File "./netapp_lsf_hot_job_detector.py", line 873, in readAndConsolidateAllJobReports
                          consolidatedJobReport = self.consolidateJobReport(content[1:], since=(currentTime - numSeconds))
                        File "./netapp_lsf_hot_job_detector.py", line 912, in consolidateJobReport
                          raise Exception('Invalid job report format.')
                      Exception: Invalid job report format.

                      [5]+  Exit 1                  ./netapp_lsf_hot_job_detector.py

                       

                       

                       

                      Could you plase check and  help us to resolve this issue.

                       

                       

                      Best Regards

                       

                      Pulkit Kaul.

                      • Re: LSF Performance Plugins
                        pulkitkaul
                        Currently Being Moderated

                        Also Below is the output of Report file:

                         

                        [root@dlhl0571 lsf_test]# more NetApp_LSF_Hot_Job_Detector_Report_-_2014-01-10_14.30.56.txt
                        **********
                        Controller: dlhfs07

                                -Aggregate fs07ag1 has a disk that has exceeded the threshold for acceptable maximum disk busy. Threshold: 10.00, Value: 18.43.

                        **********

                         

                        This shows there is a problem in filer dlhfs7 aggregate fs07ag1 because of global threshold vaule 10 which we have configured for disk busy option.

                        But this report is not provoding any LSF information like jobs which are currently running on this filer which might have generated this disk load.

                         

                        Best Regards

                        Pulkit Kaul

                        Pulkit Kaul

                        • Re: LSF Performance Plugins
                          bikash
                          Currently Being Moderated

                          Can you please check if Systemtap runtime is running on all the nodes? I think the text files are not generated for each and every job that is running. The e-mail report that you are getting is just checking the XML file with the current values and compares it with the threshold values in the configuration file.

                           

                          Can you please verify if Systemtap is running on the nodes as documented in the paper?

                           

                          Thanks,

                           

                          Bikash

                          • Re: LSF Performance Plugins
                            pulkitkaul
                            Currently Being Moderated

                            Hi Vikash,

                             

                             

                            Thanks for the reply.

                            We have created a test LSF setup of three nodes "dlhl0571,dlhl0573 and dlhl0549" on which we have executed very limited jobs for which report file has created with their job ID's as mentioned below:

                             

                            [root@dlhl0573 ~]# ls -ltr /data/logs/plugin/job_reports/

                            total 32

                            -rw-r--r-- 1 nfsnobody nfsnobody 2644 Jan  8 18:45 2471-lsftest.txt

                            -rw-r--r-- 1 nfsnobody nfsnobody 2837 Jan  8 18:56 2572-lsftest.txt

                            -rw-r--r-- 1 nfsnobody nfsnobody 3238 Jan  8 19:06 2575-lsftest.txt

                            -rw-r--r-- 1 nfsnobody nfsnobody 2947 Jan  8 19:17 2574-lsftest.txt

                            -rw-r--r-- 1 nfsnobody nfsnobody 3331 Jan  8 19:28 2576-lsftest.txt

                            -rw-r--r-- 1 root      root      3215 Jan  8 19:30 2577-lsftest.txt

                            -rw-r--r-- 1 root      root      2893 Jan  8 19:31 2573-lsftest.txt

                            -rw-r--r-- 1 nfsnobody nfsnobody 1694 Jan 10 15:06 2578-lsftest.txt

                             

                             

                            Also I have checked the systemtap runtime which I feel is running fine in all the three nodes as mentione below:

                             

                            [root@dlhl0571 ~]# ps -ef |grep neta

                            root     20232     1  0 11:20 ?        00:00:07 /usr/bin/python26 ./netapp_lsf_hot_job_detector.py

                            root     24650 24640  0 11:44 ?        00:00:24 /usr/bin/python /sw/platform/lsf_test/9.1/linux2.6-glibc2.3-x86_64/etc/elim.netapp_compute

                            root     24798 24650  0 11:44 ?        00:00:01 /usr/libexec/systemtap/stapio /sw/platform/lsf_test/netapp_nfsmon_2_6_32_358_el6_x86_64.ko

                             

                            [root@dlhl0573 ~]# ps -ef |grep netapp

                            root     19606 19557  0 15:09 pts/0    00:00:00 grep netapp

                            root     28314 28069  0 11:45 ?        00:00:14 /usr/bin/python /sw/platform/lsf_test/9.1/linux2.6-glibc2.3-x86_64/etc/elim.netapp_compute

                            root     28324 28314  0 11:45 ?        00:00:01 /usr/libexec/systemtap/stapio /sw/platform/lsf_test/netapp_nfsmon_2_6_32_358_el6_x86_64.ko

                             

                            [root@dlhl0549 ~]# ps -ef | grep netapp

                            root     10580 18837  0 15:07 pts/0    00:00:00 grep netapp

                            root     19018 19013  0 11:43 ?        00:00:09 /usr/bin/python /sw/platform/lsf_test/9.1/linux2.6-glibc2.3-x86_64/etc/elim.netapp_compute

                            root     19151 19018  0 11:43 ?        00:00:00 /usr/libexec/systemtap/stapio /sw/platform/lsf_test/netapp_nfsmon_2_6_32_358_el6_x86_64.ko

                             

                            Because of which the job launched using jobid 2578 have executed  and report file "2578-lsftest.txt" has created successfully. But after crossing the threshold value this hotjob script is not able to detect the jobs which may have increased this threshhold value.

                             

                            Below is the test case which I have tested using volume TEST created on filer dlhfs07. Using this volume we have launched the job using job Id 2578 for which report file "'/data/logs/plugin/job_reports/2578-lsftest.txt'" have created as mentioned in below logs but  after reaching the "disk busy" threshold value which I have configured as 10%, hot job script have triggered the alarm and created the report in which there is no job related information available.

                             

                             

                            2014-01-10 14:30:56,392 DEBUG Checking XML file /data/logs/plugin/DIRLOC/dlhfs07.xml.

                            2014-01-10 14:30:56,403 INFO Found performance problems on the following controllers: dlhfs07

                            2014-01-10 14:30:56,407 DEBUG Discovered text files in job report directory /data/logs/plugin/job_reports/:  ['/data/logs/plugin/job_reports/2575-lsftest.txt', '/data/logs/plugin/job_reports/2471-lsftest.txt', '/data/logs/plugin/job_reports/2576-lsftest.txt', '/data/logs/plugin/job_reports/2577-lsftest.txt', '/data/logs/plugin/job_reports/2573-lsftest.txt', '/data/logs/plugin/job_reports/2574-lsftest.txt', '/data/logs/plugin/job_reports/2572-lsftest.txt', '/data/logs/plugin/job_reports/2578-lsftest.txt'].

                            2014-01-10 14:30:56,408 DEBUG Read job reports: []

                            2014-01-10 14:30:56,408 INFO Created performance report:

                            ['Controller: dlhfs07\n\n\t-Aggregate fs07ag1 has a disk that has exceeded the threshold for acceptable maximum disk busy. Threshold: 10.00, Value: 18.43.\n\n']

                            2014-01-10 14:30:56,409 INFO Calling script: python netapp_lsf_hot_job_email.py ./NetApp_LSF_Hot_Job_Detector_Report_-_2014-01-10_14.30.56.txt

                             

                            Also, Instead of generating the data for specpfic filer/volume, the "ontapmon" script  are generating the xml files for all the filers configured on DFM and will later automatically stop updating those files for which we need to re-run the script. Could you please check and confirm why this script is not regularly updating the xml file for specific filer/volume configured on "ntapplugin.conf" file and also will stop updating these files.

                             

                            Could you please help us to resolve these issue.

                             

                            Also have one query: Is it possible to disable the "PEND" job options in scheduler plugin, incase if the defined threshold value have met so that we only can use hot job scheduler plugin feature which can provide the list of running jobs which may create the performance impact on storage controller.

                             

                             

                             

                            Best Regards

                             

                            Pulkit Kaul

                            • Re: LSF Performance Plugins
                              bikash
                              Currently Being Moderated

                              Pulkit,

                               

                              Please follow the instructions documented in Step # 5 in Page 13 or Step # 6 in page 27 to check if "systemtap" is running on the nodes or not.

                              Why are the user and the group information is showing up as "nfsnobody" and "nfsnobody"? What is the Linux version (uname -a)? What is the "id" for "nfsnobody"? Is that a genuine user and group information?

                              Can you also send me the "nfsstat -m" for the NFS shares that are mounted on the clients for the LFS plugin?

                               

                              Also send me  the "ntapplugin.conf" file. I would like to check the configuration.

                               

                              No, you cannot disable the PEND in the scheduler plugin. You cannot use the "hot job detection module" independently. The ontapmon.py script is part of the NetApp Scheduler plugin.

                               

                              Bikash

                              • Re: LSF Performance Plugins
                                pulkitkaul
                                Currently Being Moderated

                                Hi Vikash,

                                 

                                 

                                Thanks for the prompt reply.

                                 

                                I already have followed step #5 and step #6 mentioned in page 13 and page 27 which I feel is working fine.

                                 

                                Below is the output of step #5 of page 13 for all the three hosts:

                                 

                                Master Host: dlhl0571

                                 

                                Re: LSF Performance Plugins# stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'

                                Pass 1: parsed user script and 97 library script(s) using 205468virt/33680res/3048shr/31400data kb, in 320usr/40sys/357real ms.

                                Pass 2: analyzed script: 1 probe(s), 1 function(s), 3 embed(s), 0 global(s) using 435420virt/133356res/8256shr/123368data kb, in 1540usr/120sys/1669real ms.

                                Pass 3: using cached /root/.systemtap/cache/3d/stap_3da1faca192a692856ab52a4f6a4a277_1471.c

                                Pass 4: using cached /root/.systemtap/cache/3d/stap_3da1faca192a692856ab52a4f6a4a277_1471.ko

                                Pass 5: starting run.

                                read performed

                                Pass 5: run completed in 10usr/60sys/377real ms.

                                 

                                Execution Hosts: dlhl0573, dlhl0549

                                 

                                Re: LSF Performance Plugins# stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'

                                Pass 1: parsed user script and 97 library script(s) using 205468virt/33680res/3048shr/31400data kb, in 290usr/40sys/322real ms.

                                Pass 2: analyzed script: 1 probe(s), 1 function(s), 3 embed(s), 0 global(s) using 435420virt/133356res/8256shr/123368data kb, in 1480usr/110sys/1590real ms.

                                Pass 3: using cached /root/.systemtap/cache/5f/stap_5f4bb2a244f141ffc44fc8aee29e71cb_1471.c

                                Pass 4: using cached /root/.systemtap/cache/5f/stap_5f4bb2a244f141ffc44fc8aee29e71cb_1471.ko

                                Pass 5: starting run.

                                read performed

                                Pass 5: run completed in 10usr/60sys/373real ms.

                                 

                                Re: LSF Performance Plugins# stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'

                                Pass 1: parsed user script and 97 library script(s) using 205460virt/33688res/3056shr/31392data kb, in 170usr/20sys/403real ms.

                                Pass 2: analyzed script: 1 probe(s), 1 function(s), 3 embed(s), 0 global(s) using 435412virt/133356res/8264shr/123360data kb, in 1100usr/120sys/3273real ms.

                                Pass 3: using cached /root/.systemtap/cache/62/stap_62cf96d07a8158df2ae97bbf75d56607_1471.c

                                Pass 4: using cached /root/.systemtap/cache/62/stap_62cf96d07a8158df2ae97bbf75d56607_1471.ko

                                Pass 5: starting run.

                                read performed

                                Pass 5: run completed in 0usr/30sys/333real ms.

                                 

                                 

                                Also checked the  command "ps -ef | grep stap" output for all the three servers as mentioned in step #6 in page 27 .

                                 

                                 

                                Re: LSF Performance Plugins# ps -ef | grep stap

                                root     12278 11441  0 11:57 pts/1    00:00:00 grep stap

                                root     17172 24650  0 06:32 ?        00:00:01 /usr/libexec/systemtap/stapio /sw/platform/lsf_test/netapp_nfsmon_2_6_32_358_el6_x86_64.ko

                                 

                                Re: LSF Performance Plugins# ps -ef | grep stap

                                root     12194 28314  0 02:37 ?        00:00:02 /usr/libexec/systemtap/stapio /sw/platform/lsf_test/netapp_nfsmon_2_6_32_358_el6_x86_64.ko

                                root     13210 13162  0 11:57 pts/0    00:00:00 grep stap

                                 

                                Re: LSF Performance Plugins# ps -ef | grep stap

                                root      4652 19018  0 Jan15 ?        00:00:12 /usr/libexec/systemtap/stapio /sw/platform/lsf_test/netapp_nfsmon_2_6_32_358_el6_x86_64.ko

                                root     12969 12920  0 11:57 pts/0    00:00:00 grep stap

                                 

                                 

                                I feel the user "nfsnobody" is by default creatd on all the three servers which this script was used at the time of creating these job report files.:

                                 

                                Re: LSF Performance Plugins# more /etc/passwd | grep nfsnobody

                                nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin

                                 

                                root@dlhl0573 ~]# more /etc/passwd | grep nfsnobody

                                nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin

                                 

                                Re: LSF Performance Plugins# more /etc/passwd | grep nfsnobody

                                nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin

                                 

                                Also attached are all required command ouputs and files which you have requested.

                                 

                                Kindly check and confirm the next step of action for the same.

                                 

                                Thanks for you support and cooperation.

                                 

                                Best Regards

                                 

                                Pulkit Kaul

                                 

                                 

                                 

                                 

                                 

                                 

                                 

                                 

                                Date: Tue, 14 Jan 2014 10:39:50 -0800

                                From: xdl-communities@communities.netapp.com

                                To: pulkitkaul@hotmail.com

                                Subject: Re: LSF Performance Plugins - Re: LSF Performance Plugins

                                                                                                                Re: LSF Performance Plugins

                                 

                                 

                                 

                                    created by bikash in High-Tech Industry User Group - View the full discussion

                                 

                                 

                                 

                                 

                                 

                                 

                                Pulkit,

                                Please follow the instructions documented in Step # 5 in Page 13 or Step # 6 in page 27 to check if "systemtap" is running on the nodes or not.

                                Why are the user and the group information is showing up as "nfsnobody" and "nfsnobody"? What is the Linux version (uname -a)? What is the "id" for "nfsnobody"? Is that a genuine user and group information?

                                Can you also send me the "nfsstat -m" for the NFS shares that are mounted on the clients for the LFS plugin?

                                Also send me  the "ntapplugin.conf" file. I would like to check the configuration.

                                No, you cannot disable the PEND in the scheduler plugin. You cannot use the "hot job detection module" independently. The ontapmon.py script is part of the NetApp Scheduler plugin.

                                Bikash

                                 

                                 

                                 

                                 

                                     Reply to this message by replying to this email -or- go to the message on NetApp Community

                                 

                                     Start a new discussion in High-Tech Industry User Group by email or at NetApp Community

                              • Re: LSF Performance Plugins
                                pulkitkaul
                                Currently Being Moderated

                                Hi Vikash,

                                 

                                Any Update for my below mentioned queries.

                                 

                                Please confirm so that we will plan our next step of action accordingly.

                                 

                                Best Regards

                                 

                                Pulkit Kaul.

                                 

                                From: pulkitkaul@hotmail.com

                                To: jive-224886520-143r-2-2ndv@communities.netapp.com

                                Subject: RE: Re: LSF Performance Plugins - Re: LSF Performance Plugins

                                Date: Fri, 17 Jan 2014 07:10:35 +0000

                                 

                                 

                                 

                                 

                                Hi Vikash,

                                 

                                 

                                Thanks for the prompt reply.

                                 

                                I already have followed step #5 and step #6 mentioned in page 13 and page 27 which I feel is working fine.

                                 

                                Below is the output of step #5 of page 13 for all the three hosts:

                                 

                                Master Host: dlhl0571

                                 

                                Re: LSF Performance Plugins# stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'

                                Pass 1: parsed user script and 97 library script(s) using 205468virt/33680res/3048shr/31400data kb, in 320usr/40sys/357real ms.

                                Pass 2: analyzed script: 1 probe(s), 1 function(s), 3 embed(s), 0 global(s) using 435420virt/133356res/8256shr/123368data kb, in 1540usr/120sys/1669real ms.

                                Pass 3: using cached /root/.systemtap/cache/3d/stap_3da1faca192a692856ab52a4f6a4a277_1471.c

                                Pass 4: using cached /root/.systemtap/cache/3d/stap_3da1faca192a692856ab52a4f6a4a277_1471.ko

                                Pass 5: starting run.

                                read performed

                                Pass 5: run completed in 10usr/60sys/377real ms.

                                 

                                Execution Hosts: dlhl0573, dlhl0549

                                 

                                Re: LSF Performance Plugins# stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'

                                Pass 1: parsed user script and 97 library script(s) using 205468virt/33680res/3048shr/31400data kb, in 290usr/40sys/322real ms.

                                Pass 2: analyzed script: 1 probe(s), 1 function(s), 3 embed(s), 0 global(s) using 435420virt/133356res/8256shr/123368data kb, in 1480usr/110sys/1590real ms.

                                Pass 3: using cached /root/.systemtap/cache/5f/stap_5f4bb2a244f141ffc44fc8aee29e71cb_1471.c

                                Pass 4: using cached /root/.systemtap/cache/5f/stap_5f4bb2a244f141ffc44fc8aee29e71cb_1471.ko

                                Pass 5: starting run.

                                read performed

                                Pass 5: run completed in 10usr/60sys/373real ms.

                                 

                                Re: LSF Performance Plugins# stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'

                                Pass 1: parsed user script and 97 library script(s) using 205460virt/33688res/3056shr/31392data kb, in 170usr/20sys/403real ms.

                                Pass 2: analyzed script: 1 probe(s), 1 function(s), 3 embed(s), 0 global(s) using 435412virt/133356res/8264shr/123360data kb, in 1100usr/120sys/3273real ms.

                                Pass 3: using cached /root/.systemtap/cache/62/stap_62cf96d07a8158df2ae97bbf75d56607_1471.c

                                Pass 4: using cached /root/.systemtap/cache/62/stap_62cf96d07a8158df2ae97bbf75d56607_1471.ko

                                Pass 5: starting run.

                                read performed

                                Pass 5: run completed in 0usr/30sys/333real ms.

                                 

                                 

                                Also checked the  command "ps -ef | grep stap" output for all the three servers as mentioned in step #6 in page 27 .

                                 

                                 

                                Re: LSF Performance Plugins# ps -ef | grep stap

                                root     12278 11441  0 11:57 pts/1    00:00:00 grep stap

                                root     17172 24650  0 06:32 ?        00:00:01 /usr/libexec/systemtap/stapio /sw/platform/lsf_test/netapp_nfsmon_2_6_32_358_el6_x86_64.ko

                                 

                                Re: LSF Performance Plugins# ps -ef | grep stap

                                root     12194 28314  0 02:37 ?        00:00:02 /usr/libexec/systemtap/stapio /sw/platform/lsf_test/netapp_nfsmon_2_6_32_358_el6_x86_64.ko

                                root     13210 13162  0 11:57 pts/0    00:00:00 grep stap

                                 

                                Re: LSF Performance Plugins# ps -ef | grep stap

                                root      4652 19018  0 Jan15 ?        00:00:12 /usr/libexec/systemtap/stapio /sw/platform/lsf_test/netapp_nfsmon_2_6_32_358_el6_x86_64.ko

                                root     12969 12920  0 11:57 pts/0    00:00:00 grep stap

                                 

                                 

                                I feel the user "nfsnobody" is by default creatd on all the three servers which this script was used at the time of creating these job report files.:

                                 

                                Re: LSF Performance Plugins# more /etc/passwd | grep nfsnobody

                                nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin

                                 

                                root@dlhl0573 ~]# more /etc/passwd | grep nfsnobody

                                nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin

                                 

                                Re: LSF Performance Plugins# more /etc/passwd | grep nfsnobody

                                nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin

                                 

                                Also attached are all required command ouputs and files which you have requested.

                                 

                                Kindly check and confirm the next step of action for the same.

                                 

                                Thanks for you support and cooperation.

                                 

                                Best Regards

                                 

                                Pulkit Kaul

                                 

                                 

                                 

                                 

                                 

                                 

                                 

                                 

                                Date: Tue, 14 Jan 2014 10:39:50 -0800

                                From: xdl-communities@communities.netapp.com

                                To: pulkitkaul@hotmail.com

                                Subject: Re: LSF Performance Plugins - Re: LSF Performance Plugins

                                                                                                                Re: LSF Performance Plugins

                                 

                                 

                                 

                                    created by bikash in High-Tech Industry User Group - View the full discussion

                                 

                                 

                                 

                                 

                                 

                                 

                                Pulkit,

                                Please follow the instructions documented in Step # 5 in Page 13 or Step # 6 in page 27 to check if "systemtap" is running on the nodes or not.

                                Why are the user and the group information is showing up as "nfsnobody" and "nfsnobody"? What is the Linux version (uname -a)? What is the "id" for "nfsnobody"? Is that a genuine user and group information?

                                Can you also send me the "nfsstat -m" for the NFS shares that are mounted on the clients for the LFS plugin?

                                Also send me  the "ntapplugin.conf" file. I would like to check the configuration.

                                No, you cannot disable the PEND in the scheduler plugin. You cannot use the "hot job detection module" independently. The ontapmon.py script is part of the NetApp Scheduler plugin.

                                Bikash

                                 

                                 

                                 

                                 

                                     Reply to this message by replying to this email -or- go to the message on NetApp Community

                                 

                                     Start a new discussion in High-Tech Industry User Group by email or at NetApp Community

                                • Re: LSF Performance Plugins
                                  bikash
                                  Currently Being Moderated

                                  Pulkit,

                                   

                                  I am checking internally with Engg. on this issue. I will get back to you shortly.

                                   

                                  Thanks for your patience.

                                   

                                  Bikash

                                  • Re: LSF Performance Plugins
                                    zulanch
                                    Currently Being Moderated

                                    Hi Pulkit,

                                     

                                    Can you check to make sure the clocks are in sync among the three LSF nodes? If the clocks are out of sync, the hot job detector won't be able to match LSF job reports to the detected performance issue.

                                     

                                    -Ben

                                    • Re: LSF Performance Plugins
                                      pulkitkaul
                                      Currently Being Moderated

                                      Hi Ben,

                                       

                                      I have checked all the three nodes are in sync with our time server.

                                       

                                      Best Regards

                                       

                                      Pulkit Kaul.

                                       

                                      Date: Wed, 22 Jan 2014 07:09:57 -0800

                                      From: xdl-communities@communities.netapp.com

                                      To: pulkitkaul@hotmail.com

                                      Subject: Re: LSF Performance Plugins - Re: LSF Performance Plugins

                                                                                                                      Re: LSF Performance Plugins

                                       

                                       

                                       

                                          created by zulanch in High-Tech Industry User Group - View the full discussion

                                       

                                       

                                       

                                       

                                       

                                       

                                      Hi Pulkit,

                                      Can you check to make sure the clocks are in sync among the three LSF nodes? If the clocks are out of sync, the hot job detector won't be able to match LSF job reports to the detected performance issue.

                                      -Ben

                                       

                                       

                                       

                                       

                                           Reply to this message by replying to this email -or- go to the message on NetApp Community

                                       

                                           Start a new discussion in High-Tech Industry User Group by email or at NetApp Community

                                    • Re: LSF Performance Plugins
                                      pulkitkaul
                                      Currently Being Moderated

                                      Hi Ben,

                                       

                                      I have checked all the three nodes are in sync with our time server.

                                        root@del99fp[21]> ssh dlhl0571 date

                                      Fri Jan 24 09:29:09 IST 2014

                                      root@del99fp[22]> ssh dlhl0573 date

                                      Fri Jan 24 09:29:12 IST 2014

                                      root@del99fp[23]> ssh dlhl0549 date

                                      Fri Jan 24 09:28:52 IST 2014

                                       

                                      Best Regards

                                       

                                      Pulkit Kaul.

                                       

                                      Date: Wed, 22 Jan 2014 07:09:57 -0800

                                      From: xdl-communities@communities.netapp.com

                                      To: pulkitkaul@hotmail.com

                                      Subject: Re: LSF Performance Plugins - Re: LSF Performance Plugins

                                                                                                                      Re: LSF Performance Plugins

                                       

                                       

                                       

                                          created by zulanch in High-Tech Industry User Group - View the full discussion

                                       

                                       

                                       

                                       

                                       

                                       

                                      Hi Pulkit,

                                      Can you check to make sure the clocks are in sync among the three LSF nodes? If the clocks are out of sync, the hot job detector won't be able to match LSF job reports to the detected performance issue.

                                      -Ben

                                       

                                       

                                       

                                       

                                           Reply to this message by replying to this email -or- go to the message on NetApp Community

                                       

                                           Start a new discussion in High-Tech Industry User Group by email or at NetApp Community

                                      • Re: LSF Performance Plugins
                                        zulanch
                                        Currently Being Moderated

                                        When the hot job detection script detects a performance problem in the XML file, it checks the last modified time of the job report files and reads the data in any that have been modified in the last two minutes. The log you posted indicates that the script sees the job report text files, but none of them were modified recently.

                                         

                                        Are you running the hot job detector too long after the performance problem has occurred? Or are you running it on another host besides those three nodes (with a time that might be out of sync)? It still seems like something is going wrong with the time detection in your setup.

                                         

                                        -Ben

                                        • Re: LSF Performance Plugins
                                          pulkitkaul
                                          Currently Being Moderated

                                          Hi Ben,

                                           

                                          Currently we are facing two problems :

                                           

                                          1. The performance data which we are getting from "ontapmon.py" script is for all the filers configured on DFM despite currently we have configured monitoring parameters for filer  "dlhfs07" as mentioned below:

                                           

                                          #

                                          1. $Id: filerplugin.conf,v 1.1 2010/12/30 22:46:23 david Exp $

                                          #

                                          1. /etc/filesystemtags

                                          #

                                          1. Interface defining the filers and file systems to the scheduling plugin

                                          2. this file has to be managed by the site system administrator.

                                          3. This file has two sections:

                                          #

                                          1. First section lists the available filers filesystems

                                          #

                                          Begin ExportNames

                                          #test1  fas6280c-svl11:/vol/volTest,fas6280c-svl11:/vol/lsfvol

                                          test1   dlhfs07:/vol/TEST

                                          #test2  fas6280c-svl12:/vol/nfsDS

                                          End ExportNames

                                          #

                                          1. The second section lists the GLOBAL filer utilization thresholds

                                          2. and file system space thresholds.

                                          #

                                          Begin PluginPolicy

                                          Max_DiskBusy    =       10

                                          Max_NEDomain    =       75

                                          Max_AvgVolLatency =     1

                                          Min_AvailFiles  =       100

                                          Min_AvailSize   =       1000

                                          End PluginPolicy

                                          #

                                          1. Section where one can define Volume and Filer specific parameters

                                          2. and policies.  IF Filer or volume are not specified then, global

                                          3. thresholds will be used.

                                          #

                                          Begin FilerPolicy

                                          #fas6280c-svl11:/vol/volTest    Min_AvailFiles = 1000

                                          dlhfs07:/vol/TEST               Min_AvailFiles = 10

                                          #fas6280c-svl11:/volnfsDS       Min_AvailSize = 1000

                                          #dlhfs06:/vol/vol10             Min_AvailSize = 800

                                          #dlhfs06:/vol/vol10             Max_DiskBusy =  4

                                          #dlhfs06:/vol/vol10             Max_AvgVolLatency = 1

                                          End FilerPolicy

                                          #

                                          1. Parameter section controlling plugin

                                          2. behaviour.

                                          Begin Parameters

                                          Debug yes

                                          Work_Dir /data/logs/plugin/netapp-log

                                          Counter_Dir /data/logs/plugin/DIRLOC

                                          XMLReread 60

                                           

                                          Is this a normal behaviour or there is something wrong in our configuration.

                                          Same is also applicable for hot job plugin which start monitoring the job reports for all the filers for which report is gatthered by "ontapmon.py script"

                                          Also after some time these xml files will automatically stop updating by "ontapmon.py script" for which I use to rerun this script.

                                           

                                           

                                          2. For testing this hot job plugin features I have created the test volume "TEST" on filer dlhfs07 for which I have configured the global threshold for "diskbusy = 10" which this hot job plugin is able to identify but for all the filers for which xml files has generated and create the report with respect to the filers resources but not collect the LSF job details for the test jobs which are currently running on test volume "TEST" of filer dlhfs07.

                                           

                                           

                                           

                                          Please look into this and help us to resolve this issue ASAP.

                                           

                                           

                                          Best Regards

                                           

                                          Pulkit Kaul

                                           

                                          Date: Thu, 23 Jan 2014 21:39:52 -0800

                                          From: xdl-communities@communities.netapp.com

                                          To: pulkitkaul@hotmail.com

                                          Subject: Re: LSF Performance Plugins - Re: LSF Performance Plugins

                                                                                                                          Re: LSF Performance Plugins

                                           

                                           

                                           

                                              created by zulanch in High-Tech Industry User Group - View the full discussion

                                           

                                           

                                           

                                           

                                           

                                           

                                          When the hot job detection script detects a performance problem in the XML file, it checks the last modified time of the job report files and reads the data in any that have been modified in the last two minutes. The log you posted indicates that the script sees the job report text files, but none of them were modified recently.

                                          Are you running the hot job detector too long after the performance problem has occurred? Or are you running it on another host besides those three nodes (with a time that might be out of sync)? It still seems like something is going wrong with the time detection in your setup.

                                          -Ben

                                           

                                           

                                           

                                           

                                               Reply to this message by replying to this email -or- go to the message on NetApp Community

                                           

                                               Start a new discussion in High-Tech Industry User Group by email or at NetApp Community

                                          • Re: LSF Performance Plugins
                                            zulanch
                                            Currently Being Moderated

                                            1. The performance data which we are getting from "ontapmon.py" script is for all the filers configured on DFM despite currently we have configured monitoring parameters for filer  "dlhfs07" as mentioned below:

                                            Unfortunately, in the current version of the plugin, nothing can be done about this. The ontapmon.py script can only collect data for all filers configured on DFM regardless of the exports in the scheduler config. Limiting the collection to specific filers will likely be included in a future release.

                                             

                                            Also after some time these xml files will automatically stop updating by "ontapmon.py script" for which I use to rerun this script.

                                            Can you check the log file and console after the script stops updating the XML files and post your findings here so I can help debug?

                                             

                                            2. For testing this hot job plugin features I have created the test volume "TEST" on filer dlhfs07 for which I have configured the global threshold for "diskbusy = 10" which this hot job plugin is able to identify but for all the filers for which xml files has generated and create the report with respect to the filers resources but not collect the LSF job details for the test jobs which are currently running on test volume "TEST" of filer dlhfs07.

                                            That was my understanding of the problem, which I believe is most likely caused by a time synchronization issue somewhere after looking at the log file you posted earlier in this thread. I'll copy my previous reply here:

                                             

                                            When the hot job detection script detects a performance problem in the XML file, it checks the last modified time of the job report files and reads the data in any that have been modified in the last two minutes. The log you posted indicates that the script sees the job report text files, but none of them were modified recently.

                                             

                                            Are you running the hot job detector too long after the performance problem has occurred? Or are you running it on another host besides those three nodes (with a time that might be out of sync)? It still seems like something is going wrong with the time detection in your setup.

                                             

                                            -Ben

                                            • Re: LSF Performance Plugins
                                              pulkitkaul
                                              Currently Being Moderated

                                              Hi Ben,

                                               

                                              Thanks for the prompt reply.

                                               

                                              I have restarted the entire setup for trying to find some findings which will help us to diagnose this issue.

                                              But after restarting the setup again same problem have reoccured and "ontapmon.py" have stopped updating "xml" files.

                                              This time I have not used hot job plugin because I first want to diagnose this issue as in your previous mail you have told the hot job plugin problem may occur because of this issue.

                                               

                                              This time I have attached the logs files "ontapmon_error.log and ntapplugin.log" which may help you find the root cause of this issue.

                                               

                                               

                                              These are the three node only which we are using in this test setup.

                                               

                                              The hot job plugin keep running on background and create the report files for all the filers which have configured on DFM, despite we have configred the threshold for the selected volume (TEST) only for which old  log file "netapp_lsf_hot_job_detector.log.2014-01-10" have attached.

                                               

                                              Please check and try to resolve this issue ASAP.

                                               

                                               

                                              Best Regards

                                               

                                              Pulkit Kaul.

                                               

                                              Date: Fri, 24 Jan 2014 06:06:56 -0800

                                              From: xdl-communities@communities.netapp.com

                                              To: pulkitkaul@hotmail.com

                                              Subject: Re: LSF Performance Plugins - Re: LSF Performance Plugins

                                                                                                                              Re: LSF Performance Plugins

                                               

                                               

                                               

                                                  created by zulanch in High-Tech Industry User Group - View the full discussion

                                               

                                               

                                               

                                               

                                               

                                               

                                              1. The performance data which we are getting from "ontapmon.py" script is for all the filers configured on DFM despite currently we have configured monitoring parameters for filer  "dlhfs07" as mentioned below:

                                              Unfortunately, in the current version of the plugin, nothing can be done about this. The ontapmon.py script can only collect data for all filers configured on DFM regardless of the exports in the scheduler config. Limiting the collection to specific filers will likely be included in a future release.

                                              Also after some time these xml files will automatically stop updating by "ontapmon.py script" for which I use to rerun this script.

                                              Can you check the log file  and console after the script stops updating the XML files and post your findings here so I can help debug?

                                              2. For testing this hot job plugin features I have created the test volume "TEST" on filer dlhfs07 for which I have configured the global threshold for "diskbusy = 10" which this hot job plugin is able to identify but for all the filers for which xml files has generated and create the report with respect to the filers resources but not collect the LSF job details for the test jobs which are currently running on test volume "TEST" of filer dlhfs07.

                                              That was my understanding of the problem, which I believe is most likely caused by a time synchronization issue somewhere after looking at the log file you posted earlier in this thread. I'll copy my previous reply here:

                                              When the hot job detection script detects a performance problem in the XML file, it checks the last modified time of the job report files and reads the data in any that have been modified in the last two minutes. The log you posted indicates that the script sees the job report text files, but none of them were modified recently.

                                              Are you running the hot job detector too long after the performance problem has occurred? Or are you running it on another host besides those three nodes (with a time that might be out of sync)? It still seems like something is going wrong with the time detection in your setup.

                                              -Ben

                                               

                                               

                                               

                                               

                                                   Reply to this message by replying to this email -or- go to the message on NetApp Community

                                               

                                                   Start a new discussion in High-Tech Industry User Group by email or at NetApp Community

                                            • Re: LSF Performance Plugins
                                              pulkitkaul
                                              Currently Being Moderated

                                              Attached are the log files.

                                               

                                               

                                              Pls check and confirm.

                                               

                                               

                                              Best Regards

                                               

                                              Pulkit Kaul.

                                               

                                              From: pulkitkaul@hotmail.com

                                              To: jive-833730045-143r-2-2nun@communities.netapp.com

                                              Subject: RE: Re: LSF Performance Plugins - Re: LSF Performance Plugins

                                              Date: Thu, 30 Jan 2014 04:49:48 +0000

                                               

                                               

                                               

                                               

                                              Hi Ben,

                                               

                                              Thanks for the prompt reply.

                                               

                                              I have restarted the entire setup for trying to find some findings which will help us to diagnose this issue.

                                              But after restarting the setup again same problem have reoccured and "ontapmon.py" have stopped updating "xml" files.

                                              This time I have not used hot job plugin because I first want to diagnose this issue as in your previous mail you have told the hot job plugin problem may occur because of this issue.

                                               

                                              This time I have attached the logs files "ontapmon_error.log and ntapplugin.log" which may help you find the root cause of this issue.

                                               

                                               

                                              These are the three node only which we are using in this test setup.

                                               

                                              The hot job plugin keep running on background and create the report files for all the filers which have configured on DFM, despite we have configred the threshold for the selected volume (TEST) only for which old  log file "netapp_lsf_hot_job_detector.log.2014-01-10" have attached.

                                               

                                              Please check and try to resolve this issue ASAP.

                                               

                                               

                                              Best Regards

                                               

                                              Pulkit Kaul.

                                               

                                              Date: Fri, 24 Jan 2014 06:06:56 -0800

                                              From: xdl-communities@communities.netapp.com

                                              To: pulkitkaul@hotmail.com

                                              Subject: Re: LSF Performance Plugins - Re: LSF Performance Plugins

                                                                                                                              Re: LSF Performance Plugins

                                               

                                               

                                               

                                                  created by zulanch in High-Tech Industry User Group - View the full discussion

                                               

                                               

                                               

                                               

                                               

                                               

                                              1. The performance data which we are getting from "ontapmon.py" script is for all the filers configured on DFM despite currently we have configured monitoring parameters for filer  "dlhfs07" as mentioned below:

                                              Unfortunately, in the current version of the plugin, nothing can be done about this. The ontapmon.py script can only collect data for all filers configured on DFM regardless of the exports in the scheduler config. Limiting the collection to specific filers will likely be included in a future release.

                                              Also after some time these xml files will automatically stop updating by "ontapmon.py script" for which I use to rerun this script.

                                              Can you check the log file  and console after the script stops updating the XML files and post your findings here so I can help debug?

                                              2. For testing this hot job plugin features I have created the test volume "TEST" on filer dlhfs07 for which I have configured the global threshold for "diskbusy = 10" which this hot job plugin is able to identify but for all the filers for which xml files has generated and create the report with respect to the filers resources but not collect the LSF job details for the test jobs which are currently running on test volume "TEST" of filer dlhfs07.

                                              That was my understanding of the problem, which I believe is most likely caused by a time synchronization issue somewhere after looking at the log file you posted earlier in this thread. I'll copy my previous reply here:

                                              When the hot job detection script detects a performance problem in the XML file, it checks the last modified time of the job report files and reads the data in any that have been modified in the last two minutes. The log you posted indicates that the script sees the job report text files, but none of them were modified recently.

                                              Are you running the hot job detector too long after the performance problem has occurred? Or are you running it on another host besides those three nodes (with a time that might be out of sync)? It still seems like something is going wrong with the time detection in your setup.

                                              -Ben

                                               

                                               

                                               

                                               

                                                   Reply to this message by replying to this email -or- go to the message on NetApp Community

                                               

                                                   Start a new discussion in High-Tech Industry User Group by email or at NetApp Community

More Like This

  • Retrieving data ...