6 Replies Latest reply: Feb 25, 2014 12:50 AM by mark.schuren RSS

Intercluster XDP (SnapVault) performance over 10G

mark.schuren CertifiedPlus Sprinter
Currently Being Moderated

Hi all,

 

I'm having performance issues with Snapmirror XDP relationships between two clusters (primary 4-node 3250, secondary 2-node 3220, all running cDOT 8.2P5).

 

The replication LIFs on both sides are pure 10G (private replication VLAN), flat network (no routing), using jumbo frames (however also tested without jumbo and problem persist).

 

The vaulting works in general, but average throughput for a single node/relationship never goes beyond 1Gbit/s - most of the time it is even much slower (300-500MBit/s or less).

 

I verified that neither the source node(s) nor the destination node(s) are CPU or disk bound during the transfers (at least not all of the source nodes).

I also verfied the SnapMirror traffic is definitely going through the 10g interfaces.

There is no compressed volume involved, only dedupe (on source).

The dedupe schedules are not within the same time window.

Also checked the physical network interface counters on switch and netapps, no errors / drops, clean.

 

However, the replication is SLOW, no matter what i try.

Customer impression is that it got even slower over time, e.g. throughput of a source node was ~1GBit/s when relationship(s) were initialized (not as high as expected), and dropped to ~ 500MBit/s after some months of operation / regular updates...

Meanwhile the daily update (of all relationships) sums up to ~ 1,4 TB per day, and it takes almost the whole night to finish :-(

 

So the question is: how to tune that?

Is anyone having similar issues regarding Snapmirror XDP throughput over 10Gbit Ethernet?

 

Are there any configurable parameters (network compression? TCP win size? TCP delayed ack's? Anything I don't think of?) on source/destination side?

 

Thankful for all ideas / comments,

Mark

  • Re: Intercluster XDP (SnapVault) performance over 10G
    mark.schuren CertifiedPlus Sprinter
    Currently Being Moderated

    Noone?

     

    Is anyone using SnapMirror XDP actually? In a 10gE environment?

     

    What throughput do you see?

  • Re: Intercluster XDP (SnapVault) performance over 10G
    officeworks Novice
    Currently Being Moderated

    When we had vol move issues between cluster nodes.. (via 10Gb) we hit BUG 768028 after opening a performance case... this may or may not be related to what you see... this also impacted snapmirror relationships

     

    http://support.netapp.com/NOW/cgi-bin/bugrellist?bugno=768028

     

    workaround was disabling redirect scanners from running OR, you can disable the free_space_realloc completely.

    storage aggregate modify -aggregate <aggr_name> -free-space-realloc no_redirect



  • Re: Intercluster XDP (SnapVault) performance over 10G
    TWIELGOS2 Novice
    Currently Being Moderated

    We have been having this problem for months with plain old snapmirror.  We have a 10G connection between two different clusters, and snapmirror performance has been calculated at around 100Mb/sec - completely unacceptable.

     

    We disabled reallocate per bug 768028, no help.  We disabled throttling, and that helped, but not enough - and as a debug flag, disabling throttling comes with a production performance cost.

     

    We used this to disable the throttle:

      • node run local -command "priv set diag; setflag repl_throttle_enable 0;"
    • Re: Intercluster XDP (SnapVault) performance over 10G
      mark.schuren CertifiedPlus Sprinter
      Currently Being Moderated

      Thanks for the tips.

       

      I tried both settings (disable free-space realloc on source and destination aggrs, as well as setflag repl_throttle_enable 0 on source and destination nodes), but this did not make things better, maybe slightly but not really.

       

      I meanwhile experimented a bit more, migrated all intercluster interfaces of my source nodes to gigabit ports (instead of vlan-tagged 10gig interfaces).

      Although unexpected this helped quite a lot! Doubled my overall throughput by going to slower NICs on the source side.

       

      Next step is finally open a performance case

      • Re: Intercluster XDP (SnapVault) performance over 10G
        officeworks Novice
        Currently Being Moderated

        It smells like flow control on the 10Gb links to me..

        ask the network guys if they see alot of RX/TX pause packets on the switch ports.. may as well check for switch port errors while you are at it.. or work with the network guys to see if there are a lot of retransmits

         

        this is the situation where I would have hoped netapp would have network performance tools like iperf so we can validate infrastructure and throughput when the system is put in.

        • Re: Intercluster XDP (SnapVault) performance over 10G
          mark.schuren CertifiedPlus Sprinter
          Currently Being Moderated

          Checked that already. Switch stats look clean (don't have them anymore).

           

          Ifstat on netapp side looks good:

           

           

          -- interface  e1b  (54 days, 8 hours, 12 minutes, 26 seconds) --

           

          RECEIVE

          Frames/second:    3690  | Bytes/second:    18185k | Errors/minute:       0

          Discards/minute:     0  | Total frames:    30935m | Total bytes:     54604g

          Total errors:        0  | Total discards:      0  | Multi/broadcast:     7

          No buffers:          0  | Non-primary u/c:     0  | Tag drop:            0

          Vlan tag drop:       0  | Vlan untag drop:     0  | Vlan forwards:       0

          Vlan broadcasts:     0  | Vlan unicasts:       0  | CRC errors:          0

          Runt frames:         0  | Fragment:            0  | Long frames:         0

          Jabber:              0  | Bus overruns:        0  | Queue drop:          0

          Xon:                 0  | Xoff:                0  | Jumbo:               0

          TRANSMIT

          Frames/second:    3200  | Bytes/second:    12580k | Errors/minute:       0

          Discards/minute:     0  | Total frames:    54267m | Total bytes:       343t

          Total errors:        0  | Total discards:      0  | Multi/broadcast: 78699

          Queue overflows:     0  | No buffers:          0  | Xon:                 6

          Xoff:              100  | Jumbo:            3285m | Pktlen:              0

          Timeout:             0  | Timeout1:            0

          LINK_INFO

          Current state:       up | Up to downs:         2  | Speed:           10000m

          Duplex:            full | Flowcontrol:       full

           

          So there are some Xon/XOff packets, but very few.

           

          The very same link (same node) transports NFS I/O with 500MByte/sec (via different LIF), so I don't think the interface has any problems. However, Snapmirror average throughput remains below 50 MByte/sec, no matter what I try.

More Like This

  • Retrieving data ...

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points