Skip to main content
Pure1 Support Portal

Troubleshooting iSCSI connection on the FlashArray

The purpose of this document is to provide troubleshooting tips to isolate physical, logical or performance iSCSI issues. For that reason it is broken down into these sections: 

  • Physical
  • Logical
  • Performance (SAN Latency)

Physical

Begin troubleshooting at the physical layer when possible. We do not always have someone onsite, so using the following tools will assist you in your physical troubleshooting remotely. 

Tools / greps

Common Failure Points

  • Customer neglected to plug in the port, or the port is plugged into the array but not the switch.
  • The switch port or array port is disabled. 
  • The cable, HBA, switch module, or Array/Switch SFP being used is bad. 

Troubleshooting 

Checking the link status:

  • Ca_link > open playback link >select System > System Health
    • In the GUI we can see if the port in question has a green light indicating the physical link is up, as seen below. If it does not we need to make sure it is plugged in at both the array and at the switch.  If we still do not have a link, it is worth while making sure the switch port is not shut down. It is also possible that we have a link light at one side but not the other with a bad cable. 

Screen Shot 2017-11-13 at 11.57.32 AM.png

The ethtool can do more than provide information about Ethernet interfaces. Some of these commands could be impactful, so use them wisely. 

  • We can also use ethtool live on the array (ethtool -d ethX) in the CLI or search for it in the diagnostics.log file. As seen below we run the following command. 
$less diagnostics.log-2017111313.gz
/ethtool -d eth8
Nov 13 13:17:34 ethtool -d eth8
------------------------------------------------------------------------
0x042A4: LINKS (Link Status register)                 0x79081113
       Link Status:                                   up
       Link Speed:                                    10G
  • The following command shows the link speed that was negotiated from the switch of the different Ethernet ports, live on the array or in fuse. Here's the command syntax and output where we see that the link is up. Notice eth1 is at 0.00 b/s, this would be normal for a down link. If the link is green in the GUI or status is up then most likely the switch was unable to negotiate a speed, possibly due to a bad cable or a logical issue. 
purehw list |grep ETH
CT0.ETH0   ok             -         -     0      1.00 Gb/s   -            -        eth_port       PLATSASB_PCTFL1650015D_0000:0b:00.0_p0   PLATSASB_PCTFL1650015D
CT0.ETH1   ok             -         -     1      0.00 b/s    -            -        eth_port       PLATSASB_PCTFL1650015D_0000:0a:00.0_p1   PLATSASB_PCTFL1650015D
CT0.ETH2   ok             -         -     2      10.00 Gb/s  -            -        eth_port       PLATSASB_PCTFL1650015D_0000:81:00.0_p2   PLATSASB_PCTFL1650015D
CT0.ETH3   ok             -         -     3      10.00 Gb/s  -            -        eth_port       PLATSASB_PCTFL1650015D_0000:81:00.1_p3   PLATSASB_PCTFL1650015D
  • port_counter --deltas --quiet --eth is good way to view of port errors on iSCSI. Because these counters are historical, it is important to make sure you see these errors incrementing while the event was occurring. The --deltas option will show changes for 24 hours. 
port_counter --deltas --quiet --eth
{noformat:title=CT1.ETH4}
+---------------------+--------------+-----------------+-------------------+------------------+---------------------+---------------------+--------------------+
|           Timestamp |   Collisions |   Rx Crc Errors |   Rx Frame Errors |   Rx Over Errors |   Tx Aborted Errors |   Tx Carrier Errors |   Tx Timeout Count |
|---------------------+--------------+-----------------+-------------------+------------------+---------------------+---------------------+--------------------|
| 2017-09-26 01:00:00 |            0 |            8339 |                 0 |                0 |                   0 |                   0 |                  0 |
| 2017-09-26 02:00:00 |            0 |            8340 |                 0 |                0 |                   0 |                   0 |                  0 |
| 2017-09-26 03:00:00 |            0 |            8340 |                 0 |                0 |                   0 |                   0 |                  0 |
...
| 2017-09-26 12:00:00 |            0 |            8340 |                 0 |                0 |                   0 |                   0 |                  0 |
| 2017-09-26 13:00:00 |            0 |            8340 |                 0 |                0 |                   0 |                   0 |                  0 |
...
| 2017-09-26 20:00:00 |            0 |            8340 |                 0 |                0 |                   0 |                   0 |                  0 |
| 2017-09-26 21:00:00 |            0 |            8345 |                 0 |                0 |                   0 |                   0 |                  0 |
| 2017-09-26 22:00:00 |            0 |            8345 |                 0 |                0 |                   0 |                   0 |                  0 |
| 2017-09-26 23:00:00 |            0 |            8345 |                 0 |                0 |                   0 |                   0 |                  0 |
+---------------------+--------------+-----------------+-------------------+------------------+---------------------+---------------------+--------------------+
{noformat}
  • The above was do to an incompatible SFP, (see ES-32020). This was diagnosed by looking in the syslog. As a recommendation, use less time in FUSE on both controllers when viewing their syslogs. Then search by entering  /unsupported SFP+ and reviewing the log lines both before and after the unsupported error. (See the JIRA to find out what was done to resolve the issue).
/var/log/syslog-2017092612.gz:Sep 26 11:27:56 pure1-ct0 kernel: [16960442.085352,14] ixgbe 0000:82:00.1 eth4: NIC Link is Down
/var/log/syslog-2017092612.gz:Sep 26 11:27:56 pure1-ct0 kernel: [16960442.085466,14] ixgbe 0000:82:00.1 eth4: initiating reset to clear Tx work after link loss
/var/log/syslog-2017092612.gz:Sep 26 11:27:57 pure1-ct0 kernel: [16960443.563884,06] ixgbe 0000:82:00.1 eth4: Reset adapter
/var/log/syslog-2017092612.gz:Sep 26 11:27:58 pure1-ct0 kernel: [16960444.081298,06] ixgbe 0000:82:00.1 eth4: detected SFP+: 6
/var/log/syslog-2017092612.gz:Sep 26 11:28:02 pure1-ct0 kernel: [16960448.121850,18] ixgbe 0000:82:00.1: removed PHC on eth4
/var/log/syslog-2017092612.gz:Sep 26 11:28:02 pure1-ct0 kernel: [16960448.490613,1d] ixgbe 0000:82:00.1: registered PHC device on eth4
/var/log/syslog-2017092612.gz:Sep 26 11:28:03 pure1-ct0 kernel: [16960448.666722,1d] ixgbe 0000:82:00.1 eth4: detected SFP+: 6
/var/log/syslog-2017092612.gz:Sep 26 11:35:20 pure1-ct0 kernel: [16960885.298616,14] ixgbe 0000:82:00.1: failed to initialize because an unsupported SFP+ module type was detected.
/var/log/syslog-2017092612.gz:Sep 26 11:35:20 pure1-ct0 kernel: [16960885.298637,14] ixgbe 0000:82:00.1: Reload the driver after installing a supported module.
/var/log/syslog-2017092612.gz:Sep 26 11:35:20 pure1-ct0 kernel: [16960885.298946,14] ixgbe 0000:82:00.1: removed PHC on eth4

For additional information on physical troubleshooting, see from Cisco's Troubleshooting Switch Port and Interface Problems.

Logical

Use the following tools to help troubleshoot remotely:

Tools / greps

Common failure points

The following are common failure points:

  • Host initiator port Ip Addresses are in a different subnet than the Array target port.
  • Host best practices are not set. 
  • A common issue with MTU of 9000 (jumbo frames) it that the target and initiator may be configure correctly, however, the switch port is not configured. 
  • Incompatible or buggy NIC Driver Version or NIC Firmware Version.
  • Purity bug, search our wiki release history inside Purity Release Center
  • The host is connected to the array correctly but has not been given permission to access a volume.

Troubleshooting

 As per SAN Guidelines for Maximizing Pure Performance we see the following recommendations.

  • Do not route iSCSI. 
    • To check this, compare purenetwork list iSCSI ip address and subnet, to the ip addresses seen in pureport list --initiator. If the ip address is not in the same subnet it will need to be routed in order for the two devices to communicate. As a recommendation, use a subnet calculator to see what IP addresses are in the same subnet. For more information on IP subnets see this wiki article Subnetworking. Having half of the initiator and target IP Addresses in one subnet and the other half in another is acceptable. 
From purenetwork list 
ct0.eth4  True     -       static  10.10.17.78     255.255.255.128  -             9000  90:e2:ba:ca:8c:b1  10.00 Gb/s  iscsi        -
ct0.eth5  True     -       static  10.10.25.79     255.255.255.128  -             9000  90:e2:ba:ca:8c:b0  10.00 Gb/s  iscsi 

From pureport list --initiator
-              10.10.17.63:15186   iqn.1998-01.com.vmware:stl10r14c017-18edb0cc  CT0.ETH4  -           10.10.17.78:3260   iqn.2010-06.com.purestorage:flasharray.73ecbb5f9334d8bb  False
-              10.10.25.62:56176   iqn.1998-01.com.vmware:stl10r14c016-5e5d7cd1  CT0.ETH5  -           10.10.25.79:3260   iqn.2010-06.com.purestorage:flasharray.73ecbb5f9334d8bb  False

IP addresses in ct0.eth4 10.10.17.78 and host10.10.17.63 are in the same subnet. Ct0.eth5 10.10.25.79 and Host 10.10.25.63 are in a different subnet than ct0.eth4.  For ct0.eth4 10.10.17.78 to communicate with Host 10.10.25.63 the it would need to be routed. With a subnet mask of 255.255.255.128 this gives us an IP address range of 10.10.25.1 - 10.10.25.126.  For 10.10.17.78 subnet 255.255.255.128 we have a range of 10.10.17.1 - 10.10.17.126 any ip out of that range needs to be routed. 

  • VLAN tagging is only supported in Purity 4.6.0+.
From
purenetwork list --all
------------------------------------------------------------------------
Name Enabled Subnet Method Address Mask Gateway MTU MAC Speed Services Slaves
ct0.eth0 True - static 10.10.8.37 255.255.248.0 10.10.8.1 1500 24:a9:37:01:dc:42 1.00 Gb/s management -
ct0.eth1 False - static - - - 1500 24:a9:37:01:dc:43 1.00 Gb/s management -
ct0.eth4 True - static - - - 9000 90:e2:ba:cd:a3:f9 10.00 Gb/s iscsi -
ct0.eth4.157 True VLAN157-iSCSI static 10.10.141.195 255.255.252.0 10.10.140.5 9000 90:e2:ba:cd:a3:f9 10.00 Gb/s iscsi -
ct0.eth5 True - static - - - 9000 90:e2:ba:cd:a3:f8 10.00 Gb/s iscsi -
ct0.eth5.157 True VLAN157-iSCSI static 10.10.141.197 255.255.252.0 10.10.140.5 9000 90:e2:ba:cd:a3:f8 10.00 Gb/s iscsi -
ct1.eth0 True - static 10.10.8.38 255.255.248.0 10.10.8.1 1500 24:a9:37:01:ca:56 1.00 Gb/s management -
ct1.eth1 False - static - - - 1500 24:a9:37:01:ca:57 1.00 Gb/s management -
ct1.eth4 True - static - - - 9000 90:e2:ba:cc:25:05 10.00 Gb/s iscsi -
ct1.eth4.157 True VLAN157-iSCSI static 10.10.141.196 255.255.252.0 10.10.140.5 9000 90:e2:ba:cc:25:05 10.00 Gb/s iscsi -
ct1.eth5 True - static - - - 9000 90:e2:ba:cc:25:04 10.00 Gb/s iscsi -
ct1.eth5.157 True VLAN157-iSCSI static 10.10.141.198 255.255.252.0 10.10.140.5 9000 90:e2:ba:cc:25:04 10.00 Gb/s iscsi -

Use a MTU of 9000 across the entire path.

 

  • To check to see if MTU is working from the iSCSI target (FlashArray) to host use "ping -Mdo -I eth4 -s 8888 ipaddress" eth4 would be the iSCSI interface of the port you want to test live on the controller you are logged into.

 

Normal ping:
ping -I eth4 10.13.51.10
PING 10.13.51.10 (10.13.51.10) from 10.11.51.10 replbond: 56(84) bytes of data.
64 bytes from 10.13.51.10: icmp_seq=1 ttl=59 time=0.695 ms
64 bytes from 10.13.51.10: icmp_seq=2 ttl=59 time=0.657 ms
64 bytes from 10.13.51.10: icmp_seq=3 ttl=59 time=0.671 ms
64 bytes from 10.13.51.10: icmp_seq=4 ttl=59 time=0.761 ms
64 bytes from 10.13.51.10: icmp_seq=5 ttl=59 time=0.690 ms
64 bytes from 10.13.51.10: icmp_seq=6 ttl=59 time=0.666 ms


Ping test with jumbp frames:
ping -Mdo -I eth4 -s 8888 10.13.51.10
PING 10.13.51.10 (10.3.51.10) from 10.11.51.10 replbond: 8888(8916) bytes of data.
...
--- 10.13.51.10 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3005ms
  • Use all of the FlashArray's interfaces (critical for iSCSI performance).
    • host_connectivity provides an easy to read out put to see if all iSCSI interfaces are being used.
  • Verify all paths are clean; address any CRCs or similar errors.
    • port_counter --eth shows most interface errors including CRC. It is important to run this command for multiple hours, since the 
  • Create at least 8 sessions per host (or, again, use all interfaces on Pure)
    • pureport list --initiator shows you all of your hosts and which array ports they are connected to. Initiator IQN column usually has the name of the host or you can look at purehost list to see what the hosts IQN is. Each connection to a port is a session. 

If the fuse tool port_counter doesn’t reveal incrementing errors, we need to then look into what symptoms of an issue are recorded in the array’s logs. To see these symptoms need to know the name of the hosts, as seen in purehost list, and a timeline of when the issue occurred.

Use purehost list to see what the IQN in question is. Then use pureport list --initiator to see what the different initiator ip addresses are. We then want to search for "iSCSI login" or the initiator ip address in the appropriate syslogs on both controllers to see what traffic the array is logging with the following command.

  • To see a host iSCSI initiator logging in to the array target port run the following in the syslogs:
zgrep 'iSCSI login' syslog-*.gz
syslog-2017080203.gz:Aug  2 02:40:26 pure1-00109-ct0 kernel: [4843141.851391,27] Received iSCSI/TCP iSCSI login request from 10.99.17.147 Network Portal 10.155.17.207:3260
syslog-2017080207.gz:Aug  2 06:48:12 pure1-00109-ct0 kernel: [4857992.708759,4f] Received iSCSI/TCP iSCSI login request from 10.99.17.147 Network Portal 10.155.17.207:3260
syslog-2017080212.gz:Aug  2 12:14:41 pure1-00109-ct0 kernel: [4877562.598442,4f] Received iSCSI/TCP iSCSI login request from 10.99.17.147 Network Portal 10.155.17.207:3260
  • Alternately to see the iSCSI host gracefuly logout we would use:
zgrep 'logout request' syslog-*
syslog-2017111408.gz:Nov 14 07:59:55 pure1-ct0 kernel: [14925578.732493,09] Received iSCSI logout request CLOSESESSION from iqn.1998-01.com.vmware:NEIESX08-32c7a70e on CID: 0 for SID: 213.
syslog-2017111408.gz:Nov 14 08:00:05 pure1-ct0 kernel: [14925588.740377,09] Received iSCSI logout request CLOSESESSION from iqn.1998-01.com.vmware:NEIESX08-32c7a70e on CID: 0 for SID: 214.
  • We can also see when an initiator successfully logs in using the following grep.
zgrep successful syslog-*
syslog-2017111408.gz:Nov 14 07:59:55 pure1-ct0 kernel: [14925578.730638,27] iscsi_np/6927: iSCSI connection (zero tsih) #1 from iqn.1998-01.com.vmware:NEIESX08-32c7a70e SID: 213 successful on CID: 0 from 192.168.150.8 to 192.168.150.1:3260,1 sess ffff882ff9f4e3f0 conn ffff88201f2be880 bitmap=8
syslog-2017111408.gz:Nov 14 08:00:05 pure1-ct0 kernel: [14925588.738489,27] iscsi_np/6927: iSCSI connection (zero tsih) #1 from iqn.1998-01.com.vmware:NEIESX08-32c7a70e SID: 214 successful on CID: 0 from 192.168.150.8 to 192.168.150.1:3260,1 sess ffff882ff9f4e480 conn ffff88202630cd00 bitmap=8

  • Searching on a single host initiator ip address is good way to see if this port has multiple login requests with out successful login to a single or multiple array ports. We can also observe if the login was successful. 
pure1-ct1/2017_11_14$Zgrep 10.99.17.147 syslog-*
syslog-2017111408.gz:Nov 14 08:02:24 pure1-ct0 kernel: [14925727.626757,13] Received iSCSI/TCP iSCSI login request from 192.168.151.8 Network Portal 192.168.151.1:3260
syslog-2017111408.gz:Nov 14 08:05:16 pure1-ct0 kernel: [14925900.247393,13] Received iSCSI/TCP iSCSI login request from 192.168.151.8 Network Portal 192.168.151.1:3260
syslog-2017111408.gz:Nov 14 08:05:16 pure1-ct0 kernel: [14925900.247520,13] iscsi_np/6928: iSCSI connection (zero tsih) #1 from iqn.1998-01.com.vmware:NEIESX08-32c7a70e SID: 223 successful on CID: 0 from 192.168.151.8 to 192.168.151.1:3260,1 sess ffff88300b4cd870 conn ffff88201f2bee00 bitmap=8
syslog-2017111408.gz:Nov 14 08:05:32 pure1-ct0 kernel: [14925915.257590,09] Received iSCSI/TCP iSCSI login request from 192.168.151.8 Network Portal 192.168.151.1:3260
syslog-2017111408.gz:Nov 14 08:05:32 pure1-ct0 kernel: [14925915.257707,09] iscsi_np/6928: iSCSI connection (zero tsih) #1 from iqn.1998-01.com.vmware:NEIESX08-32c7a70e SID: 224 successful on CID: 0 from 192.168.151.8 to 192.168.151.1:3260,1 sess ffff88086a7a03f0 conn ffff88201f2bf900 bitmap=8robm@i-0717006fae40e481c:/logs/neirelo.com/neipurestore01-ct0/2017_11_14$ ct
pure1-ct0/2017_11_14$ct
zgrep 192.168.151.8 syslog-*
syslog-2017111408.gz:Nov 14 08:02:24 pure1-ct1 kernel: [15023204.291118,0e] Received iSCSI/TCP iSCSI login request from 192.168.151.8 Network Portal 192.168.151.2:3260
syslog-2017111408.gz:Nov 14 08:05:35 pure1-ct1 kernel: [15023394.974086,13] Received iSCSI/TCP iSCSI login request from 192.168.151.8 Network Portal 192.168.151.2:3260
syslog-2017111409.gz:Nov 14 08:52:20 pure1-ct1 kernel: [15026200.098652,13] Received iSCSI/TCP iSCSI login request from 192.168.151.8 Network Portal 192.168.151.2:3260
syslog-2017111409.gz:Nov 14 08:52:20 pure1-ct1 kernel: [15026200.098767,13] iscsi_np/7078: iSCSI connection (zero tsih) #1 from iqn.1998-01.com.vmware:NEIESX08-32c7a70e SID: 372 successful on CID: 0 from 192.168.151.8 to 192.168.151.2:3260,1 sess ffff8830143dfc60 conn ffff882006c93180 bitmap=9
  • Searching for iSCSI connection failing would reveal problems on the customers san if port_counter doesn't show related errors incrementing at that time. The following shows log entries related to the event failing connection.
zgrep 'failing connection' syslog-*
May 11 03:07:46 pure1-ct0 kernel: [21292805.140438,01] iscsi-68: Did not receive response to NOPIN on CID: 1 on SID: 21726, failing connection.
May 11 03:07:46 pure1-ct0 kernel: [21292805.140466,17] iscsi_trx-68/28013: rx_loop: -512 total_rx: 0 expected: 48
May 11 03:07:46 pure1-ct0 kernel: [21292805.140478,17] Closing iSCSI connection CID 1 on SID: 21726
May 11 03:07:46 pure1-ct0 kernel: [21292805.140506,17] iscsi_free_thread_set - iscsi_ttx-68/28012: release ts ffff880101703240/68 for iscsi_target_tx_thread+0x156/0x2b0 [iscsi_target_mod] (thread count 2 tx ffff880c46bdc440 rx ffff880c46bd8000)
May 11 03:07:46 pure1-ct0 kernel: [21292805.140564,17] iscsi_trx-68/28013: resetting connection, state: 4
May 11 03:07:46 pure1-ct0 kernel: [21292805.445159,17] Closing iSCSI session SID 21726 to iqn.1991-05.com.microsoft:crbedw1.cincyreds.com
  • The following are also important to be aware of.
zgrep 'conn_error' syslog-*
Aug  2 02:40:23 pure1-00109-ct0 kernel: [4843139.082282,3b] iSCSI post login state change: setting conn_error
Aug  2 02:40:23 pure1-00109-ct0 kernel: [4843139.082296,3b] rx_work(ffff881af5c8a5c0): signal_pending 0 conn_error 1 ret 0
Aug  2 02:40:23 pure1-00109-ct0 kernel: [4843139.082355,00] iscsi-cls-2/58096: Closing iSCSI connection CID 0 on SID 1660 bitmap 2
Aug  2 02:40:23 pure1-00109-ct0 kernel: [4843139.082392,00] iscsi-cls-2/58096: stopping tx_work        (null) SID 1660 bitmap 2
Aug  2 02:40:23 pure1-00109-ct0 kernel: [4843139.082399,00] iscsi-cls-2/58096: stopping rx_work        (null) SID 1660 bitmap 2
Aug  2 02:40:23 pure1-00109-ct0 kernel: [4843139.082405,00] iscsi-cls-2/58096: kernel_sock_shutdown: -107, state: 7
Aug  2 02:40:23 pure1-00109-ct0 kernel: [4843139.082407,00] iscsi-cls-2/58096: resetting connection, state: 7
Aug  2 02:40:23 pure1-00109-ct0 kernel: [4843139.332704,06] iscsi-cls-2: Closing iSCSI session SID 1660 to iqn.1998-01.com.vmware:stl10r13c005-7bcaae73
zgrep 'Closing iSCSI' syslog-*
syslog-2017080203.gz:Aug  2 02:40:23 pure1-00109-ct0 kernel: [4843139.082355,00] iscsi-cls-2/58096: Closing iSCSI connection CID 0 on SID 1660 bitmap 2
syslog-2017080203.gz:Aug  2 02:40:23 pure1-00109-ct0 kernel: [4843139.332704,06] iscsi-cls-2: Closing iSCSI session SID 1660 to iqn.1998-01.com.vmware:stl10r13c005-7bcaae73
syslog-2017080207.gz:Aug  2 06:48:09 pure1-00109-ct0 kernel: [4857989.941460,00] iscsi-cls-2/69033: Closing iSCSI connection CID 0 on SID 1661 bitmap 2
syslog-2017080207.gz:Aug  2 06:48:09 pure1-00109-ct0 kernel: [4857990.174381,06] iscsi-cls-2: Closing iSCSI session SID 1661 to iqn.1998-01.com.vmware:stl10r13c005-7bcaae73
syslog-2017080212.gz:Aug  2 12:14:38 pure1-00109-ct0 kernel: [4877559.833140,00] iscsi-cls-2/78696: Closing iSCSI connection CID 0 on SID 1662 bitmap 2
syslog-2017080212.gz:Aug  2 12:14:38 pure1-00109-ct0 kernel: [4877560.099753,22] iscsi-cls-2: Closing iSCSI session SID 1662 to iqn.1998-01.com.vmware:stl10r13c005-7bcaae73
  • If you are not seeing of the above iSCSI traffic for that day, it is easy to use the following to search multiple days. The syslog files are small compared to core.log files making it fast to search through multiple days. From array logs do the following
robm@i-02fefdb7c55b7a985:/logs/pure.com/pure01-ct0/2017_11_15$ cd ..
robm@i-02fefdb7c55b7a985:/logs/pure.com/pure01-ct0$ tgrep 'conn_error' 2017_11/syslog-*|tail
  • If we can narrow down a time stamp of when the issue occurred it is useful read the logs around that time. The “Slow IO op” messages are misleading and tend to be an average of other io’s as well. The following grep can help you filter out noise when looking at the syslogs at a particular time to see what we logged.
2017_11_17$ zgrep -v -E "0x85|Slow IO op|callbacks suppressed|slow ZCCD releases" syslog-201711170[7-9].gz| less
  • The host is connected to the array correctly but has not been given permission to access a volume. To check this run the following command in fuse logs or on the array. You should see the name of the host listed. 
 purevol list --connect
------------------------------------------------------------------------
Name                        Size  LUN  Host Group      Host
esx-CL1--01  4T    254  CL1-pure       purea101
  • If we are completely at a loss and need to show extra effort to explain the issue we could capture a TCPdump from both the target and the initiator at the same time. See our tcpdump kb. Here are some useful filters to use for collect SCSI SBC SRT statistics for a specific iscsi/ifcp/fcip host. To filter out traffic from a particular ip (use ip of array iscsi or host iscsi ports) as follows:
    • tshark -z scsi,srt,0,ip.addr==1.2.3.4 -r nameof.pcap
    • Or, tshark -z scsi,srt,0 -r nameof.pcap
  • Important things to note are that we don't support MC/S (Multiple Connections Per Session). You should never see a CID greater than 1. For Example when grepping for the host Ip Address in the syslogs:
Jul 28 16:15:04 Pure-Array02-ct0 kernel: [252946.296206,0f] Received iSCSI/TCP iSCSI login request from 192.168.3.175 Network Portal 192.168.3.90:3260
Jul 28 16:15:04 Pure-Array02-ct0 kernel: [252946.296564,0f] kworker/15:0/17657: iSCSI connection (zero tsih) #1 from iqn.1991-05.com.microsoft:hw-it-pvmh-hv75.healthwise.org SID: 1564 successful on CID: 1 from 192.168.3.175 to 192.168.3.90:3260,1 sess ffff881586cc8a80

Performance (SAN Latency) 

Tools / greps

Use the following tools to troubleshoot remotely:

Common failure points

The following are common failure points:

  • SAN errors due to a bad SFP, cable, and switch or array ports. 
  • Over utilized iSCSI SAN. 
  • Purity bug, search our wiki release history inside Purity Release Center.

Troubleshooting

SAN errors due to a bad SFP, cable, and switch or array ports. If error counters are incrementing it usually means that the host(s) are seeing some sort of latency. We can see errors using the following. 

port_counter --eth was created to display errors on Ethernet and FibreChannel ports which may cause poor performance or interruption for connectivity. From fuse run the following command on the array in question. We are wanting to look for incrementing counters from when the issue was occurring. If you do see counters incrementing see the port_counter --eth page and research associated errors in KB’s/JIRA/wiki

port_counter --eth hours 6 

We can also observe incrementing errors by comparing ethtool, and netstat inside the diagnostics.log-yyyymmddhh.gz between the different hours they were logged. When reviewing be mindful of what percentage of the recent traffic the error represents. Use other diagnostics files from previous hours or days to figure out how much total rx or tx traffic incremented compared to the amount of errors incrementing.  These commands can also be used live on the array as follows.

  • netstat -s
    • Have a quick look over this section for anything that stands out to get a general idea of what is going on for this controllers and what is incrementing. The following sections are good to see what the specific interfaces are doing.

Warning ethtool can do more that provide information about Ethernet interfaces, some of these commands could be impactful, use wisely. 

  • ethtool -S ethX 
    • -S is for statistics. Pay attention to errors, err, and other values that incrament. Then check tx/rx_packets from previous hours to see the number of packets sent or received. Now, figure out how much an error has incremented at the same time, consider what percentage of traffic it was. Less than 1% may not be significant. 
  • ethtool -d ethX
    • -d is for register-dump. Link Status and Link Speed are useful. 
  • ifconfig (live on the array, not in fuse) has the following information.
eth6      Link encap:Ethernet  HWaddr 90:e2:ba:a3:97:cd  
          inet addr:10.0.121.14  Bcast:10.0.121.255  Mask:255.255.255.0
          inet6 addr: fe80::92e2:baff:fea3:97cd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3173000331 errors:8 dropped:66 overruns:0 frame:8
          TX packets:12638779171 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1326651543428 (1.3 TB)  TX bytes:18017524964918 (18.0 TB)
  • watch -n 1 "ss -anp | grep 10.124"
    • If a customer's iSCSI network seem slow in responding between Pure and the initiator, run the above to test the network's ability to carry the payload.

When looking for over utilized iSCSI of the SAN. It is important to review the number of dropped or discard in the previous commands when this issue is occurring. We can also have a look at switch logs for potential issues that could be causing the latency. 

Next Steps:

If we cannot resolve the iSCSI issues from the above troubleshooting then we should request host and switch logs from the customer. We will need know which ports the array and hosts are connected to since we can not see connected mac addresses. 

Microsoft Windows Servers

This is a good guide to collect Windows Server host logs. Troubleshooting: Collect Windows Server Failover Cluster Logs and System Information. If the Windows server is not part of a cluster, don't ask for the cluster information. 

VMware

For host logs from vmware hosts see KB How to capture VMware Support Bundles for Troubleshooting.

Linux

For Red Hat linux host logs gather sosreport for other linux gather the equivalent.

iSCSI Switch logs

Do the following to address iSCSI switch logs:

  1. Have the customer let us know which ports the Array and Hosts are plugged into. 
  2. Ask the customer to clear the switch logs or get us a set before reproducing the issue and after.  Switch logs are historical, if we find errors on the switch we may not be able to identify when they occurred.  FC switch log steps are the same. 
  3. Have the customer SSH to their Brocade or Cisco switch. For Brocade run "supportshow" and for Cisco run "show tech-support details".  If the switch is not a Brocade or Cisco switch consult to manufacture's CLI documentation and do the following. 
  4. When the command completes, just save the text with ⌘S. Or go to the Shell menu and click "Export Text As.."
  5. Have the customer FTP them to us or use box through PureStorage Okta