Author Topic: Remote IO still not Rock Solid - BX-P-ECOMEX Woes. (Read 2616 times)

jcb · « **on:** January 08, 2021, 07:59:47 AM »

This is still a work in progress but the project has been up and running for a few months now.

The BXDMIO has always been a little on the touchy side and we had a couple of times where everything dropped because of some network anomaly.

I band aided that by upping the timeout on the Remote IO to 500ms and then put additional timers on the IO points that could fault/shutdown the main motors.

I'm assuming a network broadcast from another device on our PLCnet is to blame. We use subnetting to separate a few networks and prevent routing out from the PLCnet but no VLANS or routers yet although we are quickly heading that way.

Starting last week one of the operators said his controls seemed "laggy" I guess that's the gamer generation stepping into the seat. Sure enough the Discrete DMIO which consists of 4 32bit input cards and 4 32bit output cards and is in the same rack as the main PLC on the same switch was showing retry errors on the IO monitor but not disconnecting. There was also a chain that we use a lug to trigger a prox to stop that was "missing" the input and going to the next lug.

All of this has been running fine for over 2 months now so this is a new issue.

I isolated the PLC and remote IO to a single switch and disconnected it from the rest of the PLC net and everything went back to normal. (This was going to be my first PLCnet2. Add a router at the PLC net uplink and isolate everything inside.)

Then I saw the BX-P-ECOMEX and was like here is my answer. I'll add an isolated remote IO network to the PLC and problem solved. The only other thing on the Remote IO runs is the operator touch screen.

Last night I reconfigured the DMIO modules and touchscreens to a new subnet. Added the ECOMEX moved everything to an isolated switch and checked the Ethernet IO. Everything went online but the two ethernet runs that had operator controls that go out to stride switches with just the DMIO(analog and discrete inputs/outputs) and touchscreens where throwing retry errors like crazy. Error every 2 seconds or so with an update rate of 25ms. The Discrete IO on the same switch as the ECOMEX showed no errors.

Also on a side note I tried to connect my Ethernet/IP VFD's into the switch and could not figure out how to tell my EIPMSG to use the second interface for comms.

Anyway. I reconfigured everything back to the PLCNET subnet and the onboard ethernet card. I left the DMIO and PLC on it's own switch and re uplinked everything to the PLCNET. I get no errors. I changed the Poll rate from 25ms down to 10ms and still everything is rock solid again.

At this point it was the end of 14 hour shift so I scratched my head and went to bed.

Got up this morning and came back in at 4 before we fire up and everything is rock solid no errors quick update times.

I ran a Advanced IP Scanner sweep on both subnets just to get a current layout and see if anything new got plugged in and that was when I saw the DMIO retry counts climb. The weird thing is that even when I scanned to second PCnet subnet I still got retry errors on the DMIO on the PLCnet.

So I am going to reconfigure the subnet on the entire machine center, PLC, Remote IO, Touchscreens, VFD's, Motion Controller this weekend and add a router to the PLCnet subnet uplink and then use the BX-P-ECOMX on the PLC net for programming and Data logging uplink. I might see if I can NATD/VPN through/into the router (ubiquiti edgemax er-x) for touchscreen updates and motion controller tuning so I don't have to take my laptop out to the plant for that.

So after that entire breakdown here are my questions?

1. Why didn't the BX-P-ECOMEX work better than the onboard Ethernet for ISOLATED RemoteIO.
2. How would I use the BX-P-ECOMEX to address/route to the Ethernet/IP devices I created for the VFD's I control with my EIPMSG instructions.
3. What would cause the retry counts on the DMIO is it some kind of broadcast traffic?

Thanks
John.

BobO · « **Reply #1 on:** January 08, 2021, 10:55:23 AM »

Quote from: jcb on January 08, 2021, 07:59:47 AM

I band aided that by upping the timeout on the Remote IO to 500ms and then put additional timers on the IO points that could fault/shutdown the main motors.

Increasing the timeout makes retry problems more obvious, not less. The timeout should be as short as it can be, and should reflect the worst case response time of the target plus network infrastructure delays. In the case of Ethernet I/O, the DMIOs, EBCs, or EDRVs should have a very short response time, on the order of a few milliseconds. If you are on a wired network, the infrastructure delays should be sub 1ms. If you are having packet loss, the answer is increased retries, not increased timeouts.

Quote from: jcb on January 08, 2021, 07:59:47 AM

I'm assuming a network broadcast from another device on our PLCnet is to blame. We use subnetting to separate a few networks and prevent routing out from the PLCnet but no VLANS or routers yet although we are quickly heading that way.

Broadcast traffic can be a problem, yes. When we were doing the original testing of the Ethernet I/O Master, we kept seeing retries that we didn't think we should see. we were doing the testing on the company network, but knowing that others would too, we wanted to test that way. After much head scratching we finally identified a switch that was doing some kind of promiscuous ARP that was bursting out 20-30 ARP requests with almost zero delay. It was filling the packet FIFO on the PLC's MAC, causing packets to be lost. Once we stopped the switch from doing that, the packet loss stopped.

Quote from: jcb on January 08, 2021, 07:59:47 AM

1. Why didn't the BX-P-ECOMEX work better than the onboard Ethernet for ISOLATED RemoteIO.

Maybe I missed it, but I didn't see any details about how many devices are connected. I saw references to HMI and DMIO, but not the total number of units. Another consideration is PLC scan time. The ECOMEX doesn't use an interrupt, so it will start losing packets after the MAC's FIFO fills up. The FIFO is pretty deep...if memory serves, I'm thinking it is 30-100 packets depending on how big the packets are. If there is a a lot of activity and a high scan time, you can get past that. The only way to know exactly what is going on is to snoop the PLC's port with a port mirroring switch. I'm happy to look over any Wireshark traces you can pull from that.

Quote from: jcb on January 08, 2021, 07:59:47 AM

2. How would I use the BX-P-ECOMEX to address/route to the Ethernet/IP devices I created for the VFD's I control with my EIPMSG instructions.

The IP stack knows what address you are targeting. If you specify an address on the ECOMEX's subnet, it'll route to it. If it's on the internal port's subnet it'll route to it. If it isn't on either subnet, it'll route to the first gateway configured.

If cases where broadcasts are required (which doesn't allow for routing), we allow you to pick the port.

Quote from: jcb on January 08, 2021, 07:59:47 AM

3. What would cause the retry counts on the DMIO is it some kind of broadcast traffic?

Excess traffic or noise. You mentioned VFDs. They are absolute comm killers.

Last thought...if I wanted to to keep the I/O responsive in a less than perfect environment, I would reduce the timeout to the lowest stable setting (per my description above), possibly increase retries beyond the default 4, keep the scan interval low or 0, and make sure the PLC's scan time is as low as possible. Your retries may be high, but the I/O throughput will be higher. Timeouts significantly higher than normal response times don't improve packet delivery, they just delay the retry.

We're also happy work more directly with you. Contact support at hosteng.com.

jcb · « **Reply #2 on:** January 08, 2021, 11:30:38 AM »

Okay,

Thank you for the info. I will adjust the timeout values back down and increase the retries. When I got the drops before the error code was 4 in the DM.

Scan Time average is 7ms on the PLC.

I had 3 DMIO Modules / 2 Touchscreens on the ECOMEX network.

I did move the VFD to the correct subnet and was getting timeouts on my EIPMSG instructions? I will try this again on a shop system.

The network is 100% wired 100mbit. All Ethernet runs are in dedicated I/O conduit runs. I don't think VFD interference is the problem.

I have been monitoring $EthMissedFrames all morning and it is still at 0. I'm going to proceed with the network isolation and then run it for a few weeks that way to see if the problem comes back.

Thanks Again.
John.

BobO · « **Reply #3 on:** January 08, 2021, 12:12:15 PM »

Quote from: jcb on January 08, 2021, 11:30:38 AM

Okay,

Thank you for the info. I will adjust the timeout values back down and increase the retries. When I got the drops before the error code was 4 in the DM.

Scan Time average is 7ms on the PLC.

I had 3 DMIO Modules / 2 Touchscreens on the ECOMEX network.

I did move the VFD to the correct subnet and was getting timeouts on my EIPMSG instructions? I will try this again on a shop system.

The network is 100% wired 100mbit. All Ethernet runs are in dedicated I/O conduit runs. I don't think VFD interference is the problem.

I have been monitoring $EthMissedFrames all morning and it is still at 0. I'm going to proceed with the network isolation and then run it for a few weeks that way to see if the problem comes back.

Thanks Again.
John.

A 7ms scan time isn't a killer, but it's not super low either. What is your max spiking to?

I'd watch $EthMissedFrames, $EthDroppedPackets, $EthStoppedIntr for the main port. $Eth2DroppedPackets for the ECOMEX.

One other thing I forgot to mention. The PLC protects itself from excess comms. One of the ways it does that is to limit the number of packets it queues. That process occurs in 3 stages. For stage 1, it will queue the first 16 packets in a scan without regard to how fast they come in. After hitting 16, it switches to stage 2, where it allows 1 packet for each 100us of accrued scan time. So at 1.7ms it will allow one more packet, but will drop packets until 1.8ms, then accept another. $EthDroppedPackets is the count of packets that were rejected due to stage 2 protection. When a total of 32 packets have been queued, it will shut off the Ethernet interrupt, allowing all packets to hit the floor until next scan. $EthStoppedIntr is the number of times the PLC hit stage 3. At 7 ms (and higher presumably), you could easily be seeing the interrupt shut off.

Another couple of flags of interest is $QueuesFlushed and $CommStackOverrn. They are bits indicating that the TCP/IP stack was having trouble allocating packet buffers. That's bad. It shouldn't ever happen, but there are a few cases where it can happen. Sometimes due to high burst traffic, and other cases due to a programming error.

If you are doing a lot of TCP traffic (like Modbus or EIPMSG) that results in opening and closing TCP sessions very fast, you can cause issues there as well. There are a finite number of sockets supported by the TCP stack, and each open and closed socket is kept by the stack for some time before it is cleaned up. I have shortened the timeout on that process, but it is possible to start falling behind. That tends to do a couple of things. It ties up stack resources which are finite, and it starts slowing the scan down more and more. The way you might see that is to use the same EIP Client or Modbus/TCP Client to rapidly hit different devices. It you are only talking to a single device with the client, or if the interval is slow enough, you won't see that.

jcb · « **Reply #4 on:** January 08, 2021, 01:03:04 PM »

I will add those to my data logger.

All of the following are currently at 0 and it has been running like a top this morning with everything on the PLCnet and internal Ethernet the way i originally installed it.

$EthMissedFrames = 0
$EthDroppedPkts = 0
$EthStoppedIntr = 0

I do have a lot of comms but have created a device for every connection. Modbus/TCP (Motion Controller), Ethernet/IP (VFD's), Do-More to a second PLC(Slow interval 1 second updates just to call for a second air compressor to kick on).

There is a codeblock with stage programming that I wrote to exchange data back and forth the the Motion controller via modbus but it has a separate device for Reads and Writes and has a lot of Indirect addressing and another codeblock with stage programming that handles the indirect addressing pointers and memcopy to the UDT's. I'm sure i could probably clean my code up a little but it hasn't been an issue so far and I have metrics built into it for the update times and it is pretty consistent a full data exchange is 60ms according to the timers. On a new CPU with out remote IO it was 9/10ms so I think it is pretty streamlined.

Is there any way to tell the scan time of the different code blocks?

Thanks Again
John.

BobO · « **Reply #5 on:** January 08, 2021, 02:54:17 PM »

Quote from: jcb on January 08, 2021, 01:03:04 PM

I will add those to my data logger.

All of the following are currently at 0 and it has been running like a top this morning with everything on the PLCnet and internal Ethernet the way i originally installed it.

$EthMissedFrames = 0
$EthDroppedPkts = 0
$EthStoppedIntr = 0

I do have a lot of comms but have created a device for every connection. Modbus/TCP (Motion Controller), Ethernet/IP (VFD's), Do-More to a second PLC(Slow interval 1 second updates just to call for a second air compressor to kick on).

As long as none of those counters are moving up, packet loss is probably not load related. That suggests either infrastructure or noise. You might consider trending/logging slave retries to see if they vary with time or environment. Trending/logging scan time can also be interesting.

Quote from: jcb on January 08, 2021, 01:03:04 PM

Is there any way to tell the scan time of the different code blocks?

Yes, kinda. You'll need know execution order (you can get that from the Project Browser). Stick a MATH box with a TICKus() instruction at the top of each block to timestamp and calc the difference between the value acquired from the previous code block. Keep the high water value with a MAX. It'll only work if you don't have yielding code in that block. Alternatively, you can add dummy tasks between code blocks you want to profile, and do the work in them. Right-click on Control Logic in the Project Browser and select Change Execution Order to set the right order.

jcb · « **Reply #6 on:** January 08, 2021, 03:13:42 PM »

Great info. Always appreciated.

Thanks Again.
John.

Author Topic: Remote IO still not Rock Solid - BX-P-ECOMEX Woes. (Read 2616 times)

jcb

Remote IO still not Rock Solid - BX-P-ECOMEX Woes.

BobO

Re: Remote IO still not Rock Solid - BX-P-ECOMEX Woes.

jcb

Re: Remote IO still not Rock Solid - BX-P-ECOMEX Woes.

BobO

Re: Remote IO still not Rock Solid - BX-P-ECOMEX Woes.

jcb

Re: Remote IO still not Rock Solid - BX-P-ECOMEX Woes.

BobO

Re: Remote IO still not Rock Solid - BX-P-ECOMEX Woes.

jcb

Re: Remote IO still not Rock Solid - BX-P-ECOMEX Woes.