News:

  • September 17, 2021, 01:58:34 AM

Login with username, password and session length

Recent Posts

Pages: [1] 2 3 ... 10
1
Can you change the PLC to pump run signal to three-wire?  Then you could survive a REBOOT.
2
Code for the exception, not the rule  :P

Yeah, well, you have to be careful with that. When I say exception I don't mean rare event, I mean gravity quit working or the speed of light changed...things that must never happen. The only real recourse to such a thing is to reboot, which may be fine for some, but probably very bad for others. Your PLC is still functioning even though you've had a failure. For most that is preferable to rebooting. We put the REBOOT instruction in there for exactly these cases, if that's determined to be the best action and safe to do.

The real right answer is to fix the bug and eliminate the exception. We'll do our best.
3
Scan times are high, we found our own bug in one of our pointer based subroutines.  This PLC hasn't received an update yet, but when the BRX gets an invalid pointer (Block of 24, send request for 25...) the scan times shout way up.  Had a unit with ~90ms, fixed logic and now below 30ms!

Code for the exception, not the rule  :P

Will add ST145 & 146 to monitors and see if we can create an automatic invoked REBOOT or at least user initiated (with failsafes)


--
Edit:
Thanks, hope we can find something.
-----
4
We're formulating a plan to build a random/repetitive hardware failure stress test that would approximate the behavior of bad networking hardware to see if we can force a similar environment to bad radios. Hopefully we can learn something quickly.
5
I appreciate quick and honest answer, I do understand the difficulty to recreate the problem.

The remote PLC is on a wireless link that used to be very good, but now has some issues (New buildings built in the last year or two).
While the site generally works well, it does get latency and dropped packets. So the low quality of the link is very likely the cause of the problem.
The other four remote units (all DL-06 /w ECOM100) are also wireless, but have not seen same loss of network.

As for the ability to cycle power on the unit, this is not a solution that is reasonable.  This unit has a hardwired connection for controlling a water well/pump, so the outputs dropping out will also interrupt that process.

For me, finding the reason the Success/Error bits do not update would be the first check. On a device failure/hang, the error bit should at least be on.  When this happens, both remain false.  We changed our communications health detection to use a rising edge Success and heartbeat (pseudorandom and analog input; make sure PLC is in run mode) from remote unit.

The DST40 & 41 (Added all $Eth items) are now on our list of things to watch, but would these be from ALL ethernet devices? Currently, these are at 0 for each for the last 10 hours.

80ms scan time? That's pretty stout. It shouldn't hurt anything, but high scan times are never helpful for comm. Is it always that high?

It wouldn't require power cycling, just invoking the REBOOT instruction, but it will interrupt the outputs.

The instruction is clearly not working right if it's never timing out. It's a joke here at Host, and I mean no disrespect with this, but when there is a failure in the controller and the mode of failure doesn't satisfy someone, my response is generally "what about 'exception condition' do you not understand?" The bottom line is it shouldn't be happening, and we make every effort to keep things from happening, but bugs are bugs and they don't always behave the way we want them to behave...cause they're bugs.

Given the unstable link, I'm now at 90% confidence this is a network stack resource issue. Wondering if some of the low level networking code isn't accepting a missing/unavailable resource as a valid basis for timing out. That isn't great and shouldn't happen, but when is it ever OK to run out of memory? Never. Exception. If it is that, I would expect ST145 and/or ST146 to be on. They indicate that when the Ethernet hardware is trying to inject a packet into the stack, there were no free buffers. That generally should never happen. While that by itself shouldn't cause something else to hang, I'm wondering if there is some pregnant moment in the TCP stack that's causing a low level hang up.

DST40 and DST41 count up in response to excessive packet loads. They aren't problems necessarily, but they give a glimpse into whether there are packet bursts happening that could be causing issues.

I'll put some thought into this to see if I can come with reasons this might be happening and hopefully some diagnostic checks that would let us see what might be going on.
6
I appreciate quick and honest answer, I do understand the difficulty to recreate the problem.

The remote PLC is on a wireless link that used to be very good, but now has some issues (New buildings built in the last year or two).
While the site generally works well, it does get latency and dropped packets. So the low quality of the link is very likely the cause of the problem.
The other four remote units (all DL-06 /w ECOM100) are also wireless, but have not seen same loss of network.

As for the ability to cycle power on the unit, this is not a solution that is reasonable.  This unit has a hardwired connection for controlling a water well/pump, so the outputs dropping out will also interrupt that process.

For me, finding the reason the Success/Error bits do not update would be the first check. On a device failure/hang, the error bit should at least be on.  When this happens, both remain false.  We changed our communications health detection to use a rising edge Success and heartbeat (pseudorandom and analog input; make sure PLC is in run mode) from remote unit.

The DST40 & 41 (Added all $Eth items) are now on our list of things to watch, but would these be from ALL ethernet devices? Currently, these are at 0 for each for the last 10 hours.

7
It sounds like it is running out of IP stack resources. That could be due to a memory leak (that I don't know about) or a flood of packets from a bad actor. It would have been helpful to know whether $QueuesFlushed (ST145) or $CommStackOverrn (ST146) were active. If it also helpful to know what is in $EthDroppedPkts (DST40) and $EthStoppedIntr (DST41). This give some insight into what might be failing.

Unless we can dupe this, we can't begin to fix it. I understand your frustration, but problems like this are the absolute hardest, and other than a description of the failure mode, this gives us very little to go on. Even with your program, unless we dupe all of the comm activity, we aren't likely to see the issue.

It isn't a good answer, and it may not be viable, but if you are able to detect (from the PLC) that the PLC comms are hung, you could have the PLC reboot itself.
8
Latest available firmware.
9

MWX/MRX is master/client side. The session counts and structure are related to the Modbus/TCP slave/server. So some of these recommendations don't apply to your situation.

One thing that might be interesting would be to create a second Modbus/TCP client and add an MRX to the same server. When and if it happens again, switch on the second connection and see if it talks.

have a new topic created... Followed your suggestion.
10
Have created a topic and support ticket about this before, it still remains. No seems worse

We got an alert the comms for a specific site had failed (from the BRX).
Site is running perfectly fine, an old DL-06 with ECOM100.

Cannot connect to DmD on local PC to Ethernet. (Connects, Reads program, fails on system config)
Responds to Modbus TCP (Server) from local PC (SCADA)
NetEdit3 can connect, make changes from same local PC
Disconnect/close SCADA connections, allowed the DmD to connect.
Device will not show Success/Error status.  The other Modbus TCP devices are working just fine!
Created a new device while it was online, connected the instructions (MWX/MRX) to new device, came up instantly.

This cant keep happening every 2 months...
Pages: [1] 2 3 ... 10