Host Engineering Forum
General Category => Do-more CPUs and Do-more Designer Software => Topic started by: CReese on March 18, 2015, 02:40:26 PM
-
Hello,
I noticed in an otherwise 100% operational system that the CPU was starting up in Program mode. When I got into the PLC, it was displaying errors on a couple cards, a F2-08AD-2 and a F2-04AD-2. The error says that the channels have failed, when they are clearly still functional.
What are the possible causes for this error, and is there anything to do except replace the cards?
Thanks,
Colin
-
Re-seat all the cards in the base.
-
Definitely tried with the offending cards. Will give it another shot. Couldn't hurt.
-
Also make sure you have the latest firmware in the CPU.
-
Are you using all of the channels on the cards? I had an error on an analog input card when I left one of the input channels unhooked. Don't know if that may be your problem but the error did not make the CPU come up in program mode. I saw it when updating the program.
JW
-
Do-more CPUs come up in program mode when there is a change to the I/O configuration, which could be due to glitchy backplane connection or module. You might check $IOVerifyLastTO (DST27) to see if it contains anything other than 0. If so, it means that it got the wrong answer for at least 1 scan while verifying the module.
They can also come up in program mode if they have watchdogged repeatedly. If the CPU has watchdogged it will bump $WatchdogReboots (DST385). When it hits 10 (I think) it will drop into program mode on restart. This is to prevent the CPU from winding up in a unrecoverable crash loop.
Failed channels and 24vdc are status bits read from the module. The CPU has no control over it...we just report it.
-
This is interesting. I am getting the error:
Master 0: DL205 Local I/O Master
Base 0: DL205 Base
Slot 0: 8 AI [X0-7/WX0-7]
The following channels have failed on this module: 1, 2, 3, 4, 5, 6, 7, 8
I get this error any time I plug an F2-08AD-1 into any base. I've swapped bases, bought new cards and inserted them with no connectors, and still the error.
This error alternates with the "There is an error with the module error"
These cards were in slots 2 & 4. With no point terminations connected, I swapped 2 & 3. This gave errors on modules 2,3, & 4. I then swapped them back, and showed errors on 2, 3, 4, & 5. Interestingly, while these errors were reported, the PLC did come up in Run when I power cycled it repeatedly.
This seems like some sort of firmware compatibility issue. Any ideas which direction to go?
Colin
-
the 4-20mA input modules will report broken transmitter alarms for any channel that is enabled and is NOT connected to a current source. Assuming all 8 channels are enabled, (the default jumper setting), when you plugged in modules without the connectors you would get 8 broken transmitter alarms.
if you look at the I/O Mappings page of the System Config, you'll see there are 8 X inputs along with the 8 WX inputs for that module. The X bit associated with each channel will be ON when a broken transmitter is detected for that channel.
if this is the problem you're having, in Do-more Designer you should be seeing "Warning" on the status bar with a yellow background. Click on warning to open the System Information dialog. You should then see the "Open I/O System View" button also with a yellow background. click that button open I/O System View which will in turn display all of the current errors and warnings for all of the I/O modules. That's where you'll see the channels that are reporting broken transmitters.
-
That is all fine and good, but my CPU should not tell me that it is a failed channel input. This seems to imply a hardware problem, not a sensor problem. This is a VERY important distinction that should be clear from the error message.
Finally, this sort of error should not cause the CPU to not start in the mode it was in when it was powered off. I'm not clear that it ever was, now that it's starting up with no problems. Another lost question mark.
C
-
The CPU reports what the module tells it. The bits we are reporting are channel failure bits. Not sure what else to call them.
Channel fail bits will not prevent the CPU from returning to RUN, but general module failures do and any change to the I/O config does. The rules are well defined and well tested. If there is a deviation from those rules we certainly need to know about it and will be happy to fix it.
-
'channel signal loss', 'channel disconnected', 'channel current loss', or even 'channel current fault' are all more accurate.
I understand if this is a general condition that could denote actual hardware failure at the card, but as it exists it does not clearly indicate the range of problems that could exist. When I see "channel has failed", this in no way indicates that I could simply have it disconnected. Sure, RTFM, but I think this could be better indicated in DMD.
C
-
the term 'channel failure' comes the module manufacturer's documentation, and yes, it is a very generic term.
having the entire module fail - while still reporting it's module id properly - is very rare. these modules have a single A-to-D convertor that gets multiplexed across the number of channels, so again having a single channel fail in the module hardware - while having the other channels work - is also extremely rare. what typically happens when the A2D fails is you always get 0's for channel values but no broken wire indications.
by far, what happens to generate these errors is a broken wire on one or more of the channels.
All that said, we will be changing the "channel failure" indication to "broken transmitter" in the I/O System View to hopefully point more directly at the problem being the external wiring and not the module itself.
-
Do-more CPUs come up in program mode when there is a change to the I/O configuration, which could be due to glitchy backplane connection or module. You might check $IOVerifyLastTO (DST27) to see if it contains anything other than 0. If so, it means that it got the wrong answer for at least 1 scan while verifying the module.
They can also come up in program mode if they have watchdogged repeatedly. If the CPU has watchdogged it will bump $WatchdogReboots (DST385). When it hits 10 (I think) it will drop into program mode on restart. This is to prevent the CPU from winding up in a unrecoverable crash loop.
Failed channels and 24vdc are status bits read from the module. The CPU has no control over it...we just report it.
Can you help me understand what 'watchdogged repeatedly' means? I do see that watchdog reboots happened ten times before it came up in program mode.
Thanks,
C
-
'Watchdogged' = watchdog timer rebooted a crashed controller. 'Repeatedly' = more than once, in this case, repeatedly = 10. Since your counter is at 10, it crashed and restarted 10 times, which is why it dropped into program mode. A crashing/rebooting PLC can't do comm, so dropping back to program mode improves your chances of being able to fix the issue from off site.
There were some networking issues fixed several firmware revisions ago. I would update and clear the reboot counter. If you don't ever want it to drop into program, just clear the reboot counter in you first scan code.
-
So the obvious next question is: what is causing the crash in the controller? How do I debug?
Colin
-
So the obvious next question is: what is causing the crash in the controller? How do I debug?
Colin
Update the firmware. It's probably already fixed.
-
Message is "PLC rebooted following hardware watchdog timeout".
This can't be related to those AIN cards, can it?
Colin
-
Message is "PLC rebooted following hardware watchdog timeout".
This can't be related to those AIN cards, can it?
Colin
Doubtful.
You are doing some custom TCP comms, right?
-
Oh wait...not TCP...but EMAIL?
-
Not really. Three MB TCP/IP clients, an internal serial client, and an unused SERIO card.
Firmware was all up to date.
The reboots happened every five seconds. If I reset the timer, won't it just continue to reboot every five seconds indefinitely?
C
-
No email here.
-
Log.
-
So it's still doing it when you put it back in RUN mode?
-
This smells a bit like a bug we fixed. SERIO card maybe...let me look up the details.
-
So it's still doing it when you put it back in RUN mode?
It happens only sporadically. It has now decided to throw dozens of comm errors with no other change in configuration or connectivity on either side.
I'll see if it has the issue after I get it back online.
C
-
It's in RUN mode and working fine now. That's the frustrating part. Putting it back in RUN mode or power-cycling typically fixes the error immediately.
For now, I'm just going to have to leave the switch in RUN position as it's a critical PLC. I'm just curious what will happen when it watchdogs next time -- whether it will just continuously reboot or what.
C
-
There are some post mortem registers located at DST400-DST409. Throw those into a data view and send me a screen cap.
-
here you are.
-
This contains the last 4 reboot states and the current state.
518 = normal operation following a program update
806 = normal operation following a transition to run mode
413 = normal operation following a boot to program
801 = failed running user logic during the first scan
The top three are normal and expected. When you make a program change, you'll see 518. When you switch to run mode, you'll see 806. When you power up in program mode, you'll see 413.
The 801 is where the problem occurred. It failed during a transition to run mode, while running user logic the first time.
What I/O modules do you have in each slot?
What module configs do you have listed?
-
Here are some data. The device driver error is one I don't recognize.
-
Do you think it's possible this is defective hardware? We've replaced everything except the CPU (I think).
C
-
I'd like to see you clean up these messages first. Every one of these messages represents something that isn't quite right. If we aren't recovering correctly from those conditions, it could be causing the crash/reboot cycle, which clears itself after going through the program/run transition. Before you tweak anything, make sure you have a good backup, because if this clears it up, I'm gonna want to study the program to try to understand what isn't working right.
The "A mutliscan instruction..." message is due to something like an MRX/MWX still being active when termination code is run. Termination code is executed any time a stage is disabled due to jump or reset, a program is stopped through exit or halt, or a task terminates. If a stage/task/program contains a multiscan instruction, you should *always* use the success/failed bits to enable the transition. The best way to think about what is happening is like killing the power to your PC vs doing a menu shutdown.
The string buffer overruns aren't too big a deal, but as a rule, you really should be using the correct buffer sizes for whatever you are doing. When this message happens it is due to something getting truncated.
The array index out of bounds can be of concern because we prevented you from doing something your code was trying to do. Both the code being out of bounds and the PLC ignoring the request are bad. I would want to understand what is happening.
It may well be that these are all non-issues...things that happened while developing and simply haven't been cleared. Hit the Clear All button at the bottom and see if they re-occur. If they do not, it's likely not the issue. If they do, I'd clean them up.
As for the hardware, it's possible, but I'm leaning to 'no' at this point.
-
And the device driver error?
-
Could just be comm errors in your Modbus. May be related to the abnormal termination of the Modbus instructions. Any time a failed bit is set the device driver error flag gets set too.
-
Ok, well let me run this down. I've looked at these before and not been able to nail them down.
1. Attempted string operation was longer than the target string and was truncated ...
The operation is a STRGETB, getting 40 bytes from a string struct that is 80 characters long and putting it into an MHR memory location that exists.
There are four such errors.
2. The index error was an easy programming fix.
3. The forcibly terminated was indeed a jump in parallel with an MRX operation. I fixed that.
I'd still like advice on the above. I'll see if the thing still kicks over out of run.
Colin
-
1. Attempted string operation was longer than the target string and was truncated ...
The operation is a STRGETB, getting 40 bytes from a string struct that is 80 characters long and putting it into an MHR memory location that exists.
There are four such errors.
I am super confused. That message is a generic message generated by the driver layer when it doesn't have detailed parameter information...STRGETB doesn't generate that message.
There was a bounds check bug in STRGETB in Rel 1.0 firmware. It was fixed in 1.1.
If you are willing, please email your program to me.
-
Never mind, I figured it out. DmD takes the system error code and generates a message from it. It's wrong in this case.
There was definitely a bug in 1.0 firmware and it would wrongly say the buffer was too short.
-
So it is safe to ignore?
-
So it is safe to ignore?
No. It didn't work right in 1.0 (the instruction failed), but if you are getting it in anything since then, it should be a valid error. Truncating isn't the worst thing if there was nothing critical in the cut off part, but it indicates that you are trying to write something somewhere it won't fit.
Or...the instruction is still broken, and if so, I very much want to know how.
-
Alright, we saw it again. Here's what I get:
In Warning
------------------------------
Buffer overflow
Device driver reror
One of more IO masters are indicating a problem with a module
In event logs
------------------------------
Many cycles of :
PLC rebooted following hardware timeout
System was turned off
System was turned on
Then most recently, about twenty of the error:
Critical error occurred in the IO system
In I/O System
------------------------------
Channel failure warnings on unused analog channels in slots 2 and 4
What is this critical error in the IO system?
Colin
-
What is this critical error in the IO system?
Can be multiple things, but it's probably a bad module ID. As we scan the base, we check module IDs to make sure they haven't been pulled or failed or whatever. If I see the wrong ID (or no ID) it gets flagged. This type of I/O error is considered critical and takes the controller out of RUN mode.
This feels like a very systemic issue...bad base, bad hardware, bad power, horrible bit of electrical noise...
-
Ok, so what sort of IO issue could create the critical error? Overvoltage, perhaps a dip in power supply?
Colin
-
I would look for a noisy event or flaky power.
Has all the hardware been replaced?
-
Yes, except the CPU. I may replace it today.
The good part is we are about to fire up an identical system, so if it's systemic, we'll see it. If it's hardware or a miswire, I'd guess we don't. You don't often get such good data.
C
-
I don't know where this would go, but there should definitely be a warning or error for attempting to jump to a non-existent stage. It is a very annoying thing to debug. Program is running, but no active stages ...
-
Check your output window. Program check warning W201...