"Am I still working okay?" asked the micro controller...

S

SelfTest 22 years ago

Say we have a micro controller with limited memory. Say it will perform some realtime control of something.

How to make a SW for a micro controller, that in addition to its normal operation (control of something), from time to time it will also check itself if it is doing okay or not ? How a program can test itself? Can some one suggest any intelligent method (other than watch dog) ?

Vote

U

Uddo Graaf 22 years ago

That's called a 'watchdog' timer and is standard in most microcontrollers. It's basically a countdown timer which the computer program running on the microcontroller needs to set every x times per second to prevent it reaching zero. When it reaches zero the microcontroller is reset. So when a program 'hangs' the program stops setting the watchdog countdown timer and the microcontroller is reset.

Vote

H

Hans-Bernhard Broeker 22 years ago

Ultimately, you can't. A CPU can no more meaningfully ask itself "Am I still working OK?" than you can ask yourself meaningfully "Have I fallen asleep yet?"

You can use watchdogs or internal consistency checking to some extent to determine general health of the software. Assertions can be inserted into the code, i.e. conditions that you know must come out true at all times, because otherwise something's fatally wrong.

But there's often little or no point trying to detect hardware faults

--- if the hardware does break you're quite probably toast anyway. You can't usually fix such a problem from the software side, and by The Usual Kind of Luck, the faults that do occur will be exactly those you can't, or at least didn't test for. And that's before you consider that such tests mean more code in total, and thus more opportunities for bugs.

Morale: if you don't know what to do with the answer, don't ask the question.

Vote

U

Unbeliever 22 years ago

You are correct in identifying watchdog timers as one form of COP (computer operating properly test). Other things I've often used are:

1) Background checksum on code and constant/initializer areas of memory 2) Flags and timers which indicate that critical routines and interrupts are running at about the right rate, usually checked in the watchdog timer interrupt. 3) Guardwords between stacks and other memory and regular checks that these have not been compromised (agail often in the watchdog timer interrupt. 4) Feedback of critical output signals to ensure the hardware is working correctly (the hardware is much more likely to suffer random failures than the software). 5) A decent watchdog timer with an algorithmic stimulus and response (e.g. watchdog processor supplies a pseudorandom number and main processor replies with next pseudo-random number in a sequence). Much better than the primitive kick within a certain time style of watchdog, which is prone to failure to detect runaway software which includes a kick. 6) One I haven't used but seen used on a critical plc style system is an odd number of redundant processors (3 in this case) which vote on the state of an output (output follows the state of two agreeing inputs).

Of course, the next question you should ask is "What do I do when I detect a failure". If it is a safety critical system (e.g. the something you're controlling is a train, nuclear reactor or gas furnace rather than a lego windmill) there's a whole other set of questions you should ask even before asking the first one.

hth, Alf

Vote

G

Grant Edwards 22 years ago

Without special hardware support, you can't.

It can't.

Redundant hardware running independantly developed sw with majority voting of outputs.

Vote

M

Mike Harrison 22 years ago

You also need to consider the likelihood of a problem occurring in the first place - time spent designing the hardware to be reliable (e.g. EM/ESD immunity) is time much better spent than trying to second-guess what might go wrong and then hope you can do something useful about it.

For example, in the old days when systems typically comprised seperate MCU/RAM/ROM chips, it made sense to test SRAM and checksum ROM, as these involved many interconnections and sockets which could fail. It makes much less sense to do it on a single- chip MCU, where the sort of failures that are plausible on a seperate-chip system just don't happen.

Vote

J

Joe Pfeiffer 22 years ago

Most of the microcontrollers I've seen that are intended for applications like this have a built-in watchdog timer (I'm assuming when you say "other than watch dog" you mean "other than external watchdog"). In the case of the processor I know best, the HC11, it's called the COP (Computer Operating Properly) timer. The idea here is your software has to reset it occasionally; if the timer ever goes off, it's because your control program has gotten itself wedged.

Vote

G

Grant Edwards 22 years ago

And the probability that your program will still be able to run and do predictable things when there is a failure in the MCU is also small.

Multiply the probability of MCU failure by the probability your program will run with such a failure, and you get a number sufficiently close to zero yadda, yadda, ...

Vote

P

Paul E. Bennett 22 years ago

..and adding to that list. External Pulse Maintained relay. This device has to be fed a change of polarity of its input signal at a regular rate in order for it to maintain a relay in its energised state. If any single component fails, the power supply goes off or the input does not change then the relay just de-energises and opens its contacts. The pulse drive for such a circuit should be driven from the processor internal sanity checks that your software is performing (all check OK so change the state of the output). This device can elevate a single processor from SIL0 to SIL1 with very little effort.

Further, your microcontroller may be comunicating with other systems in order to perform its control. Doing sanity checks on the communication link and checking its integrity in operation will yield a good idea of sub-system health. You will need checksums and/or CRC's on all messages between systems.

Integral step-wise walking memory test and other walking sanity checks. This can detect potential failure points quite early on.

There are a number of others.

You should do an evaluation of what the system safe state is going to be (off, bypassed or gracefully degrading). Then your design efforts should always lean the system toward achieving those safe states unless it is continuing to work properly.

Vote

S

Spehro Pefhany 22 years ago

If you have access to a decent library, check out one these standards before you choose which hardware to use:

ANSI/AAMI SW68, Medical Device Software - Software Life-Cycle Processes

ANSI UL1998, the Standard for Safety of Software in Programmable Systems

EN/IEC 60601-1-4, the Collateral Standard for Programmable Electrical Medical Systems

Best regards, Spehro Pefhany

Vote

G

Guy Macon 22 years ago

What do you plan to have the microcontroller do if the answer in "no?"

Vote

D

Don Taylor 22 years ago

I worked on a project substantially larger than a single microcontroller but the idea we applied might be appropriate. We took a very hard line on this and the charter of the group was that there were going to be no bugs delivered to the customers. In some of the functions that we wrote it was feasible to write one, or a small number, of "sanity checks", small tests that would evaluate whether arguments being passed and/or state variables had values that were appropriate at the moment.

If a sanity check failed we displayed "Fatal Error nnnnn", where nnnnn was the program counter at the point where the check failed, and then we halted the processor.

This had a number of interesting and sometimes unexpected consequences. The first was that it quickly became the case that nobody wanted to be the one responsible for passing bad data to someone else's sanity check. That seemed to result in people being much more careful that they would not pass bad data. Secondly, it became a very popular thing for people to carefully craft these checks to keep themselves from being responsible for a failure. Thirdly, in an embedded environment when everyone is in a panic to get all the work done, it seems that when the box just locks up and you know it is going to take hours to try to figure out what just happened, it seems much more reasonable to just hit the reset button and try to get on with your own work. But when "Fatal Error nnnnn" pops up and in seconds you can look at the build file and tell exactly where the error happened and what sanity check failed you are much more likely to yell "FATAL ERROR NNNNN!" over the wall. Everybody in the team would cringe, hoping it wasn't them who had just called that function with bad data. And the person who had just observed this, plus the person who had inserted that sanity check were both "the good guys." This soon led to adding sanity checks when we would find the box crashed in some strange way and it took hours to realize we hadn't caught some bad case.

But this then led us to being able to test in a novel way. We wrote some code on a test harness that would hammer the box with random input. It would poke buttons and send in commands and present data, pretty much completely randomly, but at 100 commands/second! Within seconds of trying this a check blew up and we had another Fatal Error nnnnn. But that let us find and fix an oversight quickly. After a number of iterations we were to the point where this would run all weekend with zero failures.

Then the decision was made, we were going to leave all these in the code and live when we shipped it. Another team working across the wall with a similar product was horrified, "You don't want your customers to know you have BUGS, DO YOU?!?!" And our reply was that they were going to know one way or the other. We shipped. And we waited. And we waited. All the checks apparently had made us find almost all the bugs before it went out the door.

One afternoon I did get a call from the marketing rep. He had a message from the marketing secretary. She had a message from the receptionist. She had a call from Hughes. They had been using this and it had popped up "Fatal Error nnnnn" and just locked up. They were so astonished that they went over to another building, got a camera, brought it back and took a picture. Then they called. And I got nnnnn from 1500 miles away. In 30 seconds I knew which check had failed, knew that it was a single variable, knew it must have been out of range and I could now hammer the box until I could figure out a way to find and fix that. I did.

After 18 months and with 2000 of the product in the field being used by people pretty much full time we had 3 Fatal Errors found, and I thought that was pretty much all of them that were ever seen because in the manual it told them that if they ever saw this to call this phone number and tell us that number so we could fix it for them. I found and fixed those 3 and a number of others that I knew about but no customer would likely ever see.

The guys across the wall, they had ten times the support team and didn't even bother about bugs that didn't just crash the box, and if it did, they just cycled the power and went on. I even tried to get marketing to offer a campaign, I'd PAY customers for the first Fatal Error found. They squashed that, it would have made the other team look bad.

One other item that helped with the sanity checks, we filled all memory with 0xAAAA initially, and even when some memory was released. That oddball value was unlikely to be a reasonable value for most state variables and helped us fail more sanity checks.

Vote

G

Guy Macon 22 years ago

There are some applications where instead of having a watchdog reset the system when it goes astray you can simply reset the system again and again with a periodic reset. This can be the output of an oscillator or even the push of a button (a common way of designing toys).

Vote

G

Guy Macon 22 years ago

[snip]

Don, may I have permission to put your story up on my web page?

Here is another technique which I use:

Start with "finished" and "debugged" code.

Have one programmer insert N bugs in another programmer's code, keeping careful records of what and where. The idea is to put in errors typical of the errors that the person writing the code normally makes.

Have the author of the code debug and fix all bugs that he can find, stopping when he can't find any more bugs. Keep record of all bugs fixed. Don't tell him which are his or how many were inserted.

Let's say that we inserted 20 bugs, he found 10 of them, and he found

20 of his own bugs. That tells us that there are around 20 of his own bugs still undiscovered.

The psychology is interesting. The programmers write code with far fewer bugs and do a far better job of testing before saying that they are done. The programmer who finds all of the inserted bugs and no new bugs is a hero. (I reinforce that with bonuses and with specific mention in writing of this accomplishment during performance reviews.)

Vote

D

Don McKenzie 22 years ago

As SelfTest hasn't come back yet to give any more info or comments, I am looking at his "(other than watch dog)" and wondering if the question is really "Is my micro still running and going about its normal business?"

Usually the first thing any programmer learns is how to flash a LED. By adding a LED and resistor to an output pin, you can call a "turn LED on", and "turn LED off" in a sequence, say flash 4 times on power up being OK.

Extending this further, you can test for certain I/O operations taking place correctly with a set number of flashes.

Many companies use 7 segment LEDs on their products, and such things as "system alive" can mean the 7 segment LED running around in a figure 8.

Power up, self test, and real time diagnostics can be performed from a simple single LED, right up to multiple computer systems to monitor the operations.

I believe that anybody that designs a useful lump of hardware should have at least one LED that can be pulsed under program control for this purpose.

Cheers Don...

Vote

G

Gerald Bonnstetter 22 years ago

On the Amiga computer one of the testing packages used 0xDEADBEEF to fill unused memory. ;-)

It also added guard band areas around allocated memory and then checked those after the free to be sure you didn't write outside of your allocated area.

That second idea would work best if you had an OS or at least memory management code.

Vote

R

robin.pain 22 years ago

Anyone who enables the Watchdog timer is advertising:-

1) My code is dogdy. 2) My hardware is EMC prone. 3) I have a new source of error; the watchdog itself.

Cheers Robin

Vote

D

Dave VanHorn 22 years ago

For any non-trivial application, all three are true.

Vote

C

Captain Bly 22 years ago

Robin should stick to lego's and not electronics:

Vote

G

Guillaume 22 years ago

What a pile of bullshit. There are more reasons for an embedded system to fail that you can even begin to imagine. Not using watchdogs (in a sensible way, of course) is totally irresponsible in my opinion.

Vote

"Am I still working okay?" asked the micro controller...

Join the Discussion

Didn't find your answer?