Say we have a micro controller with limited memory.
Say it will perform some realtime control of something.
How to make a SW for a micro controller, that in addition to its normal
operation (control of something), from time to time it will also check
itself if it is doing okay or not ? How a program can test itself? Can
some one suggest any intelligent method (other than watch dog) ?
That's called a 'watchdog' timer and is standard in most microcontrollers.
It's basically a countdown timer which the computer program running on the
microcontroller needs to set every x times per second to prevent it reaching
zero. When it reaches zero the microcontroller is reset. So when a program
'hangs' the program stops setting the watchdog countdown timer and the
microcontroller is reset.
[OP forgot to limit F'up2; fixed. Removed non-existant c.a.e.piclist
In comp.arch.embedded SelfTest <SelfTEst> wrote:
Ultimately, you can't. A CPU can no more meaningfully ask itself "Am
I still working OK?" than you can ask yourself meaningfully "Have I
fallen asleep yet?"
You can use watchdogs or internal consistency checking to some extent
to determine general health of the software. Assertions can be
inserted into the code, i.e. conditions that you know must come out
true at all times, because otherwise something's fatally wrong.
But there's often little or no point trying to detect hardware faults
--- if the hardware does break you're quite probably toast anyway.
You can't usually fix such a problem from the software side, and by
The Usual Kind of Luck, the faults that do occur will be exactly those
you can't, or at least didn't test for. And that's before you
consider that such tests mean more code in total, and thus more
opportunities for bugs.
Morale: if you don't know what to do with the answer, don't ask the
Hans-Bernhard Broeker ( email@example.com)
Even if all the snow were burnt, ashes would remain.
You are correct in identifying watchdog timers as one form of COP (computer
operating properly test). Other things I've often used are:
1) Background checksum on code and constant/initializer areas of memory
2) Flags and timers which indicate that critical routines and interrupts
are running at about the right rate, usually checked in the watchdog timer
3) Guardwords between stacks and other memory and regular checks that
these have not been compromised (agail often in the watchdog timer
4) Feedback of critical output signals to ensure the hardware is working
correctly (the hardware is much more likely to suffer random failures than
5) A decent watchdog timer with an algorithmic stimulus and response
(e.g. watchdog processor supplies a pseudorandom number and main processor
replies with next pseudo-random number in a sequence). Much better than the
primitive kick within a certain time style of watchdog, which is prone to
failure to detect runaway software which includes a kick.
6) One I haven't used but seen used on a critical plc style system is an
odd number of redundant processors (3 in this case) which vote on the state
of an output (output follows the state of two agreeing inputs).
Of course, the next question you should ask is "What do I do when I detect a
failure". If it is a safety critical system (e.g. the something you're
controlling is a train, nuclear reactor or gas furnace rather than a lego
windmill) there's a whole other set of questions you should ask even before
asking the first one.
..and adding to that list. External Pulse Maintained relay. This device has
to be fed a change of polarity of its input signal at a regular rate in
order for it to maintain a relay in its energised state. If any single
component fails, the power supply goes off or the input does not change
then the relay just de-energises and opens its contacts. The pulse drive
for such a circuit should be driven from the processor internal sanity
checks that your software is performing (all check OK so change the state
of the output). This device can elevate a single processor from SIL0 to
SIL1 with very little effort.
Further, your microcontroller may be comunicating with other systems in
order to perform its control. Doing sanity checks on the communication link
and checking its integrity in operation will yield a good idea of
sub-system health. You will need checksums and/or CRC's on all messages
Integral step-wise walking memory test and other walking sanity checks.
This can detect potential failure points quite early on.
There are a number of others.
You should do an evaluation of what the system safe state is going to be
(off, bypassed or gracefully degrading). Then your design efforts should
always lean the system toward achieving those safe states unless it is
continuing to work properly.
Paul E. Bennett ....................<email://peb@a...>
On Wed, 19 May 2004 23:05:23 +1000, "SelfTest" <SelfTEst> wrote:
You also need to consider the likelihood of a problem occurring in the first
place - time spent
designing the hardware to be reliable (e.g. EM/ESD immunity) is time much better
spent than trying
to second-guess what might go wrong and then hope you can do something useful
For example, in the old days when systems typically comprised seperate
MCU/RAM/ROM chips, it made
sense to test SRAM and checksum ROM, as these involved many interconnections and
sockets which could
fail. It makes much less sense to do it on a single- chip MCU, where the sort of
failures that are
plausible on a seperate-chip system just don't happen.
And the probability that your program will still be able to run
and do predictable things when there is a failure in the MCU is
Multiply the probability of MCU failure by the probability your
program will run with such a failure, and you get a number
sufficiently close to zero yadda, yadda, ...
Grant Edwards grante Yow! Spreading peanut
at butter reminds me of
Most of the microcontrollers I've seen that are intended for
applications like this have a built-in watchdog timer (I'm assuming
when you say "other than watch dog" you mean "other than external
watchdog"). In the case of the processor I know best, the HC11, it's
called the COP (Computer Operating Properly) timer. The idea here is
your software has to reset it occasionally; if the timer ever goes
off, it's because your control program has gotten itself wedged.
Joseph J. Pfeiffer, Jr., Ph.D. Phone -- (505) 646-1605
Department of Computer Science FAX -- (505) 646-1002
On Wed, 19 May 2004 23:05:23 +1000, the renowned "SelfTest" <SelfTEst>
If you have access to a decent library, check out one these standards
before you choose which hardware to use:
ANSI/AAMI SW68, Medical Device Software - Software Life-Cycle
ANSI UL1998, the Standard for Safety of Software in Programmable
EN/IEC 60601-1-4, the Collateral Standard for Programmable Electrical
"it's the network..." "The Journey is the reward"
firstname.lastname@example.org Info for manufacturers: http://www.trexon.com
I worked on a project substantially larger than a single microcontroller
but the idea we applied might be appropriate. We took a very hard line
on this and the charter of the group was that there were going to be no
bugs delivered to the customers. In some of the functions that we wrote
it was feasible to write one, or a small number, of "sanity checks",
small tests that would evaluate whether arguments being passed and/or
state variables had values that were appropriate at the moment.
If a sanity check failed we displayed "Fatal Error nnnnn", where nnnnn
was the program counter at the point where the check failed, and then
we halted the processor.
This had a number of interesting and sometimes unexpected consequences.
The first was that it quickly became the case that nobody wanted to be
the one responsible for passing bad data to someone else's sanity check.
That seemed to result in people being much more careful that they would
not pass bad data. Secondly, it became a very popular thing for people
to carefully craft these checks to keep themselves from being responsible
for a failure. Thirdly, in an embedded environment when everyone is in
a panic to get all the work done, it seems that when the box just locks
up and you know it is going to take hours to try to figure out what just
happened, it seems much more reasonable to just hit the reset button and
try to get on with your own work. But when "Fatal Error nnnnn" pops up
and in seconds you can look at the build file and tell exactly where
the error happened and what sanity check failed you are much more likely
to yell "FATAL ERROR NNNNN!" over the wall. Everybody in the team would
cringe, hoping it wasn't them who had just called that function with
bad data. And the person who had just observed this, plus the person
who had inserted that sanity check were both "the good guys." This soon
led to adding sanity checks when we would find the box crashed in some
strange way and it took hours to realize we hadn't caught some bad case.
But this then led us to being able to test in a novel way. We wrote
some code on a test harness that would hammer the box with random input.
It would poke buttons and send in commands and present data, pretty
much completely randomly, but at 100 commands/second! Within seconds
of trying this a check blew up and we had another Fatal Error nnnnn.
But that let us find and fix an oversight quickly. After a number of
iterations we were to the point where this would run all weekend with
Then the decision was made, we were going to leave all these in the
code and live when we shipped it. Another team working across the
wall with a similar product was horrified, "You don't want your
customers to know you have BUGS, DO YOU?!?!" And our reply was that
they were going to know one way or the other. We shipped. And we
waited. And we waited. All the checks apparently had made us find
almost all the bugs before it went out the door.
One afternoon I did get a call from the marketing rep. He had a message
from the marketing secretary. She had a message from the receptionist.
She had a call from Hughes. They had been using this and it had popped
up "Fatal Error nnnnn" and just locked up. They were so astonished that
they went over to another building, got a camera, brought it back and
took a picture. Then they called. And I got nnnnn from 1500 miles away.
In 30 seconds I knew which check had failed, knew that it was a single
variable, knew it must have been out of range and I could now hammer
the box until I could figure out a way to find and fix that. I did.
After 18 months and with 2000 of the product in the field being used by
people pretty much full time we had 3 Fatal Errors found, and I thought
that was pretty much all of them that were ever seen because in the
manual it told them that if they ever saw this to call this phone number
and tell us that number so we could fix it for them. I found and fixed
those 3 and a number of others that I knew about but no customer would
likely ever see.
The guys across the wall, they had ten times the support team and didn't
even bother about bugs that didn't just crash the box, and if it did,
they just cycled the power and went on. I even tried to get marketing
to offer a campaign, I'd PAY customers for the first Fatal Error found.
They squashed that, it would have made the other team look bad.
One other item that helped with the sanity checks, we filled all memory
with 0xAAAA initially, and even when some memory was released. That
oddball value was unlikely to be a reasonable value for most state
variables and helped us fail more sanity checks.
Don, may I have permission to put your story up on my web page?
Here is another technique which I use:
Start with "finished" and "debugged" code.
Have one programmer insert N bugs in another programmer's code, keeping
careful records of what and where. The idea is to put in errors typical
of the errors that the person writing the code normally makes.
Have the author of the code debug and fix all bugs that he can find,
stopping when he can't find any more bugs. Keep record of all bugs
fixed. Don't tell him which are his or how many were inserted.
Let's say that we inserted 20 bugs, he found 10 of them, and he found
20 of his own bugs. That tells us that there are around 20 of his
own bugs still undiscovered.
The psychology is interesting. The programmers write code with far
fewer bugs and do a far better job of testing before saying that they
are done. The programmer who finds all of the inserted bugs and no
new bugs is a hero. (I reinforce that with bonuses and with specific
mention in writing of this accomplishment during performance reviews.)
Guy Macon, Electronics Engineer & Project Manager for hire.
Remember Doc Brown from the _Back to the Future_ movies? Do you
On the Amiga computer one of the testing packages used 0xDEADBEEF to
fill unused memory. ;-)
It also added guard band areas around allocated memory and then checked
those after the free to be sure you didn't write outside of your
That second idea would work best if you had an OS or at least memory
There are some applications where instead of having a watchdog reset
the system when it goes astray you can simply reset the system again
and again with a periodic reset. This can be the output of an
oscillator or even the push of a button (a common way of designing
Guy Macon, Electronics Engineer & Project Manager for hire.
Remember Doc Brown from the _Back to the Future_ movies? Do you
As SelfTest hasn't come back yet to give any more info or comments, I am
looking at his "(other than watch dog)" and wondering if the question is
really "Is my micro still running and going about its normal business?"
Usually the first thing any programmer learns is how to flash a LED.
By adding a LED and resistor to an output pin, you can call a "turn LED
on", and "turn LED off" in a sequence, say flash 4 times on power up
Extending this further, you can test for certain I/O operations taking
place correctly with a set number of flashes.
Many companies use 7 segment LEDs on their products, and such things as
"system alive" can mean the 7 segment LED running around in a figure 8.
Power up, self test, and real time diagnostics can be performed from a
simple single LED, right up to multiple computer systems to monitor the
I believe that anybody that designs a useful lump of hardware should
have at least one LED that can be pulsed under program control for this
E-Mail Contact Page: http://www.e-dotcom.com/ecp.php?un=Dontronics
What a pile of bullshit.
There are more reasons for an embedded system to fail that you
can even begin to imagine. Not using watchdogs (in a sensible
way, of course) is totally irresponsible in my opinion.
Polytechforum.com is a website by engineers for engineers. It is not affiliated with any of manufacturers or vendors discussed here.
All logos and trade names are the property of their respective owners.