"Am I still working okay?" asked the micro controller...

Jack Gannsle wrote a GREAT article on why you should use watchdogs, and why they are so tricky to use properly.
http://www.ganssle.com/watchdogs.htm

--
- Alan Kilian <alank(at)timelogic.com>
Director of Bioinformatics, TimeLogic Corporation 763-449-7622
  Click to see the full signature.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
I had already read most points he talks about in other articles, but this is great nevertheless.
Anyone with a concern for safety and reliability should read this - and then some.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload

There is a lot of interesting detail about space-craft software and the claim that a WDT could have saved the mission is no more or less true than fixing the original floating point exception that caused it.
The article then gives an example of crashing cooker-hood-fan firmware and assumes the WDT had *not* been used. He cannot know this. If the firmware is poor, then the WDT was likely poorly implemented too.
Here is a quote from the article:-
<start of quote> "Well-designed watchdog timers fire off a lot, daily and quietly saving systems and lives without the esteem offered to other, human, heroes. Perhaps the developers producing such reliable WDTs deserve a parade. Poorly-designed WDTs fire off a lot, too,sometimes saving things, sometimes making them worse."<end of quote>
I disagree. When the WDT fires, it is a disaster that needs fixing and if it goes off "a lot" and especially "quietly" it is a cover-up where the developers *should* be paraded.
Cheers Robin
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
On 27 May 2004 07:04:12 -0700, the renowned snipped-for-privacy@tesco.net ( snipped-for-privacy@tesco.net) wrote:

You don't understand.
Best regards, Spehro Pefhany
--
"it's the network..." "The Journey is the reward"
snipped-for-privacy@interlog.com Info for manufacturers: http://www.trexon.com
  Click to see the full signature.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
" snipped-for-privacy@tesco.net" wrote:

... snip ...

Here is a counter-example. The hardware is operating in a noisy environment. This induces dropped bits, etc. The software can handle most of the data errors, but has a few problems when the IC is altered and it is driven off to executing random data. Time for the three fingered salute, administered by the faithful hound.
--
Chuck F ( snipped-for-privacy@yahoo.com) ( snipped-for-privacy@worldnet.att.net)
Available for consulting/temporary embedded and systems.
  Click to see the full signature.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
wrote:

Let me "requote" some of that, so I can respond to it here:

Putting the discussion of WDT's aside for a moment, I find it inexcusable (engineering-wise) that such a simple application as the cooker-hood-fan would crash or fail (maybe in development, but certainly not in production), whether it's from (a) firmware bug(s) or susceptibility to static discharge. OTOH, I can see where a marketing person might play with it for two minutes (before adequate testing is done), declare to management in the heat of time-to=market pressures "It works, let's ship it" and a bad/untested design goes out the door, perhaps even over the protestations of the person(s) who designed it.

WDT's ARE valuable, but certainly not for the reasoning given above. What it SHOULD have said (IMHO) is:
Well-designed watchdog timers in well-designed systems RARELY if EVER fire off, but like an airbag and seat belts in a car accident, when they do fire off they save systems that would otherwise, perhaps literally as well as figuratively, be "lost in space."

I certainly agree that WDT's should RARELY if ever fire. It helps to have it turned off for general development, but there should be a testing time where it's on (and the timer reset point should of course be carefully thought out as part of the design), and any reset generated should be investigated for its cause (this is where an emulator and logic analyzer are really worth their rental fees) and a correction put into place. I've read and enjoyed some of Jack Gannsle's articles before, but Robin points out very well that Jack misses the mark on this one. Has anyone emailed him about this thread yet?

This is an example where the hardware isn't shielded well enough from the environment, or isn't robust enough or rad-hard enough to operate reliably in the environment. Fix that, then go for long-term testing to see of the WDT ever fires.
Having a WDT reset the hardware doesn't make a system reliable. It is only a protection against rare, worst-case conditions. And I mean TRULY rare conditions, not "rare" as the word is (ab)used on eBay.
Here, I'll frame it for you. Print it, cut it out and paste it on your monitor:
_________________________________________________________________ / \ | Having a WDT reset the hardware doesn't make a system reliable. | \_________________________________________________________________/
----- http://mindspring.com/~benbradley
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
Ben Bradley wrote:

... snip ...

I am glad you have unlimited funds to spend on your productions. A few pounds of lead around the system is always welcome, and encourages sales. Some of us believe in engineering the product to fit the desired use.
--
fix (vb.): 1. to paper over, obscure, hide from public view; 2.
to work around, in a way that produces unintended consequences
  Click to see the full signature.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
wrote:

It appears that you are thinking that the proper way to design a product is to make a complete product and then start to wonder how to get it through the EMC and other tests and hoping that a ferrite bead there and a bypass capacitor will solve the problems. Then you spend a lot of time trying, usually with several iterations, to get the device just pass the test and still wonder about random lockups and justify the use of the WDT.
EMC design should be part of the whole design cycle. You should design the RF filter return paths and static electricity discharge paths so that it does not go through any sensitive areas, since the tracks will have a significant inductance and thus have a high reactance (or even resonate) at high frequencies or generate quite a high voltage, when a high current from a static discharge passes through it. This does not necessary cost very much as a whole, since it is done in the design phase.
A metallic (or at least conductive) box may also be required or require extra ground planes on the PCB, this of course may cost some extra, but reduce support cost in the field.
A system designed for good EMC performance should also be quite immune to "unexplained" crashes or lockups and thus reduce the need for WDT.

"Desired use" seems to be get the product sold, but not care, if the customer has to throw it away as useless. Just wondering, if the customer is going to buy anything else with the same brand name in the future. I am glad that the CE requirements removed at least some the worst trash from the European market.
Paul
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
CBFalconer wrote:

Protecting the hardware is not really a costyly exercise. Most of the time it involves little more than appropriate filtering of the inputs, maybe a thin metal can over sensitive circuitry, using metal boxes instead of plastic ones. Look at it as developing boxes within boxes and using appropriate barrier techniques at the barrier boundaries. The total cost can often be less than not doing these simple things.
--
********************************************************************
Paul E. Bennett ....................<email://peb@a...>
  Click to see the full signature.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
wrote:

Lead? You're afraid of cosmic rays? Is not magnetic induction more of a risk?
Robin
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
On 28 May 2004 08:45:14 -0700, snipped-for-privacy@tesco.net ( snipped-for-privacy@tesco.net) wrote:

Whatever the cause of the problem, a WDT won't fix it, though it may cover it up for a while. I suspect CB was angered that I pointed out a flaw in his counter-example, so he came back with something mean-spirited. I didn't mean my response as a personal attack, but this is Usenet and I can't take responsibility for how others read my posts.

----- http://mindspring.com/~benbradley
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
Ben Bradley wrote:

Hardly. The particulars do not matter. The point is that, whatever the product, there is a limit to the practical production cost. You need the best bang for the buck. Random external events may require prodigious efforts to block. You, not I, brought up radiation shielding, and I only mentioned a means of blocking such. (To robin: cosmics are only one of a wide range of radiation extant. They are extremely hard to block.)
You need to face reality, in that something is going to fail. When it does, you need a means of avoiding further damage and/or effecting recovery. If you think you can build anything that is failure, damage, and idiot proof you have delusions of grandeur.
--
fix (vb.): 1. to paper over, obscure, hide from public view; 2.
to work around, in a way that produces unintended consequences
  Click to see the full signature.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload

...or you are in management. "Our company policy states that all of our products are failure, damage, and idiot proof."
--
Guy Macon, Electronics Engineer & Project Manager for hire.
Remember Doc Brown from the _Back to the Future_ movies? Do you
  Click to see the full signature.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
Guy Macon <http://www.guymacon.com wrote:

As colleague DW said, " ... idiot proof. It proves we're idiots." He was kidding, of course.
Regards. Mel.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload

Well, you have me there, I can only think of four (ignoring <hardware failure>):-
<firmware bug> <spontaneous alpha particle emmission> <brown-out> <lightning strike>
Cheers Robin
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
On 27 May 2004 01:59:00 -0700, snipped-for-privacy@tesco.net ( snipped-for-privacy@tesco.net) wrote:

The causes could be numerous - static discharge (not just the effects of lightning strikes), radio interference, other forms of radiation, electrical shortages due to fluid spillage, inappropriate scope of device usage (I don't consider it a software bug here) --- all these faults could leave the device in a state where the software can't run.
The reason that it is used in the medical field is that it provides a cost-effective mitigation for many ailments. Designing equipment to operate in a room full of X-Ray, MRI, etc equipment - some dating back a few decades, can be a very daunting exercise. Of course there is a minimum standard EMC requirement that medical equipment conform to.
Also I disagree with the notion that using a watchdog "advertises" some deficiency of the device (paraphrasing here). For me it's use does suggest that the developer's have applied due diligence and have used it as a mitigation against faults which they've arrived at through some analysis.
Ken.

+====================================+ I hate junk email. Please direct any genuine email to: kenlee at hotpop.com
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
: Well, you have me there, I can only think of four (ignoring <hardware failure>):-
I would think hardware failure is a good enough reason in and of itself, and in fact that is the usual reason I thought watchdogs were for.
If your code PROM/EPROM/EEPROM/flash fails and the mcu starts executing random memory as code, you want to make sure your motors, pumps, X-ray tube, etc shuts down.
--
==========================================================
Chris Candreva -- snipped-for-privacy@westnet.com -- (914) 967-7816
  Click to see the full signature.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
On Fri, 28 May 2004 15:45:20 GMT, "Christopher X. Candreva"

If it appears that the hardware is falling apart, how could you trust that it makes any sensible decisions ? Of course, if each output individually fall into a fail safe state if not refreshed by the processor, then it makes sense to halt the processor immediately, if something suspicious happens. Trying to do something after a watchdog reset usually just will worsen the situation, if the hardware is suspect.

In any really safety critical system, you should use double or triple (voting) redundant system, not watchdogs.
Paul
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
: If it appears that the hardware is falling apart, how could you trust : that it makes any sensible decisions ? Of course, if each output
You've changed the situation -- 'the hardware is falling apart' is hardly the same as a single hardware failure.
Generally, an MCU on reset sets the outputs to a known value -- all 0 or all 1. If you design fail-safe, then a hardware reset, in the face of some failing hardware, will at least make sure everything is off.
: In any really safety critical system, you should use double or triple : (voting) redundant system, not watchdogs.
There is a WHOLE class of problems for which that is completely overkill. Take an arcade game, or vending machine, or any machine that is going to take physical punishment and need regular maintanance.
People are going to beat on a soda machine. Do you want to put tripple-redunancy memory on that, or just design it such that when it breaks it just sits there resetting itself, so no one can get free soda ?
Arcade games use watchdogs because there is a very small window where they will make money. (Or used, when it was dedicated hardware, now it's largely PC level hardware, but I digress) Competition means getting the thing out the door relatively quickly, and cheap enough to sell.
You want to get every bug, but if you wait too long, you'll be into the next generation. The watchdog means that if there IS a bug, the machine will just reset and keep earning money, instead of not earning money until an op gets to it.
Fail-safe means that WHEN the thing fails, you try your best to make sure it's in a 'safe' condition.
--
==========================================================
Chris Candreva -- snipped-for-privacy@westnet.com -- (914) 967-7816
  Click to see the full signature.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload
Christopher X. Candreva wrote:

Which brings up Robin's original point about "dodgy code". Like it or not, code defects will occasionally make their way into any non-trivial project produced in the real world. In the face of difficult deadlines, compromises will ocassionaly get made, people may screw-up, QA may fall down on the job.
Anyone who claims NEVER, EVER to have unwittingly released "dodgy code", or to have been part of a team that did so is either:
1) lying 2) never had to code under pressure (time and cost constraints) 3) lying -- to themselves 4) not been coding for very long, or never on a project with much complexity
As another poster put it, watchdogs are one facet of an entire process of due diligence, which should also encompass code reviews, sane coding and design techniques, thorough QA, etc. In general, not implementing watchdogs where it might make sense to do so is, frankly, foolish.
--
(Replies: cleanse my address of the Mark of the Beast!)

Teleoperate a roving mobile robot from the web:
  Click to see the full signature.
Add pictures here
<% if( /^image/.test(type) ){ %>
<% } %>
<%-name%>
Add image file
Upload

Polytechforum.com is a website by engineers for engineers. It is not affiliated with any of manufacturers or vendors discussed here. All logos and trade names are the property of their respective owners.