"Am I still working okay?" asked the micro controller...

Jack Gannsle wrote a GREAT article on why you should use watchdogs, and why they are so tricky to use properly.

formatting link

Reply to
Alan Kilian
Loading thread data ...

I had already read most points he talks about in other articles, but this is great nevertheless.

Anyone with a concern for safety and reliability should read this - and then some.

Reply to
Guillaume

Well, you have me there, I can only think of four (ignoring ):-

Cheers Robin

Reply to
robin.pain

There is a lot of interesting detail about space-craft software and the claim that a WDT could have saved the mission is no more or less true than fixing the original floating point exception that caused it.

The article then gives an example of crashing cooker-hood-fan firmware and assumes the WDT had *not* been used. He cannot know this. If the firmware is poor, then the WDT was likely poorly implemented too.

Here is a quote from the article:-

"Well-designed watchdog timers fire off a lot, daily and quietly saving systems and lives without the esteem offered to other, human, heroes. Perhaps the developers producing such reliable WDTs deserve a parade. Poorly-designed WDTs fire off a lot, too,sometimes saving things, sometimes making them worse."

I disagree. When the WDT fires, it is a disaster that needs fixing and if it goes off "a lot" and especially "quietly" it is a cover-up where the developers *should* be paraded.

Cheers Robin

Reply to
robin.pain

You don't understand.

Best regards, Spehro Pefhany

Reply to
Spehro Pefhany

... snip ...

Here is a counter-example. The hardware is operating in a noisy environment. This induces dropped bits, etc. The software can handle most of the data errors, but has a few problems when the IC is altered and it is driven off to executing random data. Time for the three fingered salute, administered by the faithful hound.

Reply to
CBFalconer

Let me "requote" some of that, so I can respond to it here:

Putting the discussion of WDT's aside for a moment, I find it inexcusable (engineering-wise) that such a simple application as the cooker-hood-fan would crash or fail (maybe in development, but certainly not in production), whether it's from (a) firmware bug(s) or susceptibility to static discharge. OTOH, I can see where a marketing person might play with it for two minutes (before adequate testing is done), declare to management in the heat of time-to=market pressures "It works, let's ship it" and a bad/untested design goes out the door, perhaps even over the protestations of the person(s) who designed it.

WDT's ARE valuable, but certainly not for the reasoning given above. What it SHOULD have said (IMHO) is:

Well-designed watchdog timers in well-designed systems RARELY if EVER fire off, but like an airbag and seat belts in a car accident, when they do fire off they save systems that would otherwise, perhaps literally as well as figuratively, be "lost in space."

I certainly agree that WDT's should RARELY if ever fire. It helps to have it turned off for general development, but there should be a testing time where it's on (and the timer reset point should of course be carefully thought out as part of the design), and any reset generated should be investigated for its cause (this is where an emulator and logic analyzer are really worth their rental fees) and a correction put into place. I've read and enjoyed some of Jack Gannsle's articles before, but Robin points out very well that Jack misses the mark on this one. Has anyone emailed him about this thread yet?

This is an example where the hardware isn't shielded well enough from the environment, or isn't robust enough or rad-hard enough to operate reliably in the environment. Fix that, then go for long-term testing to see of the WDT ever fires.

Having a WDT reset the hardware doesn't make a system reliable. It is only a protection against rare, worst-case conditions. And I mean TRULY rare conditions, not "rare" as the word is (ab)used on eBay.

Here, I'll frame it for you. Print it, cut it out and paste it on your monitor:

_________________________________________________________________ / \ | Having a WDT reset the hardware doesn't make a system reliable. | \_________________________________________________________________/

-----

formatting link

Reply to
Ben Bradley

The causes could be numerous - static discharge (not just the effects of lightning strikes), radio interference, other forms of radiation, electrical shortages due to fluid spillage, inappropriate scope of device usage (I don't consider it a software bug here) --- all these faults could leave the device in a state where the software can't run.

The reason that it is used in the medical field is that it provides a cost-effective mitigation for many ailments. Designing equipment to operate in a room full of X-Ray, MRI, etc equipment - some dating back a few decades, can be a very daunting exercise. Of course there is a minimum standard EMC requirement that medical equipment conform to.

Also I disagree with the notion that using a watchdog "advertises" some deficiency of the device (paraphrasing here). For me it's use does suggest that the developer's have applied due diligence and have used it as a mitigation against faults which they've arrived at through some analysis.

Ken.

+====================================+ I hate junk email. Please direct any genuine email to: kenlee at hotpop.com
Reply to
Ken Lee

... snip ...

I am glad you have unlimited funds to spend on your productions. A few pounds of lead around the system is always welcome, and encourages sales. Some of us believe in engineering the product to fit the desired use.

Reply to
CBFalconer

It appears that you are thinking that the proper way to design a product is to make a complete product and then start to wonder how to get it through the EMC and other tests and hoping that a ferrite bead there and a bypass capacitor will solve the problems. Then you spend a lot of time trying, usually with several iterations, to get the device just pass the test and still wonder about random lockups and justify the use of the WDT.

EMC design should be part of the whole design cycle. You should design the RF filter return paths and static electricity discharge paths so that it does not go through any sensitive areas, since the tracks will have a significant inductance and thus have a high reactance (or even resonate) at high frequencies or generate quite a high voltage, when a high current from a static discharge passes through it. This does not necessary cost very much as a whole, since it is done in the design phase.

A metallic (or at least conductive) box may also be required or require extra ground planes on the PCB, this of course may cost some extra, but reduce support cost in the field.

A system designed for good EMC performance should also be quite immune to "unexplained" crashes or lockups and thus reduce the need for WDT.

"Desired use" seems to be get the product sold, but not care, if the customer has to throw it away as useless. Just wondering, if the customer is going to buy anything else with the same brand name in the future. I am glad that the CE requirements removed at least some the worst trash from the European market.

Paul

Reply to
Paul Keinanen

Protecting the hardware is not really a costyly exercise. Most of the time it involves little more than appropriate filtering of the inputs, maybe a thin metal can over sensitive circuitry, using metal boxes instead of plastic ones. Look at it as developing boxes within boxes and using appropriate barrier techniques at the barrier boundaries. The total cost can often be less than not doing these simple things.

Reply to
Paul E. Bennett

Lead? You're afraid of cosmic rays? Is not magnetic induction more of a risk?

Robin

Reply to
robin.pain

: Well, you have me there, I can only think of four (ignoring ):-

I would think hardware failure is a good enough reason in and of itself, and in fact that is the usual reason I thought watchdogs were for.

If your code PROM/EPROM/EEPROM/flash fails and the mcu starts executing random memory as code, you want to make sure your motors, pumps, X-ray tube, etc shuts down.

Reply to
Christopher X. Candreva

Whatever the cause of the problem, a WDT won't fix it, though it may cover it up for a while. I suspect CB was angered that I pointed out a flaw in his counter-example, so he came back with something mean-spirited. I didn't mean my response as a personal attack, but this is Usenet and I can't take responsibility for how others read my posts.

-----

formatting link

Reply to
Ben Bradley

If it appears that the hardware is falling apart, how could you trust that it makes any sensible decisions ? Of course, if each output individually fall into a fail safe state if not refreshed by the processor, then it makes sense to halt the processor immediately, if something suspicious happens. Trying to do something after a watchdog reset usually just will worsen the situation, if the hardware is suspect.

In any really safety critical system, you should use double or triple (voting) redundant system, not watchdogs.

Paul

Reply to
Paul Keinanen

Hardly. The particulars do not matter. The point is that, whatever the product, there is a limit to the practical production cost. You need the best bang for the buck. Random external events may require prodigious efforts to block. You, not I, brought up radiation shielding, and I only mentioned a means of blocking such. (To robin: cosmics are only one of a wide range of radiation extant. They are extremely hard to block.)

You need to face reality, in that something is going to fail. When it does, you need a means of avoiding further damage and/or effecting recovery. If you think you can build anything that is failure, damage, and idiot proof you have delusions of grandeur.

Reply to
CBFalconer

: If it appears that the hardware is falling apart, how could you trust : that it makes any sensible decisions ? Of course, if each output

You've changed the situation -- 'the hardware is falling apart' is hardly the same as a single hardware failure.

Generally, an MCU on reset sets the outputs to a known value -- all 0 or all

  1. If you design fail-safe, then a hardware reset, in the face of some failing hardware, will at least make sure everything is off.

: In any really safety critical system, you should use double or triple : (voting) redundant system, not watchdogs.

There is a WHOLE class of problems for which that is completely overkill. Take an arcade game, or vending machine, or any machine that is going to take physical punishment and need regular maintanance.

People are going to beat on a soda machine. Do you want to put tripple-redunancy memory on that, or just design it such that when it breaks it just sits there resetting itself, so no one can get free soda ?

Arcade games use watchdogs because there is a very small window where they will make money. (Or used, when it was dedicated hardware, now it's largely PC level hardware, but I digress) Competition means getting the thing out the door relatively quickly, and cheap enough to sell.

You want to get every bug, but if you wait too long, you'll be into the next generation. The watchdog means that if there IS a bug, the machine will just reset and keep earning money, instead of not earning money until an op gets to it.

Fail-safe means that WHEN the thing fails, you try your best to make sure it's in a 'safe' condition.

Reply to
Christopher X. Candreva

Which brings up Robin's original point about "dodgy code". Like it or not, code defects will occasionally make their way into any non-trivial project produced in the real world. In the face of difficult deadlines, compromises will ocassionaly get made, people may screw-up, QA may fall down on the job.

Anyone who claims NEVER, EVER to have unwittingly released "dodgy code", or to have been part of a team that did so is either:

1) lying 2) never had to code under pressure (time and cost constraints) 3) lying -- to themselves 4) not been coding for very long, or never on a project with much complexity

As another poster put it, watchdogs are one facet of an entire process of due diligence, which should also encompass code reviews, sane coding and design techniques, thorough QA, etc. In general, not implementing watchdogs where it might make sense to do so is, frankly, foolish.

Reply to
The Artist Formerly Known as K

...or you are in management. "Our company policy states that all of our products are failure, damage, and idiot proof."

Reply to
Guy Macon

Double or triple redundancy is not always the answer for Safety Critical Systems. Sometimes just a different logical processor (or even a relay based interlocking scheme) will provide the protection. Sometimes you have to even consider fully mechanical interlocking as part of the system. Whatever mitigation scheme you need to use should be based on the risk assessment arising from a fully discovered HAZOP study.

Having watched over a lot of the responses, I am in the camp that is aimed at getting the code as correct as you possibly can before you begin to worry about turning the watchdog on. However, I also use a separate Puilse Maintained Relay circuit that has to be kept energised by a correctly responding system. This relay automaticazlly signals unhealthy if it de-energises due to a system failing to kick it properly or by a failure in its own circuitry (see my Reading and Writing the World articles on my website).

Reply to
Paul E. Bennett

PolyTech Forum website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.