CAN bus reply problems

Hi folks!

We are developing a system using the CAN bus to implement the network connecting different nodes. We have a PC that needs to ask for some data (the node status) to the nodes that have to answer to the request immediately. In order to ask each node for its status we send a "remote frame" message to the CAN bus with a specific ID. The relevant node has to answer with the relevant data by using a "data frame" message. Each node is in a while loop reading a buffer and sending back data when necessary. Usually everything goes well but sometimes it happens that one of the nodes does not answer to the PC request, even if the request is sent to the bus (it is seen by another node and it can be seen by using an oscilloscope connected to the CAN bus lines). It seems the node do not see the message, it misses the interrupt for updating the buffer... We usually send a sequence of "remote frame" messages waiting every time for the answer: send ,waiting for answer, send, waiting, ... Even if we insert a sleep between a send and another, sometimes the messages are missed by a node... We modified the baud rate (from 500Kbit to 20Kbit) but the problem is not solved. We are using a T89C51CC03 micro-controller by ATMEL.

Have you ever experienced this problem? Any suggestion?

Thank you in advance for any help!

Cheers, Ska

Reply to
Ska
Loading thread data ...

1: This is either a problem with your microprocessor or with your code. 2: I have no experience with Atmel & CAN. 2a: The TMS320F2812 has been rock solid for me. 3: No protocol should trust external nodes 100% to receive something -- you should always have a timeout & retry mechanism.
Reply to
Tim Wescott

I can not answer your specific question, in other words I don't know which part of your software or hardware is responsible for it. Could be the driver, could be a miss configuration of the CAN controllers, could be the cabling. But you should consider switching your node monitoring from the master/slave principle you are using now to something other. Your current implementation looks exactly like to _old_ CANopen Node Guarding mechanism. CANopen switched to Heart Beat years ago, where each node is an autonomously Heart Beat Producer and can be monitored by every node that wishes to do so. The benefit is more flexibility and reduced band width for the node monitoring. Anyway, it can happen that one of the Heart Beat Consumers is missing one Heart Beat of one of the Producers. In this case increase the rate or accept that one or more HB are missing.

Regards Heinz

Reply to
Heinz-Jürg

if it does not responds with in a set time out, then simply use the last read info and go to the next one and come back to the problem one when it's time. each time you get a no response you should increment an error byte and decrement the byte when its ok. if the error byte reaches lets say 5, then you can assume that maybe the node has a serious problem.

its vary possible a CAN node gets into a critical state where it can answer the request at that moment.

Reply to
Jamie

Hello Tim, hello Heinz, hello everybody

Thank you for your mails.

What you both are telling is that "No protocol should trust external nodes 100% to receive something -- you should always have a timeout & retry mechanism"! This is exactly what we are doing now, but it is something I don't like so much... :( We set a maximum number of retry messages (say 10) and it sometimes happens that the trials go over this threshold! In this case we reset and start again the CAN bus but, as I said, it is something we don't like so much...

...mmm...

Regards, Ska

Reply to
Ska

[Massive quote without actual referral snipped. Please don't do that.]

What you're observing appears to be a rate of failure to receive CAN messages that is quite a lot beyond expectations of the protocol, unless you were operating in a pathologically noisy environment --- but you didn't mention anything like that.

What this hints at is a genuine bug in the receiving end, but I'm afraid you didn't reveal enough of its details for anybody out here to be able to remote-diagnose it more precisely. So I'll just bombard you with some questions:

Did you test this with only two nodes on the bus, and check if the receiving one ACKs the transmission?

What *is* the rate of failure, anyway, i.e. one in how many messages gets lost? What is the rate of transmissions with CRC or other failures, on the same network?

Do you have any way of debugging into the receiving CAN controller's register banks after a failed receival, to distinguish if the message actually failed to arrive in the message box, or just failed to raise the IRQ it's configured to? (There's a bug like that in another 8051 derivative with integrated CAN...)

Do you have a storage scope that would let you record the exact signalling up to the point of failure, so you could go look for any differences between successful and failing transmissions, on physical level?

Reply to
Hans-Bernhard Broeker

Actually, this is not necessary for CAN. The beginning of the frame contains a node ID that possible recipients filter through their match/accept registers. Active receivers calculate CRC as the frame bytes clock in and then compare it to the CRC at the frame end. If they match, the accepting receiver drives the bus active (low) for one bit in a designated tailing window. This lets the master, or sender of the frame, know that someone received it.

Use your scope to look at the bus for this ACK bit. If you see it, but the receiver doesn't process the frame, you've missed the interrupt. If you don't see the ACK bit, then the receiver didn't match the node ID or the CRC, or it's in Bus Off mode for error containment.

Also be sure you have both ends properly terminated; I've seen wild behavior on DeviceNET packets at 125, 250 and 500 kb/s.

Dan

Reply to
Dan Danknick

"Node ID" is only meaningful for some higher level protocols, such as CanOpen, but it does not make any sense in simple CanBus systems, which fully relies on message identifiers.

Unless the receiver is in the "bus off" or "error passive" mode, _all_ receivers should monitor the bus and signal ACK or error frame accordingly.

accepting receiver drives the bus

The ACK bit is sent by _any_ active (also "nonaddressed") device. Also if _any_ receiver detects a CRC or other error, it will send the error flag, which mutilates the message and no device will accept it.

This is only usable with only two devices (sender and receiver) on the bus. With more than two devices, someone else will acknowledge it. Instead of an oscilloscope, you should also be able to tell from the transmitter status registers, if someone ASKed the transmitted frame.

Or you have configured the mask registers incorrectly.

The identifier match should not affect the appearance of the ACK.

It should be possible to determine from the _transmitter_ status registers, if the frame was ACKed or an error flag generated by the receiving device.

Paul

Reply to
Paul Keinanen

Wrong.

[... CAN ACK mechanism...]

No. It only lets the sender know that someone *could* have received it, if he had been interested in it. The crux being that ACK is flagged even by nodes who won't actually do anything with this message, because it wasn't meant for them.

... or the ID didn't match the mask set in the receiver.

Reply to
Hans-Bernhard Broeker

In article , Ska writes

Having read a number of articles and threads recently on this subject, it seems to me that despite CAN's excellent hardware based acknowledgement & retry system, the above statement is probably true once you have three or more processors on the bus and certain types of message being sent.

Consider a system with processors P1, P2 and P3.

P1 wants to send a message to P2. The message is not one of the often quoted "nice" CAN bus examples whereby P1 is constantly spewing out repeated readings of a sensor so that P2 or anyone else may "consume" them; the loss of a message in this scenario isn't so important as the next reading will usually suffice. Instead, the message is an instruction for P2 to perform something, such as turn an I/O line on, or write some data to an LCD, and it is therefore 100% essential that P2 receives this message or the product fails.

So, P1 sends the message, and gets the hardware ACK. But the ACK came from P3, who isn't interested in consuming the message.

From what I understand, although P2 "should" generate an error to destroy the ACK if it detects an error, there are a number of circumstances where it may not and P2 may "lose" a message.

  1. A software bug in P2.
  2. A receive overflow in P2.
  3. Errata in the P2 CAN controller.
  4. P2 has gone error-passive or bus-off.
  5. Are there any other reasons?

Admittedly, (1) should be fixed and would be a problem even in a two- node system, but (2) may be unavoidable on certain smaller CAN controllers with limited FIFOs, (3) is unavoidable unless you change to another processor/CAN device, and (4) is actually designed to happen. I'd truly like to know if there is a (5).

So, it would seem in this situation that despite the hardware based ACK system present in the CAN controllers, you must still produce a high level protocol which provides a software based mechanism for acknowledge, timeout and retry.

Such lost messages may only be one in a billion, but if my product sends a billion or more messages per week and it doesn't include a high-level acknowledge, timeout and retry mechanism, then I'll have a product MTBF of a week or less which is totally unacceptable.

I'd be interested in the opinion of others here. I'm in the process of firmware development on my first CAN based system and only have one of the nodes up and running in loopback mode for now so I can't assess reliability on a three-or-more-node system. But based on the fact that the possibility of a message going missing isn't completely zero, I'm taking the view that I must implement the additional high-level ACK mechanism. The general view I sense from reading CAN articles is that although CAN's error mechanism is extremely robust, it's not 100%, and stuff does occasionally go missing.

Reply to
Stephen

You understand that incorrectly. No CAN node can possibly "destroy" an ACK being flagged by some other node. And "generating an error" (by which I assume you mean "sending an error frame") for reasons not already diagnosed by the CAN protocol itself would be a layer model violation. Application layer errors have no business generating transport/link layer errors. That's also the reason why CAN controllers typically don't support sending error frames on purpose: if an error frame needs to be sent, the controller will do that all by itself.

Reply to
Hans-Bernhard Broeker

One thing to watch for that hasn't been pointed out is that a CAN node may recieve multiple valid copies of the same message. This has two consequences, the first is that toggling the state based on message receipt is a bad idea. The second is that any acknowledge/retry scheme has to be able to recognize and discard duplicates if necessary.

Robert

Reply to
R Adsett

Apart for some strange networks with multiple store and forward repeaters, it is hard to imagine how such could situations could happen.

Basically this would require that the transmitter has recognised an error (missing ACK or error frame) and thus resends the message. However, your node did not detect that something was wrong and accepted the message at the first time.

A properly working receiver should check the CRC, the ACK fields _and_ check that at least six recessive bits are received in the End Of Frame field.

If your receiver is happy that the frame that you are interested in, passed the CRC check and immediately accept the message, without checking the ACK and EOF fields, you are going to get duplicates, if the transmitter works according to the standard. An other node may have generated the error frame, which the transmitter detects and retransmits, but your receiver is content with the first copy.

Paul

Reply to
Paul Keinanen

That's indeed what happens, if a bit error hits exactly the wrong bit in the CAN message: the last bit of the end-of-frame field. This bit is checked by the transmitter, but not by the receiver(s). So, if this bit is struck by an error, the transmitter will detect this as a "form error", and re-send, but the receiver will not have noticed any problem.

No. That's not the actual definition of a properly working receiver. A proper CAN receiver will *not* look at the last bit of the EOF field.

Reply to
Hans-Bernhard Broeker

I've seen it happen. Not frequently but far more than could be ignored even if you were inclined to do so.

Robert

Reply to
R Adsett

At least you'll be forewarned. Non-systematic errors will supposedly hit any bit of a CAN message randomly, at equal probability. So this particular error will occur at most 1/50 as often as the other types of error, which both transmitter and receiver notice --- less if you use longer CAN messages.

Keeping an eye on overall error-induced frame retransmission rates thus provides a handle on how often to expect this particular error. Combined with the requirements of the communication at hand, one can design the amount of countermeasures to match the risk.

Reply to
Hans-Bernhard Broeker

PolyTech Forum website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.