My watch list

Home
Encyclopedia
Immunity_Aware_Programming

Immunity Aware Programming

The introduction to this article provides insufficient context for those unfamiliar with the subject.
Please help improve the article with a good introductory style.

Additional recommended knowledge

Daily Visual Balance Check

What is the Correct Way to Check Repeatability in Balances?

Better weighing performance in 6 easy steps

Task and objectives

For devices containing a microcontroller the design of firmware impacts the EMC performance of the system. Minor and inexpensive alterations of the firmware design can lead to a more robust system.

Firmware of Embedded Systems is usually not considered as a source of emission. Emissions are often caused by harmonic frequencies of the system clock and switching currents with short rise and fall times of output drivers and registers. This effect is increased by badly designed PCBs (Printed Circuit Boards). These effects are opposed by improved microcontroller output drivers and by switched off system components.

The microcontroller is the most important part of a system and due to its complexity it is also most susceptible to faults. In this case software design can lead to inexpensive improvements without the need of hardware alterations. In the following methods to improve software design are described.

Possible Interferences of Microcontroller Based Systems

CMOS microcontrollers do have specific weak spots which have to be considered when working on optimization against interferences. Immunity analysis of the system and requirements is required in any case.

The Power Supply

Slowly changing power supply voltage does not cause significant disturbances but rapid changes can induce considerable transient disturbances. If the voltage exceeds parameters specified in the data sheet by 150 percent it can cause the input port or the output port get hung in one state (CMOS latch-up ^[1]). Without current control this can cause thermal destruction of the microcontroller. There is no software solution to this problem. In general the supply must be well grounded and decoupled using capacitors and inductors close to the microcontroller (typical values: 100uF and 0.1uF in parallel). Unstable power can cause serious malfunctions most microcontrollers. For the CPU to successfully decode and execute instructions, the supplied voltage must not drop below the minimum voltage level. When the supplied voltage drops below this level, the CPU may start to execute some instructions incorrectly. The result is unexpected activity on the internal data and control lines. This activity may causeˆ

CPU Register Corruptionˆ
I/O Register Corruptionˆ
I/O Pin Random Toggling
SRAM Corruption
EEPROM Corruption.

A Brown-Out detection, may solve most of those problems.

The oscillator

The input ports of oscillators have high impedances and are thus very susceptible to transient disturbances. According to Ohm’s law high impedances causes high voltage differences. Bursts can cause shorted tact periods and lead to incorrect data access or command execution. The result is wrong memory content and/or program pointer. Some of the effects can be handled by appropriate software design.

The I/O-Ports

I/O ports—including address lines and data lines—connected by long lines or external peripherals are the outer port of disturbances. EMI can lead to incorrect data and addresses on the lines. Strong fluctuations may lead to corrupted contents of I/O registers and hence to read errors or disable the communication on this port. Read errors can be limited as described later. ESD can destroy ports or cause malfunctions.

Most MCs pins are high impedance inputs or mixed inputs/outputs. High impedance input pins are sensitive to noise and can register false levels if not properly terminated. Not internally terminated pins need high resistance (e.g. 4.7k or 10k resistor) attached and they have to be connected to ground or supply ensuring a known logic state.

Corrective Actions

An analysis of possible errors before correction is very important. The figure provides an overview of causes and effects of disturbances to the system.

MISRA^[2] [www.misra.org.uk] identifies the required steps in case of an error as follows:

Information/warning of the user.
Store the faulty data until a defined reset can be carried out.
Keep the system in a defined state until the error can be corrected.

Instruction Pointer (IP) Error Management

A disturbed instruction pointer can lead to serious errors like a undefined jump to an arbitrary point in the memory where illegal instructions are read. The state of the system will be undefined. IP errors can be handled by use of software based solutions like Function Tokens and NOP Fills.

Many processors (like Motorola 680x0, 68300) feature a hardware trap upon encountering an illegal instruction. Not a random but a correct instruction, defined in the trap vector, is executed. Traps can handle a larger range errors than Function Tokens and NOP Fills. Supplementary to illegal instructions hardware traps securely handle memory access violations, overflows, or a division by zero.

Token Passing (Function Token)

Improved noise immunity can be achieved by execution flow control known as Token Passing. The Figure shows the functional principle schematically. This method deals with program flow errors caused by the IP. The implementation is simple and efficient. Every function is tagged with a unique function ID. When the function is called the function ID is saved in a global variable. The function is only executed if the function ID in the global variable and the ID of the function match. If the IDs do not match an instruction pointer error has occurred and specific corrective actions can be taken. A sample implementation of Token Passing using a global variable programmed in C is stated in the following source listing. The implementation of function tokens increase the program code size by 10 to 20% and slows down the performance.

To improve the implementation instead of global variables like above the function ID can be passed as an argument within the functions header as shown in the code sample below.

NOP-Fills

With NOP-Fills the reliability of a system in case of a disturbed instruction pointer can be improved in some cases. The entire program memory that is not used by the program code is filled with No-Operation (NOP) instructions. In machine code a NOP instruction is often represented by 0x00 (e.g. Intel 8051, ATMega16 etc.). The system is kept in a defined state. At the end of the physical program memory an instruction pointer error handling (IPEH IP-Error-Handler) has to be implemented. In some cases this can be a simple reset.

If an instruction pointer error occurs during the execution and a program points to a memory segment filled with NOP instructions inevitably an error occurred and is recognized.

Two methods of implementing NOP-Fills are applicable.

In the first method the unused physical memory is set to 0x00 manually by search and replace in the (HEX) program file. The drawback of this method is that this has to be done after every compilation.

The second method is to include a corresponding number of NOP assembler directives directly in the program code.

When using the CodevisionAVR C compiler, NOP fills can be implemented easily. The chip-programmer offers the feature to edit the program flash and eeprom to fill it with a specific value. Using an Atmel ATMega16 no jump to reset address 0x00 needs to be implemented as the overflow of the instruction pointer automatically sets its value to 0x00. Unfortunately resets due to overflow are not equivalent to intentional reset. During the intended reset all necessary MC registers are reset by hardware which is not done by a jump to 0x00. So this method will not be applied in the following tests.

I/O Register Errors

Microcontroller architecture requires the I/O leads to be placed at the outer edge of the silicon die. Thus I/O contacts are affected strongly by transient disturbances on their way to the silicon core and I/O registers are one of the most vulnerable parts of the microcontroller. Wrongly read I/O registers may lead to incorrect system state. Most serious errors can occur at the reset port and interrupt input ports. Disturbed data direction registers (DDR) may inhibit writing to the bus.

These disturbances can be prevented as following:

1. Cyclic update of the most important registers

By cyclically updating of the most important register and the data in the data direction registers in shortest possible intervals errors can be reduced. Thus a wrongly set bit can be corrected before it can have negative effects.

2. Multiple read of input registers

A further method of filtering disturbances is multiple read of input registers. The read in values are then checked for consistency. If the values are consistent they can be considered valid. A definition of a value range and/or the calculation of a mean value can improve the results for some applications.

Side Effect: Increased Activity

A drawback is the increased activity due to permanent updates and read outs of peripherals. This activity may add additional emissions and failures.

External Interrupt Ports; Stack Overflow

External interrupts are triggered by falling/rising edges or high/low potential at the interrupt port leading to an interrupt request (IRQ) in the controller. Hardware interrupts are divided into maskable interrupts and non-maskable interrupts (NMI). The triggering of maskable interrupts can be stopped in some time-critical functions. If an interrupt is called the current instruction pointer (IP) is saved on the stack and the stack pointer (SP) is decremented. The address of the interrupt service routine (ISR) is read from the interrupt vector table and loaded to the IP register and the ISR is executed as a consequence.

If interrupts—due to disturbances—are generated faster than processed, the stack grows until all memory is used. Data on the stack or other data might be overwritten. A defensive software strategy can be applied. The stack pointer (SP) can be watched. The growing of the stack beyond a defined address can then be stopped. The value of the stack pointer can be checked at the start of the interrupt service routine. If the SP points to an address outside the defined stack limits a reset can be executed.

Data Redundancy

In systems without error detection and correction units the reliability of the system can be improved by providing protection through software. Protecting the entire memory (code and data) may not be practical in software as it causes an unacceptable amount of overhead but it is a software implemented low-cost solution for code segments.

Another elementary requirement of digital systems is the faultless transmission of data. Communication with other components can be the weak point and a source of errors of a system. A well thought-out transmission protocol is very important. The techniques described below can also be applied to data transmitted and hence increasing transmission reliability.

Cyclic Redundancy and Parity Check

A Cyclic Redundancy Check is a type of hash function used to produce a checksum which is a small integer from a large block of data, such as network traffic or computer files. CRCs are calculated before and after transmission or duplication, and compared to confirm that they are equal. A CRC detects all one or two bit errors, all odd errors, all burst errors if the burst is smaller than the CRC, and most of the wide burst errors. Parity Checks can be applied to single characters (VRC—Vertical Redundancy Check) resulting an additional parity bit or to a block of data (LRC—Longitudinal Redundancy Check) issuing a block check character. Both methods can be implemented rather easily by using a XOR operation. A trade off is that less errors can be detected than with the CRC. Parity Checks only detect odd numbers of flipped bits. The even numbers of bit errors stay undetected. A possible improvement is the usage ob both, VRC and LRC, called Double Parity.

Some microcontrollers feature a hardware CRC unit.

Different Kinds of Duplication

A specific way of data redundancy is the duplication which can be applied in several ways as described in the following:

Data Duplication

To cope with the corruption of data, multiple copies of important registers and variables can be stored. Consistency checks between memory locations storing the same values, or voting techniques can then be performed when accessing the data.

Two different modifications to the source code need to be implemented.

The first one corresponds to duplicating some or all of the program variables in order to introduce data redundancy and modifying all the operators to manage the introduced replica of the variables.

The second modification introduces consistency checks in the control flow so that consistency between the two copies of each variable is verified.

When the data is read out the two sets of data are compared. A disturbance is detected if the two data sets are not equal. An error can be reported. If both sets of data are corrupted a significant error can be reported and the system can react accordingly.

In most cases, safety-critical applications have strict constraints in terms of memory occupation and system performance. The duplication of the whole set of variables and the introduction of a consistency check before every read operation represent the optimum choice from the fault coverage point of view. The duplication of the whole set of variables enables an extremely high percentage of faults to be covered by this software redundancy technique. On the other side, by duplicating a lower percentage of variables one can trade-off the obtained fault coverage with the CPU time overhead.

The experimental result shows that duplicating only 50% of the variables is enough to cover 85% of faults with a CPU time overhead of just 28%.

Attention should also be paid to the implementation of the consistency check as it is usually carried out after each read operation or at the end of each variables life period. Carefully implementing this check can minimize the CPU time and code size for this application.

Function Parameter Duplication

As the detection of errors in data is achieved through duplicating all variables and adding consistency checks after every read operation special considerations have to be applied according to the procedure interfaces. Parameters passed to procedures as well as return values are considered to be variables. Hence, every procedure parameter is duplicated as well as the return values. A procedure is still called only once, but it returns two results, which must hold the same value. The source listing to the right shows a sample implementation of function parameter duplication.

Test Duplication

To duplicate a test is one of the most robust methods that exist for generic soft error detection. A drawback is that no strict assumption on the cause of the errors (EMI, ESD etc.), nor on the type of errors to expect (errors affecting the control flow, errors affecting data etc.) can be made. Erroneous bit-changes in data-bytes while stored in memory, cache, register, or transmitted on a bus are known. These data-bytes could be operation codes (instructions), memory addresses, or data. Thus, this method is able to detect a wide range of faults and is not limited to a specific fault model. Using this method, memory increases about 4 times and execution time is about 2.5 times as long as the same program without test duplication. Source listing to the right shows a sample implementation of the duplication of test conditions.

Branching Duplication

Compared to Test Duplication, where one condition is cross-checked, with branching duplication the condition is duplicated.

For every conditional test in the program the condition and the resulting jump should be reevaluated as shown in the figure. Only if the condition is met again the jump is executed else an error has occurred.

Instruction Duplication and Diversity in Implementation

What is the benefit of when data, tests, and brunches are duplicated when the result is calculated incorrect? One solution is to duplicate an instruction entirely but implement them differently. So two different programs with the same functionality but with different sets of data, and different implementations are executed. Their outputs are compared and must be equal. This methods cover not just bit-flips or processor faults but also programming errors (bugs). If it is intended to especially handle hardware (CPU) faults the software can be implemented using different parts of the hardware, for example one implementation uses a hardware multiply and the other implementation multiplies by shifting or adding. This causes a significant overhead (more than a factor of 2 for the size of the code). On the other hand the results are outstanding accurately.

Ports

Reset Ports and Interrupt Ports

Reset ports and interrupts are very important as they can be triggered by rising/falling edges or high/low potential at the interrupt port. Transient disturbances can lead to unwanted resets, or trigger interrupts and thus cause the entire system to crash. For every triggered interrupt the instruction pointer is saved on the stack and the stack pointer is decremented.

Try to reduce the amount of edge-triggered interrupts. If interrupts can be triggered only with a level, then this helps to ensure that noise on an interrupt pin will not cause an undesired operation. It must be kept in mind that, level-triggered interrupts can lead to repeated interrupts as long as the level stays high. In the implementation, this characteristic must be considered, repeated unwanted interrupts must be disabled in the ISR. If this is not possible, then on immediate entry of an edge-triggered interrupt, a software check on the pin to determine if the level is correct should suffice.

For all unused interrupts an error handling routine has to be implemented to keep the system in a defined state after an unindented interrupt.

Unintentional resets disturb the correct program execution and are not acceptable for extensive applications or safety critical systems.

Reset differentiation (Cold/Warm start)

A frequent system requirement is the automatic resumption of work after a disturbance/disruption. It can be useful to record the state of a system at shut down and to save the data in a non-volatile memory. At startup the system can evaluate if the system restarts due to disturbance or failure (warm start) and the system status can be restored or an error can be indicated. In case of a cold start the saved data in the memory can be considered valid.

External Current Consumption Measurement

This method is a combination of hard- and software implementations. It proposes a simple circuit to detect an electromagnetic interference using the own resources of the device. Most microcontrollers, like the ATMega16, integrate analog to digital (ADC) converters which could be used to detect unusual power supply fluctuations caused by interferences.

When an interference is detected by the software, the microcontroller could enter a safe state while waiting for the aggression to pass. During this safe state, no critical executions are allowed. The graphic presents how interference detection can be performed. This technique can easily be used with any microcontroller that features an AD-converter.

Watchdog

Watchdog^[3] timer significantly improve the reliability of a microcontroller in electromagnetically influenced environment. They are often integrated on the chip. A watchdog timer is an independent parallel component which detects software errors and hardware errors reliably.

The software informs the watchdog in regular intervals that it is still working properly. If the watchdog is not informed it means that the software does not work as specified anymore. Then the watchdog resets the system to a defined state. During the reset the device is not able to process data and does not react to calls.

The Atmel ATMega16 owns a separate clock for the watchdog. To assure operating safety the watchdog can only be disabled by setting three specific bits within four clock cycles.

As the strategy to reset the watchdog timer is very important two requirements have to be attended:

The watchdog may only be reseted if all routines work properly.

The reset must be executed as quick as possible.

Simple activation of the watchdog and regular resets of the timer do not make optimal use of a watchdog. For best results the refresh cycle of the timer must be set as short as possible and called from the main() function, so a reset can be performed before damage is caused or an error occurred. If a microcontroller does not have an internal watchdog a similar functionality can be implemented by the use of a timer interrupt.

Brown-Out

A Brown-Out circuit is a control for monitoring the VCC level during operation by comparing it to a fixed trigger level. The ATmega16 has an On-Chip Brown-Out Detection (BOD). The trigger level for the BOD can be selected by the fuse BODLEVEL to be 2.7V (BODLEVEL unprogrammed), or 4.0V (BODLEVEL programmed). When VCC decreases to a value below the trigger level, the Brown-out Reset is immediately activated. When VCC increases above the trigger level, the delay counter starts the MC after the Time-out period. The BOD circuit will only detect a drop in VCC if the voltage stays below the trigger level for longer than 2µs. The trigger level has a hysteresis to ensure spike free Brown-out Detection.

If the Brown-Out Detector is not needed in the application this module should be turned off. If the Brown-out Detector is enabled it will be enabled in all sleep modes, and hence, always consume power. In the deeper sleep modes, this will contribute significantly to the total current consumption.

Note

^ Latch-up - also known as Single Event Latch-up (SEL) - is a short circuit of VDD (positive power supply) and VSS (negative power supply). The latch-up is caused by parasitic transistors (transistors that cannot be activated during normal operating conditions) of CMOS circuits. Strong transient disturbances can activate transistors and thermally destroy the device.
^ MISRA (The Motor Industry Software Reliability Association) provides assistance to the automotive industry in the application and creation within vehicle systems of safe and reliable software.
^ Watchdog: In general the term watchdog is used for a system component which observes the function of other components. It especially ensures that microcontroller controlled devices do not completely fail if a software error occurs. A watchdog can be implemented in software or in hardware. Hardware watchdogs are usually based on a monostable timer and can be subdivided in internal and external watchdog.