Problems with MCU hardware self-test-EEWORLD

Collect

The company uses an atmega128 microcontroller with an external RAM chip. The microcontroller can detect the circuit of the external RAM through the internal program of the microcontroller. For example, if any of the D0~D7 pins of the RAM is poorly soldered, the microcontroller can issue a warning. Now the question is how does the microcontroller determine whether the pin is poorly soldered? Is it that the D0~D7 of the RAM chip outputs data to the data bus of the microcontroller, and the microcontroller determines whether the circuit is poorly soldered based on whether one of the pins is open circuit or high impedance? I just started doing hardware maintenance, and I have never figured out this problem.

　　Also, if one of the RAM address lines is left floating, will the data output definitely be in high impedance?
　　If anyone knows, please help me explain. Thank you!

Before the hardware system leaves the factory, it needs to be tested; before the embedded system works, it generally needs to be self-tested, in which ROM and RAM testing are essential. However, many people have a misunderstanding of the purpose, reason and method
of testing. Why do we need to test ROM and RAM, and how do we test them? The general view is that because of the fear of damage to ROM and RAM chips, the quality of these two chips should be verified before leaving the factory and using them. The method of testing RAM is to write and read each memory unit to check whether it can be written correctly; the method of testing ROM is to accumulate the values of each storage unit and compare them with the checksum. This understanding is not wrong, but it is somewhat superficial, and the test program compiled according to this is incomplete. Generally speaking, ROM and RAM chips themselves are unlikely to be damaged, and the probability of using defective products is relatively small. The real problems are mostly other hardware parts. Therefore, testing ROM and RAM is often not the real purpose.

ROM testing
The real purpose of testing ROM is to ensure program integrity.
Embedded software and startup code are stored in ROM, which cannot guarantee long-term stability and reliability, because the hardware is destined to be unreliable. Taking flash ROM as an example, it will cause program volatilization due to the following two main reasons:
1. Radiation. 1. It works in a radiation environment/is exposed to radiation during transportation (such as being inspected by an X-ray machine when passing through customs).
2. Long-term storage causes storage failure, and some 0 and 1 bits flip automatically.
In any case, programs stored on hardware are unreliable. If it cannot run at all, it will not cause too much loss. The fear is that the program can run, but some key data/key code segments are damaged, causing fatal errors. For this reason, it is necessary to ensure that the running program is 100% not damaged at the software level before the program works normally, and ensure that the program to be run now is the one written at the beginning. There are
many ways to ensure program integrity, such as CRC check (-16 and -32)/cumulative sum check (shift accumulation) for all programs. As long as it can be mathematically ensured that the error probability is extremely low, the program can be considered complete in engineering.
Passing the program integrity test also proves that the ROM is not damaged. That is, testing whether the ROM is damaged is only a by-product of the test, not the main purpose.

RAM test
The real purpose of testing RAM is to ensure the reliability of the hardware system.
RAM is really not easy to break. I have not seen a system abnormality caused by RAM damage so far. However, most problems can be reflected through RAM testing. Think carefully, what kind of errors will occur when the hardware is produced/inserted into the backplane! Do you feel that the board you made is more likely to have problems! Please consider the following points:
1. The production process is not up to standard, the vias are crooked, the distance from the adjacent signal line does not meet the wire gauge, and even hit the line.
2. The signal line is stuck due to soldering.
3. Poor contact caused by cold soldering/leaky soldering.
4. Not operating according to the regulations, leaving handprints on the high-frequency line.
5. The board is dirty and not blown, and is covered with a layer of dust (containing metal particles).
...
These phenomena are quite interesting, let's take a few examples:
1. Address lines A0 and A1 are stuck. The data of the three bytes XXX00, XXX01, and XXX10 are exactly the same.
2. Data lines D0 and D1 are stuck. As long as one of D0 and D1 is 0, both lines are 0.
3. Poor contact. Sometimes good, sometimes bad.
4. The surface treatment of the device is not clean, and there is flux residue. Low-speed access is normal, but high-speed access with heavy load frequently crashes.
In short, the boards we make will have the opportunity to make mistakes during production and use, so they must be tested before leaving the factory and self-checked before use. (Of course, if you are not making actual products but laboratory samples, the steps can be simplified.)
How to test RAM? Writing a number and then reading it out to judge obviously cannot detect all problems. A single test data is not easy to cover all test contents, let alone locate the cause of the error (bad RAM, address/data line adhesion, poor contact). A good test should measure adhesion, bad RAM, and single board high-frequency characteristics as much as possible.
The method I summarized is as follows: (such as testing a FFH byte RAM)
First, test the address line,
1. '0' slide, randomly select a number such as 55, AA, etc., and write it to the FEH, FDH, FBH, F7H, EFH, DFH, BFH, 7FH address units in turn. Write the address as a binary number, and you can see that bit 0 slides from low to high on the address bus, which is called '0' sliding. The purpose is to test whether these address lines are stable and normal when they change to 0 in sequence. When each line changes from 1 to 0, undershoot will occur. If the undershoot is not well controlled, it will cause errors at high frequencies. The address lines on a single board are not necessarily the same length, and the undershoot will not be exactly the same. Therefore, each line is tested for undershoot performance separately.
2. '1' sliding, randomly select a number such as 55, AA, etc., and write it to the 1H, 2H, 4H, 8H, 10H, 20H, 40H, 80H address units in sequence. Write the address as a binary number, and you can see that bit 1 slides from low to high on the address bus, which is called '1' sliding. The purpose is to test whether these address lines are stable and normal when they change to 1 in sequence. When each line changes from 0 to 1, overshoot will occur. If the overshoot is not well controlled, it will cause errors at high frequencies. The address lines on a single board are not necessarily the same length, and the overshoot will not be exactly the same. Therefore, each line is tested for overshoot performance separately. Overshoot and undershoot are different indicators and should be measured separately.
3. "All 0 to all 1", randomly select a number such as 55, AA, etc., write it to the FFH unit, then write it to the 00H unit, and then write it to the FFH unit. Write the address as a binary number, and you can see that the address line changes from all '0' to all '1'. According to signal processing theory, when the voltage step jumps, it contains an infinitely wide spectrum, in which the high-frequency part radiates externally. These radiation signals are interference sources and have a greater impact on adjacent lines. Address lines are generally bundled and wired, and simultaneous jumps will cause the greatest interference. When the address line changes from all '0' to all '1', interference, overshoot, and fan-out current have the greatest impact.
4. "All 1 to all 0", immediately following the previous step, randomly select a number such as 55, AA, etc., and write it to the 00H unit. Write the address as a binary number, and you can see that the address line changes from all '1' to all '0', generating the maximum undershoot interference.
5. "Stick test". Write different data to different address units in turn and read out the judgment, such as: 1, 2, 3, 4......This step also tests the quality of RAM. Note that you must not use the same data for testing, otherwise the adhesion cannot be detected.
6. You can choose "all 0 all 1 continuous high-speed change". The purpose is to simulate the worst situation (large fan-out current, strong interference, over/undershoot).
Then, test the data line (the principle is the same as testing the address line, and steps 1 and 2 also test the data line adhesion)
1. Slide '0', write FEH, FDH, FBH, F7H, EFH, DFH, BFH, 7FH to a fixed address in sequence and read out the judgment.
2. Slide '1', write 1H, 2H, 4H, 8H, 10H, 20H, 40H, 80H to a fixed address in sequence and read out the judgment.
3. "All 0 to all 1", all units are set to 1 (clear first, then set to 1 and read out the judgment).
4. "All 1 to all 0", all units are cleared (clear and read out the judgment).
5. "All 0s and all 1s continuous high-speed change" can be selected. Write a number of all '0's and all '1's to a certain unit alternately at high speed, and end with all '0's.
At this point, the RAM test is completed and all storage units are cleared.
There is a lot of room for the factory inspection program, such as adding error location code to automatically point out the cause and location of the error.
The high-frequency characteristics of each single board will be different due to production process errors (board making, materials, welding, assembly, etc.) and usage conditions. The high-frequency characteristics of the same board are also different under different circumstances.
In summary, in addition to testing the quality of RAM, most of the code tests the reliability of the single board hardware.
If you don’t care about the high-frequency characteristics, the original test method is almost the same (if the test data is not selected well, the data line adhesion may not be tested), but it should be recognized that the main object of testing RAM is not the quality of the RAM itself, but the single board hardware and lines connected to the RAM.

The above is a summary of my actual work experience, written out to communicate with you, if there is anything wrong, please correct me!

Source program (pseudo code)
//TEST ROM
TestROM()
{//Use shift accumulation and checksum
=0;
for(i=0;i sum=sum+ram;
sum=sum>>1;
}
if(sum==CHECKSUM) printf("ROM test OK!\n");
else printf("ROM test ERROR!\n");
}

//TEST RAM
TestRAM()
{
//Address line test
'0' slide;
'1' slide;
"All 0 becomes all 1";
"All 1 becomes all 0";
"Stick test";
optional "All 0 all 1 continuous high-speed change";

//Data line test
'0' slide;
'1' slide;
"All 0 becomes all 1";
"All 1 becomes all 0";
optional "All 0 all 1 continuous high-speed change"
}

Keywords：MCU Reference address：Problems with MCU hardware self-test

Previous article：Part of the MODBUS program for AVR
Next article：ATmega 16 MCU USART related registers (10)

Popular Resources
Popular amplifiers