Intel and Alibaba Cloud work together to improve the reliability of DDR5 memory
background
In Alibaba Cloud data centers, memory failure is one of the main challenges facing stable server operation. Memory failure in large-scale data centers not only reduces server reliability, but also may interrupt data center services and affect server performance. Therefore, memory reliability has become a key factor in server reliability, availability, and serviceability (RAS) in data centers.
The next-generation memory standard DDR5 has higher bandwidth, lower power consumption, and higher density. However, it also brings new challenges to memory reliability, including:
-
DDR5 introduces a new architecture and signal transmission method, which requires more complex circuit design and optimization;
-
DDR5 memory modules have a larger capacity, but also increase the risk of failure;
-
Although in-DRAM error correction code (ECC) can correct single-bit errors in memory, it also leads to unclear error observation by the host.
To address these challenges, Alibaba Cloud worked with Intel to improve the reliability of DDR5 memory. Specific measures include:
1. Unified out-of-band (OOB) memory error data collection of the baseboard management controller (BMC): Unified collection of memory error data is achieved through the BMC, providing a data basis for subsequent analysis.
2. Built-in artificial intelligence-assisted (AI-assisted) fault analysis: AI-assisted is integrated into BMC to predict and analyze memory faults in real time.
3. Intel® Memory Resilience Technology (Intel® MRT ): Intel® MRT has been deployed in Alibaba Cloud data centers to provide early warning and prevent potential memory failures.
4. Integration with Alibaba Cloud Cruiser System: Integrate memory health assessment and prediction alerts with Alibaba Cloud’s server monitoring system to ensure business stability .
Challenges of memory reliability
Memory failures can be caused by many different types of memory low-level errors, such as single-bit errors (SBE), row-type errors, column-type errors, multi-array errors, memory module (DIMM) errors, etc. Each memory error has its own specific frequency and impact mode. For example, some error types occur sporadically or intermittently, making them difficult to track effectively, while some error types may report errors continuously. Some error types have a higher risk of uncorrectable errors (UE) and require immediate RAS ( reliability, availability, and serviceability ) measures, while other error types have a relatively low risk of triggering UE but may cause a large number of correctable errors (CE) in a short period of time, thus affecting system performance. There is no universal solution to all memory errors.
One of the traditional solutions is to replace the failed DIMM after an uncorrectable error (UE) is observed. However, this move cannot avoid the cost of system crash. Another approach is to predict memory failures based on a count-based correctable error (CE) rating strategy. This strategy is less effective in predicting complex memory failures because the occurrence of CE and UE depends not only on the hardware memory failure state, but also on the implicit runtime context, ECC correction capability, and memory-specific failure modes. Therefore, memory errors are highly uncertain and predicting UE is very difficult.
Although there is no universal solution, we can explore smarter ways to handle memory failures. For example, combining machine learning and real-time monitoring to more accurately predict the occurrence of UE and CE. Memory errors are a complex and critical issue that requires comprehensive consideration of multiple factors to optimize system reliability and performance.
BMC - based AI-assisted fault analysis helps improve the reliability of DDR5 memory
Alibaba Cloud and Intel jointly researched and developed a memory failure prediction and prevention solution for DDR5. The solution uses BMC to collect memory error data in a unified manner, providing a data basis for subsequent analysis. Intel® MRT technology is integrated into BMC to provide AI -assisted real-time prediction and analysis of memory failures, which is used to warn and prevent potential memory failures in advance. Data collection, fault analysis, and warning are integrated with Alibaba Cloud's server monitoring system ( Alibaba Cloud Cruiser System ), providing Alibaba Cloud's data center with fast and comprehensive hardware monitoring services to ensure business stability.
Figure 1. Solution architecture diagram
Key features of this solution include:
-
BMC - based fine-grained memory fault collection
-
Error analysis based on microscopic memory fault types
-
AI -assisted fault analysis
An AI model was trained using machine learning methods to predict memory failures by comparing massive DDR5 memory logs. The pre-trained memory failure prediction AI model was integrated into the baseboard management controller (BMC), which provides real-time prediction and analysis of memory failures for servers, thereby reducing server downtime in large-scale data centers.
-
Integrate Alibaba Cloud Cruiser hardware fault detection system
Real-time memory health assessment and prediction alerts have been integrated with the Alibaba Cloud Cruiser system, providing fast and comprehensive hardware monitoring services for physical servers in Alibaba Cloud data centers.
Intel® Memory Resilience Technology
Intel® Memory Resilience Technology (Intel® MRT ) is a technology designed to improve memory reliability in data centers . It enables data center operators to proactively predict potential memory failure risks and ensure the continuity of data center operations and workloads. The following are the key features of this technology:
1. Out-of-band fine-grained memory fault data collection: realize the unified collection of fine-grained memory error data and provide a data basis for subsequent analysis.
2. Analyze and locate memory fault points: Provide bottom-level memory fault location and analysis.
3. Predictive failure alert: detect possible memory failures in advance.
4. Prediction-based memory page offline: Based on the prediction, the memory page is offline to prevent the impact of potential failures.
5. Prediction-based memory fault area isolation: Based on the prediction and the corresponding RAS configuration of the system, the memory fault area is isolated to avoid potential memory errors.
Intel® Memory Resilience Technology uses multidimensional models and artificial intelligence algorithms to detect memory failures at the micro level. It assigns a health score to each DIMM and detects potential failures in real time. By optimizing the memory failure prediction model through artificial intelligence analysis of massive memory error logs, the technology can accurately locate potential problems and identify and prevent memory failures before they occur.
While there is no universal solution to address all memory errors, Intel® Memory Resilience Technology provides data centers with an intelligent and comprehensive approach to optimize system reliability and performance.
Using BDAT data to diagnose hardware failures
Intel BIOS reference code implements system validation capabilities that can generate comprehensive system data including memory margin data. This data is exposed from the standard BIOS data ACPI table (BDAT), which is defined in the ACPI table. BDAT data is the basic support for system BIOS, which is generated during the entire BIOS boot process and integrated into the ACPI RSDT table. By analyzing BDAT data, the efficiency of diagnosis and problem debugging of production systems can be effectively improved.
results and analysis
Alibaba Cloud has deployed Intel® Memory Resilience Technology on thousands of platforms powered by 4th Generation Intel® Xeon® Scalable processors in Alibaba Cloud data centers under different workloads , and is in the process of upgrading the platforms to 5th Generation Intel® Xeon® Scalable processors .
The new generation of processors has more reliable performance and better energy efficiency. It can achieve significant performance gains per watt when running various workloads, and also has better performance and total cost of ownership (TCO) in AI, data centers, networks, and scientific computing. Compared with the previous generation, the 5th Generation Intel® Xeon® Scalable Processor provides higher computing power and faster memory within the same power consumption range. In addition, it is compatible with the software and platform of the previous generation, so the testing and verification work can be greatly reduced when deploying a new system.
Figure 2. The fifth generation Intel® Xeon® Scalable processor has more powerful performance
Preliminary results show that the solution can effectively predict uncorrectable errors (UE) before they occur and alert on correctable error (CE) storm cases before the traditional CE count-based CE storm identification mechanism is triggered. The prediction lead time for UE and CE storm alerts varies from minutes to hours or even days depending on the underlying fault model. After iteration, the solution is expected to be able to predict 57% of UE and 74% of CE storms with the optimized DDR5 model6 .
In addition to effective UE and CE storm prediction, out-of-band (OOB) memory errors collected from the BMC are essential for further diagnosing and troubleshooting memory and system issues.
Figure 3. Efficient UE and CE storm prediction
in conclusion
By integrating Intel® Memory Resilience Technology into the BMC , the reliability of DDR5 memory in Alibaba Cloud Data Center can be effectively improved. For Alibaba Cloud, improving the total cost of ownership (TCO) of the entire data center is crucial. Intel and Alibaba Cloud are working together to develop the next generation of DDR5 fault prediction technology and provide methods for new memory technology.
For more information about
the fifth-generation
Intel®
Xeon®
Scalable
Processors
, please click "
Read More
"
1 Average performance improvement over 4th Generation Intel® Xeon® processors as measured by geometric mean of SPEC CPU rate, STREAM Triad, and LINPACK. See [G1]: 5th Generation Intel® Xeon® Scalable processors at intel.com/processorclaims . Results may vary.
2 1.19x to 1.42x performance improvement over 4th Gen Intel® Xeon® processors (ResNet50v1.5, BERT-Large, SSD-ResNet34, RNN-T (BF16 only), Resnext101 32x16d, MaskRCNN (BF16 only), DistilBERT). See [A15-A16]: 5th Gen Intel® Xeon® Scalable processors at intel.com/processorclaims. Results may vary.
3 See [G12]: 5th Generation Intel® Xeon® Scalable Processors at intel.com/processorclaims . Results may vary.
4 See [G11]: 5th Generation Intel® Xeon® Scalable Processors at intel.com/processorclaims . Results may vary.
5 Measured with 1.46x to 10.6x better performance per watt on AI, data, and network workloads using built-in accelerators. See [A19-A25], [D1], [D2], [D5], and [N16]: 5th Generation Intel® Xeon® Scalable Processors at intel.com/processorclaims . Results may vary.
6 Data cited from internal test results as of November 2023. Test configuration: Intel® Xeon® 8475B processor and Intel® Xeon® 8575C processor , 32 *64GB , 16 *64GB , 16 *32GB , 2*240GB SSD.
Intel does not control or audit third-party data. Please review the content, consult other sources, and confirm that the data mentioned is accurate.
Actual performance will vary based on usage, configuration, and other factors. See www.Intel.com/PerformanceIndex for more information. Performance results are based on testing as of the date shown in the configuration information and may not reflect all publicly available security updates. See configuration information disclosure for more information. No product or component can be absolutely secure.
Specific costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
INTEL DISCLAIMS ALL EXPRESS AND IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT, AND ANY WARRANTIES ARISING FROM COURSE OF PERFORMANCE, COURSE OF DEALING, OR USAGE OF TRADE.
Intel does not control or audit third-party data. Please review the content, consult other sources, and confirm that the data mentioned is accurate.
© Intel Corporation. Intel, the Intel logo, and other Intel trademarks are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other names and brands may be claimed as the property of others.
Want to see more "core" information
Tell us
with your
Likes
and
Watching
~