10,000-word article decrypts Apple's A12 chip

Latest update time：2021-08-31 01:58

Reads：

Source: This article is translated from "anandtech" by the public account Semiconductor Industry Observer (ID: icbank), thank you.

Over the past few years, Apple's chip design team has been at the forefront of architecture design and manufacturing processes. The Apple A12 is another generational leap for the company, as it claims to be the first commercial 7nm chip.

When talking about process nodes, generally speaking, the smaller the number, the smaller the transistor. While the association between the node name and the actual physical size has long lost its meaning in recent times, it still represents a leap in density, allowing vendors to pack more transistors into the same chip space.

Thanks to TechInsights for publicly sharing images of Apple's A12 chip, we subsequently published our first analysis and review of the die photo:

Apple A12 chip die photo (Source: TechInsights)

In this article, I review the A12 chip again, and I write down my own labels and explanations for the die photos. The new A12 chip mainly follows Apple's SoC layout structure (rotated 90 degrees compared to most past dies).

On the right, we see the GPU complex, with the four GPU cores and shared logic in the center. The CPU complex is at the bottom, with the two large Vortex CPU cores, separated by a large L2 cache, next to the four small CPU cores and their own L2 cache, to the left of the center.

The 4 large blocks of SRAM in the middle are part of the system cache, which is an SoC-wide cache layer that sits between the memory controller and the internal system interconnect and tile memory subsystem. Apple uses this block as a power saving feature: since memory transactions to DRAM are very expensive in terms of energy usage, caching it on-chip saves a lot of power, with the added benefit of potentially improving performance due to data locality.

The Apple A12's system cache sees by far the biggest change since the introduction of the Apple A7. The big change in layout also suggests a big change in the functionality of the block, as now we clearly see the block split into 4 distinct sections. In previous Apple SoCs, such as the A11 and A10, the system cache looked more like one logical block with what appeared to be two sections. The doubling of the sections in the block could indicate a big change in the memory performance of this block, which I will analyze in more detail later in this article.

The last major introduction to the A12 is a major improvement in the neural network accelerator IP. Apple claims to have moved from the dual-core design of the A11 to a new 8-core design. It is important to note that Apple never mentioned that this was an in-house design during the presentation, and the marketing materials always rush to introduce other IP blocks of the SoC.

Last year's design was rumored to be CEVA IP, but we never got full confirmation because Apple didn't want it to be known. The A12 is an 8-core design with a 4x performance increase, but the actual performance increase is closer to 8x, from 600GigaOPs in the A11 to 5TeraOPs in the A12. In the die photo, we see 8 MAC engines surrounding a large central cache, with likely shared logic on top for fixed-function and fully connected layer processing.

Looking at the changes in the sizes of the various blocks from the A11 to the A12, we see the benefits of TSMC’s new 7nm process node. It’s worth noting that almost all of the IP blocks underwent changes, so a comparison of A11 vs A12 isn’t a valid way to determine how much density has improved with the new process node. That said, we look at the single GPU core as a likely candidate (since the structure we see is essentially the same), and observe that the A12 is 37% smaller in size compared to the A11. It’s clear that the new node enabled Apple to add an additional GPU core, however, in absolute terms the GPU in the A12 is still smaller.

Bigger CPUs and massive cache hierarchies

Source: TechInsights’ Apple A12 die photo, ChipRebel’s Apple A11 die photo

Moving on to the CPU complex, and specifically the new large CPU cores, we’re now seeing what may be the biggest change in CPU layout in Apple’s chip generations. In particular, we see a doubling of the L1 data cache in the new Vortex CPUs, from 64KB to 128KB. On the front end, we also see double the SRAM blocks, which I attribute to the L1 instruction cache, which I now believe must have doubled to 128KB as well. Interestingly, even now, several years later, we still don’t really understand what the A10 introduced in the front end block: here we see a new very large cache block, whose exact function is still unclear.

A big question for many years has been what exactly Apple's cache hierarchy looks like. Looking at the memory latency behavior at different test depths, we can see different jumps at different test depths. I haven't labeled the latency numbers because we'll see them again later in the non-log version of this graph.

On the big core side we clearly see a jump from 64KB to 128KB for L1$, and I think there is no doubt about the increase here. However, moving into the L2 cache we see some strange characteristics in terms of latency. It is clear that in the 3MB range the latency increases until around 6MB. It is worth noting that the characteristic of slowly increasing latency around 3MB only occurs when accessing in a completely random pattern, and in smaller access windows the latency is consistently flat until 6MB.

We'll leave that aside for now and move onto the 6MB+ region served by the system cache. It's hard to figure out at first because of the skew caused by the overall low latency, but overall the latency curve increases by a further 4MB or so before we hit DRAM latency. This is consistent with what we actually see on the die: the new system cache not only doubles the portion of its tiles, but also doubles in capacity from 4MB to 8MB.

We move on to the little cores, and things get a little more complicated. At first glance, you'd believe that the A11's little core L2 was limited to 512KB, while the A12 is up to 1.5MB, however I think we're being tricked by the cache power management strategy. Looking at the A11 Mistral core latency, we can see clear jumps at 768KB and 1MB. A similar jump can be seen at 2MB for the A12 cores.

At this point, it's best to go back to the die photo and do some pixel math, which gives us the following table:

The big core L2 has no architectural changes between the A11 and A12, both have 128 SRAM macro instances, divided into two groups. The question here remains, if the L2 is indeed only 6MB, then this means that each SRAM block has 48KB.

When looking at the little cores we see that they use the same SRAM macros. The A12's little core L2 has been increased from 16 instances to 32, which means there must be a doubling here. However, as we can see that the measured latency depth of the L2 has increased by at least a factor of three, something else must be going on. Our measured data is in no way representative of what is in the hardware, and in fact we can confirm this by running the latency tests in a more special way, which makes power management think it is just some small workload. In the A12, the Tempest core appears to have only 512KB available.

The conclusion is that Apple uses partial cache power down at a per-bank granularity. On the A12, the L2 bank per corelet is 512KB, while on the A11 it is 256KB. And, this makes me more convinced that there is 2MB on the A12 and 1MB on the A11, it's just that the test may not satisfy the policy requirements to access the full cache.

In turn, because this confirms that each SRAM instance is 64KB, we can go back and make some assumptions about the big core L2. Again, you would think it stopped at 6MB, but looking closely, especially on the A12, the characteristics change at 8MB. Similarly, the core may have 8MB of physical cache, and once we get close to full cache, the access behavior changes significantly.

The point here is that Apple's caches are huge, and the A12 goes even further in this regard, doubling the system cache size. In practice, we have about 16MB of cache hierarchy available on the big CPU cores - a massive amount that simply dwarfs the memory and cache subsystems of competing SoCs.

Evolved GPU

On the GPU side, we have big expectations for the A12, not only in terms of performance, but also in terms of architecture. Last year, Imagination issued a press release claiming that Apple had informed them that the company planned to no longer use its IP in new products within 15 to 24 months. This ultimately led to a collapse in the stock price and the company was subsequently sold to an equity firm.

So, despite Apple's claim that the A11 GPU is an in-house design, it still looks like an Imagination-derived design, as its tile design is very similar to the previous Rogue - the biggest difference is that the so-called cores are now larger structures than the previous two cores. In fact, it is still a TBDR (Tile-Based Deferred Rendering), and IMG has many patents, but an important fact is that Apple is still very public and supportive of PVRTC (PowerVR Texture Compression, a proprietary format), which means that the GPU may still be associated with IMG's IP. Here, we may still consider it an architectural licensed design, rather than what we usually call a "clean" design.

Source: TechInsights’ Apple A12 die photo, ChipRebel’s Apple A11 die photo

Moving on to the A12 GPU, model numbered G11P, we see some very clear similarities with last year's A11 GPU. The various functional blocks appear to be largely located in the same locations and constructed in a similar manner.

I think the biggest advance in the Apple A12 GPU is support for memory compression. I was very surprised to hear this at the launch event because it means two things at the same time: previous Apple SoCs and GPUs clearly did not have memory compression, and now this alone is enough to significantly improve the performance of the new GPU.

By memory compression, I specifically mean transparent frame buffer compression from the GPU to main memory. In the desktop space, vendors like Nvidia and AMD have had this feature for years, and it improves GPU performance even without an increase in memory bandwidth. Smartphone GPUs also need memory compression, not only because of the limited bandwidth on mobile SoCs, but most importantly because of the power reduction associated with high bandwidth requirements. ARM's AFBC has been the most publicly talked about mechanism in the mobile space, but others like Qualcomm and even Imagination have their own implementations.

Apple appears to be late to introduce this feature with the A12, but it also means that the A12 will benefit from a huge generational boost in efficiency and performance, which is significant considering Apple's claims of a major increase in the new GPU.

A12 Vortex CPU Tour

When talking about the Vortex microarchitecture, the first thing we need to discuss is the frequencies we see on Apple's new SoCs. Over the past few generations, Apple has been steadily increasing the frequencies of its big cores, while also improving the IPC of the microarchitecture. I did a quick test of the frequency characteristics of the A12 and A11 and came up with the following table:

The maximum frequencies for the A11 and A12 are actually single-thread boost clocks - 2380MHz for the A11's Monsoon core and 2500MHz for the A12's new Vortex core. In ST's application, this is only a 5% frequency increase. When a second large thread is added, the A11 and A12 clocks drop to 2325MHz and 2380MHz respectively. When we run threads on the small cores at the same time, the situation diverges between the two SoCs: the A11 drops further to 2083MHz, while the A12 continues to stay at 2380MHz until it reaches its thermal limit and finally stops working.

In terms of small cores, the new Tempest core is actually more conservative than the previous Mistral. When the system runs only one small core on the A11, the maximum frequency is increased to 1694MHz. This feature is now gone on the A12, and the maximum frequency is 1587MHz. When the 4 small cores are fully loaded, the frequency is further reduced to 1538MHz.

Greatly improved memory latency

As mentioned above, it is clear that Apple has put a lot of work into the cache hierarchy and memory subsystem of the A12. Returning to the linear latency graph, we can see that the completely random latency for the big cores and the little cores has the following characteristics:

The Vortex cores have only seen a 5% frequency increase compared to the Monsoon cores, but the absolute L2 memory latency has been reduced from 11.5ns to 8.8ns, a 29% improvement. This means that the L2 cache of the new Vortex cores can now complete operations in fewer cycles. On the Tempest side, L2 cycle latency appears to have remained the same, but there have been big changes in L2 partitioning and power management, allowing access to larger physical L2 chunks.

I've only done deep testing on less than 64MB, and it's clear that the latency curve hasn't flattened out in the test data set, but you can see that DRAM latency has improved. The maximum DVFS frequency of the memory controller is increased when the small cores are active, which could explain the larger variance in DRAM accesses on the Tempest cores - they perform better when there are large threads running on the large cores.

The A12's system cache has changed dramatically in its characteristics. While bandwidth to this part of the cache hierarchy has been reduced compared to the A11, latency has been greatly improved. A big impact here can be attributed to the L2 prefetchers, and I also see the possibility of having prefetchers on the system cache side: both latency performance and the number of streaming prefetchers have been improved.

Instruction throughput and latency

To compare the backend characteristics of Vortex, we tested instruction throughput. Backend performance is determined by the number of execution units, and latency is determined by the design quality.

The Vortex core looks very similar to the previous Monsoon (A11), except we seem to find a new divide unit, as execution latency has been reduced by 2 cycles on both integer and FP sides. On the FP side, divide throughput has been doubled.

Monsoon (A11) is a significant microarchitecture update from the core middle and back end. It is here that Apple changed the microarchitecture of Hurricane (A10) from 6-wide decode to 7-wide decode. The most important change in the back end is the addition of two integer ALU units, increasing from 4 units to 6.

Monsoon (A11) and Vortex (A12) are extremely wide machines - with 6 integer execution pipelines, two complex units, two load units and store units, two branch ports, three FP/vector pipelines, which gives an estimated 13 execution ports, far more than ARM's upcoming Cortex A76, and wider than Samsung's M3. In fact, assuming we don't see the atypical shared port situation, Apple's microarchitecture seems to be far wider than anything else, including desktop CPUs.

SPEC2006 performance: reaching desktop level

We've been trying SPEC on iOS devices for a while now - for various reasons we haven't been able to continue trying SPEC over the past few years. I know a lot of you were hoping we could pick up where we left off, and I'm happy to report that I've spent some time getting SPEC2006 back in action.

SPEC2006 is an important industry standard check benchmark, which is different from other workloads in that it processes larger and more complex data sets. Although GeekBench 4 has become a popular industry benchmark - I applaud the effort to achieve a fully cross-platform benchmark - we must consider that the program size and data size of the workload are still relatively important. Therefore, SPEC2006 is a better benchmark that fully demonstrates more details of a given microarchitecture, especially in terms of memory subsystem performance.

The following SPEC numbers are estimates as they have not been submitted and formally verified by SPEC. The compilation settings for the benchmark library are as follows:

Android: Toolchain: NDK r16 LLVM compiler; Flags: -Ofast, -mcpu=cortex-A53

iOS: Toolchain: Xcode 10; Flags: -Ofast

On iOS, 429.mcf is a problem because the kernel memory allocator often refuses to allocate large individual 1.8GB chunks that programs need (even on new 4GB iPhones). I modified the benchmark to use only half of the ARC, which reduced the memory footprint to about 1GB. I measured the reduction in runtime on several platforms and also applied a similar scaling factor to the iOS score, which I estimate to be +-5% accurate. The rest of the workload was manually verified and verified to execute correctly.

Performance measurements are run in an artificial environment (ie: with a desktop fan cooling the phone) and we guarantee that heat will not be an issue during the 1-2 hours it takes to complete a full set of runs.

In terms of numbers, I looked to articles from earlier this year, such as our evaluation of the Snapdragon 845 and Exynos 9810 in our Galaxy S9 review.

When measuring performance and efficiency, it is important to consider three metrics: Obviously, the performance and runtime of the benchmark are represented on the right axis, growing from the right. The larger the data, the better the performance of the SoC/CPU. The label represents the SPECspeed score.

On the left axis, the bars represent energy usage for a given workload. Longer bars mean the platform is using more energy. Shorter bars represent a more energy-efficient platform, meaning it is using less energy. The labels represent average power (expressed in Watts), which is an important secondary metric to consider in thermally constrained devices, and total energy (expressed in Joules), which is the primary efficiency metric.

The data is arranged in the order shown in the legend, with different colors representing different SoC vendors and different generations. I have listed data for the Apple A12, A11, Exynos 9810 (2.7 and 2.3GHz), Exynos 8895, Snapdragon 845, and Snapdragon 835. This gives us an overview of all relevant CPU microarchitectures over the past two years.

We start with the SPECint2006 workload:

The A12 is clocked 5% higher than the A11 in most workloads, but we have to remember that we can't really lock the frequencies on an iOS device, so this is just an assumption of the runtime clocks during benchmarking. In SPECint2006, the A12 performs 24% better than the A11 on average.

The smallest increases were seen in 456.hmmer and 464.h264ref - the two most heavily executed bottleneck tests in the entire suite. Since the A12 doesn't seem to have made any big changes in this regard, the small increases are mainly due to higher frequencies and improvements to the cache hierarchy.

The improvement for 445.gobmk is quite large at 27% - the workload here is characterized by bottlenecks in store address events as well as branch mispredictions. I did measure some significant changes in how the A12 handles stores to cache lines, as there was no significant change in branch prediction accuracy.

Parts of 403.gcc, 429.mcf, 471.omnetpp, 473.astar, and 483.xalancbmk are sensitive to the memory subsystem, and here the A12 performance increased from 30% to 42%, which is a staggering 42%. Clearly the new cache hierarchy and memory subsystem paid off in a big way here, as Apple achieved one of the most major performance leaps in recent chip generations.

When measuring power efficiency we see that overall the A12 is 12% better, but we have to remember that we are talking about a 12% reduction in energy consumption at peak performance. The A12 shows a 24% performance improvement, and the performance/power curves of the two SoCs are already very different.

In the benchmarks where the performance gains are the largest (i.e. the memory-bound workloads mentioned earlier), we see a significant rise in power consumption. So despite the promise of increased power from the 7nm process, Apple chose to spend more energy than the new process node saved, so the average power in SPECint2006 rose from 3.36W for the A11 to 3.64W for the A12.

Next we moved to SPECfp2006 and explored C and C++ benchmarks since we didn’t have a Fortran compiler in XCode and it was complicated to get it working on Android since it’s not part of the NDK which has a deprecated version of GCC.

SPECfp2006 has more memory intensive tests, and out of the 7 tests, only 444.namd, 447.dealII, and 453.povray did not see major performance regressions when the memory subsystem was not up to par.

Of course, this mainly benefits the A12, as the average increase in SPECfp is 28%. 433.milc stands out from this with a 75% performance increase. This benchmark is characterized by being instruction store limited, which again shows the power of Vortex, and I saw a big improvement. The same analysis applies to 450.soplex, and the combination of excellent cache hierarchy and memory storage performance improves performance by 42%.

470.lbm is an interesting workload for the Apple CPU, showing a multi-factor performance advantage over the ARM and Samsung cores. Curiously, Qualcomm's Snapdragon 820 Kryo CPU still outperforms recent Android SoCs. 470.lbm is characterized by large loops in the hottest code. The microarchitecture could optimize workloads like this by having a (larger) instruction loop buffer, and on a loop iteration the core would bypass the decode stage and fetch instructions from the buffer. Apple's microarchitecture seems to have some mechanism for this. Another explanation is the vector execution performance of the Apple cores - the hot loops in Lbm make heavy use of SIMD, and Apple's 3x execution throughput advantage could also be a big contributor to the performance.

Similar to SPECint, the SPECfp workloads that saw the biggest performance jump also saw an increase in power consumption: 433.milc went from 2.7W to 4.2W, while also improving performance by 75%.

Overall, power consumption jumped from 3.65W to 4.27W. Overall energy efficiency increased in all tests, except for 482.sphinx3, where the power increase reached the maximum of all SPEC workloads for the A12 at 5.35W. In SPECfp2006, the A12's total energy consumption was 10% lower than the A11.

I didn't have time to go back and measure the power of the A10 and A9, but they were generally around 3W for SPEC. I ran performance benchmarks and here's a comprehensive performance overview of the A9 to A12 and the latest Android SoCs for those of you who are looking at comparing past generations of Apple.

Overall, the new A12 Vortex cores, along with architectural improvements to the SoC's memory subsystem, give Apple's new chip a much larger performance advantage than Apple's marketing materials suggest. Apple's advantage over the best Android SoCs is significant, both in performance and power efficiency. Apple's SoC is more power efficient than all recent Android SoCs, and has a nearly 2x performance advantage. If we normalize for energy use, I wouldn't be surprised to see Apple's performance efficiency lead reach 3x.

This also gives us a good idea of the Samsung M3 core released this year, that high energy consumption can only lead to higher performance when the total energy is under control. Here, the Exynos 9810 consumes twice as much energy as last year's A11 - a performance deficit of 55%.

Meanwhile, ARM’s Cortex A76 is slated to make its way into the Kirin 980 in a few weeks as part of the Huawei Mate 20. I promise we’ll get proper testing done for the new flagship and add it to our current SoC performance and efficiency charts.

Surprisingly, Apple's A11 and A12 are getting pretty close to current desktop CPUs. I haven't had a chance to run programs in a more comparable manner yet, but from the latest data provided by our site editor Johan De Gelas earlier this summer, we see that the A12 outperforms a moderately fast Skylake CPU in single-threaded performance. Of course, there are compiler factors and various frequency issues to take into account, but we are still talking about very small margins here, until Apple's mobile SoCs outperform the fastest desktop CPUs in ST performance. It will be interesting to get more accurate data on this topic in the coming months.

System Performance

While synthetically tested performance is one thing, and hopefully we'll do well with SPEC, interactive performance in real-world usage behaves differently, and software can play a big role in testing performance.

I must admit that our suite of iOS system performance tests looks pretty poor: we are left with only web browser tests, as iOS lacks meaningful alternatives like PCMark on Android.

Speedometer 2.0 is the latest industry-standard JavaScript benchmark that tests the performance of the most common and modern JS frameworks.

The A12 represents a massive 31% performance jump over the A11, again pointing out that the performance numbers in Apple's ads are far lower than the new chip.

We also see a small boost on previous generation devices with iOS 12. This is not only due to changes in how the iOS scheduler handles load, but also due to further improvements in each of the evolving JS engines Apple uses.

WebXPRT 3 is also a browser test, but its workload is more extensive and diverse, including a lot of processing tests. Here, the iPhone XS shows an 11% advantage over the iPhone X, which is slightly smaller than the advantage in the Speedometer 2.0 test.

Previous devices also saw a steady performance increase, with the iPhone X's score rising from 134 to 147, or 10 percent. That's a massive 33 percent improvement over the iPhone 7's A10, which we'll look at in more detail later.

iOS12 scheduler loading ramp analysis

Apple is promising significant performance gains in iOS 12 thanks to the way their new scheduler calculates the load of individual tasks. The OS's kernel scheduler tracks the execution time of threads and aggregates it into a utilization metric which is then used by mechanisms such as DVFS. The algorithm that determines how this load changes over time is usually a simple software decision - it can be tuned and designed as the vendor sees fit.

Because the iOS kernel is closed source, we can't really see what the changes are, but we can measure their effects. A relatively simple way to do this is to track the frequency from idle to maximum performance in a workload. I ran this test on iPhones 6 to X (and XS) before and after the iOS 12 system upgrade.

We start with an iPhone 6 with an A8 chipset, and I got some strange results on iOS11 because the ramp-up characteristics from idle to top performance were very unusual. I repeated this a few times, but the results were the same. The A8 CPU was at 400MHz at idle and stayed there for 110ms until it jumped to 600MHz, then stayed there for another 10ms before going to 1400MHz at top performance.

The iOS 12 system exhibits a more step-by-step behavior, starting to rise earlier and reaching peak performance after 90ms.

The iPhone 6S has significantly different ramp-up characteristics on iOS 11, and the A9 chip has a very slow DVFS. Here, the CPU takes a total of 435ms to reach its maximum frequency. With the update to iOS 12, this time has been drastically reduced to 80ms, greatly improving performance under shorter interactive workloads.

I was surprised to see how slow the scheduler was before, and this is a problem with current Samsung Exynos chipsets and other Android SoCs that don't optimize the scheduler. While the hardware performance may be there, it doesn't show up in short interactive workloads because the scheduler load tracking algorithm is so slow.

The A10 has similar shortcomings to the A9, taking over 400ms to reach peak performance. In iOS 12, the iPhone 7 cuts this speed in half to around 210ms. It's strange that the A10 is more conservative in this regard compared to the A9, but this may have something to do with the small core.

In this graph we can also see the frequencies of the small Zephyr cores, which start at 400MHz and peak at 1100MHz. The frequencies in the graph drop back to 758MHz because one core switches to the big core at this point, and then their frequencies continue to rise until they reach maximum performance.

On the Apple A11, I didn’t see any significant changes, and in fact any differences are likely random noise from measuring different firmwares. In both iOS11 and iOS12, the A11 ramps up to full frequency in about 105ms. Note that the x-axis in this graph is much shorter than in the previous graph.

Finally, on the iPhone XS’s A12 chipset, we were unable to measure any pre-update and post-update characteristics because the iPhone XS ships with iOS 12. Here again, we see it hit top performance after 108ms, and we see a trend away from the Tempest cores and toward the Vortex cores.

All in all, I hope this is the best and clearest demonstration yet of the performance difference iOS 12 brings to older devices.

As far as the iPhone XS is concerned, I have no problem with its performance, it is fast. I have to admit that I am still an Android user and I completely turn off animations on my phone because I find that this hinders the speed of the device. iOS cannot completely turn off animations, and although this is just my subjective personal opinion, I find that they seriously hinder the real-world performance of the phone. In non-interactive workloads, the iPhone XS simply completed the test without any problems or abnormalities.

GPU Performance

The performance improvement of the A12's GPU was one of the biggest highlights of the presentation, with a 50% performance improvement over the A11's GPU. Apple achieved this by "simply" adding a fourth GPU core to the A11's three GPUs, as well as introducing memory compression on the GPU. I believe that memory compression is the factor that contributes most to the microarchitectural performance improvement of the GPU, as it is actually a huge one-time shift that admittedly took Apple a long time to accomplish.

Before I get into the benchmarks, I want to mention that peak performance and peak power consumption is an issue with the latest Apple GPUs. We've seen Apple go from consistent performance gains for a while to being one of the worst offenders when it comes to dropping peak performance from their SoCs to actually dropping performance. There are reasons for this, but I'll get to that shortly.

The 3DMark Physics test is primarily a CPU limited test that also stresses the overall platform power limits while the GPU is working. We see that the iPhone XS and A12 have made great strides compared to last year's iPhones. This is a test that has been particularly problematic for Apple CPUs in the past, however this microarchitectural hiccup seems to have been resolved in the A11 and Monsoon cores. The Vortex cores and the always improving SoC power efficiency have further improved performance, finally matching ARM's cores in this particular test.

In the graphics portion of the 3DMark test, the iPhone XS delivers a 41 percent improvement in sustained performance over last year's iPhone X. In this particular test, the OnePlus 6's more generous thermals still let the Snapdragon 845 perform better than the newer chip.

In terms of peak performance, I had some big issues in 3DMark: I was completely unable to complete a single run on my iPhone XS or XS Max while keeping it cool. If the device is cool enough, the GPU will ramp up to very high performance and even crash. I could consistently reproduce this over and over again. I try to measure power during testing, and the platform averages 7-8 watts on an instantaneous basis. For numbers above 8, I suspect that this measurement method is not recording it. A GPU crash means that during a run, the power supply output cannot provide the necessary transient current, and we will see a voltage drop, causing the GPU to crash.

While repeating the test multiple times over several attempts, I heated up the SoC until it decided to boot up at a lower GPU frequency, which allowed the test to complete successfully.

GFXBench Test

Kishonti recently released the new GFXBench 5 Aztec Ruins test, which brings a newer, more modern, and more complex workload to our test suite. In an ideal world, we would test real games, but this is incredibly difficult on mobile devices because basically no games have built-in benchmark modes. There are tools that can collect FPS values, but the biggest problem here is the repeatability of the workload when playing the game manually, which is also a big problem with many online games today.

I still think that artificial benchmarks have a very solid place here, as long as you understand the nature of the benchmark. Kishonti's GFXBench has been the industry standard for years, and the new Aztec test gives us a different workload. The new test is more heavily shaded and utilizes more complex effects to stress the GPU's processing power. Although the data in the table above was collected on a Mali G72 GPU, it still provides a general expectation for other architectures. The new test is also very bandwidth-hungry due to its larger textures.

Generally speaking, how games relate to benchmarks depends on the percentage of various graphics workloads, whether they have large fills or heavy textures, whether they have complex geometry, or simply increasingly complex shading effects that require more processing power from the GPU.

Aztec Ruins Normal mode is a new, less demanding test where the new Apple A12 phones demonstrated extremely high peak performance, up 51 percent from last year’s iPhones.

In terms of sustained performance, the numbers drop off quickly after a few minutes and stabilize further afterwards. At this point, the iPhone XS performs 61% better than the iPhone X. The Apple A12 is also able to beat the current leader, the Snapdragon 845 in the OnePlus 6, in sustained performance by 45%.

In Advanced mode in Aztec Ruins, we see a strikingly similar performance ranking. Peak performance is again excellent for the iPhone XS, but it’s the sustained scores that matter. At this point, the iPhone XS outperforms the iPhone X by 61%. The performance differential for the OnePlus 6’s Snapdragon 845 drops to 31% here, a bit lower than in Normal mode, and we may be hitting some bottlenecks in some aspects of the microarchitecture.

GPU Power

The platform and GPU capabilities of Apple devices have always been something I wanted to release, but it was complicated to achieve. I got reasonable data for the new iPhone XS, but the old SoC data still needs to wait for the opportunity.

I didn't have time to test Aztec Ruins on a variety of devices, so we still rely on the standard Manhattan 3.1 and T-Rex. First, let's list the test results:

Likewise, in Manhattan 3.1, the new iPhone XS delivers 75% higher performance than the iPhone X. The improvement here is not just due to the GPU’s microarchitecture improvements, with an extra core, and the new process node for the SoC, but also because the new memory compression reduces the power consumption of external DRAM, which can account for 20-30% of system power consumption in bandwidth-heavy 3D workloads. The power savings on DRAM means that the GPU and SoC can use more energy, which improves performance.

The power figures here are the system's active power, which represents total device power minus idle power (which includes screen power) for a given workload.

At peak performance, when the device is cooled at 22°C ambient temperature, the Apple A12 GPU is very power hungry, reaching 6W. This is not an average peak for the GPU, as I mentioned earlier that 3DMark reached around 7.5W (before crashing).

Even at this high power figure, the A12 is more efficient than every other SoC out there. While this is interesting, it’s important to highlight the throttling nature of Apple. After just 3 minutes or 3 benchmark runs, the phone is throttled by about 25%, reaching what I call a “warm” state in the efficiency chart. Power comes in at a reasonable 3.79W. It’s worth noting that power efficiency isn’t a massive improvement, just a 16% improvement over peak. This means that the platform is relatively low on the power curve and performance is limited by heat.

Moving on to T-Rex, the iPhone XS again demonstrated a 61% sustained performance gain.

We see that the power consumption of the T-Rex is consistent with that of the Manhattan, with the peak power of the low-temperature device reaching a little over 6W. After running 3 times, the peak power dropped to below 4W again, and the performance dropped by 28%. There is not much improvement in efficiency here, which again shows that the power curve is relatively low.

It's important to note that the power metric for "warm" operation is not indicative of sustained performance, I just wanted to add an additional data point next to the peak data. Most devices have sustained power in the 3-3.5W range.

Why does Apple make such a huge difference between peak performance and sustained performance? Previously, sustained performance was one of Apple's main focuses when the iPhone 6 and A8 were released. This change is due to changes in everyday GPU use cases and how Apple uses GPUs for workloads that are not related to 3D.

Apple makes heavy use of GPU compute for a variety of purposes, such as general hardware acceleration in apps, camera image processing using GPU compute. In these use cases, sustained performance is not important because they are transactional workloads, meaning fixed workloads that need to be processed as quickly as possible.

Android GPU compute has been a complete disaster over the past few years, and I've mostly blamed Google for not supporting OpenCL in AOSP, which has made OpenCL support very patchy among vendors. RenderScript never gained much traction because it didn't guarantee performance. The fragmentation of Android devices and SoCs means that GPU compute in third-party apps is basically non-existent (please correct me if I'm wrong!)

Apple's vertical integration and tight control over the API stack means that GPU computing is a reality, and transactional GPU peak performance is a metric worth considering.

Now, while this does explain the throttling, I still think Apple could do some thermal optimizations. I played Fortnite on my iPhone XS and the phone got really hot which I wasn’t a fan of. At this point, there has to be some way for games and apps with sustained performance characteristics to actually throttle that sustained performance state from the GPU.

Thermal and peak performance considerations aside, iPhone XS and XS Max, thanks to the new A12 SoC, demonstrate industry-leading performance and efficiency and are currently the best mobile gaming platforms.

*Click on the end of the article to read the original English text.

*This article was originally created by the public account Semiconductor Industry Observation (ID: icbank). If you need to reprint, please add WeChat ID: icbank_kf01, or reply to the keyword "reprint" in the background of the public account, thank you.

Today is the 1730th issue of content shared by "Semiconductor Industry Observer" for you, welcome to follow.

Recommended Reading

★ Veteran Dai Hui: Let’s start with the beautiful women in World War II and listen to the gossips between Apple, Qualcomm and Intel!!

★ Challenges faced by RISC-V!

★ The history of DARPA and Moore's Law

Follow the WeChat public account Semiconductor Industry Observer (ID: icbank) and reply to the following keywords to get more relevant content

Reply to the submission and see "How to become a member of "Semiconductor Industry Observer""

Reply to the search and you can easily find other articles that interest you!

About Moore Elite

Moore Elite is a leading chip design accelerator that reconstructs semiconductor infrastructure to make it easier for China to make chips. Its main businesses include "chip design services, supply chain operation services, talent services, and enterprise services". It covers more than 1,500 chip design companies and 500,000 engineers in the semiconductor industry chain, and has precise big data on integrated circuits. It currently has 200 employees and is growing rapidly. It has branches and employees in Shanghai, Silicon Valley, Nanjing, Beijing, Shenzhen, Xi'an, Chengdu, Hefei, Guangzhou and other places.