Jive Syndication Feed

I’m excited to introduce the most complex Carbon Performance Analysis Kit (CPAK) created by Carbon; an 8-core ARM Cortex-A53 system running 64-bit Linux with full Swap & Play support. This is also the first dual-cluster Linux CPAK available on Carbon System Exchange. It’s an important milestone for Carbon and for SoC Designer users because it enables system performance analysis for 64-bit multi-core Linux applications.

Here are the highlights of the system:

Dual-cluster, quad-core Cortex-A53 for a total of 8 cores
ARM CoreLink CCI-400 providing coherency between clusters
Fully configured GIC-400 interrupt controller delivering interrupts to all cores
New Global System Counter connected to A53 Generic Timers

Here is a diagram of the system.

The design also supports fully automatic mapping to ARM Fast Models.

I would like to introduce some of the new functionality in this CPAK.

Dual Cluster System

The Cortex-A53 model supports the CLUSTERIDAFF inputs to set the Cluster ID. This value shows up for software in the MPIDR register. Values of 0 and 1 are used for each cluster, and each cluster has four cores. This means that CPU 3 in Cluster 1 has an MPIDR value of 0x80000103 as shown in the screenshot below.

Global System Counter

Another requirement for a multi-cluster system is the use of a Global System Counter. A new model is now available in SoC Designer which is connected to the CNTVALUEB input of each A53. This ensures that the Generic Timer in each processor has the same counter values for software, even when the frequency of the processors may be different. This model also enables Swap & Play systems to work correctly by saving the counter value from the Fast Model simulation and restoring it in the Cycle Accurate simulation.

Generic Timer to GIC Connections

To create a multi-cluster system the GIC-400 is used as the interrupt controller, and the A53 Generic Timers are used as the system timers. This requires the connection of the Generic Timer signals from the A53 to the GIC-400. All of these signals start with nCNT and are wired to the GIC. When a Generic Timer generates an interrupt it leaves the CPU by way of the appropriate nCNT signal, goes to the GIC, and then back to the CPU using the appropriate nIRQ signal.

As I wrote in my ARM Techcon Blog, 64-bit Linux uses nCNTPNSIRQ, but all signals are connected for completeness.

Event Connections

Additional signals which fall into the category of power management and connect between the two clusters are EVENTI and EVENTO. These signals are used for event communication using the WFE (wait for event) and SEV (send event) instructions. For a single cluster system all of the communication happens inside the processor, but for the multi-cluster system these signals must be connected.

WFE and SEV communication is used during the Linux boot. All 7 of the secondary cores execute a WFE and wait until the primary core wakes them up using the SEV instruction at the appropriate time. If the EVENTI and EVENTO signals are not connected the secondary cores will not wake up and run.

Boot Wrapper Modifications

The good news is that all of the software used in the 8-core CPAK is easily downloadable in source code format. A small boot wrapper is used to take care of starting the cores and doing a minimal amount of hardware configuration that Linux assumes to be already done. Sometimes there is additional hardware programming that is needed for proper cycle accurate operation that is not needed in a Fast Model system. These are similar to issues I covered in another article titled Sometimes Hardware Details Matter in ARM Embedded Systems Programming.

SMP Enable

Although not specific to multi-cluster, the A53 contains a bit in the CPUECTLR register named SMPEN which must be set to 1 to enable hardware management of data coherency with the other cores in the cluster. Initially, this was not set in the boot wrapper from kernel.org and the Linux kernel assumes it is already done so it was added to the boot wrapper during development.

CCI Snoop Configuration

Another hardware programming task which is assumed by the Linux kernel is the enabling of snoop requests and responses between the clusters. The Snoop Control Register for each CCI-400 slave ports is set to 0xc0000003 to enable coherency. This was also added to the boot wrapper during development of the CPAK.

The gaps between the boot wrapper functionality and Linux assumptions are somewhat expected since the boot wrapper was developed for ARM Fast Models and these details are not needed to run Linux on Fast Models, but nevertheless they make it challenging to create a functioning cycle accurate system. These changes are provided as a patch file in the CPAK so they can be easily applied to the original source code.

CPAK Contents

The CPAK comes with an application note which covers the construction of the Linux image.

The following items are configured to match the minimal hardware system design, and can be extended as the hardware design is modified.

File System: Custom file system configured and created using Buildroot
Kernel Image: Linux 3.14.0 configured to use the minimal hardware
Device Tree Blob: Based on Versatile Express device tree for ARM Fast Models
Boot Wrapper: Small assembly boot wrapper available from kernel.org

A single executable file (. axf file) containing all of the above items is compiled. This file contains all of the artifacts and is a single image that is loaded and executed in SoC Designer.

One of the amazing things is there are no kernel source code changes required. It demonstrates how far Linux has come in the ARM world and the flexibility it now has in supporting a wide variety of hardware configurations.

Summary

An octa-core A53 Linux CPAK is now available which supports Swap & Play. The ability to boot the Linux kernel using Fast Models and migrate the simulation to cycle accurate execution enables system performance analysis for 64-bit multi-core systems running Linux applications.

Also, make sure to check out the other new CPAKs for 32-bit and 64-bit Linux for Cortex-A53 now available on Carbon System Exchange.

The “Brought up 8 CPUs” message below tells it all. A number of 64-bit Linux applications are provided in the file system, but users can easily add their favorite programs and run them by following the instructions in the app note.

The first Carbon Performance Analysis Kit (CPAK) demonstrating the AMBA 5 CHI protocol has been released on Carbon System Exchange. The design features the ARM Cortex-A57 configured for AMBA 5 CHI and the ARM CoreLink CCN-504 Cache Coherent Network. The design is a modest system with a single core running 64-bit bare-metal software with memory and a PL011 UART, but for anybody who digs into the details there is a lot to learn.

Here is a diagram of the system:

AMBA 5 CHI Introduction

Engineers who have been working with ARM IP for some time will quickly realize AMBA 5 CHI is not an extension of any previous AMBA specifications. AMBA 5 CHI is both more and less complex compared to AMBA 4. CHI is more complex at the protocol layer, but less complex at the physical layer. AXI and ACE use Masters and Slaves, but CHI uses Request Nodes, Home Nodes, Slave Nodes, and Miscellaneous Nodes. All of these nodes are referenced using shorthand abbreviations as shown in the table below.

Building the A57 with CHI

The latest r1p3 A57 is now available on Carbon IP Exchange. CHI can be selected as the external memory interface. The relevant section from the IP Exchange configuration form is shown below.

The CHI memory interface relies on the System Address Map (SAM) signals. All of the A57 input signals starting with SAM*are important in constructing a working system. These values are available as parameters on the A57 model, and are configured appropriately in the CPAK to work with the CCN-504.

Configuring the CCN-504

The CCN-504 Cache Coherent Network provides the connection between the A57 and memory. The CPAK uses two SN-F interfaces since dual memory controllers is one of the key features of the IP. A similar set of SAM* parameters is available on the CCN-504 to configure the system address map. Like other ARM IP, the CCN uses the concept of PERIPHBASE to set the address of the internal, software programmable registers.

Programming Highlights

The CCN-504 includes an integrated level 3 cache. The CPAK demonstrates the use of the L3 cache.

The CPAK startup assembly code also demonstrates other CCN-504 configuration including how to setup barrier termination, load node ID lists, programming system address map control registers, and more.

AMBA 5 CHI Waveforms

One of the best ways to start learning about AMBA 5 CHI is looking at the waveforms between the A57 and the CCN-504. The lastest SoC Designer 7.15.5 supports CHI waveforms and displays Flits, the basic unit of transfer in the AMBA 5 CHI link layer.

Summary

A new CPAK by Carbon Design Systems running 64-bit bare-metal software on the Cortex-A57 processor with CHI memory interface connected to the CCN-504 and memory is now available. It demonstrates the AMBA 5 CHI protocol, serves as a starting point for optimization of CCN-based systems, and is a valuable learning tool for projects considering AMBA 5 CHI.

Today, I have three tips for using Swap & Play with Linux systems.

Launching benchmark software automatically on boot
Setting application breakpoints for Swap & Play checkpoints
Adding markers in benchmark software to track progress

With the availability of the Cortex-A15 and Cortex-A53 Swap & Play models as well as the upcoming release of the Cortex-A57 Swap & Play model, Carbon users are able to run Linux benchmark applications for system performance analysis. This enables users to create, validate, and analyze the combination of hardware and software using cycle accurate virtual prototypes running realistic software workloads. Combine this with access to models of candidate IP, and the result is a unique flow which delivers cycle accurate ARM system models to design teams.

Swap & Play Overview

Carbon Swap & Play technology enables high-performance simulation (based on ARM Fast Models) to be executed up to user specified breakpoints, and the state of the simulation to be resumed using a cycle accurate virtual prototype.One of the most common uses of Swap & Play is to run Linux benchmark applications to profile how the software executes on a given hardware design. Linux can be booted quickly and then the benchmark run using the cycle accurate virtual prototype. These tips make it easier to automate the entire process and get to the system performance analysis.

Launch Benchmarks on Boot

The first tip is to automatically launch the benchmark when Linux is booted.Carbon Linux CPAKs on System Exchange use a single executable file (.axf) for each system with the following artifacts linked into the images:

Minimal Boot loader
Kernel image
Device Tree
RAM-based File System with applications

To customize and automate the execution of a desired Linux benchmark application, a Linux device tree entry can be created to select the application to run after boot.

The device tree support for “include” can be used to include a .dtsi file containing the kernel command line, which launches the desired Linux application.

Below is the top of the device tree source file from an A15 CPAK. If one of the benchmarks to be run is the bw_pipe test from the LMBench suite a .dtsi file is included in the device tree.

The include line pulls in a description of the kernel command line. For example, if the bw_pipe benchmark from LMbench is to be run, the include file contains the kernel arguments shown below:

The rdinit kernel command line parameter is used to launch a script that executes the Linux application to be run automatically. The bw_pipe.sh can then run the bw_pipe executable with the desired command line arguments.

Scripting or manually editing the device tree can be used to modify the include line for each benchmark to be run. A unique .axf file for each Linux application to be run can be created. This gives an easy to use .axf file that will automatically launch the benchmark without the need for any interactive typing. Having unique .axf files for each benchmark also makes it easy to hand off to other engineers since they don’t need to know anything about how to run the benchmark; just load the .axf file and the application will automatically run.

I also recommend to create an .axf image which runs /bin/bash to use for testing new benchmark applications in the file system. I normally run all of the benchmarks manually from the shell first on the ARM Fast Model to make sure they are working correctly.

Setting Application Breakpoints

Once benchmarks are automatically running after boot, the next step is to set application breakpoints to use for Swap & Play checkpoints. Linux uses virtual memory which can make it difficult to set breakpoints in user space. While there are application-aware debuggers and other techniques to debug applications, most are either difficult to automate or overkill for system performance analysis.

One way to easily locate breakpoints is to call from the application into the Linux kernel, where it is much easier to put a breakpoint. Any system call which is unused by the benchmark application can be utilized for locating breakpoints. Preferably, the chosen system call would not have any other side effects that would impact the benchmark results.

To illustrate how to do this, consider a benchmark application to be run automatically on boot. Let’s say the first checkpoint should be taken when main() begins. Place a call to the sched_yield() function as the first action in main(). Make sure to include the header file sched.h in the C program. This will call into the Linux kernel. A breakpoint can be placed in the Linux kernel file kernel/sched/core.c at the entry point for the sched_yield system call.

Here is the bw_pipe benchmark with the added system call.

Put a breakpoint in the Linux kernel at the system call and when the breakpoint is hit save the Swap & Play checkpoint. Here is the code in the Linux kernel.

The same technique can be used to easily identify other locations in the benchmark application including the end of the benchmark top stop simulation and gather results for analysis.

The sched_yield system call yields the current processor to other threads, but in the controlled environment of a benchmark application it is not likely to do any rescheduling at the start or at the end of a program. If used in the middle of a multi-threaded benchmark it may impact the scheduler.

Tracking Benchmark Progress

From time to time it is nice to see that a benchmark is proceeding as expected and be able to estimate how long it will take to finish. Using print statements is one way to do this, but adding to many print statements can negatively impact performance analysis. Amazingly, even a simple printf() call in a C program to a UART under Linux is a somewhat complex sequence involving the C library, some system calls, UART device driver activations, and 4 or 5 interrupts for an ordinary length string.

A lighter way to get some feedback about benchmark application process is to bypass all of the printf() overhead and make a system call directly from the benchmark application and use very short strings which can be processed with 1 interrupt and fit in the UART FIFO.

Below is a C program showing how to do it.

By using short strings which are just a few characters, it’s easy to insert some markers in the benchmark application to track progress without getting in the way of benchmark results. This is also a great tool to really learn what happens when a Linux system call is invoked by tracing the activity in the kernel from the start of the system call to the UART driver.

Summary

Hopefully these 3 tips will help Swap & Play users run benchmark applications and get the most benefit when doing system performance analysis. I’m sure readers have other ideas how to best automate the running of Linux application benchmarks as well locating Swap & Play breakpoints, but these should get the creative ideas flowing.

Jason Andrews

Carbon cycle accurate models of ARM CPUs enable system performance analysis by providing access to the Performance Monitor Unit (PMU). Carbon models instrument the PMU registers and record PMU events into the Carbon System Analyzer database without any software programming. Contrast this non-intrusive PMU event collection with other common methods of software execution:

ARM Fast Models focus on speed and have limited ability to access PMU events
Simulating or emulating CPU RTL does not provide automatic instrumentation and event collection
Silicon requires software programming to enable and collect events from the PMU

The ARM Cortex-A53 is a good example to demonstrate the features of SoC Designer. The A53 PMU implements the PMUv3 architecture and gathers statistics on the processor and memory system. It provides six counters which can count any of the available events.

The Carbon A53 model instruments the PMU events to gather statistics without any software programming. This means all of the PMU events (not just six) can be captured from a single simulation.

The A53 PMU Events can be found in the Technical Reference Manual (TRM) in Chapter 12. Below is a partial list of PMU events just to provide some flavor of the types of events that are collected. The TRM details all of the events the PMU contains.

Profiling can be enabled by right-clicking on a CPU model and selecting the Profiling menu. Any or all of the PMU events can be enabled. Any simulation done with profiling enabled will write the selected PMU events into the Carbon System Analyzer database.

Bare Metal Software

The automatic instrumentation of PMU events is ideal for bare metal software since it requires no programming and will automatically cover the entire timeline of the software test or benchmark. Full control is available to enable the PMU events at any time by stopping the simulator and enabling or disabling profiling.

All of the profiling data from the PMU events, as well as the bus transactions, and the software profiling information end up in the Carbon Analyzer database. The picture below shows a section of the Carbon Analyzer GUI loaded with PMU events, bus activity, and software activity.

The Carbon Analyzer provides many out-of-the-box calculation of interesting metrics as well as a complete API which allows plugins to be written to compute additional system or application specific metrics.

Linux Performance Analysis

Things get more interesting in a Linux environment. A common use case is to run Linux benchmarks to profile how the software executes on a given hardware design. Linux can be booted quickly and then a benchmark can be run using a cycle accurate virtual prototype by making use of Swap & Play.

Profiling enables events to be collected in the analyzer database, but the user doesn’t have the ability to understand which events apply to each Linux process or to differentiate events from the Linux kernel vs. those from user space programs. It’s also more difficult to determine when to start and stop event collection for a Linux application. Control can be improved by using techniques from Three Tips for Using Linux Swap & Play with ARM Cortex-A Systems.

Using PMU Counters from User Space

Since the PMU can be used for Linux benchmarks, the first thing that comes to mind is to write some initialization code to setup the PMU, enable counters, run the test, and collect the PMU events at the end. This strategy works pretty well for those willing to get their hands dirty writing system control coprocessor instructions.

Enable User Space Access

The first step to being able to write a Linux application which accesses the PMU is to enable user mode access. This needs to be done from the Linux kernel. It's very easy to do, but requires a kernel module to be loaded or compiled into the kernel. All that is needed to set bit 0 in the PMUSERENR register to a 1. It takes only one instructions, but it must be executed from within the kernel. The main section of code is shown below.

Building a kernel module requires a source tree for the running kernel. If you are using a Carbon Performance Analysis Kit (CPAK), this source tree is available in the CPAK or can easily be downloaded by using the CPAK scripts.

A source code example as well as a Makefile to build it can be obtained by registering here.

The module can either be loaded dynamically into a running kernel or added to the static kernel build. When working with CPAKs it’s easier for me to just add it to the kernel. When I’m working with a board where I can natively compile it on the machine it’s easier to dynamically load it using:

$ sudo insmod enable_pmu.ko

Remember to use the lsmod command to see which modules are loaded and the rmmod command to unload it when finished.

The exit function of the module returns the user mode enable bit back to 0 to restore the original value.

PMU Application

Once user mode access to the PMU has been granted, benchmark programs can take advantage of the PMU to count events such as cycles and instructions. One possible flow from a user space program is:

Reset count values
Select which of the six PMU counter registers to use
Set the event to be counted, such as instructions executed
Enable the counters to start counting

Once this is done, the benchmark application can read the current values, run the code of interest, and then read the values again to determine how many events occurred during the code of interest.

The cycle counter is distinct from the other 6 event count registers. It is read from a separate CP15 system control register. For this example, event 0x8 is monitored, instruction architecturally executed, using event count register 0. Please take a look at the source code for the simple test application used to count cycles and instructions of a simple printf() call.

Summary

This article provided an introduction to using the Carbon Analyzer to automatically gather information on ARM PMU events for bare metal and Linux software workloads. Carbon models provide full access to all PMU events during a single simulation with no software changes and no limitations on the number of events captured.

It also explained how additional control can be achieved by writing software to access the PMU directly from a Linux test program or benchmark application. This can be done with no kernel changes, but does require the PMU to be enabled from user mode and is limited to the number of counters available in the PMU; six for CPUs such as the Cortex-A15 and A57.

Next time I will look at an alternative approach to use the ARM Linux PMU driver and a system call to collect PMU events.

The Linux kernel provides an ARM PMU driver for counting events such as cycles, instructions, and cache metrics. My previous article covered how to access data from the PMU automatically within SoC Designer by enabling hardware profiling events. It also discussed how to enable access from a Linux application so the application can directly access the PMU information. This article covers how to use the ARM Linux PMU driver to gather performance information. In the previous article, the Linux application was accessing the PMU hardware directly using system control coprocessor instructions, but this time a device driver and a system call will be used. As before, I used a Carbon Performance Analysis Kit (CPAK) for a Cortex-A53 system running 64-bit Linux.

The steps covered are:

Configure Linux kernel for profiling
Confirm the device tree entry for the ARM PMU driver is included in the kernel
Insert system calls into the Linux application to access performance information

Kernel Configuration

The first step is to enable profiling in the Linux kernel. It’s not always easy to identify the minimal set of values to enable kernel features, but in this case I enabled “Kernel performance events and counters” which is found under “General setup" then under "Kernel Performance Events and Counters".

I also enabled “Profiling support” on the “General setup” menu.

Once these options are enabled recompile the kernel as usual by following the instructions provided in the CPAK.

Device Tree Entry

Below is the device tree entry for the PMU driver. All Carbon Linux CPAKs for Cortex-A53 and Cortex-A57 include this entry so no modification is needed. If you are working with your own Linux configuration confirm the pmu entry is present in the device tree.

When the kernel boots the driver prints out a message:

hw perfevents: enabled with arm/armv8-pmuv3 PMU driver, 7 counters available

If this message is not in the kernel boot log check both the PMU driver device tree entry and the kernel configuration parameters listed above. If any of them are not correct the driver message will not appear.

Performance Information from a Linux Application

One way to get performance information from a Linux application is to use the perf_event_open system call. This system call does not have a glibc wrapper so it is called directly using syscall. Most of the available examples create a wrapper function, including the one shown in the manpage to make for easier usage.

The process is similar to many other Linux system call. First, get a file descriptor using open() and then use the file descriptor for other operations such as ioctl() and read().The perf_event_open system call uses a number of parameters to configure the events to be counted. Sticking with the simple case of instruction count, the perf_event_attr data structure needs to be filled in with the desired details.

It contains information about:

Start enabled or disabled
Trace child processes or not
Include hypervisor activity or not
Include kernel activity or not

Other system call arguments include which event to trace (such as instructions), the process id to trace, and which CPUs to trace on.

A setup function to count instructions could look like this:

At the end of the test or interesting section of code it’s easy to disable the instruction count and read the current value. In this code example, get_totaltime() uses a Linux timer to time the interesting work and this is combined with the instruction count from the PMU driver to print some metrics at the end of the test.

Conclusion

The ARM PMU driver and perf_event_open system call provide a far more robust solution for accessing the ARM PMU from Linux applications. The driver takes care of all of the accounting, event counter overflow handling, and provides many flexible options for tracing.

For situations where tracing many events is required, it may be overly cumbersome to use the perf_event_open system call. One of the features of perf_event_open is the ability to use a group file descriptor to create groups of events with one group leader and other group members with all events being traced as a group. While all of this is possible it may be helpful to look at the perf command, which comes with the Linux kernel and provides the ability to control the counters for entire applications.

Jason Andrews

Linux containers provide a way to build, ship, and run applications such as the EDA tools used in SoC Design and Verification. EDA Containers is a LinkedIn Group to explore and discover the possibilities of using container technology for EDA application development and deployment. Personally, I work in Virtual Prototyping doing simulation of ARM systems and software. This is a challenging area because it involves not only hardware simulation tools, but also software development tools to compile and run software. We are looking for other engineers interested to explore containers as they related to EDA tools and embedded software development process. If you are interested to learn or have expertise to share related to Docker, LXC, LXD, Red Hat containers please join us. The group is not specific to any EDA company or product. The members are from various companies who just happen to be interested to learn and explore what can be done with Linux containers.

If you are interested please join the group or feel free to discuss related topics here in the ARM Community!

In late 2014, Carbon released the first Carbon Performance Analysis Kit (CPAK) utilizing the ARM CoreLink CCN-504 Cache Coherent Network. Today, the CCN-504 can be built on Carbon IP Exchange with a wide range of configuration options. There are now four CPAKs utilizing the CCN-504 on Carbon System Exchange. The largest design includes sixteen Cortex-A57 processors, the most processors ever included in a Carbon CPAK.

At the same time SoC Designer has added new AMBA 5 CHI features including support for monitors, breakpoints, Carbon Analyzer support, and a CHI stub component for testing.

Introduction to AMBA 5 CHI

To get a good introduction on AMBA 5 CHI I recommend the article, "What is AMBA 5 CHI and how does it help?".

Another interesting ARM Community article is “5 things you might now know about AMBA 5 CHI”.

Although the cache coherency builds on AMBA 4 ACE and is likely familiar, some of the aspects of CHI are quite different.

CCN-504 Configuration

Configuring the CCN-504 on Carbon IP Exchange is similar to all Carbon models. Select the desired interface types, node population, and other hardware details and click the "Build It" button to compile a model.

Understanding the Memory Map

One of the challenges of configuring CHI systems is to make sure the System Address Map (SAM) is correctly defined. As indicated in the table above, the process is more complex compared to a simple memory map with address ranges assigned to ports.

The network layer of the protocol is responsible for routing packets between nodes. Recall from the previous article that CHI is a layered protocol consisting of nodes of various types. Each node has a unique Network ID and each packet specifies a Target ID to send the packet to and a Source ID to be able to route the response.

For a system with A57 CPUs and a CCN-504 each Request Node (RN), such as a CPU, has a System Address Map (SAM) which is used to determine where to send packets. There are three possible node types a message could be sent to: Miscellaneous Node (MN), Home Node I/O coherent (HN-I), or Home Node Fully coherent (HN-F). DVM and Barrier messages are always sent to the MN so the challenge is to determine which of the possible Home Nodes an address is destined for.

The make the calculation of which HN-F is targeted the RN uses an address hash function. This can be found in the CCN-504 TRM.

Each CCN has a different hashing function depending on how many HN-F partitions are being used.

The hashing function calculates the HN-F to be used, but this is still not a Network ID. Additional configuration signals provide the mapping from HN-F number to Node ID.

All of this means there are a number of SAM* parameters for the A57 and the CCN-504 which must be set correctly for the memory map to function. It also means that a debugging tool which makes use of back-door memory access needs to understand the hashing function to know where to find the correct data for a given address. SoC Designer takes all of this into consideration to provide system debugging.

As you can see, setting up a working memory map is more complex compared to routing addresses to ports.

Carbon models use configuration parameters to perform the following tasks:

Associate each address region with HN-Fs or HN-Is
Specify the Node ID values of Home Nodes and the Miscellaneous Node
Define the number of Home Nodes
Specify the Home Nodes as Fully Coherent or I/O Coherent

The parameters for the A57 CPU model are shown below:

The parameters for the CCN-504 model are similar, a list of SAMADDRMAP* values and SAM*NODEID values.

It’s key to make sure the parameters are correctly set for the system to function properly.

Cheat Sheet

Sometimes it’s helpful to have a picture of all of the parts of a CCN system. The cheat sheet below has been a tremendous help for Carbon engineers to keep track of the node types and node id values in a system.

SoC Designer Features

With the introduction of AMBA 5 CHI, SoC Designer has been enhanced to provide CHI breakpoints, monitors, and profiling information.

Screenshots of CHI transactions and CHI profiling are shown below. The Target ID and the Source ID for each transaction are shown. This is from the single-core A57 CPAK so the SourceID values are always 1. Multi-core CPAKs will create transactions with different SourceID values.

The CCN-504 has a large number of PMU events which can be used to understand performance.

Summary

AMBA 5 CHI is targeted at systems with larger numbers of coherent masters. The AMBA 5 CHI system memory map is more complex compared to ACE systems. A number of System Address Map parameters are required to build a working system, both for the CPU and for the interconnect.

Carbon SoC Designer is a great way to experiment and learn how CHI systems work. Pre-configured Carbon Performance Analysis Kits (CPAKs) are available on Carbon System Exchange which can be downloaded and run which demonstrate hardware configuration as well as the software programming needed to initialize a CHI system. Just like the address map, the initialization software is more complex compared to an ACE system with a CCI-400 or CCI-500.

The latest high-performance ARMv8-A processor is the Cortex-A72.The press release reports that the A72 delivers CPU performance that is 50x greater than leading smartphones from five years ago and will be the anchor in premium smartphones for 2016. The Cortex-A72 delivers 3.5x the sustained performance compared to an ARM Cortex-A15 design from 2014. Last week ARM began providing more details about the Cortex-A72 architecture. AnandTech has a great summary of the A72 details.

The Carbon model of the A72 is now available on Carbon IP Exchange along with 10 Carbon Performance Analysis Kits (CPAKs). Since current design projects may be considering the A72, it’s a good time to highlight some of the differences between the Cortex-A72 and the Cortex-A57.

Carbon IP Exchange Portal Changes

IP Exchange enables users to configure, build, and download models for ARM IP. There are a few differences between the A57 and the A72. The first difference is the L2 cache size. The A57 can be configured with 512 KB, 1 MB, or 2 MB L2 cache, but the A72 can be configured with a fourth option of 4MB.

Another new configuration which is available on IP Exchange for the A72 is the ability to disable the GIC CPU interface. Many designs continue to use version 2 of the ARM GIC architecture with IP such as the GIC-400. These designs can take advantage of excluding the GIC CPU interface.

The A72 also offers an option to include or exclude the ACP (Accelerator Coherency Port) interface.

The last new configuration option is the number of FEQ (Fill/Evict Queue) Entries on the A72 has been increased to include options of 20, 24, and 28 compared to the A57 which offers 16 or 20 entries. This feature has been important to Carbon users doing performance analysis and studying the impact of various L2 cache parameters.

The Cortex-A72 configuration from IP Exchange is shown below.

ACE Interface Changes

The main change to the A72 interface is the width of the transaction ID signals has been increased from 6 bits to 7 bits. The wider *IDM signals only apply when the A72 is configured with an ACE interface. The main impact occurs when connecting an A72 to a CCI-400 which was used with A53 or A57. Since those CPUs have the 6-bit wide *IDM signals the CCI-400 will need to be reconfigured for 7-bit wide *IDM signals. All of the A72 CPAKs which use CCI-400 have this change made to them so they operate properly, but it’s something to watch if upgrading existing systems to A72.

This applies to the following signals for A72:

AWIDM[6:0]
WIDM[6:0]
BIDM[6:0]
ARIDM[6:0]
RIDM[6:0]

System Register Changes

A number of system registers are updated with new values to reflect the A72. The primary part number field in the Main ID register (MIDR) for A72 is 0xD08 vs the A57 value of 0xD07 and the A53 value of 0xD03. Clearly, the 8 was chosen well before the A72 number was assigned. A number of other ID registers change value from 7 on the A57 to 8 on the A72.

New PMU Events

There are a number of new events tracked by the Cortex-A72 Performance Monitor Unit (PMU). All of the new events have event numbers 0x100 and greater. There are three main sections covering:

Branch Prediction
Queues
Cache

The screenshots below from the Carbon Analyzer show the PMU events. All of these are automatically instrumented by the Carbon model and are recorded without any software programming.

The A72 contains many micro-architecture updates for incremental performance improvement. The most obvious one which was described is the L2 FEQ size, and there are certainly many more in the branch prediction, caches, TLB, pre-fetch, and floating point units. As an example, I ran an A57 CPAK and an A72 CPAK with the exact same software program. Both CPUs reported about 21,500 instructions retired. This is the instruction count if the program were viewed as a sequential instruction stream. Of course, both CPUs do a number of speculative operations. The A57 reported about 37,000 instructions speculatively executed and the A72 reported 35,700.

The screenshots of the instruction events are shown below, first A72 followed by A57. All of the micro-architecture improvements of the A72 combine to provide the highest performance CPU created by ARM to date.

Summary

Carbon users easily can run the A57, A53, and now the A72 with various configuration options and directly compare and contrast the performance results using their own software and systems. The CPAKs available from Carbon System Exchange provide a great starting point and can be easily modified to investigate system performance topics.

Recently, Carbon released the first ARMv8 Linux CPAK utilizing theARM CoreLink CCN-504 Cache Coherent Network on Carbon System Exchange. The CCN family of interconnect offers a wide range of high bandwidth, low latency options for networking and data center infrastructure.

The new CPAK uses anARM Cortex-A57octa-core configuration to run Linux on a system withAMBA 5 CHI. Switching the Cortex-A57 configuration from ACE to CHI on Carbon IP Exchange is as easy as changing a pull-down menu item on the model build page. After that, a number of configuration parameters must be set to enable the CHI protocol correctly. Many of them were discussed in aprevious article covering usage of the CCN-504. Using native AMBA 5 CHI for the CPU interface coupled with the CCN-504 interconnect provides high-frequency, non-blocking data transfers. Linux is commonly used in many infrastructure products such as set-top boxes, networking equipment, and servers so the Linux CPAK is applicable for many of these system designs.

Selecting AMBA 5 CHI for the memory interface makes the system drastically different at the hardware level compared to a Linux CPAK using the ARM CoreLink CCI-400 Cache Coherent Interconnect, but the software stack is not significantly different.

From the software point of view, a change in interconnect usually requires some change in initial system configuration. It also impacts performance analysis as each interconnect technology has different solutions for monitoring performance metrics. An interconnect change can also impact other system construction issues such as interrupt configuration and connections.

Some of the details involved in migrating a multi-cluster Linux CPAK from CCI to CCN are covered below.

Software Configuration

Special configuration for the CCN-504 is done using the Linux boot wrapper which runs immediately after reset. The CPAK doesn’t include the boot wrapper source code, but instead usesgitto download it fromkernel.organd then patch the changes needed for CCN configuration. The added code performs the following tasks:

Set the SMP enable bit in the A57 Extended Control Register (ECR)
Terminate barriers at the HN-I
Enable multi-cluster snooping
Program HN-F SAM control registers

The most critical software task is to make sure multi-cluster snooping is operational. Without this Linux will not run properly. If you are designing a new multi-cluster CCN-based system it is worth running a bare metal software program to verify snooping across clusters is working correctly. It’s much easier to debug the system with bare metal software, and there are a number ofmulti-cluster CCN CPAKsavailable with bare metal software which can be used.

I always recommend a similar approach for other hardware specific programming. Many times users have hardware registers that need to be programmed before starting Linux and it’s easy to put this code into the boot wrapper and less error prone compared to using simulator scripts to force register values.The Linux device tree provided with the CPAK also contains a device tree entry for the CCN-504. The device tree entry has a base address which must match the PERIPHBASE parameter on the CCN-504 model. In this case the PERIPHBASE is set to 0x30 which means the address in the device tree is 0x30000000.

All Linux CPAKS come with an application note which provides details on how to configure and compile Linux to generate a single .axf file.

GIC-400 Identification of CPU Accesses

One of the new things in the CPAK is the method used to get the CPU Core ID and Cluster ID information to the GIC-400.

The GIC-400 requires the AWUSER and ARUSER bits on AXI be used to indicate the CPU which is making an access to the GIC. A number between 0 and 7 must be driven on these signals so the GIC knows which CPU is reading or writing, but getting the proper CPU number on the AxUSER bits can be a challenge.

In Linux CPAKs with CCI, this is done by the GIC automatically by inspecting the AXI transaction ID bits and then setting the AxUSER bits as input to the GIC-400. Each CPU will indicate the CPU number within the core (0-3) and the CCI will add information about which slave port received the transaction to indicate the cluster.

Users don’t need to add any special components in the design because the mapping is done inside the Carbon model of the GIC-400 using a parameter called “AXI User Gen Rule”. This parameter has a default value which assumes a 2 cluster system in which each cluster has 4 cores. This is a standard 8 core configuration which uses all of the ports of the GIC-400. The parameter can be adjusted for other configurations as needed.

The User Gen Rule does even more because the ARM Fast Model for the GIC-400 uses the concept of Cluster ID to indicate which CPU is accessing the GIC. The Cluster ID concept is familiar for software reading the MPIDR register and exists in hardware as a CPU configuration input, but is not present in each bus transaction coming from a CPU and has no direct correlation to the CCI functionality of adding to the ID bits based on slave port.

To create systems which use cycle accurate models and can also be mapped to ARM Fast Models the User Gen Rule includes all of the following information for each of the 8 CPUs supported by the GIC:

Cluster ID value which is used to create the Fast Model system
CCI Port which determines the originating cluster in the Cycle Accurate system
Core ID which determines the CPU within a cluster for both Fast Model and Cycle Accurate systems

With all of this information Linux can successfully run on multi-cluster systems with the GIC-400.

AMBA 5 CHI Systems

In a system with CHI the Cluster ID and the CPU ID values must also be presented to the GIC in the same way as the ACE systems. For CHI systems, the CPU will use the RSVDC signals to indicate the Core ID. The new CCN-504 CPAK introduces a SoC Designer component to add Cluster ID information. This component is a CHI-CHI pass through which has a parameter for Cluster ID and adds the given Cluster ID into to the RSVDC bits.

For CCN configurations with AXI master ports to memory, the CCN will automatically drive the AxUSER bits correctly for the GIC-400. For systems which bridge CHI to AXI using the SoC Designer CHI-AXI converter, this converter takes care of driving the AxUSER bits based on the RSVDC inputs. In both cases, the AxUSER bits are driven to the GIC. The main difference for CHI systems is the GIC User Gen Rule parameter must be disabled by setting the “AXI4 Enable Change USER” parameter to false so no additional modification is done by the Carbon model of the GIC-400.

Conclusion

All of this may be a bit confusing, but demonstrates the value of Carbon CPAKs. All of the system requirements needed to put various models together to form a running Linux system have already been figured out so users don’t need to know it all if they are not interested. For engineers who are interested, CPAKs offer a way to confirm the expected behavior read in the documentation by using a live simulation and actual waveforms.

System optimization involves running Linux applications and understanding the impact on the hardware and other software in a system. It would be great if system optimization could be done by running benchmarks one time and gathering all needed information to fully understand the system, but anybody who has ever done it knows it takes a numerous runs to understand system behavior and a fair amount of experimentation to identify corner cases and stress points.

Running any benchmark should be:

Repeatable for the person running it
Reproducible by others who want to run it
Reporting relevant data to be able to make decisions and improvements

Determinism one of the challenges that must be understood in order to make reliable observations and guarantee system improvements have the intended impact. For this reason, it’s important to create an environment which is as repeatable as possible. Sometimes this is easy and sometimes it’s more difficult.

Traditionally, some areas of system behavior are not deterministic. For example, networking traffic of a system connected to a network is hard to predict and control if there are uncontroled machines connected to the network. Furthermore, even in a very controlled environment the detailed timing of the individual networking packets will always have some timing variance related to when they arrive at the system under analysis.

Another source of nondeterministic behavior could be something as simple as entering Linux commands at the command prompt. The timing of how fast a user is typing will vary from person to person and from run to run when multiple test runs are required to compare performance. A solution for this could be an automated script which automatically launches a benchmark upon Linux boot so there is no human input needed.

Understanding the variables which can be controlled and countering any variables which cannot be controlled is required to obtain consistent results. Sometimes new things occur which were not expected. Recently, I was made aware of a new source of non-determinism, ASLR.

Address Space Layout Randomization

Address Space Layout Randomization (ASLR) has nothing to do with system I/O, but the internals of the Linux kernel itself. ASLR is a security feature which randomizes where various parts of a Linux application are loaded into memory. One of the things it can do is to change the load address of the C library. When ASLR is enabled the C library will be loaded into a different address of memory each time the program is run. This is great for security, but is a hinderance for somebody trying to perform system analysis by keeping track of the executed instructions for the purpose of making performance improvements.

The good news is ASLR can be disabled in Linux during benchmarking activities so that programs will generate the same address traces.

A simple command can be used to disable ASLR.

$ echo 0 > /proc/sys/kernel/randomize_va_space

The default value is 2. The Linux documentation on sysctl is a good place to find information is the randomize_va_space:

This option can be used to select the type of process address space randomization that is used in the system, for architectures that support this feature.

0 - Turn the process address space randomization off. This is the default for architectures that do not support this feature anyways, and kernels that are booted with the "norandmaps" parameter.

1 - Make the addresses of mmap base, stack and VDSO page randomized. This, among other things, implies that shared libraries will be loaded to random addresses. Also for PIE-linked binaries, the location of code start is randomized. This is the default if the CONFIG_COMPAT_BRK option is enabled.

2 - Additionally enable heap randomization. This is the default if CONFIG_COMPAT_BRK is disabled.

There is a file /proc/[pid]/maps for each process which has the address ranges where the .so files are loaded.

Launching a program and printing the maps file shows the addresses where the libraries are loaded.

For example, if the benchmark being run is sysbench run it like this:

$ sysbench & cat /proc/`echo $!`/maps

Without setting randomize_va_space to 0 different addresses will be printed each time the benchmark is run, but after setting randomize_va_space to 0 the same addresses is used from run to run.

Below is example output from the maps file.

If you ever find that your benchmarking activity starts driving you crazy because the programs you are tracing keep moving around in memory it might be worth looking into ASLR and turning it off!

ARM Cycle Models have long been used to perform design tasks such as:

IP Evaluation
System Architecture Exploration
Software Development
Performance Optimization

In October 2015, ARM acquired the assets of Carbon Design Systems with the primary goal of enabling earlier availability of cycle accurate models for ARM processors and system IP. The announcement of the ARM® Cortex®-R8 is the first step in demonstrating the benefits of early Cycle Model availability. Another goal is to provide Cycle Models which can be used in SystemC simulation environments. The Cortex-R8 model is the first Cycle Model available for use in the Accellera SystemC environment right from the start.

The Cortex-R8 model has been available to lead partners since the beginning of 2016 and will be generally available on ARM IP Exchange this month.

Earlier cycle accurate model availability has led to increased focus on using Cycle Models to understand new processors. This article describes some of the ways the Cycle Model has been used by ARM silicon partners to understand the Cortex-R8.

Prior to early availability of Cycle Models these tasks would have been performed using RTL simulation or FPGA boards. RTL simulation can be cumbersome, especially for software engineers doing benchmarking tasks, and it lacks software debugging and performance analysis features. FPGA boards are familiar to software engineers, but lack the ability change CPU build-time parameters such as cache and TCM sizes.

The examples below provide more insight on how Cycle Models are being used.

Benchmarking

A common activity for a new processor such as Cortex-R8 is to run various benchmarks and measure how many cycles are required for various C functions. SoC Designer provides an integrated disassembly view which can be used to set breakpoints to run from point A to point B and measure cycle counts.

DS-5 can also be connected to the Cortex-R8 for a full source code view of the software.

The cycle count is always visible on the toolbar of SoC Designer.

Many times a simple subtraction is all that is needed to measure cycle count between breakpoints.

After the first round of benchmarking is done, the code can be moved from external memory to TCM and execution repeated. The Cortex-R8 cycle model will boot from ITCM when the INITRAM parameters are set to true. Right clicking on the Cortex-R8 model and setting parameters make it easy to change between external memory and TCM.

In addition to just counting cycles, SoC Designer provides additional analysis features. One useful feature is a transaction view.

The transaction monitor can be used to make sure the expected transactions are occurring on the bus. For example, when running out of TCM little or no bus activity is expected on the AXI interface, and if there is activity it usually indicates incorrect configuration. Below shows a transaction view of the activity on the AXI interface when running from external memory. Each transaction has a start and end time to indicate how long it takes.

All PMU events are instrumented and can be automatically captured in Cycle Models. These are viewed by enabling the profiling feature and looking at the results using the analyzer view. The hex values to the left of each event correspond to the event codes in the Technical Reference Manual. In addition to raw values, graphs of events over time can be created to identify hotspots.

The analysis tools also provide information about bus utilization, latency, transaction counts, retired instructions, branch prediction, and cache metrics as shown below. Custom reports can also be generated.

After observing a benchmark in external memory and TCM, it’s common to change TCM sizes and cache sizes. Models with different cache sizes and TCM sizes can easily be configured and created using ARM IP Exchange and the impact on the benchmark observed. The IP configuration page is shown below. Generating a new model is as simple as selecting new values on the web page and pushing the build button. After the compilation is done the new model is ready for download and can replace the current Cortex-R8 model.

Cache and Memory Latency

Another use of the Cortex-R8 Cycle Model is to analyze the performance impact of adding the PL310 L2 cache controller. There is a Cycle Model of the PL310 available from ARM IP Exchange. It can be added into a system and enabled by programming the registers of the cache controller. The register view is shown below.

SoC Designer provides ideal memory models which can be configured for various wait states and delays. Performance of memory accesses using these memory models can be compared with adding the PL310 into the system. The same analysis tools can be used to determine latency values from the L2 cache and the overall performance impact of adding the L2 cache. Right clicking on the PL310 and enabling the profiling features will generate latency and throughput information for the analysis view.

Example systems using the Cortex-R8 and software to configure the system and run various programs are available from ARM System Exchange. The systems serve as a quick start by providing cycle accurate IP models, fully configured and initialized systems, and software source code. Most users take an example system as a starting point and then modify and customize it to meet particular design tasks.

Conclusion

Previously, the only ways to evaluate performance and understand the details of a new ARM processor were RTL simulation or FPGA boards with fixed configurations. ARM Cycle Models have become the new standard for IP evaluation and early benchmarking and performance analysis. The Cortex-R8 Cycle Model is available for use in SoC Designer and SystemC simulation. Example systems and software are available, models of different configurations can be easily generated using ARM IP Exchange, and the software debugging and performance analysis features make Cycle Models an easy to use environment to evaluate and make informed IP selection decisions.

The ARM® Cortex®-R52 processor is the most advanced processor for functional safety and the first implementation of the ARMv8-R architecture. Along with the announcement of the Cortex-R52, ARM offers a number of development tools to help partners speed up their path to market. This is especially helpful for a new architecture which highlights software separation for safety and security. This article summarizes the available tools and explains what’s new for the Cortex-R52.

About the Cortex-R52

The Cortex-R52 is the first ARMv8-R processor. ARMv8-R brings real-time virtualization to Cortex-R in the form of a new privilege Level, EL-2, which provides exclusive access to a second stage Memory Protection Unit (MPU). This enables bare metal hypervisors to maintain software separation for multiple operating systems and tasks.

Many partners will be interested in the differences between the Cortex-R52 and previous designs, such as the Cortex-R5 and Cortex-R7. The Cortex-R52 evolves the Cortex-R5 microarchitecture by providing fast, deterministic interrupt response and low latency at a performance level which is better than Cortex-R7 on some real-time workloads. Cortex-R52 also offers a dedicated, read-only, low-latency flash interface which conforms to the AXI4 specification.

ARM Fast Models and Cycle Models enable virtual prototyping for software partners to develop solutions for the new Cortex-R52 before silicon is available.

DS-5 Development Studio

DS-5 Development Studio is the ARM tool suite for embedded C/C++ software development on any ARM-based SoC. DS-5 features the ARM Compiler, DS-5 Debugger, and Streamline system profiler. Also included is a comprehensive and intuitive IDE, based on the popular and widely-used Eclipse platform.

The DS-5 Debugger is developed in close co-operation with ARM processor and subsystem IP. DS-5 is used inside ARM as part of the development and verification cycle and is extensively tested against models, early FPGA implementations and (as soon as it is available) silicon. The DS-5 Debugger provides early-access debug and trace support to ARM lead partners working with leading-edge IP. This enables mature, stable, validated debug and trace support for Cortex-R52 to be included in the upcoming DS-5 release, version 5.26.

ARM Compiler 6

ARM Compiler 6 is the latest compilation toolchain for the ARM architecture, and is included in the DS-5 Development Studio. ARM Compiler 6 brings together the modern LLVM compiler infrastructure and the highly optimized ARM C libraries to produce performance and power optimized embedded software for the ARM architecture.

ARM Compiler 6 is developed closely with ARM IP and provides early-access support to lead partners. As with core support in all compilers code generation, performance, and code size improve over time, with improvements driven by experience and feedback from real-world use-cases. The upcoming release of ARM Compiler 6, version 6.6, will feature full support of link time optimization and enhanced instruction scheduling support, giving an improvement of nearly 10 percent for Cortex-R52 in key benchmark scores. Combined with significant improvements in code size, ARM Compiler 6 is a comprehensive choice for the Cortex-R52.

Cortex-R52 provides a compelling opportunity for users to migrate from ARM Compiler 5 to ARM Compiler 6. The ARM Compiler migration and compatibility guide aids the evaluation process by comparing the command line options, source code differences, assembly syntax, and other topics of interest.

If existing code needs to be updated from ARM Compiler 5 to ARM Compiler 6, the first step is to get the code to successfully compile. This generally takes a combination of Makefile changes to invoke the new compiler as well as source code adaptations.

First, compiler invocation needs to be switched from armcc to armclang. Other tools like armasm and armlink are included in ARM Compiler 6 and can continue to be used.

For example, when changing from Cortex-R7 to Cortex-R52 a few compiler command line option changes will be required:

ARM Compiler 5	ARM Compiler 6
armcc	armclang
--cpu=Cortex-R7	--target= armv8r-arm-none-eabi –mcpu=cortex-r52
--fpu=VFPv3	-mfpu=neon-fp-armv8
-Ospace	-Os / -Oz
-Onum (default is 2)	-Onum (default is 0)

The migration guide provides further details related to specific switches, but these are the basics to get going. Some compiler switches may need to be removed because they are specific to armcc; for example, --apcs /interwork and --no_inline are not needed with armclang and can be removed.

Fast Models

Fast Models are an accurate, flexible programmer's view models of ARM IP, allowing you to develop software such as drivers, firmware, operating systems and applications prior to silicon availability. They allow full control over the simulation, including profiling, debug and trace. Fast Models can be exported to SystemC, allowing integration into the wider SoC design process.

Fast Models typical use cases:

Functional software debugging
Software profiling and optimization
Software validation and continuous integration

The Fast Model for the Cortex-R52 is being released in late September as part of Fast Models 10.1.

Cycle Models

Cycle models are compiled directly from ARM RTL and retain complete functional accuracy. This enables users to confidently make architecture decisions, optimize performance or develop bare metal software. Cycle Models run in SoC Designer or any SystemC simulator, including the Accellera reference simulator and simulators from EDA partners.

Cycle Models typical use cases:

IP selection and configuration
Analysis of HW/SW interaction
Benchmarking and system optimization

The Cortex-R52 SystemC Cycle Model supports a number of features which help with performance analysis:

SystemC signal interface
SystemC ARM TLM interface
Instruction trace (tarmac)
PMU event trace
Waveform generation

The Cycle Model for the Cortex-R52 is being released in late September and will be available on ARM IP Exchange.

Frequently asked questions about getting started with Cortex-R52

Does Cortex-R52 require DS-5 Ultimate Edition?

Yes, DS-5 Ultimate Edition is required for debugging with Cortex-R52.

What are the switches for ARM Compiler 6 to select Cortex-R52?

For ARM Compiler 6.6 use the armclang switches: --target=armv8r-arm-none-eabi -mcpu=cortex-r52

For ARM Compiler 6.5 use the armclang switches: --target=armv8r-arm-none-eabi -mcpu=kite

Can DS-5 be used for software debugging with a simulation model?

Yes, before silicon is available DS-5 can be used to develop and debug software using the Cortex-R52 Fast Model. The Fast Model can be used for functional software development, checking compliance with the ARMv8-R architecture, and software optimization. DS-5 with the Fast Model makes an ideal development environment for hypervisors, schedulers, real-time operating systems, and communication stacks.

Is there a model available for verification which works in EDA simulators?

Yes, the Cortex-R52 Cycle Model can be used in any EDA simulator. It has a SystemC wrapper which can be instantiated in a Verilog or VHDL design. It provides 100% cycle accuracy.

Is there a way to run benchmarks to compare Cortex-R52 to another core such as Cortex-R5 or Cortex-R7?

Yes, the Cortex-R52 Cycle Model instruments the Performance Monitor Unit (PMU) and provides a tarmac trace to run benchmarks and evaluate performance.

The Cortex-R52 CPU model doesn’t seem to start running after reset, is the model broken?

No, the most common cause is the CPUHALTx input is asserted and stopping the core from running.

Do Cortex-R52 models allow simulation of booting from TCM?

Yes, both the Fast Model and the Cycle Model can boot from TCM. The CFGTCMBOOTx enables the ATCM from reset on the Cycle Model and the Fast Model provides the tcm.a.enable parameter to do the same thing.

Summary

A full suite of development tools is available for the Cortex-R52, which enables developers to do more, earlier with the most advanced ARM processor for functional safety and learn about the ARMv8-R architecture. Please refer to developer.arm.com for more information on ARM Development Tools.

This week is ARM Techcon, the largest ARM technical conference of the year. It also marks the one year anniversary of the Cycle Model team joining ARM from Carbon Design Systems. These two events are best understood by describing what it was like to attend ARM Techcon before and after joining ARM.

Before the team was part of ARM, we always attended Techcon and participated in the exhibit hall doing demos and by giving technical talks on virtual prototyping. As Carbon Design Systems we typically learned about new IP when it was announced at Techcon. After product announcements were made, cycle accurate model users would immediately ask for models of the newly announced IP. It took time to create models and release them for use in virtual prototyping projects. This resulted in a 12 to 18 month gap from the time lead partners starting engaging with ARM on new IP and when cycle accurate models were available. By this time most of the IP evaluation, configuration choices, and system design work was complete.

After joining ARM a year ago, the Cycle Model team is now creating models well in advance of new IP announcements. The initial goal was to produce the first cycle accurate models at or soon after beta. During 2016 the team has done this on a number of new models which were ready just after beta, and in one case a model was produced for early IP evaluation even before beta. Another recent example is the Cortex-R52 which was announced in September and the Cycle Model was available and being used by ARM partners before it was announced and is now available for all users on ARM IP Exchange.

ARM Techcon is now much different for the Cycle Model team. Instead of learning about new IP at the conference, Cycle Models are available when the IP announcements are made. This week ARM announced the Cortex-M33 and Cortex-M23 CPUs and the Cycle Models are part of the announcement.

Early model availability has opened up new opportunities for Cycle Models to be used in IP evaluation, configuration, and early SoC design. For ARM partners considering licensing new IP, ARM Fast Models and Cycle Models are a great way to get hands-on experience to understand the architecture and performance improvements. In the future, models will play a a bigger role in helping ARM demonstrate new IP to partners and making it easier for partners to select the right IP for new projects.

ARM Cycle Models are 100% cycle accurate models and are ideal for hardware design and performance analysis. Typical usage:

IP selection and configuration
Analysis of HW/SW interaction
Benchmarking and optimization

ARM Fast Models are instruction accurate models running at speeds up to 200 MHz and are ideal for software development and testing. Typical usage:

Functional software debugging
Software profiling and optimization
Software validation and continuous integration testing

We have some exciting updates to share related to ARM Cycle Models.

New SystemC Models

ARM has released new SystemC Cycle Models on ARM IP Exchange. This marks the first time cycle accurate models are available from ARM for SystemC simulation. Along with the new models, a number of example systems are available on ARM System Exchange. Users now have the choice of using SoC Designer or SystemC for creating virtual prototypes. This also provides EDA partners who provide simulation and virtual prototyping based on SystemC another option for ARM models.

Cycle Models are compiled directly from ARM RTL and retain complete functional and cycle accuracy. This enables users to confidently make architecture decisions, optimize performance or develop bare metal software. Cycle Models are used for IP evaluation, system architecture, performance optimization, and software development.

Improving Model Availability and Choice

One of the key Cycle Model improvements made in 2016 has been model availability. Recently, I summarized the difference of attending ARM Techcon before and after joining ARM. During 2016, more models were made available to ARM partners much sooner in the design process. This makes it possible to use models and virtual prototyping tools during the IP evaluation phase of a project to understand behavior and performance of IP. Cycle Models aim to meet the requirements needed to analyze the combination of hardware and software.

Requirements for pre-silicon performance analysis and benchmarking
Cycle accurate simulation
Access to models of candidate IP
Ability to generate multiple configurations of candidate IP
Tools to quickly assemble multiple systems for exploration
Ability to run and debug software
Analysis tools to understand performance and make decisions based on results

SystemC Models

SystemC models are the second major improvement for 2016. ARM is committed to providing models to both silicon and systems partners for virtual prototyping. Cycle Models are important to ARM partners for a number of reasons. They provide an easier way to work with ARM IP and are flexible enough to work in a variety of simulation environments.

The following SystemC Cycle Models are now available on ARM IP Exchange:

ARM IP Exchange also has a new Request Model menu item to request models which are not currently available.

SoC Designer 9.0.0

SoC Designerversion 9.0.0 is now available on ARM IP Exchange and supports the latest ARM Fast Models, version 10.2.

Cycle Model Studio 9.0.0

Cycle Model Studio version 9.0.0 is also available on ARM IP Exchange.

With this release the Cycle Model Compiler includes partial SystemVerilog Interfaces support. For details about supported and unsupported SystemVerilog Interface related features please refer to Cycle Model Compiler User Manual.

Licensing Changes

One very important note: Carbon licenses will not work with this release and new ARM licenses are required.

As part of the transition of Cycle Model products from Carbon to ARM, all Cycle Model products have transitioned from using the carbond license daemon to the armlmd license daemon. This change applies to Cycle Model Studio, SoC Designer, and all Cycle Models for SoC Designer and SystemC. This transition aligns Cycle Model products with other ARM software development tools, simplifying the license generation process and management of licenses.

For more information: http://www.armipexchange.com/NOTICE_%20ARM%20Cycle%20Model%20License%20Transition.html

Branding Changes

All products have updated branding as part of the transition from Carbon to ARM. No functional changes are expected, but products look different from previous versions. The installation file names and some documentation file names have also changed.

Summary

Improved model availability for Cycle Models means more partners can take advantage of models earlier in the design cycle. The addition of SystemC models has given users more choice for simulation and virtual prototyping products.

The ARM® Cortex®-R7 and Cortex-R8 processors are the most advanced processors for modem and storage designs. One of the great things about the ARM architecture is the software compatibility between different cores. This is great for software reuse, but it can be challenging to identify differences between the various Cortex-R family members. For example, consider a design using the Cortex-R4 or Cortex-R5 and upgrading to the Cortex-R7 or Cortex-R8. The compatibility of the ARMv7-R architecture means the software will be compatible, but there will be some important differences. An example of this is branch prediction. ARM Cycles Models provide a good way to look at the impact of branch prediction on the Cortex-R8.

Introduction to branch prediction

Branch prediction is used in microprocessors to anticipate program flow with the goal of pipeline efficiency. There are many ways to implement branch prediction, and there is usually a trade-off between better prediction results and increased hardware to do the prediction.

The Cortex-R8 uses both static and dynamic branch prediction techniques. Static branch prediction is based on decoding instructions and can be used on new code that has not been executed before, but because decisions come later in the pipeline there is not as much time to speculatively fetch code at the branch target address.

Dynamic branch prediction can be used when code has been executed before. The Cortex-R8 will speculatively fetch instructions based on execution history. Dynamic branch prediction will predict:

Whether there is a branch instruction at a given address
Unconditional and conditional branches
Loops, function call, or function return
Target address and/or state of destination (ARM/Thumb)
Direction of conditional branch (taken or not taken)

Dynamic branch prediction uses an extra memory buffer to keep track of history.

Branch prediction is generally a good thing, and like caches the hardware provides improved performance without software needing to pay much attention. The main task of system software to make sure the maximum benefit of the hardware is realized.

Enabling branch prediction

One of the key things to understand when writing software for a CPU is what kind of programming is required to get the best performance. Some hardware features for performance improvement are automatically enabled, but can be disabled by software. Other features start disabled, and can be enabled to improve performance. Many times features may start disabled because special care is required to enable them, such as invalidating a cache.

In the early days of providing cycle accurate models for ARM CPUs, the most common user question was how to enable caches. Architects were interested in doing performance analysis and just wanted to run the CPU at its maximum performance and see what happened. They didn’t always have experience to figure out how to do the low-level programming required to configure the CPU for best performance. Branch prediction is similar, and this is why all ARM Cycle Models come with example software to help with the hardware configuration.

The branch prediction feature in the Cortex-R4 and Cortex-R5 is automatically enabled at reset. There is nothing software needs to do to get the maximum performance from this feature. However, the Cortex-R7 and Cortex-R8 do not enable branch prediction automatically at reset. This means software must enable branch prediction to get the maximum hardware performance.

A good place to get the details of hardware programming is in the DS-5 examples/ directory. The file Bare-metal_examples_ARMv7.zip contains example startup code for many different CPUs. Extracting the file and looking at the code for each CPU provides insight into how to make sure the maximum performance is realized. The file startup_Cortex-R8/startup.s shows how to set the Z bit to enable branch prediction.

The Z bit is located in the system control register, SCTLR, and is bit 11.

The example code to enable branch prediction is below.

This means that it is important to make sure branch prediction is enabled when a previous product which used Cortex-R4 or Cortex-R5 migrates to Cortex-R7 or Cortex-R8. Although this sounds easy, it’s not obvious for two reasons. First, without any action the software will “just work” across different CPUs. Second, there is no single place to see “what’s different” when comparing CPU A to CPU B. Providing such information is only relative to which CPUs are being compared and many ARM CPUs make incremental improvements so it’s a bit of a hunt to identify differences.

Benchmarking results

The impact of branch prediction can be evaluated using Cycle Models. The general technique to measure performance using Cycle Models is to run to a breakpoint which represents an interesting place to start profiling and enable profiling. Then continue execution to another breakpoint which represents the end of the interesting code. At the second breakpoint, turn off profiling and study the results.

All of this can be scripted and automated if needed to gather results using different compiler optimization switches.

To start things off, build the software to run, with and without branch prediction enabled. Run the Cortex-R8 CPAK and use a software debugger to set a breakpoint on the interesting starting point. Cycle Models allow connections from any software which supports CADI including modeldebugger and DS-5.

At the end of the interesting section of code, look at the metrics. The simplest thing for measuring the impact of branch prediction is how many simulation cycles were required to run the same software. If branch prediction is helping, the cycle count will be less.

A Cortex-R8 quad-core CPAK was used to experiment with branch prediction for the CPU. It can be run on either Windows or Linux using SoC Designer. For more details of this and other CPAKs visit ARM System Exchange.

To evaluate the impact of branch prediction, dhrystone, CoreMark, and whetstone were run with and without branch prediction enabled. The table below shows the number of cycles required to run the same code with and without branch prediction.

	Branch prediction off	Branch prediction on
Dhrystone cycle count	380,240 cycles	155,675 cycles
Coremark cycle count	1,123, 107 cycles	580,431 cycles
Whetstone cycle count	22,143,393 cycles	15,535,307 cycles

SoC Designer offers easy, non-intrusive performance analysis features to inspect PMU events, bus activity, and software execution.

Below are two views of the branch prediction PMU events. The first is for CoreMark with branch predication off and the second is the same CoreMark with branch prediction on. The mis-predicts are very large without setting the Z-bit to a 1 and very low when the Z-bit is set to 1. The numbers in the "Total" column represent the total number of events for the entire simulation. Click on the graphs to get a better view.

Many other events can also be captured using the profiling features of SoC Designer.

Conclusion

ARM Cycle Models can be used to study a variety of system changes. Common uses are comparing different CPUs, experimenting with cache size or branch prediction, analyzing memory subsystem design, and more. The impact of Cortex-R8 branch prediction on three benchmarks was demonstrated and the importance of making sure all hardware features are enabled for best performance was highlighted.