Quantcast
Channel: Jive Syndication Feed
Viewing all 38 articles
Browse latest View live

Unified Virtual Prototypes: Fast and Accurate?

$
0
0

Over the last five years, one of the concerning things about virtual prototyping has been the separation between speed and accuracy. The lack of any meaningful connections between models built for speed and models built for accuracy motivated me to look for better ways to address this gap. I joined Carbon because I believe the company is in the best position to address both architectural analysis and embedded software in a unified way.

 

My Background

 

When I first started working in the area, I was focused on virtual prototypes for early software development. I thought the only practical way to run operating systems, device drivers, and test applications is to use abstract models which are built for speed.

 

When I attended a conference, such asDACor aSystemC User Groupmeeting, I noticed a common pattern. About half of the presentations were about early software development, and the other half were about architectural analysis. There was rarely a connection between the two, even though the most commonly-asked questions concerned running software and doing architectural analysis based on the hardware/software combination.

 

Somewhere along the way, I decided performance analysis wasn’t as interesting to me, not because I don't like computer architecture or hardware design, but rather because SystemC TLM-2.0 AT (approximately timed) models didn't seem like a practical way to create accurate models in a timely manner. I thought the accuracy solution should always be to use the RTL itself – an approach that has always worked well for emulation and FPGA prototyping.

 

Besides building virtual prototypes for early software development, I have done RTL simulation and emulation and spent time running software on RTL processors. Although these environments are cycle accurate, there had never been an effective way to debug software using a non-intrusive software debugger, see transaction-level views of hardware accesses, and observe a good programmer’s view of registers and memory. I have seen attempts to debug software by simulating the behavior of a processor’s debug access port, or by running an abstract shadow model to provide the software debugging connection; neither approach struck me as the right solution.

 

The result of this dichotomy between the two use cases for virtual prototyping made the sales process more difficult than it already was. I always ended up in a room full of engineers who introduced themselves as "system architect,” "bus architect,” "memory controller designer,” and various other hardware-related titles. All important jobs, but nothing to do with software, which had always been my focus. I concluded that instead of trying to focus on just one use case, it was necessary to provide a unified approach that provides speed when you want it and accuracy when you need it. If this could be done, the market for virtual prototyping could be larger, and the flow for users could be simplified.

 

Over time I realized that a better approach to addressing the full spectrum of a project’s needs was necessary, rather than the segmentation of projects into architectural analysis and early software development. Could the 3 kinds of virtual prototypes (see Gary Smith’sESL Behavior Design diagram) be mapped to a single, comprehensive, effective solution?

 

Why Carbon?

 

I came to Carbon to join the company’s quest to provide a virtual prototype solution that addresses both architectural analysis and embedded software in a unified way: when accuracy is needed, by using models based on the actual design RTL, and when speed is needed, by implementing abstract models.

 

Accurate models shouldn’t be a bunch of HDL files with signals, wires, and registers that nobody can understand. Instead, they should look more like the simplified view I’m used to - abstract models with instrumented views for important things like registers, memories, and pipelines. It should be possible to run fast when required, and to run with accuracy when required. Additionally, you shouldn’t have to start over with a different simulator, or lose your non-intrusive software debugging features when presented with a new CPU model that has a different debugger interface.

 

The second reason I came to Carbon was to work with the latest IP from all of the excellent Carbon partners. Last week I was watching a recorded Google+ Hangout with Peter Greenhalgh, the Lead Architect for the ARM Cortex-A53 processor. It was interesting to hear him discuss the details of the design. He shared what kind of architecture he would use if he had the chance to design an SoC for a mobile device. He also answered a long set of questions about the Cortex-A53.

 

 

The amazing thing is I can easily run a simulation of a system using an A53 ARM Fast Model and see how it actually works. I can run a software program, see the new instruction set, see the 64-bit wide registers, see the exception levels, and more. I can also build my own configuration of the A53 on Carbon IP Exchange and run a simulation and REALLY see how the A53 works at a cycle accurate level. Although the Cortex-A53 is a complex processor, the diagram below shows how easy it is to instantiate and run.

 

a53v2

 

Although I don’t have all of the answers to the challenges that are ahead after just one month, I’m excited to continue the quest to help users build high quality, optimized chips and systems by making use of virtual prototypes for performance analysis and early software development.

 

Jason Andrews


Running the Latest Linux Kernel on a Minimal ARM Cortex-A15 System

$
0
0

Linux® has become a popular operating system in embedded products, and as a result is experiencing increased usage for system design and verification. This means that a new set of people, who are not traditional embedded software engineers, need to learn to work with Linux to accomplish daily tasks.

 

For system design tasks, Carbon users generally proceed from bare-metal software benchmarks such as Dhrystone and CoreMark to running benchmark applications on an operating system. If the design uses ARM® Cortex™-A series processors, the most common OS choice is Linux. Linux has come a long way with respect to ARM Architecture since Linus Torvalds made his famous complaint about ARM support back in March, 2011. The Linux Device Tree has made it very easy for those of us who work with simulated hardware, and often partial systems, to be able to run Linux with almost no changes the kernel source code.

 

As I mentioned in my previous blog, I recently joined Carbon.  I decided the best way to learn would be to use the tools the way a customer would.  I’ll talk below about the steps I took to get Linux running on a minimal system using the ARM Cortex-A15 processor.  If you’re using a Cortex-A15, you’ll get even more benefit as this platform is available for download as a Cortex-A15 CPAK.


Why a Minimal System?

 

It’s easier to start with small system and iteratively add complexity – add peripherals, enable drivers, and verify at each phase that the system functions as expected. Starting off with a larger system introduces the difficulty of extracting hardware; it can be tricky to identify the software that relies on the hardware you want to remove.

 

It has also been my observation that many engineers designing and using leading edge SoCs are keenly interested in the performance details of key parts of chip such as the processors, interconnect, memory controller, and GPU. These processor sub-system architects don’t always want or need a lot of slower peripherals that are typically assumed to be present in systems running Linux.


Initial Challenges

 

The primary goal is to identify the minimum hardware needed to run Linux on an ARM Cortex-A15. It should be possible to run Linux with nothing more than the A15 CPU, memory, and a UART, so I set out to try it.

 

A secondary goal is to use the latest kernel from kernel.org and change as little of the Linux source code as possible. Minimizing source code changes makes it easier to update to new versions of Linux as they are released.


Methodology

 

Simulation speed is far more important than hardware accuracy when experimenting with Linux configurations to confirm a working kernel. This is an ideal situation in which to generate an ARM Fast Model design from the Carbon SoC Designer Plus canvas. After the kernel is working with the Fast Model design, it is easy to run the Carbon cycle accurate simulator, sdsim, for benchmarking and utilize Swap & Play to confirm a fully working virtual prototype that is both fast and accurate.

 

To meet the goal of changing as little kernel source code as possible, I started from a currently-supported platform, the ARM Versatile™ Express, then configured the kernel to use only the minimal hardware. New kernel versions will continue to support Versatile Express, making future upgrades easy to do.


Hardware Design

 

I decided to call my new hardware design the a15mini to indicate a Cortex-A15 system with minimal hardware. The Versatile Express design requires specifying the memory map and interrupt connections. The memory for the Cortex-A series Versatile Express is from 0x80000000 to 0xffffffff. The first PL011 UART is located at 0x1c090000 and uses interrupt 5 on the A15 IRQS[n:0] input request lines connected to the internal Generic Interrupt Controller (GIC). The Cortex-A15 needs to have the base address for the internal memory mapped peripherals (PERIPHBASE) set to 0x2c000000. The only other relevant information is that the UART runs from a 24 MHz reference clock.

 

Creating the a15mini with SoC Designer consists of instantiating the models and connecting them on the canvas using sdcanvas. Using cycle accurate models means more detail is needed to create the design. Instead of just the CPU, simple address decoder, and memory I used the ARM CCI-400 Cache Coherent Interconnect and the NIC-301 interconnect. These models, along with the Cortex-A15 are built using Carbon IP Exchange, the web-based portal that builds models directly from ARM RTL code. It’s pretty amazing to think that I answer a few questions or submit an XML file from AMBA Designer and get back a simple to use model in the form of a .so file and know that it was generated from a very complex RTL design of a processor like the Cortex-A15. It’s as if I’m using millions on lines of Verilog code and I never see any of it.

 

The design is shown below:

 

ARM Cortex-A15 Minimal Linux CPAK Screenshot

Linux Preparation

 

Creating a Linux image for the a15mini is a little more complex than the hardware design procedure.

There are many ways to prepare Linux, but at a minimum the following items are needed:

  • Boot Loader
  • Kernel Image
  • Device Tree Blob
  • File System

For this experiment I decided to make things as easy to work with as possible. The most straightforward way was to use a single executable file (ELF file) containing all of the above items; anybody who wants to run the platform needs only this one file to represent all of the artifacts. A drawback of this approach is that the file must be regenerated when any of the items changes, but it creates a generic solution that can be run on any kind of simulator.


Kernel Image


I downloaded Linux 3.13.1 from kernel.org as the starting point. This was the latest kernel at the time I did the initial simulation, but new versions are released frequently.


Kernel Configuration


The default configuration for the Versatile Express is found in the Linux source tree at: arch/arm/configs/vexpress_defconfigFirst, I use the Versatile Express configuration as the baseline by running:


$ make ARCH=arm vexpress_defconfig


File System


To get up and running quickly, I cheated - I copied the file new-buildroot-rootfs.cpio.gz from the Carbon ARM Cortex-A9 Linux CPAK and renamed it fs.cpio.gz. (A future article may cover the various ways to make file system images.)


Customizing Kernel Configuration


To create the single executable file with all of the needed artifacts, I needed to embed the file system image in the kernel, and append the Device Tree Blob at the end of the kernel image.To embed the file system image in the kernel, you can use any of the Linux configuration interfaces. I tend to use menuconfig:


$ make ARCH=arm menuconfig


I navigated to the General Setup menu (see the image below), scrolled down to “Initramfs sources file(s)” and added the name of the file system image, fs.cpio.gz. I put this file at the top of the Linux source tree so no additional path is needed.


fs config


To append the Device Tree Blob at the end of the kernel image, access the Boot options menu item “Use appended device tree blog to zImage (EXPERIMENTAL).” I enabled this to append the .dtb file at the end of  the zImage file (see in the image below).


append dtb config


While I was in the Boot options menu I also set the Default kernel command string by adding root=/dev/ram and earlyprintk. This specifies to use a ram based root file system; the only possible choice since no other storage is included in the hardware design. There are many ways to set the default kernel command string, but this approach works for well for this application, in which we want to link everything into a single elf file.


Source Code Changes


The bad news is I wasn’t able to run the 3.13.1 kernel on the a15mini without any kernel source code changes. The good news is I came within 1 line of the goal. I needed to edit the file arch/arm/mach-vexpress/v2m.c to remove the line that configures the kernel scheduler clock to read a time value from the Versatile Express System Registers (this peripheral has a register which provides a time value).  To achieve my goal of including a minimum of hardware, I wanted to forgo the System Registers.


The line I removed is in the function v2m_dt_init_early() and is line number 423, the call to versatile_sched_clock_init().


describe the image


Compilation


Now the source tree could be compiled. I work on Ubuntu 13.10 and use the cross-compiler with the GNU prefix arm-linux-gnueabi, so my compile command was:


$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- -j 4


The most important compilation result is the kernel image file arch/arm/boot/zImage.


Device Tree Blob


The Versatile Express device trees for ARM Fast Models are available from linux-arm.org. The easiest way to get the files is to use git:


$ git clone git://linux-arm.org/arm-dts.git


The file that is the closest match for the a15mini in the fast_models/ directory rtsm_ve-cortex_a15x1.dts


This file also includes a .dtsi file named rtsm_ve-motherboard.dtsi. The main work to support the a15mini hardware system was to modify these two device tree files so they match the hardware available by removing all of the hardware that doesn’t exist.


So first, I edited rtsm_ve-motherboard.dtsi t1to remove all of the peripherals that are gone from the a15mini such as flash, Ethernet, keyboard/mouse controller, three extra UARTs, watch dog timer, and more. I left the original structure, but shrunk the file all the way down to just the UART!


Architected Timer Support


Linux needs a timer to run. I removed the SP804 timers that are present in the Versatile Express and the System Register block that provides the time value for the scheduler clock. Replacing these, I want to use an internal timer in the A15, sometimes called the Architected timer, to keep the hardware design as small as possible.


Support for the ARM Architected timer is already present in rtsm_ve-cortex_a15x1.dts, but I found that it didn’t work right away when I commented out the call to versatile_clock_sched_init(). The kernel crashed at the point of setting up the Architected timer, and printed a message that there was no frequency available. By looking through other device tree files, I found that some armv7-timer entries had a clock-frequency attached to them, while the Versatile Express entry did not. After adding the clock-frequency to the timer in rtsm_ve-cortex_a15x1.dts, the timer worked, and the Linux works as expected with the internal timer. No  external timer is needed!


timer


I have attached the two device tree files so you can take a look at them as needed.


Using the Device Tree Compiler


After the two device tree files have been shrunk down to support only the hardware available on the a15mini, they can be compiled into the device tree blob using the device tree compiler, dtc, which is available in the scripts/dtc directory of the kernel source tree.


I did not add anything to my PATH to find dtc, I just reached over into the kernel source tree and ran:

 

$ ../linux-3.13/scripts/dtc/dtc  -O dtb -o rtsm_ve-cortex_a15x1.dtb\ fast_models/rtsm_ve-cortex_a15x1.dts


Adding Device Tree to Kernel


To use the feature that appends the device tree to the Linux kernel image, I simply copied the zImage from the kernel tree.


$ cp linux-3.13.1/arch/arm/boot/zImage  .


Then concatenated the device tree blob to the end of the kernel image:


$ cat arm-dts/rtsm_ve-cortex_a15x1.dtb >> zImage


Boot Loader


There are many ways to create a boot loader, but to meet the goal of a single, all-inclusive executable file, a small assembly boot loader is the best. A great example of this is available in the ARM Fast Models ThirdParty IP package. This is an add-on to ARM Fast Models which contains all of the open source software that can be run on Fast Models examples. The majority of the package is the Linux source trees and file system images for different examples.


I selected two really useful files from the RTSM_Linux source code that comes with the ThirdParty IP package:

  • boot.S - The assembly file that serves as the boot loader. It’s small and easy to understand, and much easier to work with in cases where a full boot loader like u-boot is not needed.
  • model.lds - A linker script that specifies how to link boot.o (compiled boot.S) and zImage into a single executable file.


The addition of these two files, combined with the modified zImage (including embedded file system and concatenated device tree) means that everything is now present in the single executable file.


I also made some minor adjustments to the Makefile, inserting my paths, compiler name, and file names so it would generate the file a15-linux.axf as the final output that will be used in simulation.The updated Makefile is shown below:


boot make


Running Simulation


Because the interconnect is more complex than a simple address decoder, there are two ways to confirm the design is correct.

  • Run Linux on a fast version of the design
  • Run a small software program in cycle accurate mode

 

I would recommend both methods to make sure the design is functioning properly and there are no errors in connecting the interrupt, setting CPU model parameters, or other common mistakes made during design construction. My most common mistake is forgetting to set the PERIPHBASE address of the A15 to 0x2c000000.

 

Since I’m focusing on Linux I will describe how to run a fast version of the design.

 

From the Tools menu in sdcanvas select FastModel System…

 

tools fast model

 

The Fast Model System Creator dialog will appear as shown below. Clicking the Create button will generate a Fast Model equivalent system automatically from the cycle accurate system.

 

fm system

 

This Fast Model system can be run in sdsim using the Simulation -> Simulate System … (or F5). When the simulator starts specify the a15-linux.axf file that was created in the last section and watch Linux boot in a matter of seconds. The terminal below shows 3.13.1 Linux with the machine type reported as Versatile Express.

 

sdsim fm boot


Swap & Play

 

Now we have established a working cycle accurate simulation and a Fast Model simulation for the same design, both of which can be run with the SoC Designer simulator, sdsim.

 

Working with Linux on a cycle accurate simulation is a great way to study all of the details of the hardware, software combination such as bus utilization, cache metrics, cache snooping, barrier transactions, and much more. It’s exciting until you realize that Linux takes about 300M instructions to boot and well over a billion clock cycles.

 

One alternative to waiting for a full cycle accurate simulation is to create Swap & Play checkpoints at various points of interest and then load the checkpoints into the cycle accurate simulation. To create a checkpoint run the Fast Model simulation and then stop the simulator using either a breakpoint or just hitting the Stop button at the place you want to stop, such as the Linux prompt. Use the File -> Save As menu in sdsim and select Swap & Play Checkpoint (*.mxc) as shown below:

 

sandp

 

The next dialog will ask for a name of the checkpoint and a location for the file. Enter any name and hit OK to save the checkpoint.

 

To load the checkpoint into the cycle accurate simulation load the design into sdsim and then use the File -> Restore checkpoint view…

 

Select the checkpoint saved in the Fast Model simulation and it will load into the cycle accurate simulation. There is no need to even load an image file for this case since it will be restored by the checkpoint. The disassembly window and the register window will show the same location that was saved from the fast model simulation. I saved a checkpoint at the Linux prompt and restored it into the cycle accurate simulation and I can even see I’m sitting at the WFI instruction in the Linux idle loop.

 

disass

 

A debugger such as RealView or DS-5 can also be connected to start source level debugging.

 

An alternative trick I use is the addr2line utility to find out where in the code the Disassembly or Register Window is showing. This is useful to just take a peek at the current location without starting the full debugger. For the disassembly window above I do:

 

$ arm-linux-gnueabi-addr2line -e vmlinux 0x80014524 /home/cds/jasona/kernel.org/linux-x1/linux-3.13.1/arch/arm/mm/proc-v7.S:73

 

Now I can see the source code for the current location as a quick check. Sure enough, I’m sitting at the Linux idle loop as expected since nothing is happening sitting in the shell at the prompt. Here is the code:

 

idle

 

It’s easy to imagine using swap & play to load checkpoints and do cycle accurate debugging as well as performance analysis for benchmarks. Carbon users typically run benchmarks such as Dhrystone, CoreMark, and LMbench as Linux applications.

 

So far, this was done using a single core A15 CPU. It’s interesting but there are no actual systems that use a single core A15. Next time I will show how to extend the a15mini for multi-core simulation by moving to the ARM Cortex-A15x2 CPU and running SMP Linux, again with as little hardware as possible. The dual-core A15 matches one of my favorite machines, the Samsung Chromebook.

 

As I mentioned at the beginning, all of the work that I’ve done here porting Linux and setting up a system which can be used with Swap & Play is available as a CPAK on Carbon IP Exchange.  You can use this system to port your own version of Linux or customize the hardware configuration to match that of your own design.

 

Jason Andrews

Cycle Accurate ARM Cortex-A53 and Cortex-A57 Models Support AArch64

$
0
0

We (Carbon Design Systems) have just completed our first major release of 2014. It includes significant new content and many bug fixes to all products. Today, I would like to highlight the updated models of the ARM® Cortex™-A53 and Cortex-A57.

 

The latest ARM CPU models provide access to the ARMv8-A architecture. Since this is the first time Carbon models have supported the 64-bit architecture (also called AArch64 execution state), this is a great time to start learning some of the basics of the new architecture and get familiar with running and debugging software. The ARMv8 architecture is the largest change in the history of ARM!

 

It has been nearly 10 years since I wrote the book Co-Verification of Hardware and Software for ARM SoC Design. At that time, the ARM926EJ-S was a popular CPU for mobile devices (it’s now called a Classic Processor) and AMBA AHB was the main bus on many chips. Tremendous change over this 10 year period has brought us multi-core, multi-cluster systems utilizing Cortex-A7 and Cortex-A15 with ACE. Even with all of this change, certain things from 10 years ago still apply to an A15 system. For example, when address 0x10 appears on the bus and the simulation stops doing useful things I know it's a data abort and it probably means the software accessed a non-existent address in the system. This was the same in the ARM7TDMI as it is for the Cortex-A15. Now, the changes are more significant with the 64-bit ARM architecture (AArch64), and there is a lot to learn.

 

This is a great time to start learning the new architecture because many companies are currently working on new designs with A53 and A57 processors. One good place to start is to review some of the ARM presentations that are available. The ARMv8 Technology Preview is a good example of one I have looked at. Today, I’ll cover the basics of compiling and running bare metal software using the Carbon Performance Analysis Kit (CPAK) for the Cortex-A53.


Compiling Programs

 

The first thing to learn about v8 is how to compile programs. A new instruction set requires a new compiler, or at least a new compiler backend to generate the code. Many Carbon users will start with bare metal software programs and compile them using the ARM Compiler toolchain, also commonly called armcc. The programs in the CPAK were compiled with DS-5 version 5.17.1.

 

64-bit processors such as the A53 can also run in AArch32 state so this means a wide variety of choices are available when compiling a bare metal program. The --cpu flag is used to select the desired target architecture for armcc. A complete list is provided in the ARM Compiler Reference Guide.

 

Some of the programs that are included in the Carbon A53 CPAK use the --cpu=8-A.64.no_neon flag in the Makefile to select the target architecture. This CPAK is a good place to start to see the details of how to compile bare metal software.

 

In the Applications/ directory of the CPAK some examples are compiled for both AArch64 and AArch32. The Linux file command can be used to show the difference:

$ file DecrDataMP/decrDataMP.axf DecrDataMP_v8/decrDataMP_v8.axf

DecrDataMP/decrDataMP.axf:       ELF 32-bit LSB executable, ARM, version 1 (SYSV), statically linked, not stripped

DecrDataMP_v8/decrDataMP_v8.axf: ELF 64-bit LSB executable, version 1 (SYSV), statically linked, not stripped

 

It looks like the file command on my Ubuntu 13.10 machine doesn’t know the 64-bit program is for an ARM processor, but don't worry it will simulate fine.


Debugging Programs      

 

The most immediate difference when loading and debugging a 64-bit program is the General Purpose Registers are different. All of the registers are 64-bits wide and have Xn instead of Rn as the register names. Previously, R0 was 32-bits and now X0 is 64-bits. Below is a screenshot of the SoC Designer Plus register window for a program running on an A53 in AArch64 state. The registers shown are in the AArch64_Core group.

 

aarch64 regs

 

You will also notice there is a tab on the register window for the AArch32_Core group. These are the familiar 32-bit register values and correspond to the lower half of the AArch64_Core group. A screenshot from SoC Designer is shown below.

 

aarch32 regs

 

The ARM modeldebugger is an easy to use debugger which supports debugging 64-bit programs running on SoC Designer models. I like how the modeldebugger puts a space in the 64-bit register values.

 

md aarch64 regs

 

It's also useful to recognize the references to Wn registers as shown in the disassembly below. Some instructions work on 32-bit operands and will refer to the lower 32-bits of a 64-bit register using the Wn notation.

 

disass w regs

 

One benefit of AArch64 is the PC is easy to find, there is no need to know the PC is R15.

 

The other useful register for basic debugging is the stack pointer. The bare metal examples included in the A53 and A57 CPAKs run in exception level 3 (EL3) so the register holding the stack pointer is SP_EL3.

 

I hope this brief introduction about compiling and running bare metal software in AArch64 execution state using the newly updated Carbon CPAKs will help understand the basics and provide some motivation to learn more about working with 64-bit programs on ARM Cortex-A53 and Cortex-A57 models.

 

Jason Andrews

SMP Linux on a Minimal Dual-Core ARM Cortex-A15 System

$
0
0

Previously, I explained how to create a minimal, single-core ARM® Cortex™-A15 system running Linux®. In this article I will update the hardware design to use a dual-core ARM Cortex-A15 CPU and run SMP (Symmetric Multiprocessing) Linux. While the first system was interesting, I’m fairly certain no single-core A15 systems have ever been built, and most engineers will require multi-core systems for common design and verification tasks.

 

Changes to the Hardware Design

 

Two hardware design changes are needed to enable SMP Linux. First, the CPU must be updated to the dual-core A15. This is a matter of simply updating the CPU model to the A15x2 CPU. For those who have not build models using Carbon IP Exchange, the web-based portal that builds models directly from RTL code, there is simple configuration page used to select the A15 model parameters. The key value is to make sure the “Number of CPU Cores” is set to 2 as shown in the picture below.

 

A15 ipexchange resized 600

 

Once this is configured, and the model is created, it can be instantiated on the SoC Designer canvas in place of the single-core A15 model and connected to the interrupt from the PL011 UART and to the CCI-400 in the same way.

 

The second change is the addition of an extra memory which is used to communicate the starting address for the secondary core. This memory will take the place of the Versatile Express System Registers. Recall in the previous article I explained that 1 line of source code had to be changed to avoid using a timer value from the System Registers. For SMP Linux there is another register (offset 0x30) which is used to pass the jump address to the secondary CPU. For the minimal system, the only behavior needed is to provide a simple memory at the base address of the System Registers, this is system address 0x1c010000. Only offsets 0x30 and 0x34 are used and the values must be initialized to 0 because the secondary code waits for a non-zero value. When the seconary CPU sees a non-zero value it will jump to the address contained in at 0x1c010030. If this address is not 0 at startup the system will not boot properly. SoC Designer simple memory models initialize to 0 so no special handling is needed.

                                                

Linux Changes


The Linux image needs to be recompiled with SMP support. This is done using the normal kernel configuration process.


$ make ARCH=arm menuconfig

 

The option to enable SMP is under the Kernel Features menu. Make sure Symmetric Multi-Processing is selected as shown below.

 

enable smp resized 600

 

After enabling SMP rebuild the kernel as before, adjusting the CROSS_COMPILE to match the prefix of your ARM cross compiler:


$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- -j 4

 

This will create a new zImage which is ready for the dual-core A15 design. The remainder of the steps to prepare the final software image is the same. For the dual-core case I named the final product a15x2-linux.axf and loaded this file into the simulator.

 

Starting the Secondary CPU


The main challenge in running SMP Linux is getting the secondary CPU started. After reset, both CPUs will start running the code in boot.S which is located at 0x80000000. The first step is to determine if the code is running CPU0 or CPU1. This is done by reading the CPU ID register located in co-processor 15 (CP15). This register is also referred to as the Multiprocessor Affinity Register, MPIDR. It provides information about which core of an MPCore processor the code is running on, and which cluster of a multi-cluster system the code is running on. In this case we have only a single cluster and two cores so the code simply identifies CPU ID 0 as the primary core and CPU ID 1 as the secondary.

 

cpuid resized 600

 

The primary core finishes the boot loader and immediately starts running Linux, while the secondary core waits in the boot loader for a jump address to be provided at address 0x1c010030.

 

The picture below shows the code for the secondary CPU.

 

smp boot resized 600

 

The primary CPU which is running Linux is responsible to release the secondary CPU by writing the jump address and sending an interrupt. In the Linux 3.13.1 kernel, this is found in arch/arm/mach-vexpress/platsmp.c at line 225. Putting a breakpoint on this line of code and single stepping will reveal the details. The underlying code will write the jump address and take care of all the details to start the secondary CPU. The well commented last line in the screenshot below gives the details.

 

start seconary resized 600

 

Below is a screenshot of the memory contents for the System Registers address range. The primary CPU actually wrote 0xffffffff into address 0x1c010034 and then the 32-bit jump address into 0x1c010030. Because this is a memory view from the perspective of the CPU, I entered the virtual address into the memory viewer window, which is 0xf8010000.

 

seconary mem resized 600

 

If everything works correctly, there should be some messages in the boot log showing that 2 CPUs are running.

 

brought up 2 cpus resized 600

 

As a final check on the platform, I booted the SMP Linux and created a Swap & Play checkpoint. I restored the checkpoint into a cycle-accurate simulation and issued the command:


$ cat /proc/cpuinfo                  

 

The terminal output confirms the state of the simulation was successfully transferred to the cycle accurate simulation and both cores are reported to be running.

 

cpuinfo


Summary

 

In summary, with the addition of an extra memory model in the location of the Versatile Express System Registers and modifying the Linux kernel configuration parameter to enable SMP we are up and running with a dual-core ARM Cortex-A15 minimal hardware design. This design supports all of the same Swap & Play and benchmarking features as the single-core design and can be extended to add more hardware detail for specific design tasks as needed.

 

As before, all of the work that I’ve done here to prepare Linux and the hardware design is available as a CPAK on Carbon IP Exchange in the Cortex-A15 CPAK area.

 

Jason Andrews

Understanding ARM Bare Metal Benchmark Application Startup Time

$
0
0

One of the benefits of simulation with virtual prototypes is the added control and visibility of memory. It’s easy to load data into memory models, dump data from memory to a file, and change memory values without running any simulation. After I gave up debugging hardware in a lab and decided I would rather spend my time simulating, some of my first lessons were related to assumptions software makes about memory at power on. When a computer powers on, software must assume the data in memories such as SRAM and DRAM is unknown. I recall being somewhat amazed to find out that initialization software would commonly clear large memory ranges before doing anything useful. I also recall learning that startup software would figure out how much memory was present in a system by writing the first byte or word to some value, reading it back, and if the written value was read back there must be memory present. The way to determine memory size was to continue incrementing the address until the read value did not match the expected value and conclude this was the size of the memory.

 

Recently, I was working on a bare metal program and simulating execution of an ARM Cortex-A15 system. CarbonPerformance Analysis Kits (CPAKs) come with example systems and initialization software to help shorten the ramp up time for users. Generally, people don’t pay much attention to the initialization code unless it’s either broken or they need to configure specific hardware features related to caches, MMU, or VFP and Neon hardware.

 

Today, I’ll provide some insight into some of the things the initialization code does, specifically what happens between the end of the assembly code which initializes the hardware and the start of a C program.

 

The program I was running had the following code at the top of main.c

 

main

There is an array named memspace with a #define to set the size of the array. When running new software it’s a good idea to understand as much as possible as quickly as possible by getting through the program the first time. One way to do this is to cut down the number of iterations, data size, or whatever else is needed to complete the program and gain confidence it’s running correctly. This avoids wasting time thinking the program is running correctly when it’s not. I normally put a few breakpoints in the software and just feel my way through the program to see where it goes and how it runs.

 

I like to put a breakpoint at the end of the initial assembly code to make sure nothing has gone wrong with the basic setup. Next, I like to put a breakpoint at main() to make sure the program gets started, and then stop at interesting looking C functions to track progress. For this particular program I shrunk the size of the memspace array to 200 bytes for the first pass through the test.

 

After I understood the basics of the program, I put the array size back to the original value of 200000 bytes. When I did this I noticed a strange phenomenon. The simulation took much longer to get to main() when the array was larger, about 8 times longer as shown in the table below.

 

Array size

Cycles to reach main()

         200

4860

200000

39174

 

One of the purposes of this article is to shed some light on what happens between the end of the startup assembly code and main(). Obviously, there is something related to the size of the memory array that influences this section of code.

 

Readers that have the A15 Bare Metal CPAK can follow along with similar code to what I used for the benchmark by looking in Applications/Dhrystone/Init.s

 

There are two parts to jumping from the initial assembly code to the main() function in C. First, save the address of __main in r12 as shown below.

 

init1 resized 600

 

Next, jump to __main at the end of the assembly code by using the BX r12 instruction. After the BX instruction the program goes into a section of code provided by the compiler (for which there is no source to debug) but if all goes well it comes out again at main().

 

The code starting from __main performs the following tasks:

  • Copies the execution regions from their load addresses to their execution addresses. This is a memory setup task for the case where the code is not loaded in the location it will run from or if the code is compressed and needs to be decompressed.
  • Zeros memory that needs to be cleared based on the C standard that says statically-allocated objects without explicit initializers are initialized to zero.
  • Branch to __rt_entry


Once the memory is ready, the code starting from __rt_entry sets up the runtime environment by doing the following tasks:

  • Sets up the stack and heap
  • Initializes library functions
  • Call main()
  • Call exit() after main() completes

 

If anything goes wrong between the assembly code and main() the most common cause is the stack and heap setup. I always recommend to take a look at this if your program doesn’t make it to main().

 

You may have guessed by now that the simulation time difference I described is caused by the time required to zero a larger array compared to a smaller array. As I mentioned at the start of the article, writing zero to large blocks of memory that is already zero (or can easily be made zero using a simulator command) is a waste of time. Carbon memory models already initialize memory contents to zero by default. Some people prefer a more pessimistic approach and initialize memory to some non-zero value to make sure the code will work on the real hardware, but for users more interested in performance analysis it seems helpful to avoid wasted simulation cycles and get on to the interesting work.

 

The Linux size command is a good way to confirm the larger array impacts the bss section of the code. The zero initialized (ZI) data and bss refer to the same segment. With the 200 byte array:

 

-bash-3.2$ size main.axf

   text        data         bss         dec         hex     filename

  71276          16     721432     792724       c1894     main.axf

 

With the 200000 byte array:

 

-bash-3.2$ size main.axf

   text        data         bss         dec         hex     filename

  71276          16     921232     992524       f250c     main.axf


Alternatives to Save Simulation Time

 

There are multiple ways to avoid executing instructions to write zero to memory that is already zero. It turns out to be a popular question. Search for something like “avoid bss static object zero”.

 

One way is to use linker scripts or compiler directives to put the array into a different section of memory that is not automatically initialized to 0.

 

For the program I was working on I decided to investigate a couple alternatives.

 

One solution is to just skip __main altogether and go directly to __rt_entry since __main doesn’t do anything useful for this program and the change is simple.

 

To skip __main just replace the load of __main into r12 with a load of __rt_entry into r12. Now when the program runs __main will be skipped altogether.

 

init2 resized 600

 

Here are the new results with __main skipped.

 

Array size

Cycles to reach main()

200

4355

200000

4362

 

As expected the number of cycles to reach main() is about the same with both array sizes, and much less than zeroing the large array. Although the difference may seem small for the benchmark I have shown here, the problem gets much bigger when a larger and more complex software program is run. I checked a larger software program and found it was taking more than 10 million instructions to zero memory.

 

I wouldn't recommend just blindly applying this technique, especially on larger software programs, as debugging improperly initialized global variables is not fun.

 

Another possibility to avoid initializing large global variables is to use a compiler pragma. The ARM compiler, armcc, has a section pragma to move the large array into a section which is not automatically initialized to zero. To use it, put the pragma around the array declaration as shown below.

 

pragma

 

After putting in the pragma, one more step is needed. The scatter file for the linker must be aware of this new section. More info is available on the documentation on "How to prevent uninitialized data from being initialized to zero".

 

In my linker scatter file I added one more section:

 

scat

 

Executing the program with the pragma is a much safer solution, especially when the software is going to write the memory anyway and the initial zero values are not being assumed. With the pragma the number of cycles to reach main is the same with both sizes of the array.

 

Array size

Cycles to reach main()

200

4411

200000

4411

 

The pragma is a good solution if there are a few large arrays that can be found and instrumented with the pragma.

 

Hopefully this article provided some understanding of what happens between the initial assembly code and main(). Although there is no one-size-fits-all solution it is definitely helps to understand and improve application startup time. Next time you find yourself with a program that appears stuck before reaching main() this just might be the cause.

 

Jason Andrews

Sometimes Hardware Details Matter in ARM Embedded Systems Programming

$
0
0

Last week, I received the call for papers for the Embedded World Conference for 2015. The list of topics is a good reminder of how broad the world of embedded systems is. It also reminded me how overloaded the term “embedded" has become. The term may invoke thoughts of a system made for a specific purpose to perform a dedicated function, or visions of invisible processors and software hidden in a product like a car. When I think of embedded, I tend think about the combination of hardware and software and learning how they work together, and the challenge of building and debugging a system running software that interacts with hardware. Some people call this hardware dependent software, firmware, or device drivers. Whatever it is called, it’s always a challenge to construct and debug both hardware and software and find out what the problems are. One of the great things about working at Carbon is the variety of the latest ARM IP combined with a spectrum of different types of software. We commonly work with software ranging from small bare-metal C programs to Linux running on multiple ARM cores. We also work with a mix of cycle accurate models and abstract models.

 

If you are interested in this area I would encourage you learn as much as possible about the topics below. Amazingly, the most popular programming language is still C, and being able to read assembly language also helps.


  • Cross Compilers and Debuggers
  • CPU Register Set
  • Instruction Pipeline
  • Cache
  • Interrupts and Interrupt Handlers
  • Timers
  • Co-Processors
  • Bus Protocols
  • Performance Monitors


I could write articles about how project X at company Y used Carbon products to optimize system performance or shrink time to market and lived happy ever after, but I prefer to write about what users can learn from virtual prototypes. Finding out new things via hands-on experience is the exciting part of what embedded systems are for me.


Today, I will provide two examples of what working with embedded systems is all about. The first demonstrates why embedded systems programming is different from general purpose C programming because working with hardware requires paying attention to extended details. The second example relates to a question many people at Carbon are frequently asked, “Why are accurate models important?” Carbon has become the standard for simulation with accurate models of ARM IP, but it’s not always easy to see why or when the additional accuracy makes a difference, especially for software development. Since some software development tasks can be done with abstract models, I will share a situation where accuracy makes a difference. Both of the examples in this article looked perfectly fine on the surface, but didn’t actually work.


GIC-400 Programming Example


Recently, I was working with some software that had been used on an ARM Cortex-A9 system. I ported it to a Cortex-A15 system, and was working on running it on a new system that used the GIC-400 instead of the internal GIC of the A15.


People that have worked with me know I have two rules for system debugging:

  1. Nothing ever works the first time
  2. When things don’t work, guessing is not allowed

When I ran the new system with the external GIC-400, the software failed to start up correctly. One of the challenges in debugging such problems is that the software jumps off to bad places after things don’t work and there is little or no trail of when the software went off the path. Normally, I try to use software breakpoints to close in on the problem. Another technique is to use the Carbon Analyzer to trace bus transactions and software execution to spot a wrong turn. In this particular case I was able to spot an abort and I traced it to a normal looking access to one of the GIC-400 registers.


I was able to find the instruction that was causing the abort. The challenge was that it looked perfectly fine. It was a read of the GIC Distributor Control Register to see if the GIC is enabled. It’s one of the easiest things that could be done, and would be expected to work fine as long as the GIC is present in the system. Here is the source code:

c1

The load instruction which was aborting was the second one in the function, the LDRB:

c2

The puzzling thing was that the instruction looked fine and I was certain I ran this function on other systems containing the Cortex-A9 and Cortex-A15 internal GIC.

 

After some pondering, I recalled reading that the GIC-400 had some restrictions on access size for specific registers. Sure enough, the aborting instruction was a load byte. It’s not easy to find a clear statement specifying a byte access to this register is bad, but I'm sure it's in the documentation somewhere. I decided it was easier to just re-code the function to create a word access and try again.

 

There are probably many ways change the code to avoid the byte read, but I tried the function this way since the enable bit is the only bit used in the register:

c3

Sure enough, the compiler now generated a load word instruction and it worked as expected.

 

This example demonstrates a few principles of embedded systems. The first is the ability to understand ARM assembly language is a big help in debugging, especially tracing loads and stores to hardware such as the GIC-400. Another is that the code a C compiler generates sometimes matters. Most of the time when using C there is no need to look at the generated code, but in this case there is a connection between the C code and how the hardware responds to the generated instructions. Understanding how to modify the C code to generate different instructions was needed to solve the problem.

 

Mysterious Interrupt Handler

 

The next example demonstrates another situation where details matter. This was a bare-metal software program installing an interrupt handler for the Cortex-A15 processor for the nIRQ interrupt by putting a jump to the address of the handler at address 0x18. This occurs during program startup by writing an instruction into memory which will jump to the C function (irq_handler) to handle the interrupt. The important code looked like this, VECTOR_BASE is 0:

c4

The code looked perfectly fine and worked when simulated with abstract models, but didn’t work as expected when run on a cycle accurate simulation. Initially, it was very hard to tell why. The simulation would appear to just hang and when the simulation was stopped and it was sitting in weird places that didn’t seem like code that should have been running. Using the instruction and transaction traces it looked like an interrupt was occurring, but the program didn’t go to the interrupt handler as expected. To debug, I first placed a hardware breakpoint on a change on the interrupt signal, then I placed a software breakpoint on address 0x18 so the simulation would stop when the first interrupt occurred. The expected instruction was there, but when I single stepped to the next instruction the PC just advanced one word to address 0x1c, and no jump. Subsequent step commands just incremented the PC. In this case there was no code at any other address except 0x18 so the CPU was executing instructions that were all 0.

 

This problem was pretty mysterious considering the debugger showed the proper instruction at the right place, but it was as if it wasn’t there at all. Finally, it hit me that the only possible explanation was that the instruction really wasn’t there.

 

What if the cache line containing address 0x18 was already in the instruction cache when the jump instruction was written by the above code? When the interrupt occurred the PC jumps to 0x18 but would get the value from the instruction cache and never see the new value that had been written.

 

The solution was to invalidate the cache line after writing the instruction to memory using a system control register instruction with 0x18 in r0:

c5

Although cache details are mostly handled automatically by hardware and cache modelling is not always required for software development, this example shows that sometimes more detailed models are required to fully test software. In hindsight experienced engineers would recognize self-modifying code, and the need to pay attention to caching, but it does demonstrate a situation where using detailed models does matter.

 

Summary

 

Although you may never encounter the exact problems described here, they demonstrate typical challenges embedded systems engineers face, and remind us to keep watch for hardware details. These examples also point out another key principle of embedded software, old code lives forever. This often means that while code may have worked on one system, it won’t automatically work on a new system, even if they seem similar. If these examples sound familiar, it might be time to look into virtual prototypes for your embedded software development.

 

Jason Andrews

Using ARM Compiler 6 with Carbon Performance Analysis Kits (CPAKs)

$
0
0

ARM has released DS-5 version 5.19 including the Ultimate Edition for ARMv8 to compile and debug 64-bit software. The CarbonPerformance Analysis Kits (CPAKs) for the ARM Cortex-A57 and Cortex-A53 demonstrate 64-bit bare metal software examples which can be modified and compiled with DS-5. The software in the currently available CPAKs is compiled with ARM Compiler 5, better known as armcc, and not yet configured for ARM Compiler 6, also known as armclang. Fortunately, only a few changes are needed to move from armcc to armclang.

 

Today, I will provide some tips for using ARM Compiler 6 for those who would like to use the latest compiler from ARM with CPAK example software. In the future, all CPAKs will be updated for ARM Compiler 6, but now is a good time to give it a try and learn about the compiler.

 

ARM Compiler 6 is based on Clang and the LLVM Compiler Framework, and provides best in class code generation for the ARM Architecture. There are various articles covering the details, but the key takeaway is that ARM Compiler 6 is based on open source which has a flexible license and allows commercial products to be created without making the source code available.


Migration Guide

 

A good place to understand the differences between armcc and armclang is the ARM Compiler Migration Guide. It explains the command line differences between the two compilers and how to map switches from the old compiler to the new compiler. The migration guide also covers two additional tools provided to aid in switching compilers:

  • Source Compatibility Checker
  • Command-line Translation Wrapper

 

The compatibility checker helps find issues in the source code that is being migrated, while the translation wrapper provides an automatic way to call armcc as before, but invisibly calls armclang with the equivalent options. I didn’t spend too much time with either tool, but they are worth checking out.


The key point is that migration will involve new compiler invocation and switches, but it may also involve source code changes for things such as pragmas and attributes that are different between the compilers.


CPAK HOWTO

 

Let’s look at the practical steps to use ARM Compiler 6 on a Cortex-A53 CPAK software example. For this exercise I selected the DecrDataMP_v8/ example in the Applications/ directory of the CPAK. The system is a dual-cluster A53 where each cluster has 1 core. It also includes the CCI-400 to demonstrate cache coherency between clusters and the NIC-400 for connecting peripherals. The block diagram is shown below.

Cortex-A53 System

Setting up DS-5 is very easy, I use Linux and bash so I just add the bin/ directory of DS-5 to my PATH environment variable. Adjust the path to match your installation.

 

$ export PATH=$PATH:/o/tools/linux/ARM/DS5_5.19/64bit/bin

 

Only the 64-bit version of DS-5 includes ARM Compiler 6, it’s not included in the 32-bit version of DS-5 so make sure you install the 64-bit version and run on a 64-bit Linux machine.

 

The first step to using ARM Compiler 6 is to edit the Makefile and replace armcc with armclang to compile the C files. Any assembly files can continue to be compiled by armasm and linking done with armlink remains mostly the same. It is possible to compile assembly files and link with armclang, but for this case I decided to leave the flow as is to learn the basics of making the compiler transition.

 

The Makefile specifies the compiler as the CC variable so make it CC=armclang

 

The next important change is the specification of the target CPU. With armcc the --cpu option is used. You will see --cpu=8-A.64.no_neon in the Makefile. One tip is to use the command below to get a list of possible targets.

 

$ armcc --cpu list

 

With armclang the target CPU selection is done using the -target option. To select AArch64 use -target aarch64-arm-none-eabi in place of the --cpu option.

 

The invocation command and the target CPU selection are the main differences needed to switch from armcc to armclang.


Other Switches

 

This particular CPAK software is using –-c90 to specify the version of the C standard to use. For armclang the equivalent option is –xc –std=c90 so make this change in the Makefile also.

 

The next issue is the use of –-dwarf3 option. This is not supported by armclang and it seems like DWARF4 is the only option with armclang.

 

The Makefile also uses –Ospace as an option to shrink the program size at the possible expense of runtime speed. For armclang this should be changed to –Os.

 

The last difference relates to armlink. The armlink commands need --force_scanlib to tell armlink to include the ARM libraries. From the documentation, this option is mandatory when running armlink directly. Add this flag to the armlink commands and the compilation will complete successfully and generate .axf files!

 

Here is a table summarizing the differences.

 

ARM Compiler 5

ARM Compiler 6

Invoke using armcc

Invoke using armclang

--cpu=8-A.64.no_neon

-target aarch64-arm-none-eabi

--dwarf3

None

-Ospace

-Os

 

--force_scanlib

 

I encountered one other quirk when migrating this example to ARM Compiler 6, a compilation error caused by using .h file in the source file retarget.c

 

  #include <rt_misc.h>

 

For now I just commented out this line and the application compiled and ran fine. It’s probably something to look into on a rainy day.


Creating an eclipse Project for DS-5

 

It wouldn’t be DS-5 if we didn’t use the eclipse environment to compile the example. It’s very easy to do so I’ll include a quick tutorial for those who haven’t used it before. Since a Makefile already exists for the software I used a new Makefile project.

 

First, launch eclipse using

 

$ eclipse &

 

Once eclipse is started, use the menu File -> New -> Makefile Project with Existing Code

 

Pick a name for the a project and fill it into the dialog box, browse to the location of the code, and select ARM Compiler6 as the Toolchain for indexer settings.

 

DS5 armclang

 

There are many ways to get the build to start, but once the project is setup I use the Project menu item called Build Project and the code will be compiled.

 

There is a lot more to explore with DS-5, but this is enough information to get going in the right direction.


ARM Techcon

 

Now is a great time to start making plans to attend ARM TechCon, October 1-3 at the Santa Clara Convention Center. The schedule has just been published and registration is open. I will present Performance Optimization for an ARM Cortex-A53 System using Software Workloads and Cycle Accurate Models on Friday afternoon.

 

Jason Andrews

Using ARM DS-5 Ultimate Edition with Accurate Virtual Prototypes

$
0
0

Last month I covered the details of using ARM Compiler 6 to compile bare metal software included in Carbon Performance Analysis Kits (CPAKs) for ARMv8 processors such as the Cortex-A53. This time I will outline the flow used to connect ARM DS-5 version 5.19 to debug software. DS-5 is a full set of tools for end-to-end software development for ARM processors and includes the ability to connect to cycle accurate models running in SoC Designer and debug software.

 

Overview of DS-5 to SoC Designer Connection

 

Setting up SoC Designer for use with DS-5 involves the following steps:

  1. Use a DS-5 supplied program called cdbimporter to create a DS-5 configuration database.
  2. Add the configuration database to the list of systems to which DS-5 can connect.
  3. Create a DS-5 Debug Configuration which specifies the system to connect to and the software to debug.
  4. Connect to a SoC Designer simulation using the Debug Configuration to perform software debugging tasks.

 

This article presents a summary of the flow. The last two steps are common to debugging with eclipse, but the first two steps may be new to eclipse users. There is an application note in the SoC Designer release which provides additional information about cdbimporter. SoC Designer users can look at the file $MAXSIM_HOME/doc/DS5_App_Note.pdf

 

Generating a DS-5 Configuration Database

 

The database is created with a DS-5 utility called cdbimporter, which provides a number of useful features to help automate the process such as:

  • Query the host machine for running CADI servers
  • Identify the simulation of interest
  • Identify the cores included in the simulation
  • Generate the configuration database

 

I will use a Cortex-A53 Baremetal CPAK with SoC Designer 7.14.6 for reference.

 

a53x1 resized 600

To set up the DS-5 configuration database start the simulation as specified in the CPAK README file and load the software. Do not start running the simulation.

 

Run the simulation:

 

$ sdsim -b A53-cpak.conf -b $MAXSIM_PROTOCOLS/etc/Protocols.conf Systems/A53v8-MP1-CCI400.mxp

 

Then load the sorts_v8.axf file from the Applications/ directory and go to the shell prompt to run cdbimporter.

 

Setup the DS-5 shell environment using the path to your DS-5 installation:

 

$ /o/tools/linux/ARM/DS5_5.19/64bit/bin/suite_exec_ac6 bash

 

Now cdbimporter should be in the PATH and ready to run.

 

The output from cdbimporter is a directory containing the DS-5 configuration database which will be used in the next step.


Add the Configuration Database to DS-5

 

The next step is to make DS-5 aware of the new configuration database. Start by launching eclipse and set the workspace directory.

 

Import the database generated by cdbimporter as follows:
Open the Preferences page from the menu: Windows -> Preferences.
From the Preferences popup, select DS-5-> Configuration Database -> Add. The Add configuration database location dialog opens:

 

Capture1 resized 600

 

Use the Browse button to navigate to the directory which contains the configuration database created in the first step.

 

Use the Name field to name this database for later identification in the Preferences (it can also be left blank).
Click OK to close the Add configuration database location dialog.
Click OK again to close the Preference window. You will see a dialog box indicating it is adding the new database.

 

The configuration database is now included in the DS-5 list of target systems.

 

Create a DS-5 Debug Configuration

 

The next step is to create a DS-5 Debug Configuration that creates the connection to the system to be debugged.

 

Start the DS-5 debug perspective as follows:

 

From the top menu, select Windows -> open perspective -> DS5 Debug.

 

From the top menu, select Run > Debug Configurations… (or right-click from the upper left “Debug Control” view and select debug configurations -> DS-5 Debugger).

 

From the “Debug Configurations” pop up, right-click on DS5-Debugger and select New.

 

On the Connection tab, close default groups, then scroll down and select the imported system; e.g., Carbon > A53-MP1-CCI400.mxp > Bare Metal Debug > Debug Cortex-A53_0

 

Capture2 resized 600


On the Files tab select the .axf file of the application, in this case sorts_v8.axf.

 

Load symbols from file assumes you already loaded the application to debug from SoC Designer. To load the application from DS-5 instead, enter the name in the Application on host to download box and the application will be written to memory when the debugger connects.

 

Capture3 resized 600

 

On the Debugger tab, in the Run Control panel, select "Connect only."

 

Capture4 resized 600


Connect DS-5 to a SoC Designer Simulation and Debug Software

 

The final step is to launch the Debugger Configuration and connect to the SoC Designer simulation. You can do this immediately using the Debug button on the lower right of the Debug Configurations dialog.

 

You can also use the Debug Control tab by right-clicking on the connection and selecting “Connect to Target” from the context menu (shown below).

 

Capture5 resized 600

 

After the debugger is connected the status is shown as “connected”:

 

Capture6 resized 600


After connection, DS-5 can be used to debug the software running on the SoC Designer simulation.


Simulation on Another Machine or Port

 

If multiple simulations are starting and stopping there is a chance the port number the simulator is using is different than it was when the configuration database was created. The first simulation is normally port 7000, then the next is 7001. The numbers will recycle as simulations start and stop. It’s not necessary to start from scratch and create a new configuration database. There is an environment variable that is the easiest way to adjust to a new port number:

 

For bash I use:

 

$ export CADI_TARGET_MACHINE=localhost:7001

 

This environment variable can also be used to connect to a simulator on another machine by using the name or IP address of the machine which has the running simulation.


Conclusion

 

This summary covered the steps to connect DS-5 to SoC Designer. DS-5 is the best debugger to handle bare metal software for the AARCH64 state compiled with ARM Compiler 6 and DWARF4 debug information. Make sure to review the application note provided with each SoC Designer release for the latest information.

 

Jason Andrews


Getting Ready for ARM Techcon 2014

$
0
0

This week is the 10th year for ARM Techcon, which has evolved into the best place for all things related to ARM technology. I will be attending this year, and giving a presentation on Friday at 3:30 titled “Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models”.

 

Based on the agenda this year, ARMv8 will be one of the primary topics. For the past few years there have been presentations about ARMv8, but it’s clear many people now have hands-on experience and are ready to share it at the conference. To get warmed up for ARM Techcon, I will share a couple of fun facts about 64-bit Linux on ARMv8.


Swap & Play Technology

 

One of the differentiating technologies of Carbon SoC Designer is Swap & Play, which enables a system to be simulated with ARM Fast Models to a breakpoint and saved as a checkpoint. The simulation checkpoint can be restored into a cycle-accurate simulator constructed with models built directly from the RTL of the IP. The most common use case for this technology is running benchmarks on top of the Linux operating system. Swap & Play is attractive for this application because the Linux boot is only a means for setting up the benchmark, and the accuracy is critical to the benchmark results and the system performance analysis. It may seem strange to simulate Linux using cycle accurate models because it requires billions of instructions, but there are times when being able to run Linux benchmarks on accurate hardware models is invaluable. In fact, this is probably required before a chip can complete functional verification.

 

swap n play2 resized 600

 

One of the useful features of the ARM® Cortex®-A50 series processors is the backward compatibility with ARMv7 software. I have had good results running software binaries from A15 designs directly on A53 with no changes. Mobile devices have even started appearing with A53 processors that are running Android in 32-bit mode which have the possibility of upgrading to 64-bit Android in the future.

 

One of the reasons we always focus on the system at Carbon is because today's IP is complex and configurable, and this can lead to integration pitfalls which were not anticipated. Take for instance 64-bit Linux on ARMv8. It would be a reasonable assumption that if a design has A53 cores and is successfully running 32-bit Linux, it should be able to run 64-bit Linux just by changing the software.

 

Below are a couple of fun facts related to migrating from 32-bit Linux to 64-bit Linux on ARMv8 to get warmed up for ARM Techcon 2014.


Generic Timer Usage

 

The A15 and A53 offer a similar set of four Generic Timers. Many multi-cluster A15 (and even some single cluster) designs have used the GIC-400 as an interrupt controller instead of the internal A15 interrupt controller, so the update to A53 seems straightforward to change the CPU and run the same A15 software on the A53 in AArch32 state.

 

It turns out that 32-bit Linux uses the Virtual Generic Timer and 64-bit Linux uses the Non-Secure Physical Timer as the primary Linux timer. From a hardware design view, this probably doesn’t matter much as long as all of the nCNT* signals are connected from the CPU to the GIC, but understanding this difference when doing system debugging or building minimal Linux systems for System Performance Analysis is helpful. As I wrote in previous articles, architects doing System Performance Analysis are typically not interested in the same level of netlist detail that the RTL integration engineer would be performing, so knowing the minimal set of connections between key components in the CPU subsystem needed to run a system benchmark is helpful.

 

Below is a comparison of the Generic Timers in 32-bit and 64-bit Linux. The CNTV registers are used in 32-bit Linux and the CNTP registers are used in 64-bit Linux. The CTL register shows which timer is enabled and the CVAL register having a non-zero value indicates the active timer.

 

timer compare resized 600


Processor Mode

 

The reason for the different timers is likely because 32-bit Linux runs in supervisor mode in the secure state, and 64-bit Linux runs in normal (non-secure) mode. I first learned about these processor modes when I was experimenting with running kvm on my Samsung Chromebook, which contains the Exynos 5 dual-core A15. I found out that to run a hypervisor like kvm I had to start Linux in the hypervisor mode, and the default configuration is to run in supervisor mode. After some changes to the bootloader setup, I was able to get Linux running in hypervisor mode and run kvm.

 

It may seem like the differences between the various modes are minor and unlikely to make any difference to the system design beyond the processors, but consider the following scenario.

 

Running 32-bit Linux on A53 in AArch32 state runs fine using CCI-400, NIC-400, and GIC-400 combined with some additional peripherals. The exact same system would be expected to run 64-bit Linux without any changes. What if, however, the slave port of the NIC-400 which receives data from the CCI-400 was configured in AMBA Designer for secure access? This is one of the three possible configuration choices for slave ports. Here are the descriptions of AMBA Designer choices for the slave port:

nic 400 slave

 

If secure was selected, the system would run fine with 32-bit Linux, but would fail when running 64-bit Linux because the non-secure transactions from the A53 would be presented as secure transactions to the GIC (because of the NIC-400 configuration) and would result in reading wrong values from GIC registers such as the Interrupt Acknowledge Register (IAR) when trying to determine which peripheral is signaling an interrupt. The result would be a difficult to debug looping behavior in which the kernel is unable to service the proper interrupt. All of this because of a NIC-400 configuration parameter. For more information on the NIC-400 design flow, a recording of the recent Carbon Webinar is available.


Summary

 

As you can see, seemingly minor differences in the processor operating mode between 32-bit and 64-bit Linux can impact IP configuration as well as connections between IP. These are just two small examples of why ARM Techcon 2014 should be an exciting conference as the ARM community shares experiences with ARMv8.

 

Make sure to stop by the Carbon booth for the latest product updates and information about Carbon System Exchange, the new portal for pre-built virtual prototypes.

 

Jason Andrews

Running 64-bit Linux Applications on an 8-core ARM Cortex-A53 CPAK

$
0
0

I’m excited to introduce the most complex Carbon Performance Analysis Kit (CPAK) created by Carbon; an 8-core ARM Cortex-A53 system running 64-bit Linux with full Swap & Play support. This is also the first dual-cluster Linux CPAK available on Carbon System Exchange. It’s an important milestone for Carbon and for SoC Designer users because it enables system performance analysis for 64-bit multi-core Linux applications.

 

Here are the highlights of the system:

  • Dual-cluster, quad-core Cortex-A53 for a total of 8 cores
  • ARM CoreLink CCI-400 providing coherency between clusters
  • Fully configured GIC-400 interrupt controller delivering interrupts to all cores
  • New Global System Counter connected to A53 Generic Timers

 

Here is a diagram of the system.

octacore1

The design also supports fully automatic mapping to ARM Fast Models.

 

I would like to introduce some of the new functionality in this CPAK.

 

Dual Cluster System


The Cortex-A53 model supports the CLUSTERIDAFF inputs to set the Cluster ID. This value shows up for software in the MPIDR register. Values of 0 and 1 are used for each cluster, and each cluster has four cores. This means that CPU 3 in Cluster 1 has an MPIDR value of 0x80000103 as shown in the screenshot below.

 

mpidr1


Global System Counter

 

Another requirement for a multi-cluster system is the use of a Global System Counter. A new model is now available in SoC Designer which is connected to the CNTVALUEB input of each A53. This ensures that the Generic Timer in each processor has the same counter values for software, even when the frequency of the processors may be different. This model also enables Swap & Play systems to work correctly by saving the counter value from the Fast Model simulation and restoring it in the Cycle Accurate simulation.

 

Generic Timer to GIC Connections


To create a multi-cluster system the GIC-400 is used as the interrupt controller, and the A53 Generic Timers are used as the system timers. This requires the connection of the Generic Timer signals from the A53 to the GIC-400. All of these signals start with nCNT and are wired to the GIC. When a Generic Timer generates an interrupt it leaves the CPU by way of the appropriate nCNT signal, goes to the GIC, and then back to the CPU using the appropriate nIRQ signal.

 

As I wrote in my ARM Techcon Blog, 64-bit Linux uses nCNTPNSIRQ, but all signals are connected for completeness.

 

Event Connections

 

Additional signals which fall into the category of power management and connect between the two clusters are EVENTI and EVENTO. These signals are used for event communication using the WFE (wait for event) and SEV (send event) instructions. For a single cluster system all of the communication happens inside the processor, but for the multi-cluster system these signals must be connected.

WFE and SEV communication is used during the Linux boot. All 7 of the secondary cores execute a WFE and wait until the primary core wakes them up using the SEV instruction at the appropriate time. If the EVENTI and EVENTO signals are not connected the secondary cores will not wake up and run.

 

Boot Wrapper Modifications

 

The good news is that all of the software used in the 8-core CPAK is easily downloadable in source code format. A small boot wrapper is used to take care of starting the cores and doing a minimal amount of hardware configuration that Linux assumes to be already done. Sometimes there is additional hardware programming that is needed for proper cycle accurate operation that is not needed in a Fast Model system. These are similar to issues I covered in another article titled Sometimes Hardware Details Matter in ARM Embedded Systems Programming.

 

SMP Enable

 

Although not specific to multi-cluster, the A53 contains a bit in the CPUECTLR register named SMPEN which must be set to 1 to enable hardware management of data coherency with the other cores in the cluster. Initially, this was not set in the boot wrapper from kernel.org and the Linux kernel assumes it is already done so it was added to the boot wrapper during development.

 

CCI Snoop Configuration

 

Another hardware programming task which is assumed by the Linux kernel is the enabling of snoop requests and responses between the clusters. The Snoop Control Register for each CCI-400 slave ports is set to 0xc0000003 to enable coherency. This was also added to the boot wrapper during development of the CPAK.

The gaps between the boot wrapper functionality and Linux assumptions are somewhat expected since the boot wrapper was developed for ARM Fast Models and these details are not needed to run Linux on Fast Models, but nevertheless they make it challenging to create a functioning cycle accurate system. These changes are provided as a patch file in the CPAK so they can be easily applied to the original source code.

 

CPAK Contents

 

The CPAK comes with an application note which covers the construction of the Linux image.

 

The following items are configured to match the minimal hardware system design, and can be extended as the hardware design is modified.

  • File System: Custom file system configured and created using Buildroot
  • Kernel Image: Linux 3.14.0 configured to use the minimal hardware
  • Device Tree Blob:  Based on Versatile Express device tree for ARM Fast Models
  • Boot Wrapper: Small assembly boot wrapper available from kernel.org

 

A single executable file (. axf file) containing all of the above items is compiled. This file contains all of the artifacts and is a single image that is loaded and executed in SoC Designer.

One of the amazing things is there are no kernel source code changes required. It demonstrates how far Linux has come in the ARM world and the flexibility it now has in supporting a wide variety of hardware configurations.

 

Summary


An octa-core A53 Linux CPAK is now available which supports Swap & Play. The ability to boot the Linux kernel using Fast Models and migrate the simulation to cycle accurate execution enables system performance analysis for 64-bit multi-core systems running Linux applications.

 

Also, make sure to check out the other new CPAKs for 32-bit and 64-bit Linux for Cortex-A53 now available on Carbon System Exchange.

 

The “Brought up 8 CPUs” message below tells it all. A number of 64-bit Linux applications are provided in the file system, but users can easily add their favorite programs and run them by following the instructions in the app note.

 

8cpus

Optimization of Systems Containing the ARM CoreLink CCN-504 Cache Coherent Network

$
0
0

The first Carbon Performance Analysis Kit (CPAK) demonstrating the AMBA 5 CHI protocol has been released on Carbon System Exchange. The design features the ARM Cortex-A57 configured for AMBA 5 CHI and the ARM CoreLink CCN-504 Cache Coherent Network. The design is a modest system with a single core running 64-bit bare-metal software with memory and a PL011 UART, but for anybody who digs into the details there is a lot to learn.

 

Here is a diagram of the system:

 

ccn-504-system1


AMBA 5 CHI Introduction

 

Engineers who have been working with ARM IP for some time will quickly realize AMBA 5 CHI is not an extension of any previous AMBA specifications. AMBA 5 CHI is both more and less complex compared to AMBA 4. CHI is more complex at the protocol layer, but less complex at the physical layer. AXI and ACE use Masters and Slaves, but CHI uses Request Nodes, Home Nodes, Slave Nodes, and Miscellaneous Nodes. All of these nodes are referenced using shorthand abbreviations as shown in the table below.

 

nodes2


Building the A57 with CHI

 

The latest r1p3 A57 is now available on Carbon IP Exchange. CHI can be selected as the external memory interface. The relevant section from the IP Exchange configuration form is shown below.

 

a57chi1

 

The CHI memory interface relies on the System Address Map (SAM) signals. All of the A57 input signals starting with SAM*are important in constructing a working system. These values are available as parameters on the A57 model, and are configured appropriately in the CPAK to work with the CCN-504.

 

Configuring the CCN-504


The CCN-504 Cache Coherent Network provides the connection between the A57 and memory. The CPAK uses two SN-F interfaces since dual memory controllers is one of the key features of the IP. A similar set of SAM* parameters is available on the CCN-504 to configure the system address map. Like other ARM IP, the CCN uses the concept of PERIPHBASE to set the address of the internal, software programmable registers.

 

Programming Highlights

 

The CCN-504 includes an integrated level 3 cache. The CPAK demonstrates the use of the L3 cache.

The CPAK startup assembly code also demonstrates other CCN-504 configuration including how to setup barrier termination, load node ID lists, programming system address map control registers, and more.


AMBA 5 CHI Waveforms

 

One of the best ways to start learning about AMBA 5 CHI is looking at the waveforms between the A57 and the CCN-504. The lastest SoC Designer 7.15.5 supports CHI waveforms and displays Flits, the basic unit of transfer in the AMBA 5 CHI link layer.

 

wave

Summary


A new CPAK by Carbon Design Systems running 64-bit bare-metal software on the Cortex-A57 processor with CHI memory interface connected to the CCN-504 and memory is now available. It demonstrates the AMBA 5 CHI protocol, serves as a starting point for optimization of CCN-based systems, and is a valuable learning tool for projects considering AMBA 5 CHI.

Three Tips for Using Linux Swap & Play with ARM Cortex-A Systems

$
0
0

Today, I have three tips for using Swap & Play with Linux systems.

 

  1. Launching benchmark software automatically on boot
  2. Setting application breakpoints for Swap & Play checkpoints
  3. Adding markers in benchmark software to track progress


With the availability of the Cortex-A15 and Cortex-A53 Swap & Play models as well as the upcoming release of the Cortex-A57 Swap & Play model, Carbon users are able to run Linux benchmark applications for system performance analysis. This enables users to create, validate, and analyze the combination of hardware and software using cycle accurate virtual prototypes running realistic software workloads. Combine this with access to models of candidate IP, and the result is a unique flow which delivers cycle accurate ARM system models to design teams.


cav

Swap & Play Overview

 

Carbon Swap & Play technology enables high-performance simulation (based on ARM Fast Models) to be executed up to user specified breakpoints, and the state of the simulation to be resumed using a cycle accurate virtual prototype.One of the most common uses of Swap & Play is to run Linux benchmark applications to profile how the software executes on a given hardware design. Linux can be booted quickly and then the benchmark run using the cycle accurate virtual prototype. These tips make it easier to automate the entire process and get to the system performance analysis.


Launch Benchmarks on Boot


The first tip is to automatically launch the benchmark when Linux is booted.Carbon Linux CPAKs on System Exchange use a single executable file (.axf) for each system with the following artifacts linked into the images:

  • Minimal Boot loader
  • Kernel image
  • Device Tree
  • RAM-based File System with applications

To customize and automate the execution of a desired Linux benchmark application, a Linux device tree entry can be created to select the application to run after boot.

 

The device tree support for “include” can be used to include a .dtsi file containing the kernel command line, which launches the desired Linux application.

 

Below is the top of the device tree source file from an A15 CPAK. If one of the benchmarks to be run is the bw_pipe test from the LMBench suite a .dtsi file is included in the device tree.

 

dts11

 

The include line pulls in a description of the kernel command line. For example, if the bw_pipe benchmark from LMbench is to be run, the include file contains the kernel arguments shown below:

 

dts22

 

The rdinit kernel command line parameter is used to launch a script that executes the Linux application to be run automatically. The bw_pipe.sh can then run the bw_pipe executable with the desired command line arguments.

 

Scripting or manually editing the device tree can be used to modify the include line for each benchmark to be run. A unique .axf file for each Linux application to be run can be created. This gives an easy to use .axf file that will automatically launch the benchmark without the need for any interactive typing. Having unique .axf files for each benchmark also makes it easy to hand off to other engineers since they don’t need to know anything about how to run the benchmark; just load the .axf file and the application will automatically run.

 

I also recommend to create an .axf image which runs /bin/bash to use for testing new benchmark applications in the file system. I normally run all of the benchmarks manually from the shell first on the ARM Fast Model to make sure they are working correctly.


Setting Application Breakpoints

 

Once benchmarks are automatically running after boot, the next step is to set application breakpoints to use for Swap & Play checkpoints. Linux uses virtual memory which can make it difficult to set breakpoints in user space. While there are application-aware debuggers and other techniques to debug applications, most are either difficult to automate or overkill for system performance analysis.

 

One way to easily locate breakpoints is to call from the application into the Linux kernel, where it is much easier to put a breakpoint. Any system call which is unused by the benchmark application can be utilized for locating breakpoints. Preferably, the chosen system call would not have any other side effects that would impact the benchmark results.

 

To illustrate how to do this, consider a benchmark application to be run automatically on boot. Let’s say the first checkpoint should be taken when main() begins. Place a call to the sched_yield() function as the first action in main(). Make sure to include the header file sched.h in the C program. This will call into the Linux kernel. A breakpoint can be placed in the Linux kernel file kernel/sched/core.c at the entry point for the sched_yield system call.

 

Here is the bw_pipe benchmark with the added system call.

 

yield1

 

Put a breakpoint in the Linux kernel at the system call and when the breakpoint is hit save the Swap & Play checkpoint. Here is the code in the Linux kernel.

 

kernel-yield2

 

The same technique can be used to easily identify other locations in the benchmark application including the end of the benchmark top stop simulation and gather results for analysis.

 

The sched_yield system call yields the current processor to other threads, but in the controlled environment of a benchmark application it is not likely to do any rescheduling at the start or at the end of a program. If used in the middle of a multi-threaded benchmark it may impact the scheduler.

 

Tracking Benchmark Progress


From time to time it is nice to see that a benchmark is proceeding as expected and be able to estimate how long it will take to finish. Using print statements is one way to do this, but adding to many print statements can negatively impact performance analysis. Amazingly, even a simple printf() call in a C program to a UART under Linux is a somewhat complex sequence involving the C library, some system calls, UART device driver activations, and 4 or 5 interrupts for an ordinary length string.

 

A lighter way to get some feedback about benchmark application process is to bypass all of the printf() overhead and make a system call directly from the benchmark application and use very short strings which can be processed with 1 interrupt and fit in the UART FIFO.

 

Below is a C program showing how to do it.

 

syscall1

 

By using short strings which are just a few characters, it’s easy to insert some markers in the benchmark application to track progress without getting in the way of benchmark results. This is also a great tool to really learn what happens when a Linux system call is invoked by tracing the activity in the kernel from the start of the system call to the UART driver.

 

Summary

 

Hopefully these 3 tips will help Swap & Play users run benchmark applications and get the most benefit when doing system performance analysis. I’m sure readers have other ideas how to best automate the running of Linux application benchmarks as well locating Swap & Play breakpoints, but these should get the creative ideas flowing.

 

Jason Andrews

System Performance Analysis and the ARM Performance Monitor Unit (PMU)

$
0
0

Carbon cycle accurate models of ARM CPUs enable system performance analysis by providing access to the Performance Monitor Unit (PMU). Carbon models instrument the PMU registers and record PMU events into the Carbon System Analyzer database without any software programming. Contrast this non-intrusive PMU event collection with other common methods of software execution:

 

  • ARM Fast Models focus on speed and have limited ability to access PMU events
  • Simulating or emulating CPU RTL does not provide automatic instrumentation and event collection
  • Silicon requires software programming to enable and collect events from the PMU


The ARM Cortex-A53 is a good example to demonstrate the features of SoC Designer. The A53 PMU implements the PMUv3 architecture and gathers statistics on the processor and memory system. It provides six counters which can count any of the available events.


The Carbon A53 model instruments the PMU events to gather statistics without any software programming. This means all of the PMU events (not just six) can be captured from a single simulation.


The A53 PMU Events can be found in the Technical Reference Manual (TRM) in Chapter 12. Below is a partial list of PMU events just to provide some flavor of the types of events that are collected. The TRM details all of the events the PMU contains.


pmu-events

 

Profiling can be enabled by right-clicking on a CPU model and selecting the Profiling menu. Any or all of the PMU events can be enabled. Any simulation done with profiling enabled will write the selected PMU events into the Carbon System Analyzer database.prof



Bare Metal Software

 

The automatic instrumentation of PMU events is ideal for bare metal software since it requires no programming and will automatically cover the entire timeline of the software test or benchmark. Full control is available to enable the PMU events at any time by stopping the simulator and enabling or disabling profiling.

 

All of the profiling data from the PMU events, as well as the bus transactions, and the software profiling information end up in the Carbon Analyzer database. The picture below shows a section of the Carbon Analyzer GUI loaded with PMU events, bus activity, and software activity.

 

 

a1


The Carbon Analyzer provides many out-of-the-box calculation of interesting metrics as well as a complete API which allows plugins to be written to compute additional system or application specific metrics.


Linux Performance Analysis

 

Things get more interesting in a Linux environment. A common use case is to run Linux benchmarks to profile how the software executes on a given hardware design. Linux can be booted quickly and then a benchmark can be run using a cycle accurate virtual prototype by making use of Swap & Play.

 

Profiling enables events to be collected in the analyzer database, but the user doesn’t have the ability to understand which events apply to each Linux process or to differentiate events from the Linux kernel vs. those from user space programs. It’s also more difficult to determine when to start and stop event collection for a Linux application. Control can be improved by using techniques from Three Tips for Using Linux Swap & Play with ARM Cortex-A Systems.


Using PMU Counters from User Space

 

Since the PMU can be used for Linux benchmarks, the first thing that comes to mind is to write some initialization code to setup the PMU, enable counters, run the test, and collect the PMU events at the end. This strategy works pretty well for those willing to get their hands dirty writing system control coprocessor instructions.


Enable User Space Access

 

The first step to being able to write a Linux application which accesses the PMU is to enable user mode access. This needs to be done from the Linux kernel. It's very easy to do, but requires a kernel module to be loaded or compiled into the kernel. All that is needed to set bit 0 in the PMUSERENR register to a 1. It takes only one instructions, but it must be executed from within the kernel. The main section of code is shown below.

 

pmu-mod2

 

Building a kernel module requires a source tree for the running kernel. If you are using a Carbon Performance Analysis Kit (CPAK), this source tree is available in the CPAK or can easily be downloaded by using the CPAK scripts.

 

A source code example as well as a Makefile to build it can be obtained by registering here.

 

The module can either be loaded dynamically into a running kernel or added to the static kernel build. When working with CPAKs it’s easier for me to just add it to the kernel. When I’m working with a board where I can natively compile it on the machine it’s easier to dynamically load it using:


$ sudo insmod enable_pmu.ko


Remember to use the lsmod command to see which modules are loaded and the rmmod command to unload it when finished.


The exit function of the module returns the user mode enable bit back to 0 to restore the original value.


PMU Application

 

Once user mode access to the PMU has been granted, benchmark programs can take advantage of the PMU to count events such as cycles and instructions. One possible flow from a user space program is:

  • Reset count values
  • Select which of the six PMU counter registers to use
  • Set the event to be counted, such as instructions executed
  • Enable the counters to start counting

Once this is done, the benchmark application can read the current values, run the code of interest, and then read the values again to determine how many events occurred during the code of interest.

 

pmu-app

 

The cycle counter is distinct from the other 6 event count registers. It is read from a separate CP15 system control register. For this example, event 0x8 is monitored, instruction architecturally executed, using event count register 0. Please take a look at the source code for the simple test application used to count cycles and instructions of a simple printf() call.

 

Summary

 

This article provided an introduction to using the Carbon Analyzer to automatically gather information on ARM PMU events for bare metal and Linux software workloads. Carbon models provide full access to all PMU events during a single simulation with no software changes and no limitations on the number of events captured.

 

It also explained how additional control can be achieved by writing software to access the PMU directly from a Linux test program or benchmark application. This can be done with no kernel changes, but does require the PMU to be enabled from user mode and is limited to the number of counters available in the PMU; six for CPUs such as the Cortex-A15 and A57.

 

Next time I will look at an alternative approach to use the ARM Linux PMU driver and a system call to collect PMU events. 

Using the ARM Performance Monitor Unit (PMU) Linux Driver

$
0
0

The Linux kernel provides an ARM PMU driver for counting events such as cycles, instructions, and cache metrics. My previous article covered how to access data from the PMU automatically within SoC Designer by enabling hardware profiling events. It also discussed how to enable access from a Linux application so the application can directly access the PMU information. This article covers how to use the ARM Linux PMU driver to gather performance information. In the previous article, the Linux application was accessing the PMU hardware directly using system control coprocessor instructions, but this time a device driver and a system call will be used. As before, I used a Carbon Performance Analysis Kit (CPAK) for a Cortex-A53 system running 64-bit Linux.

 

The steps covered are:

  • Configure Linux kernel for profiling
  • Confirm the device tree entry for the ARM PMU driver is included in the kernel
  • Insert system calls into the Linux application to access performance information


Kernel Configuration


The first step is to enable profiling in the Linux kernel. It’s not always easy to identify the minimal set of values to enable kernel features, but in this case I enabled “Kernel performance events and counters” which is found under “General setup" then under "Kernel Performance Events and Counters".


konfig-1a


I also enabled “Profiling support” on the “General setup” menu.


konfig-2a


Once these options are enabled recompile the kernel as usual by following the instructions provided in the CPAK.


Device Tree Entry


Below is the device tree entry for the PMU driver. All Carbon Linux CPAKs for Cortex-A53 and Cortex-A57 include this entry so no modification is needed. If you are working with your own Linux configuration confirm the pmu entry is present in the device tree.


pmu-dt-a


When the kernel boots the driver prints out a message:


hw perfevents: enabled with arm/armv8-pmuv3 PMU driver, 7 counters available


If this message is not in the kernel boot log check both the PMU driver device tree entry and the kernel configuration parameters listed above. If any of them are not correct the driver message will not appear.


Performance Information from a Linux Application


One way to get performance information from a Linux application is to use the perf_event_open system call. This system call does not have a glibc wrapper so it is called directly using syscall. Most of the available examples create a wrapper function, including the one shown in the manpage to make for easier usage.


perf-event-open-a 


The process is similar to many other Linux system call. First, get a file descriptor using open() and then use the file descriptor for other operations such as ioctl() and read().The perf_event_open system call uses a number of parameters to configure the events to be counted. Sticking with the simple case of instruction count, the perf_event_attr data structure needs to be filled in with the desired details.


It contains information about:

  •          Start enabled or disabled
  •          Trace child processes or not
  •          Include hypervisor activity or not
  •          Include kernel activity or not

 

Other system call arguments include which event to trace (such as instructions), the process id to trace, and which CPUs to trace on.

 

A setup function to count instructions could look like this:

 

perf-setup-a

 

At the end of the test or interesting section of code it’s easy to disable the instruction count and read the current value. In this code example, get_totaltime() uses a Linux timer to time the interesting work and this is combined with the instruction count from the PMU driver to print some metrics at the end of the test.

 

perf-final-a

 

Conclusion


The ARM PMU driver and perf_event_open system call provide a far more robust solution for accessing the ARM PMU from Linux applications. The driver takes care of all of the accounting, event counter overflow handling, and provides many flexible options for tracing.

 

For situations where tracing many events is required, it may be overly cumbersome to use the perf_event_open system call. One of the features of perf_event_open is the ability to use a group file descriptor to create groups of events with one group leader and other group members with all events being traced as a group. While all of this is possible it may be helpful to look at the perf command, which comes with the Linux kernel and provides the ability to control the counters for entire applications.

 

Jason Andrews

EDA Containers

$
0
0

Linux containers provide a way to build, ship, and run applications such as the EDA tools used in SoC Design and Verification. EDA Containers is a LinkedIn Group to explore and discover the possibilities of using container technology for EDA application development and deployment. Personally, I work in Virtual Prototyping doing simulation of ARM systems and software. This is a challenging area because it involves not only hardware simulation tools, but also software development tools to compile and run software. We are looking for other engineers interested to explore containers as they related to EDA tools and embedded software development process. If you are interested to learn or have expertise to share related to Docker, LXC, LXD, Red Hat containers please join us. The group is not specific to any EDA company or product. The members are from various companies who just happen to be interested to learn and explore what can be done with Linux containers.

 

If you are interested please join the group or feel free to discuss related topics here in the ARM Community!

 

EDAContainersWide.PNG


System Address Map (SAM) Configuration for AMBA 5 CHI Systems with CCN-504

$
0
0

In late 2014, Carbon released the first Carbon Performance Analysis Kit (CPAK) utilizing the ARM CoreLink CCN-504 Cache Coherent Network. Today, the CCN-504 can be built on Carbon IP Exchange with a wide range of configuration options. There are now four CPAKs utilizing the CCN-504 on Carbon System Exchange. The largest design includes sixteen Cortex-A57 processors, the most processors ever included in a Carbon CPAK.

 

At the same time SoC Designer has added new AMBA 5 CHI features including support for monitors, breakpoints, Carbon Analyzer support, and a CHI stub component for testing.

Introduction to AMBA 5 CHI

To get a good introduction on AMBA 5 CHI I recommend the article, "What is AMBA 5 CHI and how does it help?".

 

Another interesting ARM Community article is “5 things you might now know about AMBA 5 CHI”.

 

Although the cache coherency builds on AMBA 4 ACE and is likely familiar, some of the aspects of CHI are quite different.

chi-ace-compare

CCN-504 Configuration

Configuring the CCN-504 on Carbon IP Exchange is similar to all Carbon models. Select the desired interface types, node population, and other hardware details and click the "Build It" button to compile a model.

 

ccn-portal2

Understanding the Memory Map

One of the challenges of configuring CHI systems is to make sure the System Address Map (SAM) is correctly defined. As indicated in the table above, the process is more complex compared to a simple memory map with address ranges assigned to ports.

 

The network layer of the protocol is responsible for routing packets between nodes. Recall from the previous article that CHI is a layered protocol consisting of nodes of various types. Each node has a unique Network ID and each packet specifies a Target ID to send the packet to and a Source ID to be able to route the response.

 

For a system with A57 CPUs and a CCN-504 each Request Node (RN), such as a CPU, has a System Address Map (SAM) which is used to determine where to send packets. There are three possible node types a message could be sent to: Miscellaneous Node (MN), Home Node I/O coherent (HN-I), or Home Node Fully coherent (HN-F). DVM and Barrier messages are always sent to the MN so the challenge is to determine which of the possible Home Nodes an address is destined for.

 

The make the calculation of which HN-F is targeted the RN uses an address hash function. This can be found in the CCN-504 TRM.

hash

Each CCN has a different hashing function depending on how many HN-F partitions are being used.

 

The hashing function calculates the HN-F to be used, but this is still not a Network ID. Additional configuration signals provide the mapping from HN-F number to Node ID.

 

All of this means there are a number of SAM* parameters for the A57 and the CCN-504 which must be set correctly for the memory map to function. It also means that a debugging tool which makes use of back-door memory access needs to understand the hashing function to know where to find the correct data for a given address. SoC Designer takes all of this into consideration to provide system debugging.

 

As you can see, setting up a working memory map is more complex compared to routing addresses to ports.

 

Carbon models use configuration parameters to perform the following tasks:

  •          Associate each address region with HN-Fs or HN-Is
  •          Specify the Node ID values of Home Nodes and the Miscellaneous Node
  •          Define the number of Home Nodes
  •          Specify the Home Nodes as Fully Coherent or I/O Coherent

 

The parameters for the A57 CPU model are shown below:

 

a57-params4

 

The parameters for the CCN-504 model are similar, a list of SAMADDRMAP* values and SAM*NODEID values.

 

It’s key to make sure the parameters are correctly set for the system to function properly.

 

Cheat Sheet

Sometimes it’s helpful to have a picture of all of the parts of a CCN system. The cheat sheet below has been a tremendous help for Carbon engineers to keep track of the node types and node id values in a system.

 

chi-diagram2

SoC Designer Features

With the introduction of AMBA 5 CHI, SoC Designer has been enhanced to provide CHI breakpoints, monitors, and profiling information.

 

Screenshots of CHI transactions and CHI profiling are shown below. The Target ID and the Source ID for each transaction are shown. This is from the single-core A57 CPAK so the SourceID values are always 1. Multi-core CPAKs will create transactions with different SourceID values.

 

The CCN-504 has a large number of PMU events which can be used to understand performance.

 

chianalyzer2

 

chi-pmu2

Summary

AMBA 5 CHI is targeted at systems with larger numbers of coherent masters. The AMBA 5 CHI system memory map is more complex compared to ACE systems. A number of System Address Map parameters are required to build a working system, both for the CPU and for the interconnect.

 

Carbon SoC Designer is a great way to experiment and learn how CHI systems work. Pre-configured Carbon Performance Analysis Kits (CPAKs) are available on Carbon System Exchange which can be downloaded and run which demonstrate hardware configuration as well as the software programming needed to initialize a CHI system. Just like the address map, the initialization software is more complex compared to an ACE system with a CCI-400 or CCI-500.

Comparing ARM Cortex-A72 and ARM Cortex-A57

$
0
0

The latest high-performance ARMv8-A processor is the Cortex-A72.The press release reports that the A72 delivers CPU performance that is 50x greater than leading smartphones from five years ago and will be the anchor in premium smartphones for 2016. The Cortex-A72 delivers 3.5x the sustained performance compared to an ARM Cortex-A15 design from 2014. Last week ARM began providing more details about the Cortex-A72 architecture. AnandTech has a great summary of the A72 details.

 

arm-a72

 

The Carbon model of the A72 is now available on Carbon IP Exchange along with 10 Carbon Performance Analysis Kits (CPAKs). Since current design projects may be considering the A72, it’s a good time to highlight some of the differences between the Cortex-A72 and the Cortex-A57.

 

Carbon IP Exchange Portal Changes

IP Exchange enables users to configure, build, and download models for ARM IP. There are a few differences between the A57 and the A72. The first difference is the L2 cache size. The A57 can be configured with 512 KB, 1 MB, or 2 MB L2 cache, but the A72 can be configured with a fourth option of 4MB.

 

Another new configuration which is available on IP Exchange for the A72 is the ability to disable the GIC CPU interface. Many designs continue to use version 2 of the ARM GIC architecture with IP such as the GIC-400. These designs can take advantage of excluding the GIC CPU interface.

 

The A72 also offers an option to include or exclude the ACP (Accelerator Coherency Port) interface.

 

The last new configuration option is the number of FEQ (Fill/Evict Queue) Entries on the A72 has been increased to include options of 20, 24, and 28 compared to the A57 which offers 16 or 20 entries. This feature has been important to Carbon users doing performance analysis and studying the impact of various L2 cache parameters.

 

The Cortex-A72 configuration from IP Exchange is shown below.

 

A72config-s1

ACE Interface Changes

The main change to the A72 interface is the width of the transaction ID signals has been increased from 6 bits to 7 bits. The wider *IDM signals only apply when the A72 is configured with an ACE interface. The main impact occurs when connecting an A72 to a CCI-400 which was used with A53 or A57. Since those CPUs have the 6-bit wide *IDM signals the CCI-400 will need to be reconfigured for 7-bit wide *IDM signals. All of the A72 CPAKs which use CCI-400 have this change made to them so they operate properly, but it’s something to watch if upgrading existing systems to A72.

 

This applies to the following signals for A72:

  •          AWIDM[6:0]
  •          WIDM[6:0]
  •          BIDM[6:0]
  •          ARIDM[6:0]
  •          RIDM[6:0]

System Register Changes

A number of system registers are updated with new values to reflect the A72.  The primary part number field in the Main ID register (MIDR) for A72 is 0xD08 vs the A57 value of 0xD07 and the A53 value of 0xD03. Clearly, the 8 was chosen well before the A72 number was assigned. A number of other ID registers change value from 7 on the A57 to 8 on the A72.

New PMU Events

There are a number of new events tracked by the Cortex-A72 Performance Monitor Unit (PMU). All of the new events have event numbers 0x100 and greater. There are three main sections covering:

  • Branch Prediction
  • Queues
  • Cache

The screenshots below from the Carbon Analyzer show the PMU events. All of these are automatically instrumented by the Carbon model and are recorded without any software programming.

 

pmu1-90

pmu2-90

 

The A72 contains many micro-architecture updates for incremental performance improvement. The most obvious one which was described is the L2 FEQ size, and there are certainly many more in the branch prediction, caches, TLB, pre-fetch, and floating point units. As an example, I ran an A57 CPAK and an A72 CPAK with the exact same software program. Both CPUs reported about 21,500 instructions retired. This is the instruction count if the program were viewed as a sequential instruction stream. Of course, both CPUs do a number of speculative operations. The A57 reported about 37,000 instructions speculatively executed and the A72 reported 35,700.

 

The screenshots of the instruction events are shown below, first A72 followed by A57. All of the micro-architecture improvements of the A72 combine to provide the highest performance CPU created by ARM to date.

 

A72-prof-2

A57-prof-2

Summary

Carbon users easily can run the A57, A53, and now the A72 with various configuration options and directly compare and contrast the performance results using their own software and systems. The CPAKs available from Carbon System Exchange provide a great starting point and can be easily modified to investigate system performance topics.

Migrating ARM Linux from CoreLink CCI-400 Systems to CoreLink CCN-504

$
0
0

Recently, Carbon released the first ARMv8 Linux CPAK utilizing theARM CoreLink CCN-504 Cache Coherent Network on Carbon System Exchange. The CCN family of interconnect offers a wide range of high bandwidth, low latency options for networking and data center infrastructure.

CoreLink

The new CPAK uses anARM Cortex-A57octa-core configuration to run Linux on a system withAMBA 5 CHI. Switching the Cortex-A57 configuration from ACE to CHI on Carbon IP Exchange is as easy as changing a pull-down menu item on the model build page. After that, a number of configuration parameters must be set to enable the CHI protocol correctly. Many of them were discussed in aprevious article covering usage of the CCN-504. Using native AMBA 5 CHI for the CPU interface coupled with the CCN-504 interconnect provides high-frequency, non-blocking data transfers. Linux is commonly used in many infrastructure products such as set-top boxes, networking equipment, and servers so the Linux CPAK is applicable for many of these system designs.

 

Selecting AMBA 5 CHI for the memory interface makes the system drastically different at the hardware level compared to a Linux CPAK using the ARM CoreLink CCI-400 Cache Coherent Interconnect, but the software stack is not significantly different.

 

From the software point of view, a change in interconnect usually requires some change in initial system configuration. It also impacts performance analysis as each interconnect technology has different solutions for monitoring performance metrics. An interconnect change can also impact other system construction issues such as interrupt configuration and connections.

 

Some of the details involved in migrating a multi-cluster Linux CPAK from CCI to CCN are covered below.

 

Software Configuration

Special configuration for the CCN-504 is done using the Linux boot wrapper which runs immediately after reset. The CPAK doesn’t include the boot wrapper source code, but instead usesgitto download it fromkernel.organd then patch the changes needed for CCN configuration. The added code performs the following tasks:

  • Set the SMP enable bit in the A57 Extended Control Register (ECR)
  • Terminate barriers at the HN-I
  • Enable multi-cluster snooping
  • Program HN-F SAM control registers

 

The most critical software task is to make sure multi-cluster snooping is operational. Without this Linux will not run properly. If you are designing a new multi-cluster CCN-based system it is worth running a bare metal software program to verify snooping across clusters is working correctly. It’s much easier to debug the system with bare metal software, and there are a number ofmulti-cluster CCN CPAKsavailable with bare metal software which can be used.

 

I always recommend a similar approach for other hardware specific programming. Many times users have hardware registers that need to be programmed before starting Linux and it’s easy to put this code into the boot wrapper and less error prone compared to using simulator scripts to force register values.The Linux device tree provided with the CPAK also contains a device tree entry for the CCN-504. The device tree entry has a base address which must match the PERIPHBASE parameter on the CCN-504 model. In this case the PERIPHBASE is set to 0x30 which means the address in the device tree is 0x30000000.

dt-ccn1

All Linux CPAKS come with an application note which provides details on how to configure and compile Linux to generate a single .axf file.

 

GIC-400 Identification of CPU Accesses

One of the new things in the CPAK is the method used to get the CPU Core ID and Cluster ID information to the GIC-400.

 

The GIC-400 requires the AWUSER and ARUSER bits on AXI be used to indicate the CPU which is making an access to the GIC. A number between 0 and 7 must be driven on these signals so the GIC knows which CPU is reading or writing, but getting the proper CPU number on the AxUSER bits can be a challenge.

 

In Linux CPAKs with CCI, this is done by the GIC automatically by inspecting the AXI transaction ID bits and then setting the AxUSER bits as input to the GIC-400. Each CPU will indicate the CPU number within the core (0-3) and the CCI will add information about which slave port received the transaction to indicate the cluster.

 

Users don’t need to add any special components in the design because the mapping is done inside the Carbon model of the GIC-400 using a parameter called “AXI User Gen Rule”. This parameter has a default value which assumes a 2 cluster system in which each cluster has 4 cores. This is a standard 8 core configuration which uses all of the ports of the GIC-400. The parameter can be adjusted for other configurations as needed. 

 

The User Gen Rule does even more because the ARM Fast Model for the GIC-400 uses the concept of Cluster ID to indicate which CPU is accessing the GIC. The Cluster ID concept is familiar for software reading the MPIDR register and exists in hardware as a CPU configuration input, but is not present in each bus transaction coming from a CPU and has no direct correlation to the CCI functionality of adding to the ID bits based on slave port.

 

To create systems which use cycle accurate models and can also be mapped to ARM Fast Models the User Gen Rule includes all of the following information for each of the 8 CPUs supported by the GIC:

  • Cluster ID value which is used to create the Fast Model system
  • CCI Port which determines the originating cluster in the Cycle Accurate system
  • Core ID which determines the CPU within a cluster for both Fast Model and Cycle Accurate systems

 

With all of this information Linux can successfully run on multi-cluster systems with the GIC-400.

 

AMBA 5 CHI Systems

In a system with CHI the Cluster ID and the CPU ID values must also be presented to the GIC in the same way as the ACE systems. For CHI systems, the CPU will use the RSVDC signals to indicate the Core ID. The new CCN-504 CPAK introduces a SoC Designer component to add Cluster ID information. This component is a CHI-CHI pass through which has a parameter for Cluster ID and adds the given Cluster ID into to the RSVDC bits.

 

For CCN configurations with AXI master ports to memory, the CCN will automatically drive the AxUSER bits correctly for the GIC-400. For systems which bridge CHI to AXI using the SoC Designer CHI-AXI converter, this converter takes care of driving the AxUSER bits based on the RSVDC inputs. In both cases, the AxUSER bits are driven to the GIC. The main difference for CHI systems is the GIC User Gen Rule parameter must be disabled by setting the “AXI4 Enable Change USER” parameter to false so no additional modification is done by the Carbon model of the GIC-400.

 

Conclusion

All of this may be a bit confusing, but demonstrates the value of Carbon CPAKs. All of the system requirements needed to put various models together to form a running Linux system have already been figured out so users don’t need to know it all if they are not interested. For engineers who are interested, CPAKs offer a way to confirm the expected behavior read in the documentation by using a live simulation and actual waveforms.

Understanding ARM Bare Metal Benchmark Application Startup Time

$
0
0

One of the benefits of simulation with virtual prototypes is the added control and visibility of memory. It’s easy to load data into memory models, dump data from memory to a file, and change memory values without running any simulation. After I gave up debugging hardware in a lab and decided I would rather spend my time simulating, some of my first lessons were related to assumptions software makes about memory at power on. When a computer powers on, software must assume the data in memories such as SRAM and DRAM is unknown. I recall being somewhat amazed to find out that initialization software would commonly clear large memory ranges before doing anything useful. I also recall learning that startup software would figure out how much memory was present in a system by writing the first byte or word to some value, reading it back, and if the written value was read back there must be memory present. The way to determine memory size was to continue incrementing the address until the read value did not match the expected value and conclude this was the size of the memory.

 

Recently, I was working on a bare metal program and simulating execution of an ARM Cortex-A15 system. CarbonPerformance Analysis Kits (CPAKs) come with example systems and initialization software to help shorten the ramp up time for users. Generally, people don’t pay much attention to the initialization code unless it’s either broken or they need to configure specific hardware features related to caches, MMU, or VFP and Neon hardware.

 

Today, I’ll provide some insight into some of the things the initialization code does, specifically what happens between the end of the assembly code which initializes the hardware and the start of a C program.

 

The program I was running had the following code at the top of main.c

 

main

There is an array named memspace with a #define to set the size of the array. When running new software it’s a good idea to understand as much as possible as quickly as possible by getting through the program the first time. One way to do this is to cut down the number of iterations, data size, or whatever else is needed to complete the program and gain confidence it’s running correctly. This avoids wasting time thinking the program is running correctly when it’s not. I normally put a few breakpoints in the software and just feel my way through the program to see where it goes and how it runs.

 

I like to put a breakpoint at the end of the initial assembly code to make sure nothing has gone wrong with the basic setup. Next, I like to put a breakpoint at main() to make sure the program gets started, and then stop at interesting looking C functions to track progress. For this particular program I shrunk the size of the memspace array to 200 bytes for the first pass through the test.

 

After I understood the basics of the program, I put the array size back to the original value of 200000 bytes. When I did this I noticed a strange phenomenon. The simulation took much longer to get to main() when the array was larger, about 8 times longer as shown in the table below.

 

Array size

Cycles to reach main()

         200

4860

200000

39174

 

One of the purposes of this article is to shed some light on what happens between the end of the startup assembly code and main(). Obviously, there is something related to the size of the memory array that influences this section of code.

 

Readers that have the A15 Bare Metal CPAK can follow along with similar code to what I used for the benchmark by looking in Applications/Dhrystone/Init.s

 

There are two parts to jumping from the initial assembly code to the main() function in C. First, save the address of __main in r12 as shown below.

 

init1 resized 600

 

Next, jump to __main at the end of the assembly code by using the BX r12 instruction. After the BX instruction the program goes into a section of code provided by the compiler (for which there is no source to debug) but if all goes well it comes out again at main().

 

The code starting from __main performs the following tasks:

  • Copies the execution regions from their load addresses to their execution addresses. This is a memory setup task for the case where the code is not loaded in the location it will run from or if the code is compressed and needs to be decompressed.
  • Zeros memory that needs to be cleared based on the C standard that says statically-allocated objects without explicit initializers are initialized to zero.
  • Branch to __rt_entry


Once the memory is ready, the code starting from __rt_entry sets up the runtime environment by doing the following tasks:

  • Sets up the stack and heap
  • Initializes library functions
  • Call main()
  • Call exit() after main() completes

 

If anything goes wrong between the assembly code and main() the most common cause is the stack and heap setup. I always recommend to take a look at this if your program doesn’t make it to main().

 

You may have guessed by now that the simulation time difference I described is caused by the time required to zero a larger array compared to a smaller array. As I mentioned at the start of the article, writing zero to large blocks of memory that is already zero (or can easily be made zero using a simulator command) is a waste of time. Carbon memory models already initialize memory contents to zero by default. Some people prefer a more pessimistic approach and initialize memory to some non-zero value to make sure the code will work on the real hardware, but for users more interested in performance analysis it seems helpful to avoid wasted simulation cycles and get on to the interesting work.

 

The Linux size command is a good way to confirm the larger array impacts the bss section of the code. The zero initialized (ZI) data and bss refer to the same segment. With the 200 byte array:

 

-bash-3.2$ size main.axf

   text        data         bss         dec         hex     filename

  71276          16     721432     792724       c1894     main.axf

 

With the 200000 byte array:

 

-bash-3.2$ size main.axf

   text        data         bss         dec         hex     filename

  71276          16     921232     992524       f250c     main.axf


Alternatives to Save Simulation Time

 

There are multiple ways to avoid executing instructions to write zero to memory that is already zero. It turns out to be a popular question. Search for something like “avoid bss static object zero”.

 

One way is to use linker scripts or compiler directives to put the array into a different section of memory that is not automatically initialized to 0.

 

For the program I was working on I decided to investigate a couple alternatives.

 

One solution is to just skip __main altogether and go directly to __rt_entry since __main doesn’t do anything useful for this program and the change is simple.

 

To skip __main just replace the load of __main into r12 with a load of __rt_entry into r12. Now when the program runs __main will be skipped altogether.

 

init2 resized 600

 

Here are the new results with __main skipped.

 

Array size

Cycles to reach main()

200

4355

200000

4362

 

As expected the number of cycles to reach main() is about the same with both array sizes, and much less than zeroing the large array. Although the difference may seem small for the benchmark I have shown here, the problem gets much bigger when a larger and more complex software program is run. I checked a larger software program and found it was taking more than 10 million instructions to zero memory.

 

I wouldn't recommend just blindly applying this technique, especially on larger software programs, as debugging improperly initialized global variables is not fun.

 

Another possibility to avoid initializing large global variables is to use a compiler pragma. The ARM compiler, armcc, has a section pragma to move the large array into a section which is not automatically initialized to zero. To use it, put the pragma around the array declaration as shown below.

 

pragma

 

After putting in the pragma, one more step is needed. The scatter file for the linker must be aware of this new section. More info is available on the documentation on "How to prevent uninitialized data from being initialized to zero".

 

In my linker scatter file I added one more section:

 

scat

 

Executing the program with the pragma is a much safer solution, especially when the software is going to write the memory anyway and the initial zero values are not being assumed. With the pragma the number of cycles to reach main is the same with both sizes of the array.

 

Array size

Cycles to reach main()

200

4411

200000

4411

 

The pragma is a good solution if there are a few large arrays that can be found and instrumented with the pragma.

 

Hopefully this article provided some understanding of what happens between the initial assembly code and main(). Although there is no one-size-fits-all solution it is definitely helps to understand and improve application startup time. Next time you find yourself with a program that appears stuck before reaching main() this just might be the cause.

 

Jason Andrews

Sometimes Hardware Details Matter in ARM Embedded Systems Programming

$
0
0

Last week, I received the call for papers for the Embedded World Conference for 2015. The list of topics is a good reminder of how broad the world of embedded systems is. It also reminded me how overloaded the term “embedded" has become. The term may invoke thoughts of a system made for a specific purpose to perform a dedicated function, or visions of invisible processors and software hidden in a product like a car. When I think of embedded, I tend think about the combination of hardware and software and learning how they work together, and the challenge of building and debugging a system running software that interacts with hardware. Some people call this hardware dependent software, firmware, or device drivers. Whatever it is called, it’s always a challenge to construct and debug both hardware and software and find out what the problems are. One of the great things about working at Carbon is the variety of the latest ARM IP combined with a spectrum of different types of software. We commonly work with software ranging from small bare-metal C programs to Linux running on multiple ARM cores. We also work with a mix of cycle accurate models and abstract models.

 

If you are interested in this area I would encourage you learn as much as possible about the topics below. Amazingly, the most popular programming language is still C, and being able to read assembly language also helps.


  • Cross Compilers and Debuggers
  • CPU Register Set
  • Instruction Pipeline
  • Cache
  • Interrupts and Interrupt Handlers
  • Timers
  • Co-Processors
  • Bus Protocols
  • Performance Monitors


I could write articles about how project X at company Y used Carbon products to optimize system performance or shrink time to market and lived happy ever after, but I prefer to write about what users can learn from virtual prototypes. Finding out new things via hands-on experience is the exciting part of what embedded systems are for me.


Today, I will provide two examples of what working with embedded systems is all about. The first demonstrates why embedded systems programming is different from general purpose C programming because working with hardware requires paying attention to extended details. The second example relates to a question many people at Carbon are frequently asked, “Why are accurate models important?” Carbon has become the standard for simulation with accurate models of ARM IP, but it’s not always easy to see why or when the additional accuracy makes a difference, especially for software development. Since some software development tasks can be done with abstract models, I will share a situation where accuracy makes a difference. Both of the examples in this article looked perfectly fine on the surface, but didn’t actually work.


GIC-400 Programming Example


Recently, I was working with some software that had been used on an ARM Cortex-A9 system. I ported it to a Cortex-A15 system, and was working on running it on a new system that used the GIC-400 instead of the internal GIC of the A15.


People that have worked with me know I have two rules for system debugging:

  1. Nothing ever works the first time
  2. When things don’t work, guessing is not allowed

When I ran the new system with the external GIC-400, the software failed to start up correctly. One of the challenges in debugging such problems is that the software jumps off to bad places after things don’t work and there is little or no trail of when the software went off the path. Normally, I try to use software breakpoints to close in on the problem. Another technique is to use the Carbon Analyzer to trace bus transactions and software execution to spot a wrong turn. In this particular case I was able to spot an abort and I traced it to a normal looking access to one of the GIC-400 registers.


I was able to find the instruction that was causing the abort. The challenge was that it looked perfectly fine. It was a read of the GIC Distributor Control Register to see if the GIC is enabled. It’s one of the easiest things that could be done, and would be expected to work fine as long as the GIC is present in the system. Here is the source code:

c1

The load instruction which was aborting was the second one in the function, the LDRB:

c2

The puzzling thing was that the instruction looked fine and I was certain I ran this function on other systems containing the Cortex-A9 and Cortex-A15 internal GIC.

 

After some pondering, I recalled reading that the GIC-400 had some restrictions on access size for specific registers. Sure enough, the aborting instruction was a load byte. It’s not easy to find a clear statement specifying a byte access to this register is bad, but I'm sure it's in the documentation somewhere. I decided it was easier to just re-code the function to create a word access and try again.

 

There are probably many ways change the code to avoid the byte read, but I tried the function this way since the enable bit is the only bit used in the register:

c3

Sure enough, the compiler now generated a load word instruction and it worked as expected.

 

This example demonstrates a few principles of embedded systems. The first is the ability to understand ARM assembly language is a big help in debugging, especially tracing loads and stores to hardware such as the GIC-400. Another is that the code a C compiler generates sometimes matters. Most of the time when using C there is no need to look at the generated code, but in this case there is a connection between the C code and how the hardware responds to the generated instructions. Understanding how to modify the C code to generate different instructions was needed to solve the problem.

 

Mysterious Interrupt Handler

 

The next example demonstrates another situation where details matter. This was a bare-metal software program installing an interrupt handler for the Cortex-A15 processor for the nIRQ interrupt by putting a jump to the address of the handler at address 0x18. This occurs during program startup by writing an instruction into memory which will jump to the C function (irq_handler) to handle the interrupt. The important code looked like this, VECTOR_BASE is 0:

c4

The code looked perfectly fine and worked when simulated with abstract models, but didn’t work as expected when run on a cycle accurate simulation. Initially, it was very hard to tell why. The simulation would appear to just hang and when the simulation was stopped and it was sitting in weird places that didn’t seem like code that should have been running. Using the instruction and transaction traces it looked like an interrupt was occurring, but the program didn’t go to the interrupt handler as expected. To debug, I first placed a hardware breakpoint on a change on the interrupt signal, then I placed a software breakpoint on address 0x18 so the simulation would stop when the first interrupt occurred. The expected instruction was there, but when I single stepped to the next instruction the PC just advanced one word to address 0x1c, and no jump. Subsequent step commands just incremented the PC. In this case there was no code at any other address except 0x18 so the CPU was executing instructions that were all 0.

 

This problem was pretty mysterious considering the debugger showed the proper instruction at the right place, but it was as if it wasn’t there at all. Finally, it hit me that the only possible explanation was that the instruction really wasn’t there.

 

What if the cache line containing address 0x18 was already in the instruction cache when the jump instruction was written by the above code? When the interrupt occurred the PC jumps to 0x18 but would get the value from the instruction cache and never see the new value that had been written.

 

The solution was to invalidate the cache line after writing the instruction to memory using a system control register instruction with 0x18 in r0:

c5

Although cache details are mostly handled automatically by hardware and cache modelling is not always required for software development, this example shows that sometimes more detailed models are required to fully test software. In hindsight experienced engineers would recognize self-modifying code, and the need to pay attention to caching, but it does demonstrate a situation where using detailed models does matter.

 

Summary

 

Although you may never encounter the exact problems described here, they demonstrate typical challenges embedded systems engineers face, and remind us to keep watch for hardware details. These examples also point out another key principle of embedded software, old code lives forever. This often means that while code may have worked on one system, it won’t automatically work on a new system, even if they seem similar. If these examples sound familiar, it might be time to look into virtual prototypes for your embedded software development.

 

Jason Andrews

Viewing all 38 articles
Browse latest View live