Understanding ARM Bare Metal Benchmark Application Startup Time

One of the benefits of simulation with virtual prototypes is the added control and visibility of memory. It’s easy to load data into memory models, dump data from memory to a file, and change memory values without running any simulation. After I gave up debugging hardware in a lab and decided I would rather spend my time simulating, some of my first lessons were related to assumptions software makes about memory at power on. When a computer powers on, software must assume the data in memories such as SRAM and DRAM is unknown. I recall being somewhat amazed to find out that initialization software would commonly clear large memory ranges before doing anything useful. I also recall learning that startup software would figure out how much memory was present in a system by writing the first byte or word to some value, reading it back, and if the written value was read back there must be memory present. The way to determine memory size was to continue incrementing the address until the read value did not match the expected value and conclude this was the size of the memory.

Recently, I was working on a bare metal program and simulating execution of an ARM Cortex-A15 system. Carbon Performance Analysis Kits (CPAKs) come with example systems and initialization software to help shorten the ramp up time for users. Generally, people don’t pay much attention to the initialization code unless it’s either broken or they need to configure specific hardware features related to caches, MMU, or VFP and Neon hardware.

Today, I’ll provide some insight into some of the things the initialization code does, specifically what happens between the end of the assembly code which initializes the hardware and the start of a C program.

The program I was running had the following code at the top of main.c

There is an array named memspace with a #define to set the size of the array. When running new software it’s a good idea to understand as much as possible as quickly as possible by getting through the program the first time. One way to do this is to cut down the number of iterations, data size, or whatever else is needed to complete the program and gain confidence it’s running correctly. This avoids wasting time thinking the program is running correctly when it’s not. I normally put a few breakpoints in the software and just feel my way through the program to see where it goes and how it runs.

I like to put a breakpoint at the end of the initial assembly code to make sure nothing has gone wrong with the basic setup. Next, I like to put a breakpoint at main() to make sure the program gets started, and then stop at interesting looking C functions to track progress. For this particular program I shrunk the size of the memspace array to 200 bytes for the first pass through the test.

After I understood the basics of the program, I put the array size back to the original value of 200000 bytes. When I did this I noticed a strange phenomenon. The simulation took much longer to get to main() when the array was larger, about 8 times longer as shown in the table below.

Array size	Cycles to reach main()
200	4860
200000	39174

One of the purposes of this article is to shed some light on what happens between the end of the startup assembly code and main(). Obviously, there is something related to the size of the memory array that influences this section of code.

Readers that have the A15 Bare Metal CPAK can follow along with similar code to what I used for the benchmark by looking in Applications/Dhrystone/Init.s

There are two parts to jumping from the initial assembly code to the main() function in C. First, save the address of __main in r12 as shown below.

Next, jump to __main at the end of the assembly code by using the BX r12 instruction. After the BX instruction the program goes into a section of code provided by the compiler (for which there is no source to debug) but if all goes well it comes out again at main().

The code starting from __main performs the following tasks:

Copies the execution regions from their load addresses to their execution addresses. This is a memory setup task for the case where the code is not loaded in the location it will run from or if the code is compressed and needs to be decompressed.
Zeros memory that needs to be cleared based on the C standard that says statically-allocated objects without explicit initializers are initialized to zero.
Branch to __rt_entry

Once the memory is ready, the code starting from __rt_entry sets up the runtime environment by doing the following tasks:

Sets up the stack and heap
Initializes library functions
Call main()
Call exit() after main() completes

If anything goes wrong between the assembly code and main() the most common cause is the stack and heap setup. I always recommend to take a look at this if your program doesn’t make it to main().

You may have guessed by now that the simulation time difference I described is caused by the time required to zero a larger array compared to a smaller array. As I mentioned at the start of the article, writing zero to large blocks of memory that is already zero (or can easily be made zero using a simulator command) is a waste of time. Carbon memory models already initialize memory contents to zero by default. Some people prefer a more pessimistic approach and initialize memory to some non-zero value to make sure the code will work on the real hardware, but for users more interested in performance analysis it seems helpful to avoid wasted simulation cycles and get on to the interesting work.

The Linux size command is a good way to confirm the larger array impacts the bss section of the code. The zero initialized (ZI) data and bss refer to the same segment. With the 200 byte array:

-bash-3.2$ size main.axf

text data bss dec hex filename

71276 16 721432 792724 c1894 main.axf

With the 200000 byte array:

-bash-3.2$ size main.axf

text data bss dec hex filename

71276 16 921232 992524 f250c main.axf

Alternatives to Save Simulation Time

There are multiple ways to avoid executing instructions to write zero to memory that is already zero. It turns out to be a popular question. Search for something like “avoid bss static object zero”.

One way is to use linker scripts or compiler directives to put the array into a different section of memory that is not automatically initialized to 0.

For the program I was working on I decided to investigate a couple alternatives.

One solution is to just skip __main altogether and go directly to __rt_entry since __main doesn’t do anything useful for this program and the change is simple.

To skip __main just replace the load of __main into r12 with a load of __rt_entry into r12. Now when the program runs __main will be skipped altogether.

Here are the new results with __main skipped.

Array size	Cycles to reach main()
200	4355
200000	4362

As expected the number of cycles to reach main() is about the same with both array sizes, and much less than zeroing the large array. Although the difference may seem small for the benchmark I have shown here, the problem gets much bigger when a larger and more complex software program is run. I checked a larger software program and found it was taking more than 10 million instructions to zero memory.

I wouldn't recommend just blindly applying this technique, especially on larger software programs, as debugging improperly initialized global variables is not fun.

Another possibility to avoid initializing large global variables is to use a compiler pragma. The ARM compiler, armcc, has a section pragma to move the large array into a section which is not automatically initialized to zero. To use it, put the pragma around the array declaration as shown below.

After putting in the pragma, one more step is needed. The scatter file for the linker must be aware of this new section. More info is available on the documentation on "How to prevent uninitialized data from being initialized to zero".

In my linker scatter file I added one more section:

Executing the program with the pragma is a much safer solution, especially when the software is going to write the memory anyway and the initial zero values are not being assumed. With the pragma the number of cycles to reach main is the same with both sizes of the array.

Array size	Cycles to reach main()
200	4411
200000	4411

The pragma is a good solution if there are a few large arrays that can be found and instrumented with the pragma.

Hopefully this article provided some understanding of what happens between the initial assembly code and main(). Although there is no one-size-fits-all solution it is definitely helps to understand and improve application startup time. Next time you find yourself with a program that appears stuck before reaching main() this just might be the cause.

Jason Andrews

Understanding ARM Bare Metal Benchmark Application Startup Time

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112