Quantcast
Channel: Jive Syndication Feed
Viewing all articles
Browse latest Browse all 38

Understanding ARM Bare Metal Benchmark Application Startup Time

$
0
0

One of the benefits of simulation with virtual prototypes is the added control and visibility of memory. It’s easy to load data into memory models, dump data from memory to a file, and change memory values without running any simulation. After I gave up debugging hardware in a lab and decided I would rather spend my time simulating, some of my first lessons were related to assumptions software makes about memory at power on. When a computer powers on, software must assume the data in memories such as SRAM and DRAM is unknown. I recall being somewhat amazed to find out that initialization software would commonly clear large memory ranges before doing anything useful. I also recall learning that startup software would figure out how much memory was present in a system by writing the first byte or word to some value, reading it back, and if the written value was read back there must be memory present. The way to determine memory size was to continue incrementing the address until the read value did not match the expected value and conclude this was the size of the memory.

 

Recently, I was working on a bare metal program and simulating execution of an ARM Cortex-A15 system. CarbonPerformance Analysis Kits (CPAKs) come with example systems and initialization software to help shorten the ramp up time for users. Generally, people don’t pay much attention to the initialization code unless it’s either broken or they need to configure specific hardware features related to caches, MMU, or VFP and Neon hardware.

 

Today, I’ll provide some insight into some of the things the initialization code does, specifically what happens between the end of the assembly code which initializes the hardware and the start of a C program.

 

The program I was running had the following code at the top of main.c

 

main

There is an array named memspace with a #define to set the size of the array. When running new software it’s a good idea to understand as much as possible as quickly as possible by getting through the program the first time. One way to do this is to cut down the number of iterations, data size, or whatever else is needed to complete the program and gain confidence it’s running correctly. This avoids wasting time thinking the program is running correctly when it’s not. I normally put a few breakpoints in the software and just feel my way through the program to see where it goes and how it runs.

 

I like to put a breakpoint at the end of the initial assembly code to make sure nothing has gone wrong with the basic setup. Next, I like to put a breakpoint at main() to make sure the program gets started, and then stop at interesting looking C functions to track progress. For this particular program I shrunk the size of the memspace array to 200 bytes for the first pass through the test.

 

After I understood the basics of the program, I put the array size back to the original value of 200000 bytes. When I did this I noticed a strange phenomenon. The simulation took much longer to get to main() when the array was larger, about 8 times longer as shown in the table below.

 

Array size

Cycles to reach main()

         200

4860

200000

39174

 

One of the purposes of this article is to shed some light on what happens between the end of the startup assembly code and main(). Obviously, there is something related to the size of the memory array that influences this section of code.

 

Readers that have the A15 Bare Metal CPAK can follow along with similar code to what I used for the benchmark by looking in Applications/Dhrystone/Init.s

 

There are two parts to jumping from the initial assembly code to the main() function in C. First, save the address of __main in r12 as shown below.

 

init1 resized 600

 

Next, jump to __main at the end of the assembly code by using the BX r12 instruction. After the BX instruction the program goes into a section of code provided by the compiler (for which there is no source to debug) but if all goes well it comes out again at main().

 

The code starting from __main performs the following tasks:

  • Copies the execution regions from their load addresses to their execution addresses. This is a memory setup task for the case where the code is not loaded in the location it will run from or if the code is compressed and needs to be decompressed.
  • Zeros memory that needs to be cleared based on the C standard that says statically-allocated objects without explicit initializers are initialized to zero.
  • Branch to __rt_entry


Once the memory is ready, the code starting from __rt_entry sets up the runtime environment by doing the following tasks:

  • Sets up the stack and heap
  • Initializes library functions
  • Call main()
  • Call exit() after main() completes

 

If anything goes wrong between the assembly code and main() the most common cause is the stack and heap setup. I always recommend to take a look at this if your program doesn’t make it to main().

 

You may have guessed by now that the simulation time difference I described is caused by the time required to zero a larger array compared to a smaller array. As I mentioned at the start of the article, writing zero to large blocks of memory that is already zero (or can easily be made zero using a simulator command) is a waste of time. Carbon memory models already initialize memory contents to zero by default. Some people prefer a more pessimistic approach and initialize memory to some non-zero value to make sure the code will work on the real hardware, but for users more interested in performance analysis it seems helpful to avoid wasted simulation cycles and get on to the interesting work.

 

The Linux size command is a good way to confirm the larger array impacts the bss section of the code. The zero initialized (ZI) data and bss refer to the same segment. With the 200 byte array:

 

-bash-3.2$ size main.axf

   text        data         bss         dec         hex     filename

  71276          16     721432     792724       c1894     main.axf

 

With the 200000 byte array:

 

-bash-3.2$ size main.axf

   text        data         bss         dec         hex     filename

  71276          16     921232     992524       f250c     main.axf


Alternatives to Save Simulation Time

 

There are multiple ways to avoid executing instructions to write zero to memory that is already zero. It turns out to be a popular question. Search for something like “avoid bss static object zero”.

 

One way is to use linker scripts or compiler directives to put the array into a different section of memory that is not automatically initialized to 0.

 

For the program I was working on I decided to investigate a couple alternatives.

 

One solution is to just skip __main altogether and go directly to __rt_entry since __main doesn’t do anything useful for this program and the change is simple.

 

To skip __main just replace the load of __main into r12 with a load of __rt_entry into r12. Now when the program runs __main will be skipped altogether.

 

init2 resized 600

 

Here are the new results with __main skipped.

 

Array size

Cycles to reach main()

200

4355

200000

4362

 

As expected the number of cycles to reach main() is about the same with both array sizes, and much less than zeroing the large array. Although the difference may seem small for the benchmark I have shown here, the problem gets much bigger when a larger and more complex software program is run. I checked a larger software program and found it was taking more than 10 million instructions to zero memory.

 

I wouldn't recommend just blindly applying this technique, especially on larger software programs, as debugging improperly initialized global variables is not fun.

 

Another possibility to avoid initializing large global variables is to use a compiler pragma. The ARM compiler, armcc, has a section pragma to move the large array into a section which is not automatically initialized to zero. To use it, put the pragma around the array declaration as shown below.

 

pragma

 

After putting in the pragma, one more step is needed. The scatter file for the linker must be aware of this new section. More info is available on the documentation on "How to prevent uninitialized data from being initialized to zero".

 

In my linker scatter file I added one more section:

 

scat

 

Executing the program with the pragma is a much safer solution, especially when the software is going to write the memory anyway and the initial zero values are not being assumed. With the pragma the number of cycles to reach main is the same with both sizes of the array.

 

Array size

Cycles to reach main()

200

4411

200000

4411

 

The pragma is a good solution if there are a few large arrays that can be found and instrumented with the pragma.

 

Hopefully this article provided some understanding of what happens between the initial assembly code and main(). Although there is no one-size-fits-all solution it is definitely helps to understand and improve application startup time. Next time you find yourself with a program that appears stuck before reaching main() this just might be the cause.

 

Jason Andrews


Viewing all articles
Browse latest Browse all 38

Trending Articles