MB Slow speed - Possible dynamic linker issues

Im running a kalman filter on the microblaze.

I can run it as either a stand alone application,
Or run it as a cfs application that is loaded via the dynamic loader.

When it runs as a CFS application the cpu use of my klaman filter task has 5x the cpu usage.

If I run it in stand alone mode teh cpu use of my kalman filter is 5x faster.

I suspect this is due to the dynamic library.
A coworker was looking at cache hit stats.
It did not appear to be a measurable difference between the two builds when looking at performance counters.
https://docs.amd.com/r/en-US/ug984-vivado-microblaze-ref/Performance-Monitoring

Any other recommendations on Event Counter Events I should look at?

Any best next steps to try would be appreciated.

I had ai look at the rtems dynamic linker code for the microblaze.

• Most likely this is not an RTEMS tasking issue, but a MicroBlaze libdl codegen/ABI issue.

The strongest reason is that RTEMS’s MicroBlaze runtime-linker backend is still very minimal. In cpukit/libdl/rtl-mdreloc-microblaze.c:42 the arch-specific
section handling is basically a stub, .sdata/.sbss handling is absent, trampolines are disabled at cpukit/libdl/rtl-mdreloc-microblaze.c:83, and only a subset
of relocations is implemented. By contrast, PowerPC has explicit small-data support in cpukit/libdl/rtl-mdreloc-powerpc.c:64 through cpukit/libdl/rtl-mdreloc-
powerpc.c:182. That points to MicroBlaze loaded modules missing ABI optimizations that static links can use.

The second likely reason is linker relaxation. RTEMS libdl loads relocatable ELF objects, not shared libraries: cpukit/libdl/rtl-elf.c:1597 rejects ET_DYN and
cpukit/libdl/rtl-elf.c:1602 rejects program headers. So a dynamically loaded module does not go through a final linker relaxation pass the way a statically
linked image does. On MicroBlaze that matters a lot, because the backend itself notes R_MICROBLAZE_32_NONE is “only used for relaxing” in cpukit/libdl/rtl-
mdreloc-microblaze.c:227. My inference is that loaded code keeps conservative multi-instruction address/call sequences that the static final link can shrink,
and on MicroBlaze that can easily become a large slowdown if the task is global-access-heavy or call-heavy.

I do not think instruction-cache coherency is the primary cause. dlopen() does synchronize cache after load in cpukit/libdl/rtl.c:593 and cpukit/libdl/rtl-
obj.c:755.

A related warning sign is that the BSP startup still has the small-data anchor setup commented out in bsps/microblaze/shared/start/start.S:97. I would treat
that as supporting evidence that MicroBlaze small-data/GP handling is not fully wired for the dynamic-load path.

If you want, I can do the next step and make this concrete by comparing the static and loadable builds with microblaze-*-objdump, looking specifically for:

  • many imm + addik/lwi/swi sequences in the loaded build,
  • R_MICROBLAZE_32_NONE and related relax-only relocations,
  • .sdata/.sbss usage or forced -G0 style codegen.

Are you able to reduce the code to something that can exhibit the issues you see?

The RTL does not support shared libraries so it is reasonable to not have any related types in the back end.

Any extra passes a back end needs has to be part of that back end. I do not know what they mean by linker relaxation and what it needs. It is some form of call proximity optimization?

Does the Microblaze have small data? How does small data and small bss data differ?

Trampolines are only needed if the back end needs to bounce small local jumps to the full address space. I do not think that is a problem of nothing would work.

Does incremental or relocatable linker work on the Microblaze? Support can be a hit or miss. If it does that see if the linker does any extra processing?

I think to reduce the code I need to identify what part of the code is causing the extra CPU.

I will research how to get cpu metrics for this. Possibly logging the context switching pc / stack trace of that thread.

I’ll try porting the Wheatstone drystone tests to the dynamic library tests also. That might make a nice comparison test between dynamic / static runs.

Good idea. Lets avoid premature optimization.

Static link, normal test run

  • dhrystone: 2,110,497.8 Dhrystones/sec, 0.5 microseconds/run, 1201.19 DMIPS
  • whetstone: 131.4 MIPS, with Loops: 10000, Iterations: 1, Duration: 7.612319 sec

Working on building out linker tests next for these two.

dl14.txt (27.6 KB)
dl15.txt (24.9 KB)

Struggling to get these tests to load .

The output attached is for the RAP format. Are you using the RAP format?

I think the format needs some love. It is missing TSL support. I have not looked at that format in many years.

Benchmark Static build Dynamic build (libdl)
Dhrystone, Dhrystones/sec 2,110,497.8 166,580.3
Dhrystone, microseconds/run 0.5 6.0
Dhrystone, DMIPS 1201.19 94.81
Whetstone, MIPS 131.4 63.6
Whetstone, Loops 10000 10000
Whetstone, Iterations 1 1
Whetstone, Duration (sec) 7.612319 15.719940

Log references:

  • Static Dhrystone: tester/results/logs/dhrystone.log:230, tester/results/logs/dhrystone.log:231, tester/results/logs/dhrystone.log:232
  • Dynamic Dhrystone: tester/results/logs/dl14.log:67, tester/results/logs/dl14.log:68, tester/results/logs/dl14.log:69
  • Static Whetstone: tester/results/logs/whetstone.log:9, tester/results/logs/whetstone.log:10
  • Dynamic Whetstone: tester/results/logs/dl15.log:11, tester/results/logs/dl15.log:12

Im a little out of my depth here.
– AI guess –
• Yes. The main cause looks like a QEMU/runtime-linker placement issue, not a compiler-flag mismatch.

What I checked:

  • The compile flags are basically the same.
  • The disassembly delta is too small to explain the slowdown by itself.
    • Dhrystone hot-function total: static 1052 instructions vs dynamic 1084
    • Whetstone hot-function total: static 724 vs dynamic 805
  • MicroBlaze libdl is not using trampolines here. In src/rtems/cpukit/libdl/rtl-mdreloc-microblaze.c, rtems_rtl_elf_relocate_tramp_max_size() returns 0.

The stronger evidence came from the loaded object memory map:

  • dl14 in tester/results/logs/dl14.log:17
    • text base : 0x800a5f74
    • const base : 0x800a594c
    • bss base : 0x800a3128
  • dl15 in tester/results/logs/dl15.log:17
    • text base : 0x800feefc
    • const base : 0x800fede8
    • bss base : 0x800fed90

In both cases, the loaded text and writable bss land on the same 4 KB guest page:

  • dl14: both touch page 0x800a5000
  • dl15: both touch page 0x800fe000

That matters because these benchmarks write globals heavily:

  • Dhrystone updates Arr_1_Glob, Arr_2_Glob, Int_Glob, Bool_Glob, etc.
  • Whetstone updates T, T1, T2, E1, J, K, L

On QEMU/TCG, writes to a page that also contains translated code can invalidate translation blocks for that page. That fits the results very well:

  • Dhrystone is very global-write and call heavy, so it gets hit hardest
  • Whetstone also writes globals, but less aggressively, so the slowdown is smaller

So my conclusion is:

  • not primarily an -O/flag mismatch
  • not a MicroBlaze trampoline issue
  • most likely a QEMU artifact caused by libdl placing writable benchmark data and executable benchmark text on the same guest page

If you want, the next fix to try is page-separating loaded text from data/bss in the loader allocation path and rerunning the same comparison.

• Yes. The page-sharing hypothesis was right.

I changed the heap-backed libdl allocator in src/rtems/cpukit/libdl/rtl-alloc-heap.c so RTEMS_RTL_ALLOC_READ_EXEC allocations are page-aligned, while keeping
free/resize working through allocator-local metadata. ./waf build -j30 still passes.

The confirmation is in the loader layout dump from the direct QEMU runs:

  • Before, dl14 had text base 0x800a5f74 and bss base 0x800a3128, both touching page 0x800a5000.
  • After, dl14 has text base 0x800a7000 and bss base 0x800a32e8, so text is on its own page. See tester/results/logs/dl14.log.
  • Before, dl15 had text base 0x800feefc and bss base 0x800fed90, both on page 0x800fe000.
  • After, dl15 has text base 0x80100000 and bss base 0x800fef90, again separated. See tester/results/logs/dl15.log.

The timings moved the way you’d expect if QEMU was invalidating translated code because writable benchmark globals shared a page with loaded text:

  • dl14 improved from about 166,580.3 Dhrystones/sec (6.0 us/run) to about 217,065.8 Dhrystones/sec (4.6 us/run). The probe run logged 216,553.1 / 4.6 us/run.
  • dl15 improved from 63.6 MIPS (15.719940 sec) to about 120.8 MIPS (8.278298 sec). The probe run logged 114.8 MIPS with the extra RTL dump overhead.

So the main time delta was not a compile-flag mismatch. It was the libdl heap allocator placing loaded executable text too close to writable sections, which is
especially bad under MicroBlaze QEMU/TCG for these benchmarks.

Current source diff is only the allocator change. I reverted the temporary rtl list probe from the test sources after confirming the layout.

• Here’s the updated comparison table using the fresh clean dynamic runs after the allocator fix:

Benchmark Static build Dynamic build (libdl) Delta
Dhrystone, Dhrystones/sec 2,110,497.8 221,753.8 -89.5%
Dhrystone, microseconds/run 0.5 4.5 9.0x slower
Dhrystone, DMIPS 1201.19 126.21 -89.5%
Whetstone, MIPS 131.4 123.6 -5.9%
Whetstone, Loops 10000 10000 same
Whetstone, Iterations 1 1 same
Whetstone, Duration (sec) 7.612319 8.088988 +6.3%

So after separating loaded text onto its own page, Whetstone is now much closer to static, while Dhrystone is still noticeably slower but far better than before.

Does this mean figures on real hardware are needed to prove these changes are valid?

As stated in discord alignment is factor of the ELF section.

Hardware Whetstone results on KCU105:

Case Static Dynamic (dl15)
microblaze-rtl-debug with page-align fix 7.3 MIPS, 137.115663 sec 7.1 MIPS, 141.366792 sec
micoroblaze-debug without page-align fix 7.3 MIPS, 137.115660 sec 7.2 MIPS, 138.373369 sec

Does not appear to be an issue here, im going to switch to adding the microblaze performance counters on my full software stack.

Its possible im hitting more cache hits? Will update here when I figure out whats slowing down my performance.

I am confused.

Are both tests with libdl? The “without page-align” result does not say -rtl-?

I ran 4 tests.
dynamic vs static, with / without page alignment on the kcu105 hardware.
No noticeable difference between them. Maybe static is slightly faster.

The static version really wouldn’t change on the page / non page align. So I guess thats 3 different tests.

I did see a difference on on qemu. But not really sure how that does its emulation. Still running into lab issues, but if I find anything interesting with my full app, I will report.

Thanks for the update and it is good to know libdl performs as intended.

Static linking may be slightly faster if the linker has an optimization pass around some call sites but that is normally specific to an instruction set.