[fpc-devel] Optimization theory

J. Gareth Moreton

2018-12-04 22:11:19 UTC

Not sure if this was intended for the group mailing list or not, but was
only e-mailed to me.

One quesion though... in your example, what level of your optimisation are
you compiling under? I thought the peephole optimizer already changes
things like "lea (%ecx,%eax),%eax" to "add %ecx,%eax".Â Nevertheless, my
Deep Optimizer is able to spot the unnecessary assignments to %eax.Â I did
submit a patch, but because it was a prototype and tied into the post
peephole optimizer, it hasn't received that much attention.Â It's
something I'm still working on though and something I'd like to submit as
an -O4 option.
Gareth aka Kit

----- Original Message -----
From: David Pethes ***@satd.sk
To: ***@moreton-family.com
Sent: Tue 04/12/18 19:45
Subject: Fwd: Re: [fpc-devel] Optimization theory
Hi Gareth,
nice to see that you are still working on compiler improvements. If I
may, can I ask you to look at the assembly output of this routine (it's
a link to compiler explorer):

https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:0.8957951999999999,j:1,lang:pascal,source:'unit+output%3B%0A%0Ainterface%0A%0Aprocedure+core_4x4(block:+psmallint)%3B%0A%0Aimplementation%0A%0Atype%0A++matrix_t+%3D+array%5B0..3,+0..3%5D+of+smallint%3B+%0A%0Aprocedure+core_4x4(block:+psmallint)%3B%0Avar%0A++m:+matrix_t%3B%0A++e,+f,+g,+h:+array%5B0..3%5D+of+smallint%3B%0A%0Avar%0A++i:+longword%3B%0Abegin%0A++move(block%5E,+m,+16*2)%3B%0A%0A++for+i+:%3D+0+to+3+do+begin%0A++++++e%5Bi%5D+:%3D+m%5Bi%5D%5B0%5D+%2B+m%5Bi%5D%5B3%5D%3B%0A++++++f%5Bi%5D+:%3D+m%5Bi%5D%5B0%5D+-+m%5Bi%5D%5B3%5D%3B%0A++++++g%5Bi%5D+:%3D+m%5Bi%5D%5B1%5D+%2B+m%5Bi%5D%5B2%5D%3B%0A++++++h%5Bi%5D+:%3D+m%5Bi%5D%5B1%5D+-+m%5Bi%5D%5B2%5D%3B%0A++end%3B%0A%0Aend%3B%0A+++++++++%0Aend.%0A'),l:'5',n:'0',o:'Pascal+source+%231',t:'0')),k:51.379872662717624,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:fpc304,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',trim:'1'),lang:pascal,libs:!(),options:'-O4',source:1),l:'5',n:'0',o:'x86-64+fpc+3.0.4+(Editor+%231,+Compiler+%231)+Pascal',t:'0')),header:(),k:48.6201273372824,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',m:100,n:'0',o:'',t:'0')),version:4

As you can see, there are a lot of duplicated index/load calculations,
yet there are only a couple of registers used, so the results could be
reused. I wonder if this is somehow covered in your optimization passes
as well, or if you have some idea where would I start if I wanted to
optimize the code generation for this?
Thanks again for your efforts,

David

Hah, thanks.Â In a strange way, I find it quite fun.Â The culmination

my efforts has been the first stage of a "Deep Optimizer", which
searches around MOV commands to see if it can minimise pipeline stalls
and sometimes can remove MOV commands entirely if it determines that the
two registers are identical in value at that point.
As an example where inlining helped (actually, just copying and pasting
the function's contents virtually verbatim because everything was in
assembly language) is the new Int and Frac functions on x86-64, where I
inserted Int's contents directly into Frac.Â While Int had a branch and
a few safety checks to prevent an integer overflow, it was much faster
when inlined compared to fiddling around with register moves and doing a
function call, especially as the rest of Frac was very fast already, and
it also allowed the entirety of Frac to have no stack frame.
It's probably a bit insulting to Free Pascal, but my initial drive was
to reduce the size of the compiled binaries, without sacrificing

speed.Â

Of course, improving speed is a major focus as well!
Gareth
Hi, one less thing to worry about then :) Unless in a quite tight loop,
it shouldn't make much difference - or any at all, due to OoO
execution.
By the way, this Intel code analyzer could be interesting to you - I

https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
[2]

Good luck with you optimizations!
David

Aah, I was in error. I used agner as my
source, but even that says 5 cycles for a
near call and 17-33 for a far call, so I'm
not sure where I got 50 from. My
apologies. I have probably been avoiding
function calls more than necessary!
Gareth aka. Kit

Links:
------
[1] mailto:***@satd.sk
[2]
https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/