Discussion:
Optimization theory
(too old to reply)
J. Gareth Moreton
2018-06-16 21:21:44 UTC
Permalink
Raw Message
Note that I speak mostly from an x86_64 perspective, since this is where I
have almost universal exposure.

So I've been pondering a few things after researching Florian's prototype
patch for optimisations done prior to register allocation, when the
pre-compiled assembly language utilises imaginary (virtual) registers
pretty much everywhere other than where distinct registers are required
(e.g. function parameters).  My question is... how much can be moved to
the pre-allocation stage?  I believe it can make the outcome far more
efficient, especially with my initial work on the deep optimizer as a few
people have stated already.  Unless I'm mistaken, from what I've observed,
a lot of the peephole optimizations don't actually care what registers
they're playing with.

There are also situations in optimization where one might detect a repeat
calculation that the programmer cannot eliminate themselves, but which the
best optimization is to store the result and re-use it later... the most
obvious situation where this arises is with div and mod with the same
numerator and denominator.  Currently, the compiler doesn't know any
better and has to calculate the division twice, a relatively expensive
operation, even though DIV returns both the quotient and the remainder in
RAX and RDX respectively.  I believe storing the mod result in a virtual
register for later use will be far easier to manage when the registers have
not yet been allocated, especially if it's determined that a new register
has to be preserved in the function prologue, or there are no free
registers at all and it has to be put on the stack, or if it's at all
possible to use RDX itself as that temporary storage (the most ideal
outcome in both speed and size), something that would be near impossible if
RDX has been allocated for something in between.

Speaking of the deep optimizer, I do have a patch ready to submit once the
backlog of other patches are analysed (it relies on a few of them), and
I'll be working on porting elements of it to the pre-allocation stage. 
Some bits are a little difficult because of how imaginary registers are
tracked... if you delete an instruction, you have to be careful you don't
leave a dangling pointer in the "live range".

This is just thinking out loud and pondering.
Gareth aka. Kit
Florian Klämpfl
2018-06-17 08:56:23 UTC
Permalink
Raw Message
Note that I speak mostly from an x86_64 perspective, since this is where I have almost universal exposure.
So I've been pondering a few things after researching Florian's prototype patch for optimisations done prior to register
allocation, when the pre-compiled assembly language utilises imaginary (virtual) registers pretty much everywhere other
than where distinct registers are required (e.g. function parameters).  My question is... how much can be moved to the
pre-allocation stage?
A lot, basically everything which reduced register pressure. The only problem is, at this stage, the code contains a lot
of moves (compile with -sr to see how it looks like). So the optimizer must be able to handle this. It might be even
possible to build a generic optimizer pass at this stage. Example:

A typical sequence FPC often generates is:

mov %src1,%dest1
add %dest1,%src2,%dest2

If src1 is no released after mov but dest1 is release, src1 and dest1 still cannot be coalesced as they interfere, so an
extra register is allocated. The move will be remove by the peephole optimizer, but register was allocated and increase
register pressure. Such optimizations could be done generic (for all CPUs): if the destination of a mov is only read
afterwards (this information is already generically available), the mov can be removed and in this case dest1 can be
replaced by src1.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman
Sven Barth via fpc-devel
2018-06-17 11:06:33 UTC
Permalink
Raw Message
Post by J. Gareth Moreton
Note that I speak mostly from an x86_64 perspective, since this is where
I have almost universal exposure.
Post by J. Gareth Moreton
So I've been pondering a few things after researching Florian's
prototype patch for optimisations done prior to register
Post by J. Gareth Moreton
allocation, when the pre-compiled assembly language utilises imaginary
(virtual) registers pretty much everywhere other
Post by J. Gareth Moreton
than where distinct registers are required (e.g. function parameters).
My question is... how much can be moved to the
Post by J. Gareth Moreton
pre-allocation stage?
A lot, basically everything which reduced register pressure. The only
problem is, at this stage, the code contains a lot
of moves (compile with -sr to see how it looks like). So the optimizer
must be able to handle this. It might be even
mov %src1,%dest1
add %dest1,%src2,%dest2
If src1 is no released after mov but dest1 is release, src1 and dest1
still cannot be coalesced as they interfere, so an
extra register is allocated. The move will be remove by the peephole
optimizer, but register was allocated and increase
register pressure. Such optimizations could be done generic (for all
CPUs): if the destination of a mov is only read
afterwards (this information is already generically available), the mov
can be removed and in this case dest1 can be
replaced by src1.
Though only if the type of src1 is the same as that of dest1, right? (e.g.
int vs. address register)

Regards,
Sven
Christo Crause
2018-06-17 09:08:53 UTC
Permalink
Raw Message
The penalty for separate div & mod operations for your scenario is severe
on targets that require software implementation (e.g. 8 bit AVR). There is
a DivMod implementation in unit math, would be useful if this is available
in system.
Post by J. Gareth Moreton
There are also situations in optimization where one might detect a repeat
calculation that the programmer cannot eliminate themselves, but which the
best optimization is to store the result and re-use it later... the most
obvious situation where this arises is with div and mod with the same
numerator and denominator. Currently, the compiler doesn't know any better
and has to calculate the division twice, a relatively expensive operation,
even though DIV returns both the quotient and the remainder in RAX and RDX
respectively. I believe storing the mod result in a virtual register for
later use will be far easier to manage when the registers have not yet been
allocated, especially if it's determined that a new register has to be
preserved in the function prologue, or there are no free registers at all
and it has to be put on the stack, or if it's at all possible to use RDX
itself as that temporary storage (the most ideal outcome in both speed and
size), something that would be near impossible if RDX has been allocated
for something in between.
Gareth aka. Kit
_______________________________________________
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
J. Gareth Moreton
2018-06-17 10:16:13 UTC
Permalink
Raw Message
Indeed, so if it can be done only once, that's a major saving.  Even if
div and mod are separate assembler instructions on the target platform,
it's possible to make a saving if multiplication and subtraction aren't too
expensive, as then the modulus can be calculated via r = n - qd.

The issue with DivMod is that the saving isn't that great because of the
penalty of calling a subroutine, which amounts to around 50 cycles on
x86-64, about equal to DIV.  This could be alleviated if the function is
able to be inlined or, when dealing with constants, can be identified as a
'pure' function (I wrote a little request article about this:
http://wiki.freepascal.org/Feature_Ideas#.22pure.3B.22_function_directive
[1] )

Gareth aka. Kit

On Sun 17/06/18 10:08 , "Christo Crause" ***@gmail.com sent:
The penalty for separate div border-left:1px #ccc solid;padding-left:1ex">

There are also situations in optimization where one might detect a repeat
calculation that the programmer cannot eliminate themselves, but which the
best optimization is to store the result and re-use it later... the most
obvious situation where this arises is with div and mod with the same
numerator and denominator.  Currently, the compiler doesn't know any
better and has to calculate the division twice, a relatively expensive
operation, even though DIV returns both the quotient and the remainder in
RAX and RDX respectively.  I believe storing the mod result in a virtual
register for later use will be far easier to manage when the registers have
not yet been allocated, especially if it's determined that a new register
has to be preserved in the function prologue, or there are no free
registers at all and it has to be put on the stack, or if it's at all
possible to use RDX itself as that temporary storage (the most ideal
outcome in both speed and size), something that would be near impossible if
RDX has been allocated for something in between.
Gareth aka. Kit
_______________________________________________
fpc-devel maillist  -  fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel [2]


Links:
------
[1]
http://wiki.freepascal.org/Feature_Ideas#.22pure.3B.22_function_directive
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
J. Gareth Moreton
2018-06-17 18:36:57 UTC
Permalink
Raw Message
Aah, I was in error. I used agner as my source, but even that says 5
cycles for a near call and 17-33 for a far call, so I'm not sure where I
got 50 from. My apologies. I have probably been avoiding function calls
more than necessary!

Gareth aka. Kit

On Sun 17/06/18 18:06 , David Pethes ***@satd.sk sent:
Hi,
Post by J. Gareth Moreton
The issue with DivMod is that the saving isn't that great because of the
penalty of calling a subroutine, which amounts to around 50 cycles on
x86-64, about equal to DIV.
That number sounds awfully high, what's your source?

David
J. Gareth Moreton
2018-06-17 19:54:45 UTC
Permalink
Raw Message
That's where the first stage of my deep optimizer might be able to help,
since it explicitly starts at a MOV command (say, mov %reg_source,
%reg_dest) and scans forward to see if %reg_dest can be replaced with
%reg_source - it stops when it hits a jump, call, non-skippable label, when
the registers change value or if %reg_dest cannot be replaced.  If all
references to %reg_dest were replaced prior to its value being completely
overwritten, then it can then delete the original MOV.

While it will only be appropriate to run at -O2 and -O3, the restrictions
surrounding the register replacement, and specially where it stops
searching, help ensure that it runs relatively quickly.
By the way, now that the debug strip patch has now being approved (thanks
Florian!), I can now safely submit my prototype for the Deep Optimizer!
https://bugs.freepascal.org/view.php?id=33871 - as specifid in the notes,
compile the compiler with the DEBUG_AOPTCPU directive and compile projects
with the -a flag if you wish to see where the Deep Optimizer has made
savings in the intermediate assemblies.

Note that this version works during post-peephole optimisations.  I'm
looking at ways to move it to the preallocation phase in a way that's
extensible, especially in regards to tracking virtual registers.  It's a
little tricker because it's very easy to leave a dangling pointer.

Gareth aka. Kit
Post by J. Gareth Moreton
Note that I speak mostly from an x86_64 perspective, since this is where
I have almost universal exposure.
Post by J. Gareth Moreton
So I've been pondering a few things after researching Florian's
prototype patch for optimisations done prior to register
Post by J. Gareth Moreton
allocation, when the pre-compiled assembly language utilises imaginary
(virtual) registers pretty much everywhere other
Post by J. Gareth Moreton
than where distinct registers are required (e.g. function parameters). 
My question is... how much can be moved to the
Post by J. Gareth Moreton
pre-allocation stage?
A lot, basically everything which reduced register pressure. The only
problem is, at this stage, the code contains a lot
of moves (compile with -sr to see how it looks like). So the optimizer
must be able to handle this. It might be even
possible to build a generic optimizer pass at this stage. Example:

A typical sequence FPC often generates is:

mov %src1,%dest1
add %dest1,%src2,%dest2

If src1 is no released after mov but dest1 is release, src1 and dest1
still cannot be coalesced as they interfere, so an
extra register is allocated. The move will be remove by the peephole
optimizer, but register was allocated and increase
register pressure. Such optimizations could be done generic (for all
CPUs): if the destination of a mov is only read
afterwards (this information is already generically available), the mov
can be removed and in this case dest1 can be
replaced by src1.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
J. Gareth Moreton
2018-06-17 10:36:38 UTC
Permalink
Raw Message
That's where the first stage of my deep optimizer might be able to help,
since it explicitly starts at a MOV command (say, mov %reg_source,
%reg_dest) and scans forward to see if %reg_dest can be replaced with
%reg_source - it stops when it hits a jump, call, non-skippable label, when
the registers change value or if %reg_dest cannot be replaced.  If all
references to %reg_dest were replaced prior to its value being completely
overwritten, then it can then delete the original MOV.

While it will only be appropriate to run at -O2 and -O3, the restrictions
surrounding the register replacement, and specially where it stops
searching, help ensure that it runs relatively quickly.
Find attached a test patch of the deep optimizer - it was intertwined with
a number of other peephole changes, so I hope these were removed
successfully and won't cause a compiler error.  (This is what I intend to
submit to the bug reporter as a patch eventually)  Running the compiler
with DEBUG_AOPTCPU enabled will show where the deep optimizer has made
savings in the .s files.

Gareth aka. Kit
Post by J. Gareth Moreton
Note that I speak mostly from an x86_64 perspective, since this is where
I have almost universal exposure.
Post by J. Gareth Moreton
So I've been pondering a few things after researching Florian's
prototype patch for optimisations done prior to register
Post by J. Gareth Moreton
allocation, when the pre-compiled assembly language utilises imaginary
(virtual) registers pretty much everywhere other
Post by J. Gareth Moreton
than where distinct registers are required (e.g. function parameters). 
My question is... how much can be moved to the
Post by J. Gareth Moreton
pre-allocation stage?
A lot, basically everything which reduced register pressure. The only
problem is, at this stage, the code contains a lot
of moves (compile with -sr to see how it looks like). So the optimizer
must be able to handle this. It might be even
possible to build a generic optimizer pass at this stage. Example:

A typical sequence FPC often generates is:

mov %src1,%dest1
add %dest1,%src2,%dest2

If src1 is no released after mov but dest1 is release, src1 and dest1
still cannot be coalesced as they interfere, so an
extra register is allocated. The move will be remove by the peephole
optimizer, but register was allocated and increase
register pressure. Such optimizations could be done generic (for all
CPUs): if the destination of a mov is only read
afterwards (this information is already generically available), the mov
can be removed and in this case dest1 can be
replaced by src1.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
J. Gareth Moreton
2018-06-18 13:28:56 UTC
Permalink
Raw Message
Ignore my last e-mail - I sent it yesterday and thought the attachment was
too big (it's also slightly faulty).  See here instead:
https://bugs.freepascal.org/view.php?id=33871

Gareth
Tomas Hajny
2018-06-18 15:02:14 UTC
Permalink
Raw Message
Post by J. Gareth Moreton
Ignore my last e-mail - I sent it yesterday and thought the attachment
https://bugs.freepascal.org/view.php?id=33871
It was indeed held for moderation due to its size. Since it wasn't much
bigger compared to the limit, I decided to let it through.

Tomas (one of FPC mailing list moderators)


_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinf
J. Gareth Moreton
2018-06-18 14:11:12 UTC
Permalink
Raw Message
Aah, okay. Thanks Tomas.
It turns out the patch had a couple of faults anyway - thanks for letting
it through though.  I can never guarantee the size of .patch files.
Gareth
Post by J. Gareth Moreton
Ignore my last e-mail - I sent it yesterday and thought the attachment
https://bugs.freepascal.org/view.php?id=33871 [1]
It was indeed held for moderation due to its size. Since it wasn't much
bigger compared to the limit, I decided to let it through.

Tomas (one of FPC mailing list moderators)

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [2]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[3]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:

Loading...