Discussion:
Kit's ambitions!
(too old to reply)
J. Gareth Moreton
2018-06-07 22:46:00 UTC
Permalink
Raw Message
So a progress update.

I've tied in part of my deep optimiser
into the peephole optimiser, specifically
PostPeepholeOptMov, and it's had some
unexpected benefits. One of the things it
does is start with a MOV command that
copies a register's contents into another,
then looks at subsequent reference
addresses to see if it can swap out one
register for another, to reduce the chance
of a pipeline stall. There are cases where
it's noticed that all such registers have
been switched in a certain block and hence
safely removes the original MOV command.

What this means is that as well as
reducing the chances of a pipeline stall,
it's removing unnecessary assignments.

My main test case has been compiling the
compiler, since it's sufficiently complex
and easy to crash if incorrect machine
code is produced, and it also gives plenty
of examples of optimisation. As a very
brief example, in
compiler/x86_64/symcpu.pas in
TCPUProcDef.ppuload_platform, the first
four lines are:

movq %rcx,%rax
movq %rdx,%rsi
movq %rax,%rbx
movq %rbx,%rcx

The deep optimiser changes this to:

movq %rcx,%rax
movq %rdx,%rsi
movq %rcx,%rbx

It determines, for the third MOV, it can
change %rax for %rcx to minimise a
pipeline stall, and then knows that %rbx
and %rcx contain the same value, so can
remove the 4th MOV completely. Given that
modern processors usually have at least 3
ALUs and the interdependencies have been
removed, this will likely give a speed
increase of one cycle over these few
commands.

Before I go submitting patches though, I
still need to test it under Linux and
i386.

Kit
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listin
David Pethes
2018-06-11 19:27:16 UTC
Permalink
Raw Message
Hi,
nice work.
Post by J. Gareth Moreton
movq %rcx,%rax
movq %rdx,%rsi
movq %rcx,%rbx
It determines, for the third MOV, it can
change %rax for %rcx to minimise a
pipeline stall, and then knows that %rbx
and %rcx contain the same value, so can
remove the 4th MOV completely. Given that
modern processors usually have at least 3
ALUs and the interdependencies have been
removed, this will likely give a speed
increase of one cycle over these few
commands.
Note that modern cpu-s can use move elimination for reg to reg moves, so
it doesn't cost any execution resources (it's "free"). Despite that it's
still a win, because it spares both bytes in I-cache and decoder
bandwidth (which can indirectly lead to some spared cycle(s) at other
places).

David
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.fr
J. Gareth Moreton
2018-06-11 20:07:18 UTC
Permalink
Raw Message
Thanks David,

I'm still learning some of the nuances of the Intel and AMD processors,
but most of it is just logical analysis.  Admittedly my main drive has
been to shrink down the size of the binary, since Delphi and Free Pascal
have always been a little bit bloated in comparison.  Not that it is
necessarily a bad thing, but saving space without sacrificing performance
can only be a good thing, especially for those with limited bandwidth or
for saving those few precious bytes when burning files to a CD or DVD.

There have been a few instances in the compiled compiler (my main test
case) where an entire register is freed up due to my deep optimisation, and
that means the corresponding "push" and "pop" at either end of the
procedure can be removed (along with the corresponding stack unwinding
information), although I haven't started programming that yet.

I am ready to submit this part of my deep optimiser as a patch.  I'm just
waiting for Florian's acceptance or rejection of my debug strip patch -
https://bugs.freepascal.org/view.php?id=33798 (the 3rd attempt!) - only
because it shares some debugging code with said patch (it was useful to
monitor how the registers inside references were changed).  If it's
rejected, it just means I'll have to change some of that debugging code a
bit.

Gareth aka. Kit

On Mon 11/06/18 20:27 , David Pethes ***@satd.sk sent:
Hi,
nice work.
Post by J. Gareth Moreton
movq %rcx,%rax
movq %rdx,%rsi
movq %rcx,%rbx
It determines, for the third MOV, it can
change %rax for %rcx to minimise a
pipeline stall, and then knows that %rbx
and %rcx contain the same value, so can
remove the 4th MOV completely. Given that
modern processors usually have at least 3
ALUs and the interdependencies have been
removed, this will likely give a speed
increase of one cycle over these few
commands.
Note that modern cpu-s can use move elimination for reg to reg moves, so
it doesn't cost any execution resources (it's "free"). Despite that it's
still a win, because it spares both bytes in I-cache and decoder
bandwidth (which can indirectly lead to some spared cycle(s) at other
places).

David
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
n***@gmail.com
2018-06-12 21:45:50 UTC
Permalink
Raw Message
Post by J. Gareth Moreton
Thanks David,
I'm still learning some of the nuances of the Intel and AMD
processors, but most of it is just logical analysis. Admittedly my
main drive has been to shrink down the size of the binary, since
Delphi and Free Pascal have always been a little bit bloated in
comparison. Not that it is necessarily a bad thing, but saving space
without sacrificing performance can only be a good thing, especially
for those with limited bandwidth or for saving those few precious
bytes when burning files to a CD or DVD.
There have been a few instances in the compiled compiler (my main
test case) where an entire register is freed up due to my deep
optimisation, and that means the corresponding "push" and "pop" at
either end of the procedure can be removed (along with the
corresponding stack unwinding information), although I haven't
started programming that yet.
Isn't it better to perform this optimization before register
allocation. Then, when this happens, the corresponding "push" and "pop"
wouldn't even be put by the compiler, because the register wouldn't
have to be spilled.

Nikolay
Post by J. Gareth Moreton
I am ready to submit this part of my deep optimiser as a patch. I'm
just waiting for Florian's acceptance or rejection of my debug strip
patch - https://bugs.freepascal.org/view.php?id=33798 (the 3rd
attempt!) - only because it shares some debugging code with said
patch (it was useful to monitor how the registers inside references
were changed). If it's rejected, it just means I'll have to change
some of that debugging code a bit.
Gareth aka. Kit
Post by David Pethes
Hi,
nice work.
Post by J. Gareth Moreton
movq %rcx,%rax
movq %rdx,%rsi
movq %rcx,%rbx
It determines, for the third MOV, it can
change %rax for %rcx to minimise a
pipeline stall, and then knows that %rbx
and %rcx contain the same value, so can
remove the 4th MOV completely. Given that
modern processors usually have at least 3
ALUs and the interdependencies have been
removed, this will likely give a speed
increase of one cycle over these few
commands.
Note that modern cpu-s can use move elimination for reg to reg moves, so
it doesn't cost any execution resources (it's "free"). Despite that it's
still a win, because it spares both bytes in I-cache and decoder
bandwidth (which can indirectly lead to some spared cycle(s) at other
places).
David
_______________________________________________
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel">htt
p://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
_______________________________________________
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/l
Florian Klämpfl
2018-06-13 19:23:53 UTC
Permalink
Raw Message
Post by n***@gmail.com
Post by J. Gareth Moreton
Thanks David,
I'm still learning some of the nuances of the Intel and AMD
processors, but most of it is just logical analysis. Admittedly my
main drive has been to shrink down the size of the binary, since
Delphi and Free Pascal have always been a little bit bloated in
comparison. Not that it is necessarily a bad thing, but saving space
without sacrificing performance can only be a good thing, especially
for those with limited bandwidth or for saving those few precious
bytes when burning files to a CD or DVD.
There have been a few instances in the compiled compiler (my main
test case) where an entire register is freed up due to my deep
optimisation, and that means the corresponding "push" and "pop" at
either end of the procedure can be removed (along with the
corresponding stack unwinding information), although I haven't
started programming that yet.
Isn't it better to perform this optimization before register
allocation. Then, when this happens, the corresponding "push" and "pop"
wouldn't even be put by the compiler, because the register wouldn't
have to be spilled.
Yes, this is what I already started once, a peephole optimizer pass being able to be run before register allocation
which executes in particular optimizations which reduce register usage.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/
J. Gareth Moreton
2018-06-12 21:27:13 UTC
Permalink
Raw Message
Ideally yes, but this occurs after peephole optimisations where all of the
register allocations have already been made.  Doing the peephole and deep
optimisations while the registers are still in a virtual state would be
better overall, but may require a huge overhaul of the compiler that might
be asking for too much trouble.  There's also the issue that some commands
only work with certain registers, and optimisations have to be careful of
that fact.

Gareth
Post by J. Gareth Moreton
Thanks David,
I'm still learning some of the nuances of the Intel and AMD
processors, but most of it is just logical analysis. Admittedly my
main drive has been to shrink down the size of the binary, since
Delphi and Free Pascal have always been a little bit bloated in
comparison. Not that it is necessarily a bad thing, but saving space
without sacrificing performance can only be a good thing, especially
for those with limited bandwidth or for saving those few precious
bytes when burning files to a CD or DVD.
There have been a few instances in the compiled compiler (my main
test case) where an entire register is freed up due to my deep
optimisation, and that means the corresponding "push" and "pop" at
either end of the procedure can be removed (along with the
corresponding stack unwinding information), although I haven't
started programming that yet.
Isn't it better to perform this optimization before register
allocation. Then, when this happens, the corresponding "push" and "pop"
wouldn't even be put by the compiler, because the register wouldn't
have to be spilled.

Nikolay
Florian Klämpfl
2018-06-13 19:29:58 UTC
Permalink
Raw Message
Ideally yes, but this occurs after peephole optimisations where all of the register allocations have already been made.
Doing the peephole and deep optimisations while the registers are still in a virtual state would be better overall, but
may require a huge overhaul of the compiler that might be asking for too much trouble.  There's also the issue that some
commands only work with certain registers, and optimisations have to be careful of that fact.
This is not that hard actually. The only difference is how register allocations are handled. Just look at the scheduler
pass of arm, it works also before register allocation (and afterwards).
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailm
J. Gareth Moreton
2018-06-13 18:50:07 UTC
Permalink
Raw Message
I haven't fully uncovered the secrets of
the compiler yet, but I did notice "pre-
peephole pass" under x86, but I think the
only functions it touched was one of the
bit shifts. Does this occur before
register allocation or was it just
something that had to be done before Pass
1?

Gareth

On Wed 13/06/18 20:29 , Florian Klämpfl
Am 12.06.2018 um 23:27 schrieb J. Gareth
Post by J. Gareth Moreton
Ideally yes, but this occurs after
peephole
optimisations where all of the register
allocations have already been made.
Post by J. Gareth Moreton
Doing the peephole and deep
optimisations while
the registers are still in a virtual
state would be better overall, but
Post by J. Gareth Moreton
may require a huge overhaul of the
compiler that
might be asking for too much trouble. 
There's also the issue that
some
Post by J. Gareth Moreton
commands only work with certain
registers, and
optimisations have to be careful of that
fact.
This is not that hard actually. The only
difference is how register
allocations are handled. Just look at
the scheduler
pass of arm, it works also before
register allocation (and afterwards).
__________________________________________
_____
fpc-devel maillist - fpc-
http://lists.freepascal.org/cgi-
bin/mailman/listinfo/fpc-devel
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-dev
Florian Klämpfl
2018-06-14 20:58:43 UTC
Permalink
Raw Message
Post by J. Gareth Moreton
I haven't fully uncovered the secrets of
the compiler yet, but I did notice "pre-
peephole pass" under x86, but I think the
only functions it touched was one of the
bit shifts. Does this occur before
register allocation or was it just
something that had to be done before Pass
1?
It is only before pass 1.

I attached a patch I once started which shows the idea.
J. Gareth Moreton
2018-06-14 20:06:22 UTC
Permalink
Raw Message
Thanks. I'll have a study of this and potentially move my initial deep
optimisation component to this stage.

I've made some more peephole optimisations in the meantime, but I'm going
to hold off on posting them because they're starting to conflict with my
other submissions.  Besides, I've given you far too many patches already!

Gareth
Post by J. Gareth Moreton
I haven't fully uncovered the secrets of
the compiler yet, but I did notice "pre-
peephole pass" under x86, but I think the
only functions it touched was one of the
bit shifts. Does this occur before
register allocation or was it just
something that had to be done before Pass
1?
It is only before pass 1.

I attached a patch I once started which shows the idea.

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
J. Gareth Moreton
2018-06-14 21:49:53 UTC
Permalink
Raw Message
Hi Florian,
I don't know if you have any answers, but I'm unable to apply any patches
I receive. I can view them and see the changes, and manually apply them via
copy+paste if I have to, but using the "Apply Patch" option ends up not
doing anything.  Is there a fix to this, or does it error out because I
only have read access to SVN (even though the patch should only modify my
local files)?

Looking at the design though, I can definitely experiment to see how the
deep optimiser performs in the preallocation block.  It will certainly
have the advantage of being able to handle registers that may end up being
stored on the stack due to the lack of free actual registers.  If needs
be, I'll submit the current deep optimiser that does all of its work after
the peephole optimisation, and can change it to pre register allocation
later on.  I will need to see if it performs better or worse the earlier
stage and also potentially cause other optimisations to get missed because
of MOVs being changed or removed.

Fun times ahead!  Thanks for the patch.
Gareth
Post by J. Gareth Moreton
I haven't fully uncovered the secrets of
the compiler yet, but I did notice "pre-
peephole pass" under x86, but I think the
only functions it touched was one of the
bit shifts. Does this occur before
register allocation or was it just
something that had to be done before Pass
1?
It is only before pass 1.

I attached a patch I once started which shows the idea.

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Florian Klämpfl
2018-06-15 15:23:04 UTC
Permalink
Raw Message
Post by J. Gareth Moreton
Hi Florian,
I don't know if you have any answers, but I'm unable to apply any patches I receive. I can view them and see the
changes, and manually apply them via copy+paste if I have to, but using the "Apply Patch" option ends up not doing
anything.  Is there a fix to this, or does it error out because I only have read access to SVN (even though the patch
should only modify my local files)?
Did you try to use the patch.exe from the command line which comes with FPC?
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman
J. Gareth Moreton
2018-06-15 14:25:57 UTC
Permalink
Raw Message
Oh! I'm still a beginner with version
control, it seems!

On Fri 15/06/18 16:23 , Florian Klämpfl
Am 14.06.2018 um 23:49 schrieb J. Gareth
Post by J. Gareth Moreton
Hi Florian,
I don't know if you have any answers,
but I'm
unable to apply any patches I receive. I
can view them and see the
Post by J. Gareth Moreton
changes, and manually apply them via
copy+paste
if I have to, but using the "Apply
Patch" option ends up not
doing
Post by J. Gareth Moreton
anything.  Is there a fix to this, or
does
it error out because I only have read
access to SVN (even though the patch
Post by J. Gareth Moreton
should only modify my local files)?
Did you try to use the patch.exe from
the command line which comes with
FPC?
__________________________________________
_____
fpc-devel maillist - fpc-
http://lists.freepascal.org/cgi-
bin/mailman/listinfo/fpc-devel
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.
J. Gareth Moreton
2018-06-15 16:17:15 UTC
Permalink
Raw Message
Not much luck for me - the file won't patch without options or
modifications, and using -p 1 to remove the "a/" and "b/" from the starts
of the files causes an assertion in patch.exe.  Back to doing it manually
for now!

Gareth
Post by J. Gareth Moreton
Hi Florian,
I don't know if you have any answers, but I'm unable to apply any
patches I receive. I can view them and see the
Post by J. Gareth Moreton
changes, and manually apply them via copy+paste if I have to, but using
the "Apply Patch" option ends up not doing
Post by J. Gareth Moreton
anything.  Is there a fix to this, or does it error out because I only
have read access to SVN (even though the patch
Post by J. Gareth Moreton
should only modify my local files)?
Did you try to use the patch.exe from the command line which comes with
FPC?
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Florian Klämpfl
2018-06-15 19:03:28 UTC
Permalink
Raw Message
Not much luck for me - the file won't patch without options or modifications, and using -p 1 to remove the "a/" and "b/"
from the starts of the files causes an assertion in patch.exe.
Sorry, my bad. The patch has unix line feeds, this crashes patch.exe for windows. Try again with the attached one or
convert the line endings of the first one to window ones.
J. Gareth Moreton
2018-06-15 20:11:32 UTC
Permalink
Raw Message
Something tells me that we should write our own patch.exe at some point to
alleviate these shortcomings!  Thanks for the patch again.

Any word on what I've submitted so far? I ask because I found some new
peephole optimisations that can make some good speed and size savings, but
one of them requires a new Pass 1 function and will either have to be
merged into the binary search list, or the large case block, so I can't
submit it yet until I know which way the source tree will go.
Gareth aka. Kit
Post by J. Gareth Moreton
Not much luck for me - the file won't patch without options or
modifications, and using -p 1 to remove the "a/" and "b/"
Post by J. Gareth Moreton
from the starts of the files causes an assertion in patch.exe.
Sorry, my bad. The patch has unix line feeds, this crashes patch.exe for
windows. Try again with the attached one or
convert the line endings of the first one to window ones.

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
J. Gareth Moreton
2018-06-15 21:25:29 UTC
Permalink
Raw Message
Sorry, I just realised that was unfairly impatient of me.  I've still got
little things I can work on, but I'm worried about creating a large
backlog.

Gareth

On Fri 15/06/18 21:11 , "J. Gareth Moreton" ***@moreton-family.com
sent:
Something tells me that we should write our own patch.exe at some point
to alleviate these shortcomings!  Thanks for the patch again.

Any word on what I've submitted so far? I ask because I found some new
peephole optimisations that can make some good speed and size savings, but
one of them requires a new Pass 1 function and will either have to be
merged into the binary search list, or the large case block, so I can't
submit it yet until I know which way the source tree will go.
Gareth aka. Kit
Post by J. Gareth Moreton
Not much luck for me - the file won't patch without options or
modifications, and using -p 1 to remove the "a/" and "b/"
Post by J. Gareth Moreton
from the starts of the files causes an assertion in patch.exe.
Sorry, my bad. The patch has unix line feeds, this crashes patch.exe for
windows. Try again with the attached one or
convert the line endings of the first one to window ones.

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [3]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[4]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:

Loading...