Discussion:
[fpc-devel] x86_64 Optimizer Overhaul
J. Gareth Moreton
2018-12-01 15:06:00 UTC
Permalink
Hi everyone,

So for the fast few weeks, I've attempted to do some work on the peephole
optimizer, specifically to reduce the number of passes required to produce
optimised code, and I can finally submit a patch:
https://bugs.freepascal.org/view.php?id=34628

This is only for x86_64, and I've tried to keep i386 separate until I know
x86_64 works.  I hope it works and is to a high standard.  The bug report
should explain most of the logic behind the changes.

I've had problems testing it under Linux due to configuration
difficulties, so if anyone is willing to try out "make all", I'll be most
grateful.

Thank you.

Gareth aka. Kit
J. Gareth Moreton
2018-12-02 04:39:12 UTC
Permalink
Following advice from Florian, I've split my submission into five separate
patches so they are easier to test.  It also now compiles under
x86_64-linux.  It seems that there's an apparent fault with one of the MOV
optimisations that was causing incorrect code to be generated in some
instances.  I have a good idea as to what's going on and can try to fix
this at another time.

Hopefully now it's stable enough for time metrics to be taken and to
confirm it doesn't break other platforms.
Some more refactoring should be performed down the line; I plan to do this
once my code is confirmed reasonable and I begin adapting it for i386,
where there's a bounty for speed gains!

Find all the new patch files over here:
https://bugs.freepascal.org/view.php?id=34628 - note that some of the
patches require others to work; prerequisite information is given in the
first note.

Gareth aka. Kit
J. Gareth Moreton
2018-12-02 20:30:11 UTC
Permalink
Thanks for the feedback.  Do you have a reproducible case, and does it
fail on Linux or Windows?  I'll have a look for the infinite loops in the
meantime.

Gareth aka. Kit
Post by J. Gareth Moreton
I've had problems testing it under Linux due to configuration
difficulties, so if anyone is willing to try out "make all", I'll be most
grateful. 

"make all" work well on linux.

Compiler options -O3 and -O4 are broken.
It was possible to compile my program, but program at some point went into
never ending loop - cpu usage 100% and response zero.

Compiling my speed test program using -O2, optimizations made by Overhaul,
was speed lose by 2% comparing to current trunk.  I guess, optimizations
is good for compiler itself, but no so much for user programs.

margers
       
Marģers . via fpc-devel
2018-12-02 23:21:06 UTC
Permalink
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
J. Gareth Moreton
2018-12-02 22:24:47 UTC
Permalink
That's interesting. Thanks for that. Time to get fixing.

In the meantime I'm also fixing up the buggy optimisation that caused the
original crash on Linux... nothing against the contributor, but it looks
like some badly-copied code from the MovXX routine... it even still
mentions "movsx" in the comments!

Hopefully this effort on the overhaul won't all be for naught.

Gareth aka. Kit

On Sun 02/12/18 23:21 , "Marģers ." ***@inbox.lv sent:
I run it no linux. Problem code part.

type PLongData = ^TLongData;
      TLongData = array [0..100] of longint;

function binarySearchLong ( sortedArray:PLongData; nLen,
toFind:longint):longint;
var low, high, mid, l, h, m : longint;
begin
    { Returns index of toFind in sortedArray, or -1 if not found}
    low := 0;
    high := nLen - 1;

    l := sortedArray^[low];
    h := sortedArray^[high];

    while ((l = toFind)) do
    begin
         mid := (low + high) shr 1;   { var "low" in register
r8d }
         m := sortedArray^[mid];

         if (m < toFind) then
         begin
              low := mid + 1;
              l := sortedArray^[low];

        { asm code generated
-- with trunk
        lea     r8d,
[r11d+1H]                          
    mov  esi, r8d
--end trunk
-- with overhaul   it never set r8d to new value, but should
        lea     esi,
[r11d+1H]                          
-- end  overhaul

        mov     r10d, dword
[rdi+rsi*4]                 
        jmp    
?_00144                                 

        }
         end else
         if (m > toFind) then
         begin
              high := mid - 1;
              h := sortedArray^[high];
         end else
         begin
            binarySearchLong:=mid;
            exit;
         end;
         
    end;

    if (sortedArray^[low] = toFind) then
    begin
         binarySearchLong:=low;
    end else
        binarySearchLong := -1; { Not found}
end;

      ----- Reply to message -----
Subject: Re: [fpc-devel] x86_64 Optimizer Overhaul
Date: 2018. gada 2. decembris 23:32:36
From: J. Gareth Moreton
To: FPC developers' list Thanks for the feedback.  Do you have a
reproducible case, and does it fail on Linux or Windows?  I'll have a look
for the infinite loops in the meantime.   Gareth aka. Kit      
Marģers . via fpc-devel
2018-12-02 20:54:39 UTC
Permalink
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
J. Gareth Moreton
2018-12-06 15:55:13 UTC
Permalink
I believed I've fixed the bug.  Thanks for your help.

I had misunderstood one of the internal methods and, as a result, it
wasn't resetting the register allocation usage with each iteration of the
loop (and to add insult to injury, caused a memory leak!).  By sheer
coincidence, this wasn't a problem under Windows because of some additional
code that skipped over the function prologue, but got triggered under
Linux.

I've updated all of the patch files in the bug report and added an
additional one, since one function in particular got a bigger rework than
everything else (overhaul-mov-refactor).

I haven't had a chance to re-test the timings yet, although I've tried to
provide a couple of additional savings for -O1 and -O2.

Gareth aka. Kit

P.S. Note that the code is very messy with functions being split between
i386 and x86_64. This is for testing and control cases.  If x86_64 is
successful, I intend to remove the distinctions and have i386 and x86_64
share the same overhaul.  One platform at a time though!

On Sun 02/12/18 23:21 , "Marģers ." ***@inbox.lv sent:
I run it no linux. Problem code part.

type PLongData = ^TLongData;
      TLongData = array [0..100] of longint;

function binarySearchLong ( sortedArray:PLongData; nLen,
toFind:longint):longint;
var low, high, mid, l, h, m : longint;
begin
    { Returns index of toFind in sortedArray, or -1 if not found}
    low := 0;
    high := nLen - 1;

    l := sortedArray^[low];
    h := sortedArray^[high];

    while ((l = toFind)) do
    begin
         mid := (low + high) shr 1;   { var "low" in register
r8d }
         m := sortedArray^[mid];

         if (m < toFind) then
         begin
              low := mid + 1;
              l := sortedArray^[low];

        { asm code generated
-- with trunk
        lea     r8d,
[r11d+1H]                          
    mov  esi, r8d
--end trunk
-- with overhaul   it never set r8d to new value, but should
        lea     esi,
[r11d+1H]                          
-- end  overhaul

        mov     r10d, dword
[rdi+rsi*4]                 
        jmp    
?_00144                                 

        }
         end else
         if (m > toFind) then
         begin
              high := mid - 1;
              h := sortedArray^[high];
         end else
         begin
            binarySearchLong:=mid;
            exit;
         end;
         
    end;

    if (sortedArray^[low] = toFind) then
    begin
         binarySearchLong:=low;
    end else
        binarySearchLong := -1; { Not found}
end;

      ----- Reply to message -----
Subject: Re: [fpc-devel] x86_64 Optimizer Overhaul
Date: 2018. gada 2. decembris 23:32:36
From: J. Gareth Moreton
To: FPC developers' list Thanks for the feedback.  Do you have a
reproducible case, and does it fail on Linux or Windows?  I'll have a look
for the infinite loops in the meantime.   Gareth aka. Kit      
J. Gareth Moreton
2018-12-06 22:11:13 UTC
Permalink
Got some new timing metrics for compiling Lazarus under Windows. I haven't
got them for Linux because I only have it on a virtual machine, which
significantly skews the performance:

----

-O3: Trunk

[125.383] 1285571 lines compiled, 125.4 sec, 9137600 bytes code, 788740
bytes data
[122.078] 1285571 lines compiled, 122.1 sec, 9137600 bytes code, 788740
bytes data
[119.125] 1285571 lines compiled, 119.1 sec, 9137600 bytes code, 788740
bytes data

Avg. 122.195 sec. Binary size = 19,325,952 bytes

-O3: Overhaul

[103.133] 1285571 lines compiled, 103.1 sec, 9133968 bytes code, 788740
bytes data
[105.234] 1285571 lines compiled, 105.2 sec, 9133968 bytes code, 788740
bytes data
[104.906] 1285571 lines compiled, 104.9 sec, 9133968 bytes code, 788740
bytes data

Avg. 104.424 sec. Binary size = 19,322,368 bytes

Time improvement: 14.5%
Size improvement: 0.0185%

----

-O2: Trunk

[118.852] 1285571 lines compiled, 118.9 sec, 9103760 bytes code, 788996
bytes data
[120.266] 1285571 lines compiled, 120.3 sec, 9103760 bytes code, 788996
bytes data
[116.531] 1285571 lines compiled, 116.5 sec, 9103760 bytes code, 788996
bytes data

Avg. 118.550 sec. Binary size = 19,292,672 bytes

-O2: Overhaul

[100.875] 1285571 lines compiled, 100.9 sec, 9100096 bytes code, 788996
bytes data
[100.922] 1285571 lines compiled, 100.9 sec, 9100096 bytes code, 788996
bytes data
[101.813] 1285571 lines compiled, 101.8 sec, 9100096 bytes code, 788996
bytes data

Avg. 101.203 sec. Binary size = 19,289,088 bytes

Time improvement: 14.6%
Size improvement: 0.0186%

----

-O1: Trunk

[114.641] 1285571 lines compiled, 114.6 sec, 10196576 bytes code, 788996
bytes data
[112.734] 1285571 lines compiled, 112.7 sec, 10196576 bytes code, 788996
bytes data
[113.516] 1285571 lines compiled, 113.5 sec, 10196576 bytes code, 788996
bytes data

Avg. 113.630 sec. Binary size = 20,370,432 bytes

-O1: Overhaul

[99.711] 1285571 lines compiled, 99.7 sec, 10193536 bytes code, 788996
bytes data
[102.375] 1285571 lines compiled, 102.4 sec, 10193536 bytes code, 788996
bytes data
[102.211] 1285571 lines compiled, 102.2 sec, 10193536 bytes code, 788996
bytes data

Avg. 101.432 sec. Binary size = 20,370,360 bytes

Time improvement: 10.7%
Size improvement: 0.000353%

----

Note there are no actual new optimisations, so to speak. The size
improvements come from more intelligent elimination of dead labels and the
removal of unnecessary alignment hints, which allows some new branch
optimisations to be found.

One thing that is a little bit mysterious and might warrant further
investigation... the binary size is larger on O3 than it is for O2 on both
the trunk and the overhaul patches.  It might be that some of the
optimisations sacrifice code size for speed, but over 3 kilobytes feels
rather significant.  I may be wrong though.
Does anyone have other test projects to compile that would give more
coverage for the timing metrics?

Gareth aka. Kit

On Thu 06/12/18 15:55 , "J. Gareth Moreton" ***@moreton-family.com
sent:
I believed I've fixed the bug.  Thanks for your help.

I had misunderstood one of the internal methods and, as a result, it
wasn't resetting the register allocation usage with each iteration of the
loop (and to add insult to injury, caused a memory leak!).  By sheer
coincidence, this wasn't a problem under Windows because of some additional
code that skipped over the function prologue, but got triggered under
Linux.

I've updated all of the patch files in the bug report and added an
additional one, since one function in particular got a bigger rework than
everything else (overhaul-mov-refactor).

I haven't had a chance to re-test the timings yet, although I've tried to
provide a couple of additional savings for -O1 and -O2.

Gareth aka. Kit

P.S. Note that the code is very messy with functions being split between
i386 and x86_64. This is for testing and control cases.  If x86_64 is
successful, I intend to remove the distinctions and have i386 and x86_64
share the same overhaul.  One platform at a time though!

On Sun 02/12/18 23:21 , "Marģers ." ***@inbox.lv sent:
I run it no linux. Problem code part.

type PLongData = ^TLongData;
      TLongData = array [0..100] of longint;

function binarySearchLong ( sortedArray:PLongData; nLen,
toFind:longint):longint;
var low, high, mid, l, h, m : longint;
begin
    { Returns index of toFind in sortedArray, or -1 if not found}
    low := 0;
    high := nLen - 1;

    l := sortedArray^[low];
    h := sortedArray^[high];

    while ((l = toFind)) do
    begin
         mid := (low + high) shr 1;   { var "low" in register
r8d }
         m := sortedArray^[mid];

         if (m < toFind) then
         begin
              low := mid + 1;
              l := sortedArray^[low];

        { asm code generated
-- with trunk
        lea     r8d,
[r11d+1H]                          
    mov  esi, r8d
--end trunk
-- with overhaul   it never set r8d to new value, but should
        lea     esi,
[r11d+1H]                          
-- end  overhaul

        mov     r10d, dword
[rdi+rsi*4]                 
        jmp    
?_00144                                 

        }
         end else
         if (m > toFind) then
         begin
              high := mid - 1;
              h := sortedArray^[high];
         end else
         begin
            binarySearchLong:=mid;
            exit;
         end;
         
    end;

    if (sortedArray^[low] = toFind) then
    begin
         binarySearchLong:=low;
    end else
        binarySearchLong := -1; { Not found}
end;

      ----- Reply to message -----
Subject: Re: [fpc-devel] x86_64 Optimizer Overhaul
Date: 2018. gada 2. decembris 23:32:36
From: J. Gareth Moreton
To: FPC developers' list Thanks for the feedback.  Do you have a
reproducible case, and does it fail on Linux or Windows?  I'll have a look
for the infinite loops in the meantime.   Gareth aka. Kit      
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Ryan Joseph
2018-12-07 01:26:32 UTC
Permalink
Does anyone have other test projects to compile that would give more coverage for the timing metrics?
Sure. How do I download and build? Are you just relying the FPC standard output for timing or are there are special switches to show compiles times more accurate?

Regards,
Ryan Joseph

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi
J. Gareth Moreton
2018-12-07 01:21:25 UTC
Permalink
The patches in question can be found here:
https://bugs.freepascal.org/view.php?id=34628 - just get the latest source
files from the SVN trunk and apply the patches - the order shouldn't
matter, but be careful you don't accidentally apply the same patch twice. 
After that, you just "make clean all install" as normal.  If you've never
built FPC before, you might want to hunt around the website for
instructions.

For the most accurate timing information, specify the "-vs" flag and it
will put timestamps at the front of all of the messages.

Hope this helps.

Gareth aka. Kit
Post by J. Gareth Moreton
Does anyone have other test projects to compile that would give more
coverage for the timing metrics?

Sure. How do I download and build? Are you just relying the FPC standard
output for timing or are there are special switches to show compiles times
more accurate?

Regards,
Ryan Joseph

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [2]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[3]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
J. Gareth Moreton
2018-12-09 02:02:26 UTC
Permalink
(This should probably be on the mailing list because it's helpful to
everyone)

Hmmm, I'm not sure about that one - those shouldn't be affected.  Just
the standard "make clean all" should work.

However, the document here contains everything about building FPC:
http://www.stack.nl/~marcov/buildfaq.pdf [1] - not sure why your packages
aren't in the default directories now though... the patches only change a
few source files.

Gareth aka. Kit

On Sun 09/12/18 02:56 , Ryan Joseph ***@thealchemistguild.com sent:
got it compiling but I need a better way to specify the changed
rtl/package units. I just did a “make clean all” but I need to specify
the new location of the rtf/packages units which are no longer in the
default locations.

Is there a better way than adding tons of -Fu’s in the command line?

Regards,
Ryan Joseph



Links:
------
[1] http://www.stack.nl/~marcov/buildfaq.pdf
Ryan Joseph
2018-12-09 03:13:45 UTC
Permalink
(This should probably be on the mailing list because it's helpful to everyone)
Hmmm, I'm not sure about that one - those shouldn't be affected. Just the standard "make clean all" should work.
However, the document here contains everything about building FPC: http://www.stack.nl/~marcov/buildfaq.pdf - not sure why your packages aren't in the default directories now though... the patches only change a few source files.
Gareth aka. Kit
The reason is that PPU versions changed so I can’t use the units at the default system locations. Not sure how the compiler decides but it was looking for units at the 3.1.1 install location which is an older PPU version.

There should be a single command to specify a top-level directory that wins out over default locations.

Regards,
Ryan Joseph

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org
Ryan Joseph
2018-12-09 01:36:44 UTC
Permalink
Couldn’t figure out the patching. I tried a dry run but it doesn’t seem to find the file.

Downloaded from svn

cd trunk
patch < /Users/ryanjoseph/Downloads/overhaul-64-32-split.patch --dry-run

(Stripping trailing CRs from patch.)
can't find file to patch at input line 5
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--------------------------
|Index: compiler/x86/aoptx86.pas
|===================================================================
|--- compiler/x86/aoptx86.pas (revision 40472)
|+++ compiler/x86/aoptx86.pas (working copy)
--------------------------
Had any luck with this?
Regards,
Ryan Joseph

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/
Ryan Joseph
2018-12-09 01:56:44 UTC
Permalink
I was stupid and didn’t use the right options. They work now except this one:

sudo patch -p0 < /Users/ryanjoseph/Downloads/overhaul-base.patch
(Stripping trailing CRs from patch.)
patching file compiler/aopt.pas
patch: **** malformed patch at line 15: Index: compiler/aoptbase.pas

did it fail?
Post by Ryan Joseph
Couldn’t figure out the patching. I tried a dry run but it doesn’t seem to find the file.
Regards,
Ryan Joseph

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal
J. Gareth Moreton
2018-12-09 02:15:04 UTC
Permalink
Hmmm, that's a shame if the time difference is so small.  Up to you if
it's worth it or not.  I hoped it would be slightly better than that,
although if it's consistently faster, especially with large projects, then
it's a winner in my eyes.  Fingers crossed!

Gareth aka. Kit

On Sun 09/12/18 03:10 , "Ryan Joseph" ***@thealchemistguild.com sent:
Got everything building finally but the time difference is so small I'll
need to make a script to compile multiple times and average all the runs.
Is it even worth the time doing that?

Regards,
Ryan Joseph
Ryan Joseph
2018-12-09 03:32:27 UTC
Permalink
Hmmm, that's a shame if the time difference is so small. Up to you if it's worth it or not. I hoped it would be slightly better than that, although if it's consistently faster, especially with large projects, then it's a winner in my eyes. Fingers crossed!
My biggest project is only ~20 seconds so it’s just not a big enough code base I think.

How do we even know how much of the time was spent on code generation vs parsing? All I looked at was the final time using -vs when it got to the linking phase.

Regards,
Ryan Joseph

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lis
J. Gareth Moreton
2018-12-09 22:39:06 UTC
Permalink
Because of how intertwined my work is, I can't easily work on something
else until I know if this overhaul is accepted or rejected.  However, in
the meantime, would anyone object if I start porting it to i386, so I can
get rid of all those horrible $ifdef's more than anything? From the little
I've observed, i386 still works as it does normally, which was the original
intention so x86_64 can be tested in isolation.

Gareth aka. Kit
Marco van de Voort
2018-12-09 11:57:19 UTC
Permalink
I'm not sure. I've always had problems
with patch.exe. I personally use "svn
patch", which works for me both under
Windows and Linux.
Cygwin patch afaik also works fine.


_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://l
J. Gareth Moreton
2018-12-09 04:08:52 UTC
Permalink
Saying that though, despite the near-identical time, what's the size of
the binary like?  It should be the same or slightly smaller, but
(hopefully) never larger.

Gareth aka. Kit
Hmmm, that's a shame if the time difference is so small. Up to you if
it's worth it or not. I hoped it would be slightly better than that,
although if it's consistently faster, especially with large projects, then
it's a winner in my eyes. Fingers crossed!

My biggest project is only ~20 seconds so it’s just not a big enough
code base I think.

How do we even know how much of the time was spent on code generation vs
parsing? All I looked at was the final time using -vs when it got to the
linking phase.

Regards,
Ryan Joseph
Ryan Joseph
2018-12-09 03:10:23 UTC
Permalink
Got everything building finally but the time difference is so small I'll need to make a script to compile multiple times and average all the runs. Is it even worth the time doing that?

Regards,
Ryan Joseph

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cg
J. Gareth Moreton
2018-12-09 02:45:11 UTC
Permalink
Hmmm, it might imply that the overhaul isn't worth it except for the
largest projects.  I guess we'll have to let Florian make that call.

I'm not sure how to time the optimisation stage separately, unless you're
able to pass in the PPU files directly.  Other factors like reading from
the disk can take up proportinally more time as well.  Thanks for your
help though - at least it's not considerably worse!

i'm just hoping my changes are successful so I can port it to i386 and
then implement some new peephole optimisations that I've found (the changes
clash slightly with the overhaul).

Gareth aka. Kit.
Hmmm, that's a shame if the time difference is so small. Up to you if
it's worth it or not. I hoped it would be slightly better than that,
although if it's consistently faster, especially with large projects, then
it's a winner in my eyes. Fingers crossed!

My biggest project is only ~20 seconds so it’s just not a big enough
code base I think.

How do we even know how much of the time was spent on code generation vs
parsing? All I looked at was the final time using -vs when it got to the
linking phase.

Regards,
Ryan Joseph
J. Gareth Moreton
2018-12-09 17:07:11 UTC
Permalink
I think patch.exe gets a bit confused if the changes have to be offset -
in my case, there are 6 patches that should be applied together, and some
of them modify the same file, hence causing movement of procedures in the
source file.
Sorry if I'm getting pushy and impatient with this change - I guess I'm a
bit too passionate for my own good!

Gareth aka. Kit
I'm not sure. I've always had problems
with patch.exe. I personally use "svn
patch", which works for me both under
Windows and Linux.
Cygwin patch afaik also works fine.

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Loading...