Discussion:
Optimization theory
(too old to reply)
J. Gareth Moreton
2018-06-24 21:28:52 UTC
Permalink
Raw Message
I figured this is as good a place as any to talk about my attempt at
improving the speed of the stage 1 optimizer in issue #33855.  On paper, a
binary search is the fastest method to locate a data pointer associated
with a key (in this case, an opcode).  Of course, in practice that isn't
always the case due to overhead and the penalties (or lack of) associated
with conditional jumps.
The main problem with my binary search implementation is the extra
complexity it puts into the code, and as Markov mentioned, the fact it's
limited to just x86 rather than being more general-purpose.  The ideal
solution is to tie the binary search and the associated key/data map into
the compiler, specifically the case node generation, but there are some
theoretical issues that need to be resolved:
- What is the minimum number of case nodes where a binary search should be
used over a simpler linear search?  The maximum number of iterations of a
binary search loop is equal to ceil(log2(n)), where n is the size of the
list.  Further optimisations with the likes of CMOVcc and loop unrolling
can eliminate jumps entirely, but there is still significant overhead where
it might be faster to simply perform a linear search, especially where the
most frequent entry is at or near the top of the list.
- How can it be detected that each case node has an identical setup? 
With the optimizer, the current opcode is passed through a case block, and
if a match is found, it is passed through an object method of the format
"function(var p: tai): Boolean".  The fact that the function calls are
identical means that some optimisation can be performed with common
sub-expressions, specifically passing the address of p into RDX or RSI,
depending on if the OS is Windows or Linux on x86-64.
- What happens if there is an imperfection in the source code? A vague
question, I know, but in the case, some of the functions are of the format
"function(const p: tai): Boolean", which require a slightly different
set-up with the parameters.  This is not an error, and the functions are
sometimes set up this way if p never changes (which is the case if this
particular instruction is never deleted by the optimizer), but it breaks
the homogenuity of the case nodes and hence would make a highly-optimized
search much harder.  Should the programmer be warned or given a hint in
how to improve their code, or is it up to us to be smart enough to spot
"const p: tai" instead of "var p: tai"?

- How easy would it be to make such a binary search more general-purpose?
My particular code simply uses a map that pairs an opcode to a pointer in a
many-to-one relationship.  In this case, the algorithm doesn't care what
the pointer is - it could easily be a label address, an object or a generic
block of memory on the heap instead of a procedural address.  Granted, as
mentioned, the situation where each case node is a function call with an
identical setup is a special case because jumps can be eliminated entirely.
I think that's everything anyway.  I admit my motivations have slightly
negative roots because I remember when the main advertising point of Turbo
Pascal 7.0 was the speed of its compiler, and I'm harking back to those
days.  Also the size of the binaries produced when compared to, say,
Microsoft Visual C++.  Of course, size isn't as big of a concern as it was
back in the days when you were limited to a 1.44 MB floppy disk, but if
there's a space where I can eliminate or restructure inefficient or
superfluous code, I'll take it!
If I can be filled in, how do case nodes work currently? I know there's a
kind of linear search where the ordinal variable has a number subtracted
that corresponds to the next case node, and something called a "jump
table", but I'm not entirely sure how that works yet.
Gareth aka. Kit
J. Gareth Moreton
2018-06-24 22:12:43 UTC
Permalink
Raw Message
If I had to start somewhere though... this is an extract of the
disassembly from the pass 1 optimizer from the SVN head:

x86_64aoptcpu.pas:105   result:=OptPass1OP(p);
488b16                   mov    (%rsi),%rdx
4889d9                   mov    %rbx,%rcx
e8414f0000               callq  0x100189a30
4088c7                   mov    %al,%dil
e951000000               jmpq   0x100184b48

90                       nop
x86_64aoptcpu.pas:110   result:=OptPass1MOVXX(p);
4889f2                   mov    %rsi,%rdx
4889d9                   mov    %rbx,%rcx
e8cd4c0000               callq  0x1001897d0
4088c7                   mov    %al,%dil
eb40                     jmp    0x100184b48

x86_64aoptcpu.pas:112   result:=OptPass1LEA(p);
4889f2                   mov    %rsi,%rdx
4889d9                   mov    %rbx,%rcx
e8ad500000               callq  0x100189bc0
4088c7                   mov    %al,%dil
eb30                     jmp    0x100184b48

x86_64aoptcpu.pas:114   result:=OptPass1Sub(p);
4889f2                   mov    %rsi,%rdx
4889d9                   mov    %rbx,%rcx
e89d580000               callq  0x10018a3c0
4088c7                   mov    %al,%dil
eb20                     jmp    0x100184b48

There are some interesting things to note here.  OptPass1OP has "const p:
tai" rather than "var p: tai" as a parameter, and this causes an
unfortunate difference in the set-up, namely "mov (%rsi),%rdx" instead of
"mov %rsi,%rdx".  If these were fixed  to all be "var p: tai" (I might
actually do this, since it's a very simple change), then all the case nodes
will be identical, save the call address, and would be a better
experimental case for optimisation improvements.  Additionally, after the
call, ther lies "mov %al,%dil" and an identical jump (to the end of the
case block).  The MOV could easily be moved to after the jump with some
careful code analysis, which might then allow for futher optimisation,
depending on what is then done with %dil.

Gareth aka. Kit
Martin
2018-06-24 23:18:42 UTC
Permalink
Raw Message
On paper, a binary search is the fastest method to locate a data
pointer associated with a key (in this case, an opcode).
Forgive me for stepping in, given that I hardly looked at your work, nor
the particular part of fpc.
But the fastest way should be a hash look up (O(n) = 1)

I might have missed something, so maybe I am completely out of line....

You have an enum. This could be used as a hash key (no need to
calculate, just take the raw value, and you have a perfect hash).

All you need to do is once build the hash table.
  lookup =  Array[tasmop] of CodePointer;
in unit initialization set the codepointer for the known opcodes.

Then you do not need to search. just
   codepointer := lookup[taicpu(p).opcode]
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/c
J. Gareth Moreton
2018-06-25 03:23:00 UTC
Permalink
Raw Message
Actually you're right - that is the fastest method (it's written O(1), by
the way, if it doesn't depend on n).  The drawback is that the large
majority of instructions don't have such an optimisation procedure, so the
table  would be extremely sparse.  There are 1,109 individual
instructions on x86 that FPC currently supports.  Come to think of it,
even that is not too bad as far as memory goes - you're looking at 8,872
bytes on i386 (4 bytes for the instruction and 4 bytes for a pointer) and
17,744 bytes on x86-64 (8 bytes for each field), both are potentially
cacheable on the CPU hence won't be too costly in regards to looking it
up.  Not sure why I overlooked that one.  Thanks Martin.

Gareth aka. Kit
On paper, a binary search is the fastest method to locate a data
pointer associated with a key (in this case, an opcode).
Forgive me for stepping in, given that I hardly looked at your work, nor
the particular part of fpc.
But the fastest way should be a hash look up (O(n) = 1)

I might have missed something, so maybe I am completely out of line....

You have an enum. This could be used as a hash key (no need to
calculate, just take the raw value, and you have a perfect hash).

All you need to do is once build the hash table.
  lookup =  Array[tasmop] of CodePointer;
in unit initialization set the codepointer for the known opcodes.

Then you do not need to search. just
   codepointer := lookup[taicpu(p).opcode]
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Martin
2018-06-25 10:18:56 UTC
Permalink
Raw Message
Post by J. Gareth Moreton
Actually you're right - that is the fastest method (it's written O(1),
by the way, if it doesn't depend on n).  The drawback is that the
large majority of instructions don't have such an optimisation
procedure, so the table  would be extremely sparse.  There are 1,109
individual instructions on x86 that FPC currently supports. Come to
think of it, even that is not too bad as far as memory goes - you're
looking at 8,872 bytes on i386 (4 bytes for the instruction and 4
bytes for a pointer) and 17,744 bytes on x86-64 (8 bytes for each
field), both are potentially cacheable on the CPU hence won't be too
costly in regards to looking it up.  Not sure why I overlooked that
one.  Thanks Martin.
While probably not worth it, but if you want to save memory (and only
few entries have data): Use 2 maps:

map1: array[opcode] of byte; // I do not know how efficient reading
bytes is....
      maps the (hopefully fewer than 255 active entries)

map2: array[byte] of codepointer
J. Gareth Moreton
2018-06-25 03:57:33 UTC
Permalink
Raw Message
Of course, there's nothing to stop us from having a smaller table with an
actual hash, but that may give too much constant overhead, especially a
modulus operation.

Gareth aka. Kit

On Mon 25/06/18 04:23 , "J. Gareth Moreton" ***@moreton-family.com
sent:
Actually you're right - that is the fastest method (it's written O(1), by
the way, if it doesn't depend on n).  The drawback is that the large
majority of instructions don't have such an optimisation procedure, so the
table  would be extremely sparse.  There are 1,109 individual
instructions on x86 that FPC currently supports.  Come to think of it,
even that is not too bad as far as memory goes - you're looking at 8,872
bytes on i386 (4 bytes for the instruction and 4 bytes for a pointer) and
17,744 bytes on x86-64 (8 bytes for each field), both are potentially
cacheable on the CPU hence won't be too costly in regards to looking it
up.  Not sure why I overlooked that one.  Thanks Martin.

Gareth aka. Kit
On paper, a binary search is the fastest method to locate a data
pointer associated with a key (in this case, an opcode).
Forgive me for stepping in, given that I hardly looked at your work, nor
the particular part of fpc.
But the fastest way should be a hash look up (O(n) = 1)

I might have missed something, so maybe I am completely out of line....

You have an enum. This could be used as a hash key (no need to
calculate, just take the raw value, and you have a perfect hash).

All you need to do is once build the hash table.
  lookup =  Array[tasmop] of CodePointer;
in unit initialization set the codepointer for the known opcodes.

Then you do not need to search. just
   codepointer := lookup[taicpu(p).opcode]
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [3]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[4]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
J. Gareth Moreton
2018-06-25 08:32:51 UTC
Permalink
Raw Message
In the meantime, I've just submitted this patch:
https://bugs.freepascal.org/view.php?id=33908 - it changes those routines
with const p: tai to var p: tai.

Gareth aka. Kit

On Sun 24/06/18 23:12 , "J. Gareth Moreton" ***@moreton-family.com
sent:
If I had to start somewhere though... this is an extract of the
disassembly from the pass 1 optimizer from the SVN head:

x86_64aoptcpu.pas:105   result:=OptPass1OP(p);
488b16                   mov    (%rsi),%rdx
4889d9                   mov    %rbx,%rcx
e8414f0000               callq  0x100189a30
4088c7                   mov    %al,%dil
e951000000               jmpq   0x100184b48

90                       nop
x86_64aoptcpu.pas:110   result:=OptPass1MOVXX(p);
4889f2                   mov    %rsi,%rdx
4889d9                   mov    %rbx,%rcx
e8cd4c0000               callq  0x1001897d0
4088c7                   mov    %al,%dil
eb40                     jmp    0x100184b48

x86_64aoptcpu.pas:112   result:=OptPass1LEA(p);
4889f2                   mov    %rsi,%rdx
4889d9                   mov    %rbx,%rcx
e8ad500000               callq  0x100189bc0
4088c7                   mov    %al,%dil
eb30                     jmp    0x100184b48

x86_64aoptcpu.pas:114   result:=OptPass1Sub(p);
4889f2                   mov    %rsi,%rdx
4889d9                   mov    %rbx,%rcx
e89d580000               callq  0x10018a3c0
4088c7                   mov    %al,%dil
eb20                     jmp    0x100184b48

There are some interesting things to note here.  OptPass1OP has "const p:
tai" rather than "var p: tai" as a parameter, and this causes an
unfortunate difference in the set-up, namely "mov (%rsi),%rdx" instead of
"mov %rsi,%rdx".  If these were fixed  to all be "var p: tai" (I might
actually do this, since it's a very simple change), then all the case nodes
will be identical, save the call address, and would be a better
experimental case for optimisation improvements.  Additionally, after the
call, ther lies "mov %al,%dil" and an identical jump (to the end of the
case block).  The MOV could easily be moved to after the jump with some
careful code analysis, which might then allow for futher optimisation,
depending on what is then done with %dil.

Gareth aka. Kit
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
J. Gareth Moreton
2018-06-25 14:18:18 UTC
Permalink
Raw Message
One other problem with this method, and I think it has caused problems
before - https://bugs.freepascal.org/view.php?id=32079 - is what happens if
an invalid value is passed in for the opcode.  Granted, a sanity check can
be employed.
One would need to develop some tight timing tests to determine the fastest
method for a given number of cases.  As Florian once stated... it takes a
large number of jumps before the standard linear subtraction search (where
the ordinal difference is subtracted from each case branch in turn) becomes
prohibitive.  On modern processors it can be as high as 64.

Gareth aka. Kit

On Mon 25/06/18 11:18 , Martin ***@mfriebe.de sent:
On 25/06/2018 05:23, J. Gareth Moreton wrote:
Actually you're right - that is the fastest method (it's written O(1),
by the way, if it doesn't depend on n).  The drawback is that the large
majority of instructions don't have such an optimisation procedure, so the
table  would be extremely sparse.  There are 1,109 individual
instructions on x86 that FPC currently supports.  Come to think of it,
even that is not too bad as far as memory goes - you're looking at 8,872
bytes on i386 (4 bytes for the instruction and 4 bytes for a pointer) and
17,744 bytes on x86-64 (8 bytes for each field), both are potentially
cacheable on the CPU hence won't be too costly in regards to looking it
up.  Not sure why I overlooked that one.  Thanks Martin.

While probably not worth it, but if you want to save memory (and only few
entries have data): Use 2 maps:

map1: array[opcode] of byte; // I do not know how efficient reading bytes
is....
      maps the (hopefully fewer than 255 active entries)

map2: array[byte] of codepointer

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Martin
2018-06-25 15:29:09 UTC
Permalink
Raw Message
Post by J. Gareth Moreton
One would need to develop some tight timing tests to determine the
fastest method for a given number of cases.  As Florian once stated...
it takes a large number of jumps before the standard linear
subtraction search (where the ordinal difference is subtracted from
each case branch in turn) becomes prohibitive.  On modern processors
it can be as high as 64.
How well does kcachegrind represent those timings?

Granted you don't want to run the entire compiler in kcachegrind, that
might take very long. But if you extract the relevant code into a
test/benchmark project, then that might help.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi

Loading...