Discussion:
Optimization of redundant mov's
(too old to reply)
Martok
2017-03-18 22:32:55 UTC
Permalink
Raw Message
Hi all,

there has been some discussion about FPCs optimizer in #31444, prompting me to
investigate some of my own code. Generally speaking the generated assembler is
not all that bad (I like how it uses LEA for almost all integer arithmetics),
but I keep seeing sections with lots of redundant MOVs.

Example, from a SHA512 implementation:
CurrentHash is a field of the current class, compiled with anything above -O2,
-CpCOREAVX2, -Px86_64.

a:= CurrentHash[0]; b:= CurrentHash[1]; c:= CurrentHash[2]; d:= CurrentHash[3];
0000000100074943 488b8424a0020000 mov 0x2a0(%rsp),%rax
000000010007494B 4c8b5038 mov 0x38(%rax),%r10
000000010007494F 488b8424a0020000 mov 0x2a0(%rsp),%rax
0000000100074957 4c8b5840 mov 0x40(%rax),%r11
000000010007495B 488b9424a0020000 mov 0x2a0(%rsp),%rdx
0000000100074963 488b4248 mov 0x48(%rdx),%rax
0000000100074967 488b9424a0020000 mov 0x2a0(%rsp),%rdx
000000010007496F 488b6a50 mov 0x50(%rdx),%rbp

Every single one of the "mov 0x2a0(%rsp), %rxx" instructions except the first is
redundant and causes another memory round-trip. At the same time, more registers
are used, which probably makes other optimizations more difficult, especially
when something similar happens on i386.

Now, the fun part: I haven't been able to build a simple test that causes the
same issue (the self-pointer already is in %rcx and not fetched from the stack
each time), so I have a feeling this may be a side effect of some other part of
the code.

Does this sound familiar to anyone? If so, what could I do about it?


Regards,

Martok

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/f
Jonas Maebe
2017-03-19 09:25:07 UTC
Permalink
Raw Message
Post by Martok
a:= CurrentHash[0]; b:= CurrentHash[1]; c:= CurrentHash[2]; d:= CurrentHash[3];
0000000100074943 488b8424a0020000 mov 0x2a0(%rsp),%rax
000000010007494B 4c8b5038 mov 0x38(%rax),%r10
000000010007494F 488b8424a0020000 mov 0x2a0(%rsp),%rax
0000000100074957 4c8b5840 mov 0x40(%rax),%r11
000000010007495B 488b9424a0020000 mov 0x2a0(%rsp),%rdx
0000000100074963 488b4248 mov 0x48(%rdx),%rax
0000000100074967 488b9424a0020000 mov 0x2a0(%rsp),%rdx
000000010007496F 488b6a50 mov 0x50(%rdx),%rbp
Every single one of the "mov 0x2a0(%rsp), %rxx" instructions except the first is
redundant and causes another memory round-trip. At the same time, more registers
are used, which probably makes other optimizations more difficult, especially
when something similar happens on i386.
Now, the fun part: I haven't been able to build a simple test that causes the
same issue (the self-pointer already is in %rcx and not fetched from the stack
each time), so I have a feeling this may be a side effect of some other part of
the code.
It's called register spilling: once there are no registers left to hold
values, the compiler has to pick registers whose value will be kept in
memory instead. Register allocation is an NP-complete problem, so the
result will never be 100% optimal (at least if you don't want to wait
forever while the compiler checks out all possible assignments). One
possible heuristic, which is used by FPC's register allocator, is to
spill the register that conflicts with the largest number of other
registers (to minimise the number of registers spilled to memory).

There are techniques to more optimally spill (e.g. live range
splitting), and there are also other kinds of optimisations that could
be run after register allocation to make the code more optimal. CSE at
the assembler level could be used in this case. That's a very complex
undertaking for relatively little gain though. E.g. those memory loads
are probably optimised by the processor itself (not necessarily coming
even from the L1 cache, but possibly from the write-back buffer).


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-
Martok
2017-03-19 20:28:33 UTC
Permalink
Raw Message
Hi,
Post by Jonas Maebe
It's called register spilling: once there are no registers left to hold
values, the compiler has to pick registers whose value will be kept in
memory instead.
I thought it would be something like that...

Still, my main issue was with the repeated fetches. I'd (naively!) say that it
should be relatively easy for an assembly-level optimizer to detect that these
are repeated loads of the same thing, with nothing that could affect the outcome
inbetween. It's not even a CSE in the technical sense, not a sub-expression but
the entire thing...
Post by Jonas Maebe
E.g. those memory loads
are probably optimised by the processor itself (not necessarily coming
even from the L1 cache, but possibly from the write-back buffer).
Not as well as one might believe, manually fixing (by forcing @CurrentHash into
a register with a local variable) just those 4 lines gives a ~2% increase in
MB/s for this hash. Which is quite a lot, given this is the part *without*
actual computations.

And again, I've seen this happen more than once on i386 code, where it even
creates "fake" register pressure (by using 2 or more registers to hold exactly
the same temporary) that makes the rest of the code worse than it could be.
As a ballpark: the same change as above results in a 10% speedup by freeing up 2
registers (all-int64 operations on i386, so 2 regs needed for everything, having
one more is very noticeable...)

It just strikes me as odd to have some rather good local code but then just
pointlessly add the second-most expensive operation in between ;-)


Regards,

Martok



_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mail
Jonas Maebe
2017-03-20 07:50:42 UTC
Permalink
Raw Message
Post by Martok
Post by Jonas Maebe
It's called register spilling: once there are no registers left to hold
values, the compiler has to pick registers whose value will be kept in
memory instead.
I thought it would be something like that...
Still, my main issue was with the repeated fetches. I'd (naively!) say that it
should be relatively easy for an assembly-level optimizer to detect that these
are repeated loads of the same thing, with nothing that could affect the outcome
inbetween. It's not even a CSE in the technical sense, not a sub-expression but
the entire thing...
It is trivial to create a peephole optimization for that particular
pattern. At least if it's just two loads, because after you've optimized
the second load into a register move, the third load no longer fits the
pattern... Unless you create a special peephole optimizer pass that goes
over the code backwards to apply this specific optimization, or you
first match the pattern as many times as possible before changing it.
But then it will still fail if there is at least one other instruction
in between.

So then you have to slightly generalise it, and in the end you do end up
with a full-blown assembler CSE optimizer, like the one we removed for
3.0. I'm a staunch believer in not wasting time on stuff like that, it's
just not worth it. Especially since a better register allocator, or SSA,
can probably achieve the same thing in this case.
Post by Martok
Post by Jonas Maebe
E.g. those memory loads
are probably optimised by the processor itself (not necessarily coming
even from the L1 cache, but possibly from the write-back buffer).
a register with a local variable) just those 4 lines gives a ~2% increase in
MB/s for this hash. Which is quite a lot, given this is the part *without*
actual computations.
You cannot attribute those 2% exclusively to keeping the values in
registers. E.g. removing them can change branch target alignments. Even
adding random nops can get you 10% due to changed code layout.
Post by Martok
And again, I've seen this happen more than once on i386 code, where it even
creates "fake" register pressure (by using 2 or more registers to hold exactly
the same temporary)
That's again something that needs to be solved at the register allocator
level (with SSA). Freeing up registers anymore afterwards is useless,
since only the register allocator can keep stuff in them permanently.


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.fre
Florian Klämpfl
2017-03-24 17:54:26 UTC
Permalink
Raw Message
Post by Martok
And again, I've seen this happen more than once on i386 code, where it even
creates "fake" register pressure (by using 2 or more registers to hold exactly
the same temporary)
That's again something that needs to be solved at the register allocator level (with SSA).
SSA is still on my todo list :)

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/

Loading...