Discussion:
Loop unrolling question
(too old to reply)
J. Gareth Moreton
2018-08-03 19:57:14 UTC
Permalink
Hi everyone,

So I'm pondering about attempting to implement the binary search system I
devised for case blocks.  To find the correct case label, you need to
iterate a search loop a fixed maximum of ceil(log2(n)) times, where n is
the number of individual case values (excluding "else").  Given that this
is quite a low number and I managed to get the loop itself down to just 8
instructions (not including the comparison and conditional jump at the
end), this could be unwrapped for low values of ceil(log2(n))... possibly
up to 7 or 8, I'd say.  However, if an exact match is found early, how
serious is the penalty of having a conditional jump forward on each
unrolled iteration for x86-64, whether it's taken or not? (This would be
the equivalent of a Break command)

Gareth aka. Kit
J. Gareth Moreton
2018-08-03 23:25:14 UTC
Permalink
(You replied to me directly, Kirinn, not
to the mail list)

Thanks Kirinn. The loop count will likely
be only 6 iterations, at least in my test
case which is stage 1 of the peephole
optimizer in the i386 and x86-64 build of
the compiler, consisting of (currently) 36
branches that call identically-structured
functions.

In the plan I have currently, I won't need
an extra comparison because it would be
something like this:

cmp (tableaddr), %value
je @MatchFound
cmovle %midindex, %loindex
cmovge %midindex, %hiindex

I can see it maybe being a waste on the
last iteration, maybe the second-to-last
too. It would require some empirical tests
though.

Gareth aka. Kit

On Fri 03/08/18 23:07 , "Kirinn"
Each comparison and conditional jump
takes 2 instructions immediately,
which is reduced to 1 instruction
through macro-op fusion, which means a
single loop iteration becomes maybe 12%
more costly (assuming simple
instructions)... but every processor
varies, of course.
Forward jumps are normally predicted to
not be taken, so there is no
further execution penalty if it's not an
exact match.
The additional instructions do take up a
little bit of space in the
instruction cache, and the conditional
jumps may mess up the existing
branch prediction table, causing a
presumably inconsequential penalty
slightly afterward.
The penalty for a branch misprediction
is a pipeline flush. According to
Agner Fog's research
(https://www.agner.org/optimize/microarchi
tecture.pdf
[1]), the penalty varies by processor.
Pentium MMX 4-5 cycles, PPro/P2/P3
10-20 cycles, P4 25-100+ cycles, Core2
about 15 cycles, Nehalem 17+ cycles,
Sandy/Ivy Bridge 15+ cycles, Haswell+
15-20 cycles, AMD Bulldozer etc ~20
cycles, Ryzen ~18 cycles. Evidently,
processors are designed to execute 2-5
instructions per cycle, complicated a
lot by micro-op splitting and
pipelining, so we might guesstimate a
misprediction penalty to be worth
around 30-100 instructions?
Assuming case blocks tend to be
relatively small, running through without
the conditional checks is probably
usually faster. But the only way to be
sure is to test on a bunch of
processors...
~Kirinn
On 08/03/2018 08:57 PM, J. Gareth
Hi everyone,
So I'm pondering about attempting to
implement the binary search system I
devised for case blocks.  To find the
correct case label, you need to
iterate a search loop a fixed maximum of
ceil(log2(n)) times, where n is
the number of individual case values
(excluding "else").  Given that this
is quite a low number and I managed to
get the loop itself down to just 8
instructions (not including the
comparison and conditional jump at the
end), this could be unwrapped for low
values of ceil(log2(n))... possibly
up to 7 or 8, I'd say.  However, if an
exact match is found early, how
serious is the penalty of having a
conditional jump forward on each
unrolled iteration for x86-64, whether
it's taken or not? (This would be
the equivalent of a Break command)
Gareth aka. Kit
__________________________________________
_____ 
fpc-devel maillist - fpc-
http://lists.freepascal.org/cgi-
bin/mailman/listinfo/fpc-devel 
------
[1]
https://www.agner.org/optimize/microarchit
ecture.pdf
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-deve
Benito van der Zander
2018-08-04 22:17:16 UTC
Permalink
Hi,

case of strings could also use some optimization

does it still call fpc_ansistr_compare_equal ?

fpc could hash the strings, or build a trie, even testing the length
first would help


Cheers,
Benito
Post by J. Gareth Moreton
Hi everyone,
So I'm pondering about attempting to implement the binary search system I
devised for case blocks.  To find the correct case label, you need to
iterate a search loop a fixed maximum of ceil(log2(n)) times, where n is
the number of individual case values (excluding "else").  Given that this
is quite a low number and I managed to get the loop itself down to just 8
instructions (not including the comparison and conditional jump at the
end), this could be unwrapped for low values of ceil(log2(n))... possibly
up to 7 or 8, I'd say.  However, if an exact match is found early, how
serious is the penalty of having a conditional jump forward on each
unrolled iteration for x86-64, whether it's taken or not? (This would be
the equivalent of a Break command)
Gareth aka. Kit
_______________________________________________
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
J. Gareth Moreton
2018-08-04 21:45:43 UTC
Permalink
I think the fastest approach with case of strings, especially if it's more
than a few entries, is to sort them into alphabetical order (by ASCII code)
then length, as then you only have to test one character at a time and the
moment you find no match, you can drop out completely. (e.g. searching for
"ANNUL" in a list that contains "A, ALL, AND, ANSWER"... it finds a match
in A, a mismatch in L but then sees N, but when it reaches the second N, it
find D and then S, and S is after N, so there can't be a match).  In other
words, this would be a trie, although there would have to be safeguards
against a false partial match (e.g. the non-existent "ANS" which appears as
a part of "ANSWER") such as encoding a jump to the else block.

I've done a bit more theoretical work on the binary search loop, and
managed to get the loop count down from ceil(log(n)) to floor(log(n)), with
the last iteration being much simpler than the others.  I'm tempted to
write up a Wiki article with the proposal once I've got my notes in
order.  If programmed, there would be two good test cases for it... the
x86-64 peephole optimizer, which has, as of writing, a case block with 36
branches (counting each case label separately) that all call
identically-structured functions, and the source code analyser of Lazarus
that hashes words and makes them bold if they match a list of keywords
(this is also another approach to the 'case of strings' problem).

Gareth aka. Kit

On Sat 04/08/18 23:17 , Benito van der Zander ***@benibela.de sent:
Hi,

case of strings could also use some optimization
does it still call fpc_ansistr_compare_equal ?

fpc could hash the strings, or build a trie, even testing the length first
would help

Cheers,
Benito 

Am 03.08.2018 um 21:57 schrieb J. Gareth Moreton:

 Hi everyone, 
 
 So I'm pondering about attempting to implement the binary search system I 
devised for case blocks.  To find the correct case label, you need to 
iterate a search loop a fixed maximum of ceil(log2(n)) times, where n is 
the number of individual case values (excluding "else").  Given that this 
is quite a low number and I managed to get the loop itself down to just 8 
instructions (not including the comparison and conditional jump at the 
end), this could be unwrapped for low values of ceil(log2(n))... possibly 
up to 7 or 8, I'd say.  However, if an exact match is found early, how 
serious is the penalty of having a conditional jump forward on each 
unrolled iteration for x86-64, whether it's taken or not? (This would be 
the equivalent of a Break command) 
 
 Gareth aka. Kit 
 

_______________________________________________ 
fpc-devel maillist - fpc-***@lists.freepascal.org 
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel 

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org [1]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Links:
------
[1] mailto:fpc-***@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Loading...