Discussion:
[fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Simon Kissel
2018-10-20 14:07:20 UTC
Permalink
Hi,

I assume everybody here still knows who I am, so I'll drop the
introduction part.

In our products, we use FPC for a couple of targets. However, for all
of Linux x86 platforms, we still have to use Kylix (CrossKylix). This
is because for our code, FPC on these platforms compiles code that
is 25% slower than Kylix, and up to 50% when it comes to
multi-threaded stuff.

We know about a couple of bottlenecks (fpc_pushexceptaddr /
RelocateThreadVar etc) which explain FPC's terrible multi-threading
performance, but in general, FPC's code generator really is quite
a mess, which we learned the hard way a couple of years when we
did optimization work on the ARM target.

Due to use having to stick to Kylix, we can not use any of the
recent Object Pascal language features of the last 15 years,
which is frustrating. It also prevents us from fully moving over
to Unicode.

I'd therefore like to put out a 15.000 Euro bounty for whoever
brings FPC at least on par with Kylix when it comes to executable
speed in multi-threaded scenarios, but first would like to discuss
with you guys what route should be taken (the list is not
complete and not mutually exclusive, of course):

- Complete the LLVM branch of FPC. It looks like Jonas has stopped
working on it two years ago, which is a pity.

- Rewrite the code generator, for example in a SSA-IR way

- Make Exception handling, TLS etc use the infrastructure that
libpthread is providing

The requirements for my bounty would be:

- Must bring executable speed for non-Floating point load
on both multihreaded and non-multithreaded workloads to
the Speed of Kylix combined binaries

- Improvements should also help on ARM targets

- An LLVM-based solution must allow inline assembler for
all x86 and ARM

- Must be completed by February 2019

So, any suggestions on how to move forward on this?

Cheers,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freep
Sven Barth via fpc-devel
2018-10-25 07:06:14 UTC
Permalink
Post by Simon Kissel
- Complete the LLVM branch of FPC. It looks like Jonas has stopped
working on it two years ago, which is a pity.
I personally don't think that LLVM is the way to go. It's essentially a
moving target and adds an unnecessary dependency to the compiler.

- Rewrite the code generator, for example in a SSA-IR way
Didn't Florian work on that already? I wonder how far he is by now 🀔

- Make Exception handling, TLS etc use the infrastructure that
Post by Simon Kissel
libpthread is providing
I'm against having such a basic functionality depend on an external library
as I quite enjoy that FPC can be used without any dependencies on Linux.
However I am in favor of introducing DWARF exception handling that should
have similar benefits as SEH on Win64 if I remember correctly.
And for threadvars we could try to implement a different mechanism as well.
I think there was some experiment for that some time ago 🀔

A further problem is that not all of us have access to Kylix so that not
everyone can compare the performance.

Regards,
Sven
Florian Klaempfl
2018-10-25 16:40:42 UTC
Permalink
Post by Simon Kissel
- Complete the LLVM branch of FPC. It looks like Jonas has stopped
  working on it two years ago, which is a pity.
I personally don't think that LLVM is the way to go. It's essentially a
moving target and adds an unnecessary dependency to the compiler.
Me neither :)
Post by Simon Kissel
- Rewrite the code generator, for example in a SSA-IR way
Didn't Florian work on that already? I wonder how far he is by now 🤔
Got distracted by other stuff but also because I do not believe that it
matters much for a lot real world programs (small benchmarks are another
story).
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepa
Ben Grasset
2018-10-27 03:45:39 UTC
Permalink
On Thu, Oct 25, 2018 at 3:06 AM Sven Barth via fpc-devel <
Post by Sven Barth via fpc-devel
Post by Simon Kissel
- Complete the LLVM branch of FPC. It looks like Jonas has stopped
working on it two years ago, which is a pity.
I personally don't think that LLVM is the way to go. It's essentially a
moving target and adds an unnecessary dependency to the compiler.
Not really. The IR format has been pretty stable since version 3.9 or so
(LLVM is current at version 8.) As far as dependencies, it would add none
whatsoever other than a copy of the LLC or LLVM-AS binaries (as in, no more
than any other target FPC supports. Just think of it as yet another
assembler format.)
Sven Barth via fpc-devel
2018-10-27 07:27:59 UTC
Permalink
Post by Ben Grasset
On Thu, Oct 25, 2018 at 3:06 AM Sven Barth via fpc-devel <
Post by Sven Barth via fpc-devel
Post by Simon Kissel
- Complete the LLVM branch of FPC. It looks like Jonas has stopped
working on it two years ago, which is a pity.
I personally don't think that LLVM is the way to go. It's essentially a
moving target and adds an unnecessary dependency to the compiler.
Not really. The IR format has been pretty stable since version 3.9 or so
(LLVM is current at version 8.) As far as dependencies, it would add none
whatsoever other than a copy of the LLC or LLVM-AS binaries (as in, no more
than any other target FPC supports. Just think of it as yet another
assembler format.)
It's more than just an additional assembler format as the infrastructure
inside the compiler shows. Also there are the problems that Jonas
mentioned.
In my opinion that time is better spent optimizing our own code generator.

Regards,
Sven
Martin Schreiber
2018-10-27 07:57:37 UTC
Permalink
Post by Sven Barth via fpc-devel
Post by Ben Grasset
Not really. The IR format has been pretty stable since version 3.9 or so
(LLVM is current at version 8.) As far as dependencies, it would add none
whatsoever other than a copy of the LLC or LLVM-AS binaries (as in, no
more than any other target FPC supports. Just think of it as yet another
assembler format.)
It's more than just an additional assembler format as the infrastructure
inside the compiler shows. Also there are the problems that Jonas
mentioned.
In my opinion that time is better spent optimizing our own code generator.
MSElang uses the approach to write LLVM bitcode directly without a temporary
LLVM assembler text. Building the needed LLVM lists and tracking the ssa
values is not trivial. IMO the worst aspect of LLVM is its slowness but the
resulting code is awesome.

Martin
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepas
Jonas Maebe
2018-10-27 12:46:37 UTC
Permalink
Post by Ben Grasset
As far as dependencies, it would add
none whatsoever other than a copy of the LLC or LLVM-AS binaries (as in,
no more than any other target FPC supports. Just think of it as yet
another assembler format.)
You also need "opt" if you want to perform full optimizations (or just
use clang, which a.o. combines the functionality of llc and opt).

There's one more problem I forgot to mention in my first post, and it is
probably a deal breaker for the original bounty: LLVM does not support
Borland's fastcall calling convention for i386. So you would need to add
support for Borland fastcall on i386 to LLVM if it has to support
existing i386 inline assembly routines written for FPC/Delphi.

Finally, adding support for 32 bit targets in FPC's LLVM backend would
also require some work due to how FPC's code generator is structured,
and due to the fact that need to have two code generators in a single
binary (the native one to support the generation of entry and exit code
for pure inline assembler routines, and the LLVM one for the rest).


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman
Ben Grasset
2018-10-27 16:21:44 UTC
Permalink
Post by Jonas Maebe
You also need "opt" if you want to perform full optimizations (or just
use clang, which a.o. combines the functionality of llc and opt).
There's one more problem I forgot to mention in my first post, and it is
probably a deal breaker for the original bounty: LLVM does not support
Borland's fastcall calling convention for i386. So you would need to add
support for Borland fastcall on i386 to LLVM if it has to support
existing i386 inline assembly routines written for FPC/Delphi.
Finally, adding support for 32 bit targets in FPC's LLVM backend would
also require some work due to how FPC's code generator is structured,
and due to the fact that need to have two code generators in a single
binary (the native one to support the generation of entry and exit code
for pure inline assembler routines, and the LLVM one for the rest).
LLC (at least now) statically links the necessary parts of LLVM and works
independently of Opt, with a simpler set of command line options (it just
has overall O1, O2, and O3 flags.)

As far as the point about assembly on 32 bit, while it does seem like that
would be a problem for the bounty requirements, would it really be the end
of the world in a more general sense? I can't imagine people who are still
using 32-bit-hardware and writing 32-bit applications would complain if the
LLVM backend was not available for 32-bit.

Anyways though, I do think code gen improvements for FPC, LLVM or not, are
likely going to be a lot more widely helpful than just rewriting exception
handling.... (not that rewriting exception handling is a bad idea.) I think
there's a lot of people who would like FPC to generate faster code than it
currently does. Can you recommend any known areas in need of improvement of
the non-platform-specific parts of the code generators that might be a good
place to start for someone who's an experienced Pascal developer but hasn't
worked with the compiler codebase before?
Florian Klämpfl
2018-10-27 16:42:59 UTC
Permalink
Post by Jonas Maebe
  
You also need "opt" if you want to perform full optimizations (or just
use clang, which a.o. combines the functionality of llc and opt).
There's one more problem I forgot to mention in my first post, and it is
probably a deal breaker for the original bounty: LLVM does not support
Borland's fastcall calling convention for i386. So you would need to add
support for Borland fastcall on i386 to LLVM if it has to support
existing i386 inline assembly routines written for FPC/Delphi.
Finally, adding support for 32 bit targets in FPC's LLVM backend would
also require some work due to how FPC's code generator is structured,
and due to the fact that need to have two code generators in a single
binary (the native one to support the generation of entry and exit code
for pure inline assembler routines, and the LLVM one for the rest).
LLC (at least now) statically links the necessary parts of LLVM and works independently of Opt, with a simpler set of
command line options (it just has overall O1, O2, and O3 flags.)
As far as the point about assembly on 32 bit, while it does seem like that would be a problem for the bounty
requirements, would it really be the end of the world in a more general sense? I can't imagine people who are still
using 32-bit-hardware and writing 32-bit applications would complain if the LLVM backend was not available for 32-bit.
Anyways though, I do think code gen improvements for FPC, LLVM or not, are likely going to be a lot more widely helpful
than just rewriting exception handling....
If you read the whole thread, LLVM needs a rewritten exception handling as well. Further, a quick test
of table based exception handling on bansi1 (which is mainly a memory manager test) gives:

standard exception handling:

fpctrunk\tests\bench>pp11 bansi1 -O3

fpctrunk\tests\bench>bansi1
Test 1: 1000000 done in 0.537 sec
Test 2: 1000000 done in 0.535 sec
Test 3: 1000000 done in 0.587 sec

SEH based exception handling:

fpctrunk\tests\bench>pp11 bansi1 -O3

fpctrunk\tests\bench>bansi1
Test 1: 1000000 done in 0.456 sec
Test 2: 1000000 done in 0.457 sec
Test 3: 1000000 done in 0.446 sec

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/c
Michael Van Canneyt
2018-10-27 17:19:47 UTC
Permalink
Post by Florian Klämpfl
If you read the whole thread, LLVM needs a rewritten exception handling as well. Further, a quick test
fpctrunk\tests\bench>pp11 bansi1 -O3
fpctrunk\tests\bench>bansi1
Test 1: 1000000 done in 0.537 sec
Test 2: 1000000 done in 0.535 sec
Test 3: 1000000 done in 0.587 sec
fpctrunk\tests\bench>pp11 bansi1 -O3
fpctrunk\tests\bench>bansi1
Test 1: 1000000 done in 0.456 sec
Test 2: 1000000 done in 0.457 sec
Test 3: 1000000 done in 0.446 sec
Florian, I am not sure what this is supposed to prove ?

It's 15% off the elapsed time (almost 1/6th), that seems worth spending some time on...

Michael.
Florian Klämpfl
2018-10-27 17:38:25 UTC
Permalink
Post by Michael Van Canneyt
Post by Florian Klämpfl
If you read the whole thread, LLVM needs a rewritten exception handling as well. Further, a quick test
fpctrunk\tests\bench>pp11 bansi1 -O3
fpctrunk\tests\bench>bansi1
Test 1: 1000000 done in 0.537 sec
Test 2: 1000000 done in 0.535 sec
Test 3: 1000000 done in 0.587 sec
fpctrunk\tests\bench>pp11 bansi1 -O3
fpctrunk\tests\bench>bansi1
Test 1: 1000000 done in 0.456 sec
Test 2: 1000000 done in 0.457 sec
Test 3: 1000000 done in 0.446 sec
Florian, I am not sure what this is supposed to prove ?
That it is useful to work on table based exception handling for all targets ...
Post by Michael Van Canneyt
It's 15% off the elapsed time (almost 1/6th), that seems worth spending some time on...
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/lis
Ben Grasset
2018-10-27 22:29:11 UTC
Permalink
Post by Florian Klämpfl
That it is useful to work on table based exception handling for all targets
Not arguing with that at all. I was just trying to point out that I'm not a
fan of the idea that FPC's code generators are "good enough" as is.
Sven Barth via fpc-devel
2018-10-27 22:46:18 UTC
Permalink
Post by Ben Grasset
Post by Florian Klämpfl
That it is useful to work on table based exception handling for all targets
Not arguing with that at all. I was just trying to point out that I'm not
a fan of the idea that FPC's code generators are "good enough" as is.
And no one said that it is. But points like table based exception handling
and section based threadvars can be relatively easily achieved and benefits
more targets while working on the optimizer usually is a per platform work.
Except of course for optimizations that can be done on the platform
independent node tree.

Regards,
Sven
Ben Grasset
2018-10-28 00:11:18 UTC
Permalink
On Sat, Oct 27, 2018 at 6:46 PM Sven Barth via fpc-devel <
Post by Sven Barth via fpc-devel
Except of course for optimizations that can be done on the platform
independent node tree.
That specifically is IMO the "key" to a higher compiler-wide level of
optimization capabilities, as shown by various more recent compilers for
other languages and also by LLVM. Target-CPU-level optimizations are
certainly still very necessary for some things, but it you pass the
assembly code generator better information to begin with they're not nearly
as relevant. I've been looking over the compiler codebase recently and
there's quite a few things that could obviously be done better IMO at the
top level before any platform specific-stuff comes into play.

There's also a number of things that would specifically help the build-time
performance of the compiler itself that I've noticed, such as there being
many, many, many, one-liner functions and procedures that should almost
certainly be marked as inline but currently are not. Also linked lists
absolutely everywhere, that would perform much better as array based lists.

If the core team is open to arbitrary/speculative patches I might try to
work out a few for what I think are the most important issues and submit
them for consideration sometime in the near future.
Ozz Nixon
2018-10-28 00:22:11 UTC
Permalink
* Not arguing, but... *

Linked List faster than Array?
Unless I missed what you are talking about... I always teach programmers:

Array is the fastest collection to use, followed by Linked List, followed
by bTree, etc.

* Sorry for off topic - just that grabbed my "What did he just say?"
button...
Ozz Nixon
2018-10-28 00:22:52 UTC
Permalink
SORRY - JUST RE-READ... that is what you are saying... it's late here ;-(
Post by Ozz Nixon
* Not arguing, but... *
Linked List faster than Array?
Array is the fastest collection to use, followed by Linked List, followed
by bTree, etc.
* Sorry for off topic - just that grabbed my "What did he just say?"
button...
Ben Grasset
2018-10-28 00:24:28 UTC
Permalink
Post by Ozz Nixon
* Sorry for off topic - just that grabbed my "What did he just say?"
button..
Huh? I said "Also linked lists absolutely everywhere, that would perform
much better as array based lists."

Meaning, exactly the same thing you just implied. You got what I meant
completely backwards somehow.
Florian Klämpfl
2018-10-28 08:13:10 UTC
Permalink
* Sorry for off topic - just that grabbed my "What did he just say?" button..
Huh? I said "Also linked lists absolutely everywhere, that would perform much better as array based lists."
Only if it does not increase memory fragmentation which is even now already a problem.

But there is another pretty simple optimization opportunity in this area: make the FPC heap manager capable of using
os-based memory reallocation. Kernel-based memory reallocation of large blocks has the big advantage that the OS can
move the memory contents only by re-mapping memory pages.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.o
Simon Kissel
2018-10-28 11:46:45 UTC
Permalink
Hi Florian,
Post by Florian Klämpfl
But there is another pretty simple optimization opportunity in this
area: make the FPC heap manager capable of using
os-based memory reallocation. Kernel-based memory reallocation of
large blocks has the big advantage that the OS can
move the memory contents only by re-mapping memory pages.
I fully agree that the memory manager for obvious reasons is
an important subject, especially for heavily multithreaded code,
and even more for any string stuff in such code. I haven't
informed myself enough to judge how well the FPC memory manager
behaves in this regard, and if it might make sense to try
to use an alternative memory manager with FPC for Linux.

However, being aware of that, we are avoiding reallocations
wherever we can and instantiate pretty much every thing using
own memory caches.

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/m
Sven Barth via fpc-devel
2018-10-28 11:52:11 UTC
Permalink
Post by Simon Kissel
Hi Florian,
Post by Florian Klämpfl
But there is another pretty simple optimization opportunity in this
area: make the FPC heap manager capable of using
os-based memory reallocation. Kernel-based memory reallocation of
large blocks has the big advantage that the OS can
move the memory contents only by re-mapping memory pages.
I fully agree that the memory manager for obvious reasons is
an important subject, especially for heavily multithreaded code,
and even more for any string stuff in such code. I haven't
informed myself enough to judge how well the FPC memory manager
behaves in this regard, and if it might make sense to try
to use an alternative memory manager with FPC for Linux.
However, being aware of that, we are avoiding reallocations
wherever we can and instantiate pretty much every thing using
own memory caches.
I think Florian was talking about the memory management inside the compiler
🀔

Regards,
Sven
Florian Klämpfl
2018-10-28 08:33:30 UTC
Permalink
There's also a number of things that would specifically help the build-time performance of the compiler itself that I've
noticed, such as there being many, many, many, one-liner functions and procedures that should almost certainly be marked
as inline but currently are not.
... because FPC can auto inline if needed. However, the current autoinline heuristics which is pretty conservative
(read: inlines only very small subroutines), has exactly two effects: it makes the compiler executable bigger and
slower. A few bytes bigger would be ok, but slower is not acceptable, right? I can tell you also why it is slower: the
compiler is memory throughput limited, so everything which increases the memory footprint is bad. While (auto)inlining
helps very much for "normal" programs and benchmarks, for the compiler it is not a good solution.

The only thing I consider useful in this direction is to work on improving the auto inline heuristics by maybe adding
two methods: for pure size and for speed, if the program is not memory throughput limited.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/lis
Simon Kissel
2018-10-28 11:42:37 UTC
Permalink
Hi Sven,
Post by Sven Barth via fpc-devel
And no one said that it is. But points like table based exception
handling and section based threadvars can be relatively easily
achieved and benefits more targets while working on the optimizer
usually is a per platform work.
I agree that this very likely will make a big boost. From what
I recall, and the oldest ARM platform we have (Marvell Kirkwood),
every access to threadvars right now involve a full CPU cache
flush (but forgot why exactly, has been a long time).

Cheers,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.free
Simon Kissel
2018-10-28 11:39:51 UTC
Permalink
Hi Ben,
Post by Jonas Maebe
There's one more problem I forgot to mention in my first post, and it is
probably a deal breaker for the original bounty: LLVM does not support
Borland's fastcall calling convention for i386. So you would need to add
support for Borland fastcall on i386 to LLVM if it has to support
existing i386 inline assembly routines written for FPC/Delphi.
I don't see how not supporting fastcall would be a deal-breaker?
Post by Jonas Maebe
As far as the point about assembly on 32 bit, while it does seem
like that would be a problem for the bounty requirements, would it
really be the end of the world in a more general sense? I can't
imagine people who are still using 32-bit-hardware and writing
32-bit applications would complain if the LLVM backend was not available for 32-bit.
We have tons of hand-tuned Assembler library code for stuff
like encryption, and other libraries we use, have, too, even those
who are multiplatform - think mORMot, for example.

Most of our embedded platforms sadly aren't and won't be 64bit.

Cheers,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailma
Sven Barth via fpc-devel
2018-10-28 11:49:43 UTC
Permalink
Post by Simon Kissel
Hi Ben,
Post by Jonas Maebe
There's one more problem I forgot to mention in my first post, and it is
probably a deal breaker for the original bounty: LLVM does not support
Borland's fastcall calling convention for i386. So you would need to add
support for Borland fastcall on i386 to LLVM if it has to support
existing i386 inline assembly routines written for FPC/Delphi.
I don't see how not supporting fastcall would be a deal-breaker?
You mean Jonas here I take it, not Ben.

Borland's Fastcall is more famously known as the Register calling
convention aka the default calling convention in Object Pascal. As you
admitted in your mail further down you have quite some assembly code and as
such you rely on the calling convention for parameter passing. Here
register differs significantly from cdecl or stdcall. Thus not supporting
the calling convention *will break* your code.

Regards,
Sven
Simon Kissel
2018-10-28 12:04:06 UTC
Permalink
Hi Sven,
Post by Sven Barth via fpc-devel
Borland's Fastcall is more famously known as the Register calling
convention aka the default calling convention in Object Pascal. As
you admitted in your mail further down you have quite some assembly
code and as such you rely on the calling convention for parameter
passing. Here register differs significantly from cdecl or stdcall.
Thus not supporting the calling convention *will break* your code. 
My expectations are not that no (low-level library) code may be broken.
There are much bigger IFDEF hells than adapting assembler code
boiler plates to handle other calling conventions.

Just throwing a compiler error if an assembler procedure is not
decorated with a calling convention supported by the LLVM branch
would be just fine to me.

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-b
Jonas Maebe
2018-11-25 20:28:24 UTC
Permalink
Post by Ben Grasset
LLC (at least now) statically links the necessary parts of LLVM and
works independently of Opt, with a simpler set of command line options
(it just has overall O1, O2, and O3 flags.)
Are you certain llc now incorporates the functionality of opt? From what
I can tell, llc still only performs codegen optimisations and no complex
IR transformations. It has always had the -O1/-O2/-O3 flags, but those
always have only affected the codegen.

All information I can find via google also suggest you need to use
either clang or both opt and llc to get everything (e.g.
https://lists.llvm.org/pipermail/llvm-dev/2018-January/120226.html ).


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-de
Michael Van Canneyt
2018-10-25 07:38:08 UTC
Permalink
Post by Simon Kissel
- Make Exception handling, TLS etc use the infrastructure that
libpthread is providing
TLS is handled already by libpthread. I doubt you will gain much there.

However, Exception handling is a problem. There are 2 possible ways ahead:
- DWARF exception handling as mentioned by Sven.
- Port SEH to be cross platform, this is the approach as taken by Kylix.
Kilyx has a small rtlunwind library that mimics the needed run-time functionality
offered by Windows.

Conceivably, it can be duplicated. wine probably has such a library which
can be used as an inspiration.

The needed compiler infrastructure for SEH already exists, so this is most likely
the fastest way to proceed.

Michael..
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.f
Sven Barth via fpc-devel
2018-10-25 09:18:58 UTC
Permalink
Post by Michael Van Canneyt
Post by Simon Kissel
- Make Exception handling, TLS etc use the infrastructure that
libpthread is providing
TLS is handled already by libpthread. I doubt you will gain much there.
- DWARF exception handling as mentioned by Sven.
- Port SEH to be cross platform, this is the approach as taken by Kylix.
Kilyx has a small rtlunwind library that mimics the needed run-time functionality
offered by Windows.
Conceivably, it can be duplicated. wine probably has such a library which
can be used as an inspiration.
The needed compiler infrastructure for SEH already exists, so this is most likely
the fastest way to proceed.
I'm against emulating SEH. Better implement DWARF exceptions. The
infrastructure that was created for SEH inside the compiler should help
nevertheless.

Regards,
Sven
Martin Schreiber
2018-10-25 09:46:51 UTC
Permalink
Post by Sven Barth via fpc-devel
I'm against emulating SEH. Better implement DWARF exceptions. The
infrastructure that was created for SEH inside the compiler should help
nevertheless.
MSElang has some code for "Itanium ABI Zero-cost Exception Handling" supported
by LLVM, for example the runtime part:
https://gitlab.com/mseide-msegui/mselang/blob/master/mselang/compiler/__mla__personality.pas
Works well so far.

Martin
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fp
Michael Van Canneyt
2018-10-25 09:52:18 UTC
Permalink
Post by Martin Schreiber
Post by Sven Barth via fpc-devel
I'm against emulating SEH. Better implement DWARF exceptions. The
infrastructure that was created for SEH inside the compiler should help
nevertheless.
MSElang has some code for "Itanium ABI Zero-cost Exception Handling" supported
https://gitlab.com/mseide-msegui/mselang/blob/master/mselang/compiler/__mla__personality.pas
Works well so far.
Great, thank you for this info. The more choice, the better!

Michael.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailma
Joao Schuler
2018-10-25 11:53:30 UTC
Permalink
Hello Simon - wondering if you have code examples that provoke problems you
are experiencing? It will be easier to measure/test improvements with test
cases. Solutions might not come from a single person/team and therefore not
sure how to apply the bounty in the most effective/fair way.
Michael Van Canneyt
2018-10-25 09:51:31 UTC
Permalink
Post by Sven Barth via fpc-devel
Post by Michael Van Canneyt
Post by Simon Kissel
- Make Exception handling, TLS etc use the infrastructure that
libpthread is providing
TLS is handled already by libpthread. I doubt you will gain much there.
- DWARF exception handling as mentioned by Sven.
- Port SEH to be cross platform, this is the approach as taken by Kylix.
Kilyx has a small rtlunwind library that mimics the needed run-time functionality
offered by Windows.
Conceivably, it can be duplicated. wine probably has such a library which
can be used as an inspiration.
The needed compiler infrastructure for SEH already exists, so this is most likely
the fastest way to proceed.
I'm against emulating SEH. Better implement DWARF exceptions. The
infrastructure that was created for SEH inside the compiler should help
nevertheless.
You can be against, and you don't need to work on it,
but if someone supplies a patch, I don't think we should refuse it.

Personally I am also in favour of a more open technique instead of a
technique which is proprietary to a platform, and in this sense I understand
and endorse your point of view, but beggars can't be choosers.

There is no problem to have both techniques available. As I wrote, the SEH
is the fastest path.

So hopefully we will be able to compare and can still choose the better/faster one.

Michael.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/l
Sven Barth via fpc-devel
2018-10-25 11:46:47 UTC
Permalink
Post by Michael Van Canneyt
2018,
Post by Sven Barth via fpc-devel
Post by Michael Van Canneyt
Post by Simon Kissel
- Make Exception handling, TLS etc use the infrastructure that
libpthread is providing
TLS is handled already by libpthread. I doubt you will gain much there.
However, Exception handling is a problem. There are 2 possible ways
- DWARF exception handling as mentioned by Sven.
- Port SEH to be cross platform, this is the approach as taken by Kylix.
Kilyx has a small rtlunwind library that mimics the needed run-time functionality
offered by Windows.
Conceivably, it can be duplicated. wine probably has such a library
which
Post by Sven Barth via fpc-devel
Post by Michael Van Canneyt
can be used as an inspiration.
The needed compiler infrastructure for SEH already exists, so this is most likely
the fastest way to proceed.
I'm against emulating SEH. Better implement DWARF exceptions. The
infrastructure that was created for SEH inside the compiler should help
nevertheless.
You can be against, and you don't need to work on it,
but if someone supplies a patch, I don't think we should refuse it.
I don't agree here.
Post by Michael Van Canneyt
Personally I am also in favour of a more open technique instead of a
technique which is proprietary to a platform, and in this sense I understand
and endorse your point of view, but beggars can't be choosers.
There is no problem to have both techniques available. As I wrote, the SEH
is the fastest path.
I have my doubts especially as the rtlunwind stuff of Kylix only works on
i386. The SEH mechanism between i386 and all other Windows platforms
differs significantly and I doubt that Simon only wants i386 to benefit.

Regards,
Sven
Michael Van Canneyt
2018-10-25 12:55:39 UTC
Permalink
Post by Sven Barth via fpc-devel
Post by Michael Van Canneyt
Personally I am also in favour of a more open technique instead of a
technique which is proprietary to a platform, and in this sense I understand
and endorse your point of view, but beggars can't be choosers.
There is no problem to have both techniques available. As I wrote, the SEH
is the fastest path.
I have my doubts especially as the rtlunwind stuff of Kylix only works on
i386. The SEH mechanism between i386 and all other Windows platforms
differs significantly and I doubt that Simon only wants i386 to benefit.
If 'SEH is the fastest path.' is not correct, then all the more reason to use DWARF...

Michael.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/
Sven Barth via fpc-devel
2018-10-25 15:06:50 UTC
Permalink
Post by Michael Van Canneyt
Post by Sven Barth via fpc-devel
Post by Michael Van Canneyt
Personally I am also in favour of a more open technique instead of a
technique which is proprietary to a platform, and in this sense I understand
and endorse your point of view, but beggars can't be choosers.
There is no problem to have both techniques available. As I wrote, the
SEH
Post by Sven Barth via fpc-devel
Post by Michael Van Canneyt
is the fastest path.
I have my doubts especially as the rtlunwind stuff of Kylix only works on
i386. The SEH mechanism between i386 and all other Windows platforms
differs significantly and I doubt that Simon only wants i386 to benefit.
If 'SEH is the fastest path.' is not correct, then all the more reason to use DWARF...
A further obstacle for SEH on non-i386: GNU AS supports the pseudo
instructions needed for SEH only for PE/COFF, but not ELF. This would mean
that we'd need to add them manually to to the assembly files which would
definitely be more bothersome...

Regards,
Sven
Florian Klaempfl
2018-10-25 16:34:19 UTC
Permalink
Post by Michael Van Canneyt
Post by Simon Kissel
- Make Exception handling, TLS etc use the infrastructure that
  libpthread is providing
TLS is handled already by libpthread. I doubt you will gain much there.
- DWARF exception handling as mentioned by Sven.
- Port SEH to be cross platform, this is the approach as taken by Kylix.
Kilyx has a small rtlunwind  library that mimics the needed run-time
functionality
offered by Windows.
Conceivably, it can be duplicated. wine probably has such a library which
can be used as an inspiration.
The needed compiler infrastructure for SEH  already exists, so this
is most likely
the fastest way to proceed.
I'm against emulating SEH. Better implement DWARF exceptions.
Yes.

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/l
Sven Barth via fpc-devel
2018-10-25 15:09:54 UTC
Permalink
Post by Michael Van Canneyt
Post by Simon Kissel
- Make Exception handling, TLS etc use the infrastructure that
libpthread is providing
TLS is handled already by libpthread. I doubt you will gain much there.
GCC has (depending on the platform) a faster implementation for "__thread"
variables. E.g. on x86 it uses the GS segment and the data is stored in ELF
sections. There were experiments in the past to support this in FPC as
well, so maybe we're on a good way there already.

Regards,
Sven
Michael Van Canneyt
2018-10-25 15:23:09 UTC
Permalink
Post by Sven Barth via fpc-devel
Post by Michael Van Canneyt
Post by Simon Kissel
- Make Exception handling, TLS etc use the infrastructure that
libpthread is providing
TLS is handled already by libpthread. I doubt you will gain much there.
GCC has (depending on the platform) a faster implementation for "__thread"
variables. E.g. on x86 it uses the GS segment and the data is stored in ELF
sections. There were experiments in the past to support this in FPC as
well, so maybe we're on a good way there already.
That is good news. The contours of a TODO list are becoming visible :)

But we may need also need a solution for other platforms, which means the
current system should remain in place for those platforms where such a
system is not present ?

Michael.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.f
Karoly Balogh (Charlie/SGR)
2018-10-25 15:38:13 UTC
Permalink
Hi,
Post by Michael Van Canneyt
Post by Sven Barth via fpc-devel
Post by Michael Van Canneyt
Post by Simon Kissel
- Make Exception handling, TLS etc use the infrastructure that
libpthread is providing
TLS is handled already by libpthread. I doubt you will gain much there.
GCC has (depending on the platform) a faster implementation for "__thread"
variables. E.g. on x86 it uses the GS segment and the data is stored in ELF
sections. There were experiments in the past to support this in FPC as
well, so maybe we're on a good way there already.
That is good news. The contours of a TODO list are becoming visible :)
But we may need also need a solution for other platforms, which means the
current system should remain in place for those platforms where such a
system is not present ?
FPC already has some code to support section threadvars via the GS segment
on i386 at least, but it doesn't seem to be enabled by default? (Couldn't
test it, but the tf_section_threadvars target flag, which enable this is
actually behind a define in i_linux.pas, which I couldn't find enabled
anywhere?). Also tf_section_threadvars flag has some code to support it
all over the compiler, including the x86 cg. I have some really vague
memories I actually enabled it in some experimental local version I had,
and it worked on first sight at least, but I could be completely off here.

I wonder why it was never enabled by default. Maybe to keep compatibility
to some older Linux version, which didn't support this yet?

IOW, it might be an one line change. Can I take some of the bounty now? :P

Charlie
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freep
Florian Klaempfl
2018-10-25 16:33:12 UTC
Permalink
This post might be inappropriate. Click to display it.
Karoly Balogh (Charlie/SGR)
2018-10-25 18:08:49 UTC
Permalink
Hi,
Post by Florian Klaempfl
Post by Karoly Balogh (Charlie/SGR)
Post by Michael Van Canneyt
That is good news. The contours of a TODO list are becoming visible :)
But we may need also need a solution for other platforms, which means the
current system should remain in place for those platforms where such a
system is not present ?
FPC already has some code to support section threadvars via the GS segment
on i386 at least, but it doesn't seem to be enabled by default? (Couldn't
test it, but the tf_section_threadvars target flag, which enable this is
actually behind a define in i_linux.pas, which I couldn't find enabled
anywhere?). Also tf_section_threadvars flag has some code to support it
all over the compiler, including the x86 cg. I have some really vague
memories I actually enabled it in some experimental local version I had,
and it worked on first sight at least, but I could be completely off here.
I wonder why it was never enabled by default.
The %gs based approach works only for object files linked statically to
the executable. In general there are four TLS access models on linux and
at least three of them need to be supported, if one wants to support
dyn. libraries in a usefull manner. Of course, this comes with the
requirement to over means to control the used model. The tls.pdf by U.
Drepper decribes it very well.
Ah, right. It's been a while. Ironically, it would have been enough for
the actual use case at hand, when I fiddled with it.

Charlie
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/c
Simon Kissel
2018-10-28 11:29:59 UTC
Permalink
Hi Florian,
Post by Florian Klaempfl
The %gs based approach works only for object files linked statically to
the executable. In general there are four TLS access models on linux and
at least three of them need to be supported, if one wants to support
dyn. libraries in a usefull manner.
Are you talking about being able to create dynlibs in FPC,
that then are consumed by FPC, and need to be able to support
exceptions?

I know an approach is needed that FPC benefits from in a generic
way, but for my case: We don't do that. As long as I am able
to link against glibc-based stuff, I am fine.

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo
Sven Barth via fpc-devel
2018-10-28 11:45:29 UTC
Permalink
Post by Simon Kissel
Hi Florian,
Post by Florian Klaempfl
The %gs based approach works only for object files linked statically to
the executable. In general there are four TLS access models on linux and
at least three of them need to be supported, if one wants to support
dyn. libraries in a usefull manner.
Are you talking about being able to create dynlibs in FPC,
that then are consumed by FPC, and need to be able to support
exceptions?
I know an approach is needed that FPC benefits from in a generic
way, but for my case: We don't do that. As long as I am able
to link against glibc-based stuff, I am fine.
The thing is that we can't enable or disable a feature based on whether a
program links third party libraries or a unit is included in a library or
not, cause we might need to work with precompiled units. So either you'll
need to enable this feature for a locally build FPC amd be aware that you
can't really create libraries then or the feature needs to be implemented
completely.

Regards,
Sven
Simon Kissel
2018-10-28 11:48:54 UTC
Permalink
Hi Sven,
Post by Sven Barth via fpc-devel
The thing is that we can't enable or disable a feature based on
whether a program links third party libraries or a unit is included
in a library or not, cause we might need to work with precompiled
units. So either you'll need to enable this feature for a locally
build FPC amd be aware that you can't really create libraries then
or the feature needs to be implemented completely. 
For us it would be just fine to have custom FPC builds, we
do that anyway - we use CrossFPC built by bero to be able
to target a whole lot of platforms concurrently, including
inside the Delphi and Lazarus IDEs.

But of course it means far less other users would benefit.

Simon


_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://l
Jonas Maebe
2018-10-25 16:59:47 UTC
Permalink
Post by Simon Kissel
- Complete the LLVM branch of FPC. It looks like Jonas has stopped
working on it two years ago, which is a pity.
I didn't stop working on it, but I didn't make real progress anymore
either. The current state of the LLVM code generator is that everything
works on Darwin/x86-64, except for
a) exception handling in general: indeed needs DWARF-EH support in the
RTL, and also support for the LLVM exception handling intrinsics in the
code generator. I've worked on and off on this and have some local
patches, but it's not complete
b) hardware exceptions (null pointer, floating point): the LLVM versions
I worked with back then did not support support any form of hardware
exceptions. If a memory access faults, the result is undefined behaviour
(even with full exception support in the LLVM IR). If a floating point
instruction throw an exception, the result is undefined (although they
have been working a bit on it since then). This is not something that
can be changed/fixed in FPC, and is quite different from how FPC's
current code generator works (I don't know how Embarcardero deals with
it in their LLVM-based code generator).

Additionally, in the current FPC code generator global variables behave
mostly as volatile variables. With LLVM, that won't be the case (unless
we mark all of their accesses as volatile, but that would obviously
inhibit LLVM optimizations). This may break some multithreaded code that
currently works, and would probably require the introduction of a
volatile() operatator (similar to the unaligned() one). On the other
hand, I already added support for tracking the volatile state of
references in the past, so that should be easy to do.


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/
Florian Klämpfl
2018-10-25 18:13:16 UTC
Permalink
Post by Simon Kissel
- Complete the LLVM branch of FPC. It looks like Jonas has stopped
   working on it two years ago, which is a pity.
I didn't stop working on it, but I didn't make real progress anymore either. The current state of the LLVM code
generator is that everything works on Darwin/x86-64, except for
a) exception handling in general: indeed needs DWARF-EH support in the RTL,
This is something I would like to work for years on already. So maybe its now a good opportunity to start with it.

I started a branch for it: https://svn.freepascal.org/svn/fpc/branches/debug_eh

As a first step, I'll depend on libgcc unwinding, let's see how far we get.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://li
Jonas Maebe
2018-10-25 18:34:44 UTC
Permalink
Post by Florian Klämpfl
Post by Simon Kissel
- Complete the LLVM branch of FPC. It looks like Jonas has stopped
   working on it two years ago, which is a pity.
I didn't stop working on it, but I didn't make real progress anymore either. The current state of the LLVM code
generator is that everything works on Darwin/x86-64, except for
a) exception handling in general: indeed needs DWARF-EH support in the RTL,
This is something I would like to work for years on already. So maybe its now a good opportunity to start with it.
I started a branch for it:https://svn.freepascal.org/svn/fpc/branches/debug_eh
As a first step, I'll depend on libgcc unwinding, let's see how far we get.
Using libgcc's foreign exception support works somewhat, but is not very
usable in practice due to the limitation of having only one exception in
flight. I simply started translating all of libgcc's exception support
to Pascal, since it's also licensed under LGPL + linking exception (I
took the one from gcc 4.2.1 for the people who don't like (L)GPL3).


Jonas

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mai
Sven Barth via fpc-devel
2018-10-25 19:33:49 UTC
Permalink
Post by Jonas Maebe
Post by Florian Klämpfl
Post by Jonas Maebe
Post by Simon Kissel
- Complete the LLVM branch of FPC. It looks like Jonas has stopped
    working on it two years ago, which is a pity.
I didn't stop working on it, but I didn't make real progress anymore
either. The current state of the LLVM code
generator is that everything works on Darwin/x86-64, except for
a) exception handling in general: indeed needs DWARF-EH support in the RTL,
This is something I would like to work for years on already. So maybe
its now a good opportunity to start with it.
I started a branch for
it:https://svn.freepascal.org/svn/fpc/branches/debug_eh
As a first step, I'll depend on libgcc unwinding, let's see how far we get.
Using libgcc's foreign exception support works somewhat, but is not
very usable in practice due to the limitation of having only one
exception in flight. I simply started translating all of libgcc's
exception support to Pascal, since it's also licensed under LGPL +
linking exception (I took the one from gcc 4.2.1 for the people who
don't like (L)GPL3).
As you already started working on translating that part of libgcc, would
you please provide what you have so far? :)

Regards,
Sven
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listin
Jonas Maebe
2018-10-28 18:14:06 UTC
Permalink
Post by Sven Barth via fpc-devel
As you already started working on translating that part of libgcc, would
you please provide what you have so far? :)
I've committed it in the dwarf_eh branch. Unfortunately, the an x86-64
compiler compiled with optimizations enabled crashes while compiling
this code (probably due to https://bugs.freepascal.org/view.php?id=34385
:) )


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://l
Jonas Maebe
2018-10-28 19:50:22 UTC
Permalink
Post by Jonas Maebe
I've committed it in the dwarf_eh branch. Unfortunately, the an x86-64
compiler compiled with optimizations enabled crashes while compiling
this code (probably due to https://bugs.freepascal.org/view.php?id=34385
:) )
Actually, it was to a bug in my code! Fixed.


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-b
Simon Kissel
2018-10-28 12:06:22 UTC
Permalink
Hi Florian,

[DWARF-EH]
Post by Florian Klämpfl
This is something I would like to work for years on already. So
maybe its now a good opportunity to start with it.
*hugs*

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal
Florian Klämpfl
2018-11-04 16:35:39 UTC
Permalink
Am 25.10.2018 um 20:13 schrieb Florian Klämpfl:

In case somebody wonders: as I started years ago on tls-based threadvars, I decided first to work on this one first and
try to bring this code into a commitable state.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepasca
Florian Klämpfl
2018-11-07 22:00:20 UTC
Permalink
Post by Florian Klämpfl
In case somebody wonders: as I started years ago on tls-based threadvars, I decided first to work on this one first and
try to bring this code into a commitable state.
I committed my tls-based threadvar code. It still comes with a few limitations though:
- to enable it, the compiler, rtl and packages must be build with "OPT=-dtls_threadvars -Aas"
- it works only on i386-linux
- the internal assembler does not support the necessary relocations yet: so all compilations must be done with -Aas
- threadvars in FPC built libraries do not work yet
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailm
Florian Klämpfl
2018-11-11 17:41:53 UTC
Permalink
Post by Florian Klämpfl
- threadvars in FPC built libraries do not work yet
This is fixed with r40281. It requires though that all units being part of a library are compiled with -fPIC.

Now waiting for Simon, if he reports any improvements ...
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.
Simon Kissel
2018-11-14 13:46:08 UTC
Permalink
Hi Florian,

you are a hero. In a very artificial benchmark which just consists
of threads and exception handlers, a 32 bit Linux executable now
is *twice as fast*!

In a real-life scenario we are "only" seeing an improvement of about
10%. But really, this is huge progress. I think everyone will
benefit from these improvements.

We have not yet tested this on ARM (does it work on ARM?).

Bero will do more testing in the next couple of days and report
back.

Cheers,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/lis
Sven Barth via fpc-devel
2018-11-14 14:24:59 UTC
Permalink
Am Mi., 14. Nov. 2018, 14:46 hat Simon Kissel <
Post by Simon Kissel
Hi Florian,
you are a hero. In a very artificial benchmark which just consists
of threads and exception handlers, a 32 bit Linux executable now
is *twice as fast*!
Up to now only thread variables are improved, the exception handling not
yet.
Post by Simon Kissel
In a real-life scenario we are "only" seeing an improvement of about
10%. But really, this is huge progress. I think everyone will
benefit from these improvements.
We have not yet tested this on ARM (does it work on ARM?).
Currently it's i386-linux only.

Regards,
Sven
Florian Klämpfl
2018-11-15 21:31:55 UTC
Permalink
Post by Simon Kissel
We have not yet tested this on ARM (does it work on ARM?).
After r40321, arm-linux works as well.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-
Simon Kissel
2018-11-16 19:22:52 UTC
Permalink
Hi guys,

turns out that in our real-life scenario there sadly aren't big
improvements yet. Might be due to the exception handling, but
we haven't profiled it yet. As said we have seen better improvements
in simpler benchmark code - but this benchmark here is what
really matters for us.

Please find the benchmark here - the ZIP includes a Kylix-built
binary.

https://share.nerdherrschaft.net/f/2ac772f0327e4840a533/?dl=1

Here are some results from a Dualcore i7 with 2 cores and 4 HT,
32 bit:

Kylix:
Time: 5015ms = 9770688 pkts/s = 14610 MB/s
./vipribenchmemcache_nodeps_kylix 5.06s user 0.01s system 99% cpu 5.119 total

FPC 3.0.4:
Time: 5052ms = 8016627 pkts/s = 11987 MB/s
./vipribenchmemcache 5.07s user 0.01s system 97% cpu 5.206 total

FPC 3.3.1 trunk (SVN Rev 40300):
Time: 5040ms = 8035714 pkts/s = 12016 MB/s
./vipribenchmemcache_nodeps 5.07s user 0.02s system 97% cpu 5.207 total

Benchmark results for ARM will follow.

Cheers,

Simon
Post by Florian Klämpfl
Post by Simon Kissel
We have not yet tested this on ARM (does it work on ARM?).
After r40321, arm-linux works as well.
_______________________________________________
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Best regards,

Simon Kissel
--
Nerdherrschaft GmbH
Mainzer Str. 40
55411 Bingen am Rhein
Germany

Phone: +49-6721-9492994
Fax: +49-6721-9492996

***@nerdherrschaft.com
http://www.nerdherrschaft.com

Registered office/Sitz der Gesellschaft: Bingen am Rhein, Germany
CEO/Geschäftsführer: Simon Kissel
Commercial register/Handelsregister: Amtsgericht Mainz HRB43337

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/
Florian Klämpfl
2018-11-16 21:44:57 UTC
Permalink
Post by Simon Kissel
Hi guys,
turns out that in our real-life scenario there sadly aren't big
improvements yet. Might be due to the exception handling, but
we haven't profiled it yet. As said we have seen better improvements
in simpler benchmark code - but this benchmark here is what
really matters for us.
Please find the benchmark here - the ZIP includes a Kylix-built
binary.
https://share.nerdherrschaft.net/f/2ac772f0327e4840a533/?dl=1
Here are some results from a Dualcore i7 with 2 cores and 4 HT,
Time: 5015ms = 9770688 pkts/s = 14610 MB/s
./vipribenchmemcache_nodeps_kylix 5.06s user 0.01s system 99% cpu 5.119 total
Time: 5052ms = 8016627 pkts/s = 11987 MB/s
./vipribenchmemcache 5.07s user 0.01s system 97% cpu 5.206 total
Time: 5040ms = 8035714 pkts/s = 12016 MB/s
./vipribenchmemcache_nodeps 5.07s user 0.02s system 97% cpu 5.207 total
Benchmark results for ARM will follow.
With some compiler tuning and a few tricks (two changes to the code and hand-simulated peephole optimizations, but I
think these tricks can also the compiler do):

***@ubuntu32:~$ ./vipribench
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
..............................................................................................
Time: 5005ms = 9390609 pkts/s = 14042 MB/s
***@ubuntu32:~$ ./vipribenchmemcache_nodeps_kylix
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
.............................................................................................
Time: 5018ms = 9266640 pkts/s = 13856 MB/s

;)

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listin
Jonas Maebe
2018-11-16 22:36:10 UTC
Permalink
Post by Florian Klämpfl
With some compiler tuning and a few tricks (two changes to the code and hand-simulated peephole optimizations, but I
You can improve performance further by devirtualising all method calls
using wpo. First compile it with -FWvipri.wpo -OWDEVIRTCALLS,OPTVMTS and
next with -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (at least on my machine it
gives a small boost, and makes the results also more stable).

Since I only have a preliminary llvm version (with Dwarf EH) running on
macOS, I can't provide a direct Kylix comparison. The versions below are
both x86-64. As mentioned before, a 32 bit FPC/LLVM is still quite a way
off.

* FPC 3.0.4 -MDelphi -O2 -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS:

$ time ./vipribenchmemcache_nodeps
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0,
NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4
.................................................................................................
Time: 5016ms = 9669059 pkts/s = 14680 MB/s

real 0m5.137s
user 0m5.042s
sys 0m0.017s

FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm
IR) and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no LLVM link-time
optimization):

$ time ./vipribenchmemcache_nodeps_llvm
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0,
NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4
.................................................................................................................
Time: 5018ms = 11259466 pkts/s = 17094 MB/s

real 0m5.161s
user 0m5.060s
sys 0m0.017s


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/lis
Florian Klämpfl
2018-11-16 22:41:47 UTC
Permalink
Post by Florian Klämpfl
With some compiler tuning and a few tricks (two changes to the code and hand-simulated peephole optimizations, but I
You can improve performance further by devirtualising all method calls using wpo. First compile it with -FWvipri.wpo
-OWDEVIRTCALLS,OPTVMTS and next with -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (at least on my machine it gives a small boost,
and makes the results also more stable).
Since I only have a preliminary llvm version (with Dwarf EH) running on macOS, I can't provide a direct Kylix
comparison. The versions below are both x86-64. As mentioned before, a 32 bit FPC/LLVM is still quite a way off.
$ time ./vipribenchmemcache_nodeps
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
.................................................................................................
Time: 5016ms = 9669059 pkts/s = 14680 MB/s
real    0m5.137s
user    0m5.042s
sys    0m0.017s
FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm IR) and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no
$ time ./vipribenchmemcache_nodeps_llvm
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
.................................................................................................................
Time: 5018ms = 11259466 pkts/s = 17094 MB/s
real    0m5.161s
user    0m5.060s
sys    0m0.017s
Can you test with FPC 3.1.1 native, -O4 and the following patch:

compiler/nmem.pas | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/compiler/nmem.pas b/compiler/nmem.pas
index d5c1d85e8f..52add1fd81 100644
--- a/compiler/nmem.pas
+++ b/compiler/nmem.pas
@@ -1176,7 +1176,7 @@ implementation
begin
include(flags,nf_write);
{ see comment in tsubscriptnode.mark_write }
- if not(is_implicit_pointer_object_type(left.resultdef)) then
+ if not(is_implicit_array_pointer(left.resultdef)) then
left.mark_write;
end;

?
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://
Florian Klämpfl
2018-11-16 22:58:35 UTC
Permalink
Post by Florian Klämpfl
Post by Florian Klämpfl
With some compiler tuning and a few tricks (two changes to the code and hand-simulated peephole optimizations, but I
You can improve performance further by devirtualising all method calls using wpo. First compile it with -FWvipri.wpo
-OWDEVIRTCALLS,OPTVMTS and next with -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (at least on my machine it gives a small boost,
and makes the results also more stable).
Since I only have a preliminary llvm version (with Dwarf EH) running on macOS, I can't provide a direct Kylix
comparison. The versions below are both x86-64. As mentioned before, a 32 bit FPC/LLVM is still quite a way off.
$ time ./vipribenchmemcache_nodeps
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
.................................................................................................
Time: 5016ms = 9669059 pkts/s = 14680 MB/s
real    0m5.137s
user    0m5.042s
sys    0m0.017s
FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm IR) and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no
$ time ./vipribenchmemcache_nodeps_llvm
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
.................................................................................................................
Time: 5018ms = 11259466 pkts/s = 17094 MB/s
real    0m5.161s
user    0m5.060s
sys    0m0.017s
compiler/nmem.pas | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/compiler/nmem.pas b/compiler/nmem.pas
index d5c1d85e8f..52add1fd81 100644
--- a/compiler/nmem.pas
+++ b/compiler/nmem.pas
@@ -1176,7 +1176,7 @@ implementation
begin
include(flags,nf_write);
{ see comment in tsubscriptnode.mark_write }
- if not(is_implicit_pointer_object_type(left.resultdef)) then
+ if not(is_implicit_array_pointer(left.resultdef)) then
left.mark_write;
end;
?
Hmmm, needs a few more of my changes to make work, though it should work if used only with the benchmark.

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://li
Jonas Maebe
2018-11-17 09:13:46 UTC
Permalink
Post by Florian Klämpfl
diff --git a/compiler/nmem.pas b/compiler/nmem.pas
index d5c1d85e8f..52add1fd81 100644
--- a/compiler/nmem.pas
+++ b/compiler/nmem.pas
@@ -1176,7 +1176,7 @@ implementation
begin
include(flags,nf_write);
{ see comment in tsubscriptnode.mark_write }
- if not(is_implicit_pointer_object_type(left.resultdef)) then
+ if not(is_implicit_array_pointer(left.resultdef)) then
left.mark_write;
end;
The compiler crashes when I try to compile the program with that patch
applied (I did not do a make cycle with that patch, just applied it,
recompiled the compiler, and then tried to compile the test program with
the new compiler.


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/list
Simon Kissel
2018-11-17 21:15:44 UTC
Permalink
Hi Jonas,

Nice results!
Post by Jonas Maebe
Since I only have a preliminary llvm version (with Dwarf EH) running on
macOS, I can't provide a direct Kylix comparison. The versions below are
both x86-64. As mentioned before, a 32 bit FPC/LLVM is still quite a way
off.
How far of a way is that? Sadly we'll have to support some 32 bit
platforms for a couple more years...

And how far away is getting this to run on Linux?

And: Any language features or RTL stuff that does not yet work
with FPC/LLVM?

Bonus question: I don't know on which layer threads and exceptions are
handled with LLVM - will you be able to make use of the improvements to
TLS and Exception handling, in other words, can we combine the best
of both worlds?

BR,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepasc
Jonas Maebe
2018-11-17 22:13:39 UTC
Permalink
Post by Simon Kissel
How far of a way is that? Sadly we'll have to support some 32 bit
platforms for a couple more years...
I really don't know. It's not something I have looked into, but I'm
afraid it will be messy.
Post by Simon Kissel
And how far away is getting this to run on Linux?
Getting it to work on Linux/x86-64 should be fairly easy. Other 64 bit
platforms (both architectures and OSes) should not be difficult either.
Post by Simon Kissel
And: Any language features or RTL stuff that does not yet work
with FPC/LLVM?
Only the ones mentioned before:
* global variables are currently not treated as volatile by the LLVM
code generator, so if you use them to share values between threads with
explicit synchronisation, that will fail (as Sven explained)

* hardware exceptions (like segmentation faults, fpu exceptions and bus
errors) because LLVM does not model them. I could try to work around
this by making all accesses to all variables potentially referenced in
try/except blocks "volatile" (both in the blocks and afterwards), but
that would prevent many optimizations and it would not even guarantee to
solve all potential problems (since the LLVM code generator would still
assume that if those instructions trap, all behaviour afterwards is
undefined and hence it can optimize as if those instructions will never
trap; marking them as volatile won't change that: even if in many cases
the end result may be the same, it's not guaranteed).

The only work done on LLVM in this regard is some experimental
support for FPU exceptions in recent versions
(https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics),
but I have not yet added support for that yet (nor do I know how well it
works, or on which platforms it is supported).
Post by Simon Kissel
Bonus question: I don't know on which layer threads and exceptions are
handled with LLVM - will you be able to make use of the improvements to
TLS and Exception handling, in other words, can we combine the best
of both worlds?
The only improvements to exception handling until now have been for the
LLVM target. The code I already submitted is generic though, and can be
used by non-LLVM targets as well.

TLS-based threadvar support needs to be implemented separately for LLVM,
but that should be fairly easy (it's just another way of declaring the
variable in the LLVM IR.


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-b
Simon Kissel
2018-11-17 21:10:46 UTC
Permalink
Hi Florian,
Post by Florian Klämpfl
With some compiler tuning and a few tricks (two changes to the code
and hand-simulated peephole optimizations, but I
Nice - what changes did you do?

Changing the code of course is cheating, but there might be something
to learn for us, here.

Would be great if whatever trick you did could be part of the
compiler.

Cheers,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists
Florian Klämpfl
2018-11-17 21:28:12 UTC
Permalink
Post by Simon Kissel
Hi Florian,
Post by Florian Klämpfl
With some compiler tuning and a few tricks (two changes to the code
and hand-simulated peephole optimizations, but I
Nice - what changes did you do?
Changing the code of course is cheating, but there might be something
to learn for us, here.
I prevented the compiler to put certain variables in registers by taking their address :) But I did so only to test if
this helps and for i386 this helps as the decision which variables go into registers is not that easy, but see below.
Post by Simon Kissel
Would be great if whatever trick you did could be part of the
compiler.
Meanwhile the compiler can do it (not yet committed). Same VM as yesterday, all rates are a little bit lower, not sure
why (probably to many VMs open :)), but this applies to all three executables.

***@ubuntu32:~$ ./vipribenchmemcache_nodeps
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
.......................................................................................
Time: 5022ms = 8661888 pkts/s = 12952 MB/s
***@ubuntu32:~$ ./vipribenchmemcache_nodeps_kylix
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
......................................................................................
Time: 5040ms = 8531746 pkts/s = 12758 MB/s
***@ubuntu32:~$ ./vipribenchmemcache_nodeps_fpc
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
.............................................................
Time: 5058ms = 6030051 pkts/s = 9017 MB/s
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal
Florian Klämpfl
2018-11-18 10:08:05 UTC
Permalink
Post by Florian Klämpfl
Post by Simon Kissel
Hi Florian,
Post by Florian Klämpfl
With some compiler tuning and a few tricks (two changes to the code
and hand-simulated peephole optimizations, but I
Nice - what changes did you do?
Changing the code of course is cheating, but there might be something
to learn for us, here.
I prevented the compiler to put certain variables in registers by taking their address :) But I did so only to test if
this helps and for i386 this helps as the decision which variables go into registers is not that easy, but see below.
Post by Simon Kissel
Would be great if whatever trick you did could be part of the
compiler.
Meanwhile the compiler can do it (not yet committed). Same VM as yesterday, all rates are a little bit lower, not sure
why (probably to many VMs open :)), but this applies to all three executables.
With rev. 40346 I have committed my last changes. As the code is still experimental, it needs to be activated by the
command line when building FPC:

make clean all "OPT=-Aas -dtls_threadvars -O4 -dSPILLING_NEW"

(add -Cp... -Op... options if the target system is known)

Compile the benchmark with (where fpcnew is the newly build fpc):

fpcnew -O4 -Sd -FWvipri.wpo -OWDEVIRTCALLS,OPTVMTS vipribenchmemcache_nodeps.dpr
fpcnew -O4 -Sd -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS vipribenchmemcache_nodeps.dpr

The changes help also on arm and arm can be build using the same command line, however, at least on a Raspi3B+ the
improvement is less significant than on i386 (still the old cache flush (?) issue which is outside of the scope of FPC?).
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.
Simon Kissel
2018-11-18 22:38:12 UTC
Permalink
Hi Florian,
Bero has confirmed, works for us as well. This rocks!
Post by Florian Klämpfl
The changes help also on arm and arm can be build using the same
command line, however, at least on a Raspi3B+ the
improvement is less significant than on i386 (still the old cache
flush (?) issue which is outside of the scope of FPC?).
We'll try that next. And yes, on the bloody Kirkwood CPU which we use
a context switch will result in a CPU cache flush.

Cheers,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-deve
Simon Kissel
2018-11-20 12:58:05 UTC
Permalink
Hi Florian,
Post by Florian Klämpfl
The changes help also on arm and arm can be build using the same
command line, however, at least on a Raspi3B+ the
improvement is less significant than on i386 (still the old cache
flush (?) issue which is outside of the scope of FPC?).
Actually the changes are significant:

Before:

01-00512-00-00016:/opt/viprinet/bin # ./vipribenchmemcache_nodeps_crossfpc
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4
...
Time: 5212ms = 287797 pkts/s = 430 MB/s

After:

01-00512-00-00016:/opt/viprinet/bin # ./vipribenchmemcache_nodeps_armv5te_fpc
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4
....
Time: 5893ms = 339386 pkts/s = 507 MB/s

BR,

Simon
--
Nerdherrschaft GmbH
Mainzer Str. 40
55411 Bingen am Rhein
Germany

Phone: +49-6721-9492994
Fax: +49-6721-9492996

***@nerdherrschaft.com
http://www.nerdherrschaft.com

Registered office/Sitz der Gesellschaft: Bingen am Rhein, Germany
CEO/Geschäftsführer: Simon Kissel
Commercial register/Handelsregister: Amtsgericht Mainz HRB43337

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-deve
Simon Kissel
2018-12-04 01:16:29 UTC
Permalink
Hi Florian,

we are currently to try to do some real-life benchmarks with our
products, however with rev. 40346 compilation fails with the two following
showstoppers:

1.)

The assembler parser appears to be broken - the following very valid
opcodes get rejected:

SBMath.pas(1932,9) Error: Asm: [cmp imm32,imm8s] invalid combination of opcode and operands
SBMath.pas(1934,5) Error: Asm: [lea reg32,imm32] invalid combination of opcode and operands
SBMath.pas(1939,9) Error: Asm: [cmp imm32,imm8s] invalid combination of opcode and operands
SBMath.pas(1941,5) Error: Asm: [lea reg32,imm32] invalid combination of opcode and operands
SBMath.pas(1946,9) Error: Asm: [cmp imm32,imm8s] invalid combination of opcode and operands
SBMath.pas(1948,5) Error: Asm: [lea reg32,imm32] invalid combination of opcode and operands
SBMath.pas(1953,3) Error: Asm: [lea reg32,imm32] invalid combination of opcode and operands
SBMath.pas(1954,3) Error: Asm: [lea reg32,imm32] invalid combination of opcode and operands
SBMath.pas(1955,3) Error: Asm: [lea reg32,imm32] invalid combination of opcode and operands
SBMath.pas(1972,3) Error: Asm: [lea reg32,imm32] invalid combination of opcode and operands
SBMath.pas(1976,5) Error: Asm: [lea reg32,imm32] invalid combination of opcode and operands
SBMath.pas(1981,5) Error: Asm: [lea reg32,imm32] invalid combination of opcode and operands
SBMath.pas(1982,5) Error: Asm: [lea reg32,imm32] invalid combination of opcode and operands

(-Tlinux -XPi386-linux- -CpPENTIUMM -O2 -OoCSE -CfSSE2 -Ooorderfields)

2.)

On ARM, I get Internal error 200603253 at various places:

SBMath.pas(1989,1) Fatal: Internal error 200603253
(sadly the line numbers are complete off for unknown reasons, so I can
not find the actual source line causing this)

But also happens at various other places. Most easy to reproduce by
compiling PasZLib-SG (e.g. https://github.com/Soldat/PasZlib-SG).


Any clues?

BR,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal
Florian Klämpfl
2018-12-04 21:28:10 UTC
Permalink
Post by Simon Kissel
Hi Florian,
we are currently to try to do some real-life benchmarks with our
products, however with rev. 40346 compilation fails with the two following
Do you compile with -Aas? The internal assemblers do not support TLS yet, this is WIP.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mail
Simon Kissel
2018-12-04 22:48:15 UTC
Permalink
Hi Florian,
Post by Florian Klämpfl
Do you compile with -Aas? The internal assemblers do not support TLS yet, this is WIP.
Ah wow! -Aas does indeed help. Both the assembler errors and
the internal error are gone, both in Linux i386 and ARM.

And the created binaries even work. Nice! Thank you!

Cheers,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Simon Kissel
2018-10-28 12:00:25 UTC
Permalink
Hi Jonas,
Post by Jonas Maebe
Post by Simon Kissel
- Complete the LLVM branch of FPC. It looks like Jonas has stopped
working on it two years ago, which is a pity.
I didn't stop working on it, but I didn't make real progress anymore
either.
So, would you be interested in making progress again? :)
Post by Jonas Maebe
a) exception handling in general: indeed needs DWARF-EH support in the
RTL, and also support for the LLVM exception handling intrinsics in the
code generator. I've worked on and off on this and have some local
patches, but it's not complete
So maybe someone else could work on DWARF exceptions, which then
would enable you to progress on LLVM?
Post by Jonas Maebe
have been working a bit on it since then). This is not something that
can be changed/fixed in FPC, and is quite different from how FPC's
current code generator works (I don't know how Embarcardero deals with
it in their LLVM-based code generator).
Someone could do some reverse engineering to learn more
about how they have solved the problem (unlike actually copying
code I don't see any legal or ethical problem in learning from
reversing).

If the lone Embarcardero russian Java-developer-turned-compiler
engineer can do it, you guys sure can, too ;)
Post by Jonas Maebe
Additionally, in the current FPC code generator global variables behave
mostly as volatile variables. With LLVM, that won't be the case (unless
we mark all of their accesses as volatile, but that would obviously
inhibit LLVM optimizations). This may break some multithreaded code that
currently works, and would probably require the introduction of a
volatile() operatator (similar to the unaligned() one). On the other
hand, I already added support for tracking the volatile state of
references in the past, so that should be easy to do.
I have to admit my knowledge on this is very limited. We do
use global variables unsynchronized in Multi-Threaded code,
but only in a single-writer multiple-reader scenarios, in these
cases we don't have any expectations for the new value to be
available "immediately". Obviously the compiler can not know at
what point during runtime the thread gets scheduled, but what
are the rules (if any) on "how long" it takes for a volatile
variables content to get "flushed"? Is there some scoping involved
like "on return of current function/method"?

Unlike the crap that Embarcadero has been polluting the language
with in recent years, I think that adding support for volatile()
to the language would make a lot of sense - however potentially
turning this around so that the unmodified default stays
volatile, and an implementing an erm.. "non-volatile" modifier
instead, so not to break existing code.

Cheers,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepas
Sven Barth via fpc-devel
2018-10-28 13:22:12 UTC
Permalink
Post by Simon Kissel
Post by Jonas Maebe
Additionally, in the current FPC code generator global variables behave
mostly as volatile variables. With LLVM, that won't be the case (unless
we mark all of their accesses as volatile, but that would obviously
inhibit LLVM optimizations). This may break some multithreaded code that
currently works, and would probably require the introduction of a
volatile() operatator (similar to the unaligned() one). On the other
hand, I already added support for tracking the volatile state of
references in the past, so that should be easy to do.
I have to admit my knowledge on this is very limited. We do
use global variables unsynchronized in Multi-Threaded code,
but only in a single-writer multiple-reader scenarios, in these
cases we don't have any expectations for the new value to be
available "immediately". Obviously the compiler can not know at
what point during runtime the thread gets scheduled, but what
are the rules (if any) on "how long" it takes for a volatile
variables content to get "flushed"? Is there some scoping involved
like "on return of current function/method"?
What volatile means in this context is that the compiler always fetches
the global value anew when it is accessed instead of e.g. caching it in
a register which could be done if global variables would not be
considered as volatile.
Post by Simon Kissel
Unlike the crap that Embarcadero has been polluting the language
with in recent years, I think that adding support for volatile()
to the language would make a lot of sense - however potentially
turning this around so that the unmodified default stays
volatile, and an implementing an erm.. "non-volatile" modifier
instead, so not to break existing code.
It seems that Delphi changed the default behavior in their NextGen
compiler (probably due to the same reasons that Jonas stated for LLVM)
as they introduced a "[volatile]" compiler attribute to decorate global
variables and fields so that the compiler handles them as volatile...

Regards,
Sven
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listin
Jonas Maebe
2018-10-28 21:39:37 UTC
Permalink
Post by Simon Kissel
Hi Jonas,
[exceptions for invalid memory accesses]
Post by Jonas Maebe
have been working a bit on it since then). This is not something that
can be changed/fixed in FPC, and is quite different from how FPC's
current code generator works (I don't know how Embarcardero deals with
it in their LLVM-based code generator).
Someone could do some reverse engineering to learn more
about how they have solved the problem (unlike actually copying
code I don't see any legal or ethical problem in learning from
reversing).
Well, maybe they didn't... Optimizations based on undefined behaviour
are a major feature of LLVM.


Jonas
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org
Jeppe Johansen
2018-10-25 18:30:21 UTC
Permalink
Post by Simon Kissel
- Must bring executable speed for non-Floating point load
on both multihreaded and non-multithreaded workloads to
the Speed of Kylix combined binaries
- Improvements should also help on ARM targets
- An LLVM-based solution must allow inline assembler for
all x86 and ARM
- Must be completed by February 2019
So, any suggestions on how to move forward on this?
Cheers,
Simon
Hi,

Can you create some benchmarks showing typical workloads that you
experience a large performance difference on?

Best Regards,
Jeppe

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/f
Simon Kissel
2018-10-28 15:04:57 UTC
Permalink
Hi,

I've packed together a minimal CrossKylix build that includes
the old Kylix 3 Open Edition, for those who wish to have a look
and/or test (to be provided) the bounty test project later on
without violating any Borland (RIP) licenses.

Please note that this has only been tested using my CrossKylix
Linux emulation under Windows. The original dcc compiler may
or may not work under Linux. If it does, at the minimum you'll
probably need glibc.i686 libgcc.i686. That's also what
built executables will need on x64 platforms to work.

Here is how to use it:

- Don't install into a path containing spaces

- ckdcc.exe is the dcc compiler wrapped in a Linux syscall
emulation layer. Use it as you would use dcc.exe

- Project files need a special .conf file. See
examples/helloworld.conf, and adapt paths there.

- Afterwards ckdcc.exe examples\helloworld.dpr should work.

https://crosskylix.untergrund.net/kylix-open.zip

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/l
Adriaan van Os
2018-11-23 10:14:49 UTC
Permalink
Post by Simon Kissel
We know about a couple of bottlenecks (fpc_pushexceptaddr /
RelocateThreadVar etc) which explain FPC's terrible multi-threading
performance, but in general, FPC's code generator really is quite
a mess, which we learned the hard way a couple of years when we
did optimization work on the ARM target.
I find the phrase. "FPC's terrible multi-threading performance" unjust. When I do multi-threading
with FPC, I get a near N speed improvement (on i386 and x86_64) where N is the number of cores,
including hyper-threaded cores ....

What about taking another way, having a precise look at the source code ? Did you profile it ? What
sort of work does the code do ? How are the threads synchronized ? What data structures are used ?

I don't take "the compiler is so bad" without an answer to these questions.

Regards,

Adriaan van Os

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc
Sven Barth via fpc-devel
2018-11-23 13:12:38 UTC
Permalink
Post by Adriaan van Os
Post by Simon Kissel
We know about a couple of bottlenecks (fpc_pushexceptaddr /
RelocateThreadVar etc) which explain FPC's terrible multi-threading
performance, but in general, FPC's code generator really is quite
a mess, which we learned the hard way a couple of years when we
did optimization work on the ARM target.
I find the phrase. "FPC's terrible multi-threading performance" unjust.
When I do multi-threading
with FPC, I get a near N speed improvement (on i386 and x86_64) where N is
the number of cores,
including hyper-threaded cores ....
What about taking another way, having a precise look at the source code ?
Did you profile it ? What
sort of work does the code do ? How are the threads synchronized ? What
data structures are used ?
I don't take "the compiler is so bad" without an answer to these questions.
Simon wrote that the same code performs better when compiled with Kylix, so
there definitely are things that can be done better by FPC and as Florian's
work on TLS variables showed indeed *do* make FPC perform better. I suspect
a similar improvement with DWARF exceptions as the setjmp/longjmp based
approach *is* more expensive for the case when no exception occures
compared to the case of marking protected code in the meta data as DWARF
and SEH64 do.

Regards,
Sven
Simon Kissel
2018-11-23 13:36:01 UTC
Permalink
Hi Adriaan,
Post by Adriaan van Os
I find the phrase. "FPC's terrible multi-threading performance"
unjust.
Well, see the complete thread to better understand what this
is about, and what progress is being made. So far a 20%
improvement has been made, which kinda is like a proof that
there was something to improve ;)
Post by Adriaan van Os
When I do multi-threading
with FPC, I get a near N speed improvement (on i386 and x86_64) where N is the number of cores,
including hyper-threaded cores ....
This isn't about FPC's code not scaling with N cores, it does.
It is about it being slow as soon as threads are used *at all*,
due to TLS stuff and exception handling. It's slow in a linear
fashion, so to say...

Best regards,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listin
Adriaan van Os
2018-11-23 14:03:06 UTC
Permalink
Post by Simon Kissel
This isn't about FPC's code not scaling with N cores, it does.
It is about it being slow as soon as threads are used *at all*,
N cores being near N times faster than "not using threads at all".
Post by Simon Kissel
due to TLS stuff and exception handling. It's slow in a linear
fashion, so to say...
You didn't answer any of my questions. The goal is to get the code faster, isn't it. Or are you
writing an academic thesis on compilers ?

Regards,

Adriaan van Os
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listi
Simon Kissel
2018-11-23 20:07:53 UTC
Permalink
Hi Adriaan,

In case you aren't just trolling and the subject really is of
interest to you, I would recommend reading the discussion
thread in full. That works much better than treating this
like a write-only system.
Post by Adriaan van Os
You didn't answer any of my questions. The goal is to get the
code faster, isn't it.
No, the goal is not to get any specific code faster. The goal
is to have the compiler and/or RTL improved so that all code
compiled benefits, and that execution speed in general gets on
par with the 15 years old Kylix/Delphi 7 compilers.

And yes, of course we are profiling our code for years, and we
know what we are doing and talking about. Our code sadly does
not have any bottlenecks in the sense of a small number of
functions eating most of the CPU, the load is pretty evenly
distributed across all of the functions. This means that the
problem is distributed all across the code. However, there
is something sticking out, being at the very top of pretty
much all multi-threaded code we compile:

fpc_pushexceptaddr & CRelocateThreadVar.

Besides this, not everything can be uncovered by profiling,
and that part is nothing that FPC can change: On one of
the ARM platforms we use every context switch results in a
CPU cache flush, so simply by having more threads *all* of
them will become slower.

The benchmark code as our real-life code is able to utilize
~99% of the CPU, so no, it's also not a matter of thread
synchronization (we aren't spinlocking).

The commercial reason behind putting out a 15k bounty is that
no matter how much more money I invest into optimizing my
own code, it won't get much better than what it is today,
and that Kylix producing faster code does not compensate it
not supporting any of the nice-to-have language features that
FPC has today.

Simon




_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/m
Florian Klämpfl
2018-11-23 20:11:18 UTC
Permalink
Post by Simon Kissel
own code, it won't get much better than what it is today,
and that Kylix producing faster code does not compensate it
Well, to be fair, there is a lot of code out there where FPC is faster. Nevertheless, FPC's code can be still improved.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://li
Simon Kissel
2018-11-27 20:35:37 UTC
Permalink
Hi guys,

that platform is not relevant for us, but to provide some motivational
boost:

CrossFPC 4.14 beta Win64:
C:\Users\BeRo\Documents\Projects\Tests\threadingtest0\aa>vipribenchmemcache_nodeps
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4
...............................................................................................
Time: 5021ms = 9460267 pkts/s = 14363 MB/s

vs. Delphi 10.3 Win64:
C:\Users\BeRo\Documents\Projects\Tests\threadingtest0\aa>vipribenchmemcache_nodeps
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4
..................................................
Time: 5086ms = 4915454 pkts/s = 7462 MB/s

:)

Best regards,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-b
Adriaan van Os
2018-11-24 12:43:29 UTC
Permalink
Post by Simon Kissel
Hi Adriaan,
In case you aren't just trolling and the subject really is of
interest to you, I would recommend reading the discussion
thread in full. That works much better than treating this
like a write-only system.
In case you are just trolling, I recommend reading a book on programming, learning to write better
code.

Adriaan van Os

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/ma
Tomas Hajny
2018-11-24 13:13:13 UTC
Permalink
Hello all,
Post by Adriaan van Os
In case you are just trolling, I recommend reading
a book on programming, learning to write better code.
Could we stop this, please? This is neither on topic, nor very polite,
especially after Simon explained that he already spent effort on improving
his code, but also referenced comparison to another compiler / RTL doing
better job than FPC in that particular area. Simon's original sentence
about FPC weakness might have been somewhat sharper than necessary, but
these follow-ups are useless. If you, Adrian, believe to have discovered
inefficiency in code posted by Adrian, feel free to point it out
explicitly.

Thanks

Tomas
(one of FPC mailing list moderators)


_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman
Florian Klämpfl
2018-11-25 08:26:58 UTC
Permalink
Post by Simon Kissel
problem is distributed all across the code. However, there
is something sticking out, being at the very top of pretty
fpc_pushexceptaddr & CRelocateThreadVar.
This, however, does the benchmark not reflect.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/lis
Florian Klämpfl
2018-11-23 15:53:18 UTC
Permalink
Post by Simon Kissel
Hi Adriaan,
Post by Adriaan van Os
I find the phrase. "FPC's terrible multi-threading performance"
unjust.
Well, see the complete thread to better understand what this
is about, and what progress is being made. So far a 20%
improvement has been made, which kinda is like a proof that
there was something to improve ;)
Post by Adriaan van Os
When I do multi-threading
with FPC, I get a near N speed improvement (on i386 and x86_64) where N is the number of cores,
including hyper-threaded cores ....
This isn't about FPC's code not scaling with N cores, it does.
It is about it being slow as soon as threads are used *at all*,
due to TLS stuff and exception handling. It's slow in a linear
fashion, so to say...
Actually, most of the improvements so far are no related to threading. In particular r40339 helped a lot, it was a bug
fix: the compiler assumed that a certain sub expression was written while it not was and this prevented CSE.
_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listin
Simon Kissel
2018-11-23 19:44:30 UTC
Permalink
Hi Florian,
Post by Florian Klämpfl
Actually, most of the improvements so far are no related to
threading. In particular r40339 helped a lot, it was a bug
fix: the compiler assumed that a certain sub expression was written
while it not was and this prevented CSE.
Even better, that means there is still gold to be uncovered :)

In our case the bottleneck very clearly appears to be that
every call to fpc_pushexceptaddr/fpc_popaddrstack causes a
call to CRelocateThreadVar, which causes a call to
pthread_getspecific.

We do create our ARM production builds with {$IMPLICITEXCEPTIONS OFF}
to get acceptable speed, else it would be completely unbearable.

BR,

Simon

_______________________________________________
fpc-devel maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman
Loading...