Discussion:
[fpc-devel] FPC/Lazarus Rebuild performance
Adem
2010-09-10 15:43:59 UTC
Permalink
Sometime ago, there was a brief mention of multi-threading FPC would be
counter productive because compilation process was mostly disk IO bound
--this is what I understood anyway.

I wanted to check to see if disk IO was really limiting FPC/Lazarus
compile performance.

The only quick way I could devise to check this was to use two different
disks which are significantly different from one another in terms of
performance.

And, to help with timing, I modified the mnuToolBuildLazarusClicked()
event (in \lazarus\ide\main.pp). [See below for code.]

Here is test setup:

OS: Win7, x64, 8 GB RAM.
Disk1: SSD, OCZ Vertex, NTFS
Disk2: RAMDisk, Dataram, NTFS [
http://memory.dataram.com/products-and-services/software/ramdisk ]

To show that these disks perform significantly differently from one
another, I used 'ATTO Disk Benchmark' (default settings) and took the
max values for each disk.

SSD: Read: 200 MB/s Write: 198 MB/s
RAMDisk: Read: 2,227 MB/s Write: 1,545 MB/s

IOW, RAMDisk is between 8 and 11 times faster than the SSD --significant
enough for my purposes.

I used this version of Lazarus: lazarus-0.9.28.2-fpc-2.2.4-win64.exe

Identical copies of (default setup of) Lazarus on both disks.
Nothing, other than the relevant paths altered by me.

Testing:

Rebuild Lazarus 10 times on each disk. Here are the values:

SSD:

01: 105,498 ms
02: 103,678 ms
03: 103,345 ms
04: 101,522 ms
05: 104,720 ms
06: 101,874 ms
07: 100,136 ms
08: 100,492 ms
09: 104,850 ms
10: 104,488 ms

RAMDisk:

01: 101,150 ms
02: 111,198 ms
03: 109,066 ms
04: 103,516 ms
05: 105,875 ms
06: 103,133 ms
07: 104,036 ms
08: 108,763 ms
09: 104,306 ms
10: 103,583 ms

Here are the average values:

SSD: 103,060 ms (1 min 43 sec)
RAMDisk: 105,463 ms (1 min 46 sec)

This doesn't make sense. FPC/Lazarus compiles on the faster medium
longer (albeit only 3 sec.).

But, more than that, I can't see how FPC/Lazarus would be disk IO bound,
if it takes practically the same time on a medium that is 8-11 times
faster than the other one.

While doing the builds, I kept an eye on the other parameters too. This
is an 8-core machine, and never during this test did any core hit
anything beyond %20. Same with paging.. It practically stayed the same.

If it isn't disk IO, if it isn't CPU load; then what is it?

RAM IO bound?

Could it be?

--------------------------------------------------------------
here is modified code:
--------------------------------------------------------------

procedure TMainIDE.mnuToolBuildLazarusClicked(Sender: TObject);
var
StartTime1: TDateTime;
EndTime1: TDateTime;
Path1: String;
Times1: TStringList;
DoTime1: Boolean;
MilliSecs1: Int64;
begin
if MiscellaneousOptions.BuildLazOpts.ConfirmBuild then
if MessageDlg(lisConfirmLazarusRebuild, mtConfirmation, mbYesNo,
0)<>mrYes then exit;
DoTime1 := MessageDlg('Do you also want to time the build process',
mtConfirmation, mbYesNo, 0) = mrYes;

if DoTime1 then begin
Times1:= TStringList.Create;
StartTime1 := Now();
end;

try
DoBuildLazarus([]);
finally
if DoTime1 then begin
EndTime1 := Now();
Path1 :=
IncludeTrailingPathDelimiter(ExtractFilePath(Application.ExeName)) +
'BuildTime.txt';
MilliSecs1 := MilliSecondsBetween(StartTime1, EndTime1);
if FileExists(Path1) then Times1.LoadFromFile(Path1);
Times1.Add(IntToStr(MilliSecs1));
Times1.SaveToFile(Path1);
Times1.Free;
end;
end;
end;
Jonas Maebe
2010-09-10 15:54:18 UTC
Permalink
Post by Adem
SSD: 103,060 ms (1 min 43 sec)
RAMDisk: 105,463 ms (1 min 46 sec)
This doesn't make sense. FPC/Lazarus compiles on the faster medium
longer (albeit only 3 sec.).
Everything on your SSD is cached in RAM, so it's normal that both are
about the same speed. The 3 seconds difference is probably just noise.


Jonas
Adem
2010-09-10 16:05:40 UTC
Permalink
Post by Adem
SSD: 103,060 ms (1 min 43 sec)
RAMDisk: 105,463 ms (1 min 46 sec)
This doesn't make sense. FPC/Lazarus compiles on the faster medium
longer (albeit only 3 sec.).
I am sorry, but what you've just said doesn't make sense either --at
least to me.

If that were true, wouldn't the disk benchmarks be also the same?
Jonas Maebe
2010-09-10 16:19:06 UTC
Permalink
Post by Adem
Post by Jonas Maebe
Post by Adem
SSD: 103,060 ms (1 min 43 sec)
RAMDisk: 105,463 ms (1 min 46 sec)
This doesn't make sense. FPC/Lazarus compiles on the faster medium
longer (albeit only 3 sec.).
Everything on your SSD is cached in RAM, so it's normal that both
are about the same speed. The 3 seconds difference is probably just
noise.
I am sorry, but what you've just said doesn't make sense either --at
least to me.
If that were true, wouldn't the disk benchmarks be also the same?
No. Disk benchmark specifically disable OS disk in order to be able to
measure the actual disk throughput rather than just the speed of the
OS buffer cache.


Jonas
Daniel
2010-09-10 16:16:43 UTC
Permalink
AFAIR the ATTO tool measures read and write bursts of single "files" X in size.

An interesting exercise is to transfer 1000 files to a USB memory
stick in 2 situations:

- Compacted in a single file, transfers at or near full USB speed.
- Spread out normally on a folder takes forever.

This happens because the time it takes to SWITCH between one file to
another is significant. Ending one operation (a single file transfer)
and begining another takes a time slice. Summing up all these start
and finish ops takes a significant time slice.

I guess this kind of behaviour is exactly what you're noticing. The
raw I/O transfer speed provided by your SSD is enough for your needs.
A faster channel won't help you. Some other factor is accounting for
your lags.

You can't really judge a storage medium just by a single benchmark
kind. The compiler will ask for your storage medium to seek files all
the time. Switching files all the time. This takes time. Much more
time than blazing through a single file like ATTO does.
  SSD: 103,060 ms (1 min 43 sec)
RAMDisk: 105,463 ms (1 min 46 sec)
This doesn't make sense. FPC/Lazarus compiles on the faster medium longer
(albeit only 3 sec.).
Everything on your SSD is cached in RAM, so it's normal that both are about
the same speed. The 3 seconds difference is probably just noise.
Jonas
_______________________________________________
http://lists.freepascal.org/mailman/listinfo/fpc-devel
Adem
2010-09-11 05:25:03 UTC
Permalink
Post by Daniel
This happens because the time it takes to SWITCH between one file to
another is significant. Ending one operation (a single file transfer)
and begining another takes a time slice. Summing up all these start
and finish ops takes a significant time slice.
I wonder if all this means compilation/build speeds cannot be improved
much by introducing faster hardware.

To see what's happening in the bigger picture, I used 'Process Monitor'
[ http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx ] and
filtered out those I believed to be irrelevant to FPC/Lazarus build
operation.

Here are the results I see on Process Monitor (An 'event' is
Create/Open/Read/Write/Delete etc.).

I am pasting summaries below hoping it helps someone identify a
bottleneck or something.

*Count of occurences by EventClass:*
File System: 324,958
Registry: 25,165
Process: 4,577

*Count of occurences by ProcessName: (Count: 12)*
ppcx64.exe: 275,003
make.exe: 27,825
rm.exe: 17,973
fpc.exe: 9,110
lazarus.exe: 6,742
gorc.exe: 6,719
conhost.exe: 5,034
gdate.exe: 3,122
pwd.exe: 1,483
startlazarus.exe: 1,102
csrss.exe: 946
cmd.exe: 786

*Count of occurences by Result:*
SUCCESS: 300,998
NO SUCH FILE: 26,919
NAME NOT FOUND: 15,446
BUFFER OVERFLOW: 3,110
REPARSE: 2,846
FILE LOCKED WITH ONLY READERS: 2,469
NO MORE FILES: 2,308
NAME INVALID: 667
PATH NOT FOUND: 538
NO MORE ENTRIES: 327
END OF FILE: 213
INVALID PARAMETER: 32
NOT REPARSE POINT: 10
NAME COLLISION: 7
IS DIRECTORY: 5
Marco van de Voort
2010-09-10 16:27:57 UTC
Permalink
Post by Adem
I wanted to check to see if disk IO was really limiting FPC/Lazarus
compile performance.
The only quick way I could devise to check this was to use two different
disks which are significantly different from one another in terms of
performance.
Both are atypical memory technology based devices which probably have low
seeking times, compared to harddisks.

With the total FPC/Lazarus sources being in the magnitude of 100MB, this
should be logical, doing this test that is based on raw read/write
bandwidth is not even necessary to determine that the bottleneck is not raw
read/write performance.

This becasue even a harddisk budget model still does 80-100MB/s, and overall
building is not in the 1-2s magnitude.

The I/O bottleneck is thus more the searching and opening of files, as well,
on Windows, executing programs. (.exe's)
Martin Schreiber
2010-09-11 06:55:14 UTC
Permalink
Post by Adem
Sometime ago, there was a brief mention of multi-threading FPC would be
counter productive because compilation process was mostly disk IO bound
--this is what I understood anyway.
I wanted to check to see if disk IO was really limiting FPC/Lazarus
compile performance.
Interesting is that Delphi 7 compiles about 10 times faster than FPC on the
same machine.
http://www.mail-archive.com/fpc-devel%40lists.freepascal.org/msg08029.html
Results with more code and FPC 2.4:
http://thread.gmane.org/gmane.comp.ide.mseide.user/18797
One would think Delphi and FPC need the same disk IO?

Martin
Jonas Maebe
2010-09-11 09:32:38 UTC
Permalink
Post by Martin Schreiber
Interesting is that Delphi 7 compiles about 10 times faster than FPC on the
same machine.
http://www.mail-archive.com/fpc-devel%40lists.freepascal.org/msg08029.html
http://thread.gmane.org/gmane.comp.ide.mseide.user/18797
One would think Delphi and FPC need the same disk IO?
First of all, they don't, unless Delphi's source/DCU searching and DCU loading logic is identical to FPC's.

Secondly, even *if* FPC (due to its design) is currently mainly limited in speed by I/O and *if* parellising would not help much because of that reason, then it can still also be slower than Delphi in other ways. Since Delphi 7 does not use parallel compilation (afaik), that's in fact a given.

So yes, FPC is slower than Delphi. Would parallelising FPC reduce the speed gap? Maybe (more likely for hot compiles), maybe not (more likely for cold compiles).


Jonas
Martin Schreiber
2010-09-11 10:23:36 UTC
Permalink
Post by Jonas Maebe
So yes, FPC is slower than Delphi. Would parallelising FPC reduce the speed
gap?
Because the gap is so big I think not substantial.
Given that (please correct me if I am wrong):

- FPC bottleneck is disk IO and not compiler logic and calculation.
- Delphi is much faster than FPC.

-> Delphi uses a file access approach which performs much better than the
approach of FPC.

or it isn't true that FPC bottleneck is disk IO. Are we absolutely sure about
the bottleneck?

Martin
Jonas Maebe
2010-09-11 10:37:14 UTC
Permalink
Post by Martin Schreiber
or it isn't true that FPC bottleneck is disk IO. Are we absolutely sure about
the bottleneck?
I'm quite certain that there are many reasons that FPC compiles more slowly than Delphi. The bottlenecks probably also vary from platform to platform and from compilation scenario to compilation scenario. I'm quite sure there is not a single thing you can "fix" and suddenly get compilation speeds in the same ballpark as Delphi 7 (how does the compilation speed of current Delphi's compare to Delphi 7 btw? I read that Delphi XE compiles much faster than Delphi 2010 in some cases, but I did not see comparisons to Delphi 7).

One thing that could be done is adding a linear scan register allocator (it would result in slightly worse code than the current register colouring, but it executes more quickly).

Going further, general restructuring of the compiler for compilation speed reasons would only be acceptable if it does not negatively impact the maintainability. There is a reason why we can support 6 architectures and umpteen OSes in the compiler with only a handful of people.


Jonas
Sergei Gorelkin
2010-09-11 12:53:00 UTC
Permalink
Post by Jonas Maebe
Post by Martin Schreiber
or it isn't true that FPC bottleneck is disk IO. Are we absolutely sure about
the bottleneck?
I'm quite certain that there are many reasons that FPC compiles more slowly than Delphi. The bottlenecks probably also vary from platform to platform and from compilation scenario to compilation scenario. I'm quite sure there is not a single thing you can "fix" and suddenly get compilation speeds in the same ballpark as Delphi 7 (how does the compilation speed of current Delphi's compare to Delphi 7 btw? I read that Delphi XE compiles much faster than Delphi 2010 in some cases, but I did not see comparisons to Delphi 7).
One thing that could be done is adding a linear scan register allocator (it would result in slightly worse code than the current register colouring, but it executes more quickly).
Going further, general restructuring of the compiler for compilation speed reasons would only be acceptable if it does not negatively impact the maintainability. There is a reason why we can support 6 architectures and umpteen OSes in the compiler with only a handful of people.
One idea that comes at this point is to put PPU data directly into object files, so the number of
output files is reduced plain twice. The PPU data could be placed into a section that is ignored by
linker. However I don't know is this is possible for all platforms.

Regards,
Sergei
Michael Van Canneyt
2010-09-11 14:06:14 UTC
Permalink
Post by Sergei Gorelkin
Post by Jonas Maebe
Post by Martin Schreiber
or it isn't true that FPC bottleneck is disk IO. Are we absolutely sure
about the bottleneck?
I'm quite certain that there are many reasons that FPC compiles more slowly
than Delphi. The bottlenecks probably also vary from platform to platform
and from compilation scenario to compilation scenario. I'm quite sure there
is not a single thing you can "fix" and suddenly get compilation speeds in
the same ballpark as Delphi 7 (how does the compilation speed of current
Delphi's compare to Delphi 7 btw? I read that Delphi XE compiles much
faster than Delphi 2010 in some cases, but I did not see comparisons to
Delphi 7).
One thing that could be done is adding a linear scan register allocator (it
would result in slightly worse code than the current register colouring,
but it executes more quickly).
Going further, general restructuring of the compiler for compilation speed
reasons would only be acceptable if it does not negatively impact the
maintainability. There is a reason why we can support 6 architectures and
umpteen OSes in the compiler with only a handful of people.
One idea that comes at this point is to put PPU data directly into object
files, so the number of output files is reduced plain twice. The PPU data
could be placed into a section that is ignored by linker. However I don't
know is this is possible for all platforms.
Not each platform may support this, so if you want to go down this path,
why not do the opposite, as Delphi ? Putting the .o data at the end allows
the compiler to simply not read that data when reading the PPU.
Only when actually linking do you need to extract all .o data from the ppus.

Michael.
Sergei Gorelkin
2010-09-11 14:43:05 UTC
Permalink
Post by Michael Van Canneyt
Post by Sergei Gorelkin
One idea that comes at this point is to put PPU data directly into
object files, so the number of output files is reduced plain twice.
The PPU data could be placed into a section that is ignored by linker.
However I don't know is this is possible for all platforms.
Not each platform may support this, so if you want to go down this path,
why not do the opposite, as Delphi ? Putting the .o data at the end
allows the compiler to simply not read that data when reading the PPU.
Only when actually linking do you need to extract all .o data from the ppus.
That is an option, too. But given that linking typically takes place at every build (as opposed to
compiling), I am afraid that total number of file operations on platforms without builtin linker
won't be noticeably less than it is currently.


Regards,
Sergei
Juha Manninen (gmail)
2010-09-11 10:25:14 UTC
Permalink
Post by Martin Schreiber
Post by Adem
Sometime ago, there was a brief mention of multi-threading FPC would be
counter productive because compilation process was mostly disk IO bound
--this is what I understood anyway.
I wanted to check to see if disk IO was really limiting FPC/Lazarus
compile performance.
Interesting is that Delphi 7 compiles about 10 times faster than FPC on the
same machine.
http://www.mail-archive.com/fpc-devel%40lists.freepascal.org/msg08029.html
http://thread.gmane.org/gmane.comp.ide.mseide.user/18797
One would think Delphi and FPC need the same disk IO?
I read the threads. My guess is also that the slowness comes from searching
and writing many files in big directory structures. It is slow even if the files
are cached. Also starting a new process is slow.
These OS kernel tasks are difficult to measure and process monitors don't give
reliable results.

Suggestion:
Create an API for integrating FPC with IDEs and special "make" programs.
The API would pass info about exact file names and locations.
It could also pass the whole source memory buffers.

Then build FPC as a dynamic shared library. There would be 2 FPC binaries
then: the traditional executable, and a shared library to be called from
external programs.

For example Lazarus IDE already scans a lot of information about project's
files and directories. That info could be "easily" passed to compiler.
Codetools in Lazarus already parses lots of code. The whole parsed interface
section could be passed to compiler (symbol table and whatnot).
... but that is the next step, let's stick with file info now...

Then there would be a new dedicated build program which reads all project info
first and then calls the compile function (not a separate process) in the
shared lib for each source file.

No expensive process startups, no searching for include files again for each
source file from huge directory structures.
I bet it would make a BIG difference in speed.
Delphi must be doing something like this although I don't know details.
After that it makes sense to make the compiler multi-threaded. It could scale
almost linearly with CPU cores (maybe).

I haven't seen such ideas in these mailing lists. Is it possible I am the first
one to have it? I don't believe because the idea is so obvious.
If there is already such development with the new make tools then sorry for my
ignorance.


Juha
Martin Schreiber
2010-09-11 10:40:26 UTC
Permalink
Post by Juha Manninen (gmail)
Post by Martin Schreiber
One would think Delphi and FPC need the same disk IO?
I read the threads. My guess is also that the slowness comes from searching
and writing many files in big directory structures. It is slow even if the
files are cached. Also starting a new process is slow.
These OS kernel tasks are difficult to measure and process monitors don't
give reliable results.
Create an API for integrating FPC with IDEs and special "make" programs.
The API would pass info about exact file names and locations.
It could also pass the whole source memory buffers.
And why does the Delphi commandline compiler (dcc32) not need this IDE
assistance?

Martin
Juha Manninen (gmail)
2010-09-11 10:55:25 UTC
Permalink
Post by Martin Schreiber
And why does the Delphi commandline compiler (dcc32) not need this IDE
assistance?
My guess is that dcc32 works as an integrated make program + compiler and thus
doesn't start external processes for each file.
Or, if it starts an external process then it can use some (hidden) temporary
file with pre-scanned info of the project. So the compiler would only open one
"info" file instead of scanning the whole search paths.

I used the IDE always when working with Delphi and don't really know dcc32.
Guessing only.

Juha
Michael Van Canneyt
2010-09-11 14:02:52 UTC
Permalink
Post by Juha Manninen (gmail)
Post by Martin Schreiber
And why does the Delphi commandline compiler (dcc32) not need this IDE
assistance?
My guess is that dcc32 works as an integrated make program + compiler and thus
doesn't start external processes for each file.
No it does not.
dcc32 compiles 1 file only, but does compile any additional units it needs.

You'll need makefiles as well if you use dcc32 (or any other build tool).

I have an extended build system using dcc32, and it takes easily up to 15
minutes to compile a 1.5 million lines project.

Michael.
Mattias Gaertner
2010-09-11 14:12:10 UTC
Permalink
On Sat, 11 Sep 2010 16:02:52 +0200 (CEST)
Post by Michael Van Canneyt
Post by Juha Manninen (gmail)
Post by Martin Schreiber
And why does the Delphi commandline compiler (dcc32) not need this IDE
assistance?
My guess is that dcc32 works as an integrated make program + compiler and thus
doesn't start external processes for each file.
No it does not.
dcc32 compiles 1 file only, but does compile any additional units it needs.
You'll need makefiles as well if you use dcc32 (or any other build tool).
I have an extended build system using dcc32, and it takes easily up to 15
minutes to compile a 1.5 million lines project.
Maybe dcc32 likes the MSEgui sources.

Martin, can you give a comparison between win32 and Linux 32?


Mattias
Graeme Geldenhuys
2010-09-11 16:37:49 UTC
Permalink
Post by Mattias Gaertner
Martin, can you give a comparison between win32 and Linux 32?
Add to that.... Martin, I know MSEgui is compilable with FPC and
Delphi. Is MSEgui compilable with Kylix 3 too? Then one could do a
Delphi 7 vs Kylix 3 comparison as well - seeing that Kylix 3 is
supposed to be on-par with Delphi 7. I've got Kylix 3 Enterprise
running in a VM here, I also have Delphi 7 available in a VM.
--
Regards,
  - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
Martin Schreiber
2010-09-11 17:11:29 UTC
Permalink
Post by Graeme Geldenhuys
Post by Mattias Gaertner
Martin, can you give a comparison between win32 and Linux 32?
Add to that.... Martin, I know MSEgui is compilable with FPC and
Delphi. Is MSEgui compilable with Kylix 3 too?
In theory yes, I did not test recently. Probably some modifications in Linux
MSEgui are necessary now.
Martin Schreiber
2010-09-11 17:50:41 UTC
Permalink
Post by Mattias Gaertner
Maybe dcc32 likes the MSEgui sources.
Or maybe FPC does not like MSEgui sources. ;-)
Post by Mattias Gaertner
Martin, can you give a comparison between win32 and Linux 32?
I don't have a working Kylix 3 environment at the moment. IIRC dcc32 on Linux
and Windows had about the same compiling performance.

Building MSEide with FPC fixes_2_4, about 366'000 lines:
Windows NTFS:
smartlink 50.4 sec
no smartlink 50.5 sec

Same machine Linux reiser FS:
smartlink 126 sec
no smartlink 62.7 sec.

Martin
Florian Klämpfl
2010-09-11 18:27:46 UTC
Permalink
Post by Martin Schreiber
Post by Mattias Gaertner
Maybe dcc32 likes the MSEgui sources.
Or maybe FPC does not like MSEgui sources. ;-)
Post by Mattias Gaertner
Martin, can you give a comparison between win32 and Linux 32?
I don't have a working Kylix 3 environment at the moment. IIRC dcc32 on Linux
and Windows had about the same compiling performance.
smartlink 50.4 sec
no smartlink 50.5 sec
What machine? Because with hot disk cache, I just build MSEide in about
10 s (15 s cold) on W7 64 Bit:

...
Linking mseidefp.exe
308574 lines compiled, 10.6 sec , 2577952 bytes code, 1618920 bytes data
196 warning(s) issued
414 note(s) issued
Martin Schreiber
2010-09-11 18:50:57 UTC
Permalink
Post by Florian Klämpfl
What machine? Because with hot disk cache, I just build MSEide in about
The same as for all other tests,
win2000, AMD Athlon XP 3000+, 1GB RAM
Post by Florian Klämpfl
...
Linking mseidefp.exe
308574 lines compiled, 10.6 sec , 2577952 bytes code, 1618920 bytes data
Then Delphi7 probably uses about 1 sec on your machine. ;-)

Martin
Florian Klämpfl
2010-09-11 19:10:20 UTC
Permalink
Post by Martin Schreiber
Post by Florian Klämpfl
What machine? Because with hot disk cache, I just build MSEide in about
The same as for all other tests,
win2000, AMD Athlon XP 3000+, 1GB RAM
Post by Florian Klämpfl
...
Linking mseidefp.exe
308574 lines compiled, 10.6 sec , 2577952 bytes code, 1618920 bytes data
Then Delphi7 probably uses about 1 sec on your machine. ;-)
...
mseide.pas(63)
280491 lines, 2.18 seconds, 2110568 bytes code, 752073 bytes data.

Anyways, before this ends in an endless discussion: if anybody is
interested in improving FPC compilation speed (for my needs is
sufficient) and have a look at fillchar and, have a look at FPC's unit
loading algorithm (not the actual i/o itself but how all the symbols
classes etc. are restored). This two points are bottlenecks which might
help when they are improved, though it's pretty unlikely that this will
improve things more than a few percent.
Jonas Maebe
2010-09-11 20:04:19 UTC
Permalink
Post by Florian Klämpfl
Anyways, before this ends in an endless discussion: if anybody is
interested in improving FPC compilation speed (for my needs is
sufficient) and have a look at fillchar and, have a look at FPC's unit
loading algorithm (not the actual i/o itself but how all the symbols
classes etc. are restored). This two points are bottlenecks which might
help when they are improved, though it's pretty unlikely that this will
improve things more than a few percent.
Note that in 2.5.1 it will already be somewhat faster because of r15604 (removes a lot of calls to fillchar from the register allocator) and r15515 (optimizations to sysfreemem_fixed).

In attachment you can find the time profile of the current FPC svn trunk compiling mseide svn trunk under Mac OS X (after some minor changes to the source code to get it to compile). This is with assembling and linking disabled to minimise interference (but including the external assembler writer). I've expanded some call stacks to show where generic routines are called from.

At the top level, the % means "x % of the execution time was spent in this routine". At the lower levels, it means "x% of the execution time spent in the top level routine was due to calls from this routine".

As you can see, there is not really any one hot routine that can be optimised to speed up everything a lot.


Jonas
Florian Klämpfl
2010-09-11 20:59:41 UTC
Permalink
Post by Jonas Maebe
Post by Florian Klämpfl
Anyways, before this ends in an endless discussion: if anybody is
interested in improving FPC compilation speed (for my needs is
sufficient) and have a look at fillchar and, have a look at FPC's
unit loading algorithm (not the actual i/o itself but how all the
symbols classes etc. are restored). This two points are bottlenecks
which might help when they are improved, though it's pretty
unlikely that this will improve things more than a few percent.
Note that in 2.5.1 it will already be somewhat faster because of
r15604 (removes a lot of calls to fillchar from the register
allocator) and r15515 (optimizations to sysfreemem_fixed).
True, but at least I fear the generic i386 fillchar is slower than the
MacOSX system routine __bzero. But I agree also that there won't be a
big improvement by improving fillchar.
Graeme Geldenhuys
2010-09-11 23:20:18 UTC
Permalink
On 11 September 2010 21:10, Florian Klämpfl <***@freepascal.org> wrote:

FPC
-------
Post by Florian Klämpfl
Post by Florian Klämpfl
Linking mseidefp.exe
308574 lines compiled, 10.6 sec , 2577952 bytes code, 1618920 bytes data
Delphi
---------
Post by Florian Klämpfl
mseide.pas(63)
280491 lines, 2.18 seconds, 2110568 bytes code, 752073 bytes data.
Now this is weird! Anybody else spotted the difference? Delphi seems
to compile +-28000 lines less that FPC! Florian, I presume it's the
same machine with the same MSEgui source code revision? What would be
the reason for that?

Would that (lines compiled) also account for the huge difference in
"bytes data". What is the final executable size (not that this matters
much to me) generated by the two compilers?

Martin, do you get the same results for "lines compiled"?
--
Regards,
  - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
Martin Schreiber
2010-09-12 06:21:00 UTC
Permalink
Now this is weird! Anybody else spotted the difference? Delphi seems
to compile +-28000 lines less that FPC! Florian, I presume it's the
same machine with the same MSEgui source code revision? What would be
the reason for that?
Would that (lines compiled) also account for the huge difference in
"bytes data". What is the final executable size (not that this matters
much to me) generated by the two compilers?
Martin, do you get the same results for "lines compiled"?
MSEide Delphi version has no DB components. In order to switch off non
Delphi components in FPC mode some defines must be set.
This is with MSEide+MSEgui SVN trunk rev.3910:

Delphi:
280496 lines, 4.49 seconds, 2127360 bytes code, 752073 bytes data.
12.09.2010 07:01 3'129'856 mseide.exe

FPC 2.4.0:
281604 lines compiled, 40.2 sec , 2136496 bytes code, 1541192 bytes data
12.09.2010 07:09 3'687'228 mseidefp.exe

FPC 2.4.0 with debug info:
281604 lines compiled, 54.4 sec , 2142896 bytes code, 1541336 bytes data
12.09.2010 07:16 41'759'691 mseidefp.exe

Commandline Delphi:
dcc32 -B -I..\..\lib\common\kernel -U..\..\lib\common\kernel
-U..\..\lib\common\kernel\i386-win32 -U..\..\lib\common\image
-U..\..\lib\common\widgets -U..\..\lib\common\designutils
-U..\..\lib\common\sysutils -U..\..\lib\common\editwidgets
-U..\..\lib\common\dialogs -U..\..\lib\common\regcomponents
-U..\..\lib\common\serialcomm -U..\..\lib\common\printer
-U..\..\lib\common\ifi -U..\..\lib\common\math -dmse_no_db
-dmse_no_opengl mseide.pas

Commandline FPC:
ppc386.exe -O2 -CX -XX -Xs -B -I..\..\lib\common\kernel
-Fu..\..\lib\common\kernel -Fu..\..\lib\common\kernel\i386-win32
-Fu..\..\lib\common\image -Fu..\..\lib\common\widgets
-Fu..\..\lib\common\designutils -Fu..\..\lib\common\sysutils
-Fu..\..\lib\common\editwidgets -Fu..\..\lib\common\dialogs
-Fu..\..\lib\common\regcomponents -Fu..\..\lib\common\serialcomm
-Fu..\..\lib\common\printer -Fu..\..\lib\common\ifi
-Fu..\..\lib\common\math -dmse_no_db -dmse_no_opengl -omseidefp.exe
mseide.pas

Commandline FPC with ddebug info:
ppc386.exe -O2 -CX -XX -gl -B -I..\..\lib\common\kernel
-Fu..\..\lib\common\kernel -Fu..\..\lib\common\kernel\i386-win32
-Fu..\..\lib\common\image -Fu..\..\lib\common\widgets
-Fu..\..\lib\common\designutils -Fu..\..\lib\common\sysutils
-Fu..\..\lib\common\editwidgets -Fu..\..\lib\common\dialogs
-Fu..\..\lib\common\regcomponents -Fu..\..\lib\common\serialcomm
-Fu..\..\lib\common\printer -Fu..\..\lib\common\ifi
-Fu..\..\lib\common\math -dmse_no_db -dmse_no_opengl -omseidefp.exe
mseide.pas

Martin
Florian Klämpfl
2010-09-12 08:15:11 UTC
Permalink
Post by Adem
FPC
-------
Post by Florian Klämpfl
Post by Florian Klämpfl
Linking mseidefp.exe
308574 lines compiled, 10.6 sec , 2577952 bytes code, 1618920 bytes data
Delphi
---------
Post by Florian Klämpfl
mseide.pas(63)
280491 lines, 2.18 seconds, 2110568 bytes code, 752073 bytes data.
Now this is weird! Anybody else spotted the difference? Delphi seems
to compile +-28000 lines less that FPC! Florian, I presume it's the
same machine with the same MSEgui source code revision? What would be
the reason for that?
Ifdef'ed unit usage/includes?
Martin Schreiber
2010-09-12 05:33:55 UTC
Permalink
Post by Florian Klämpfl
Anyways, before this ends in an endless discussion: if anybody is
interested in improving FPC compilation speed (for my needs is
sufficient) and have a look at fillchar and, have a look at FPC's unit
loading algorithm (not the actual i/o itself but how all the symbols
classes etc. are restored). This two points are bottlenecks which might
help when they are improved, though it's pretty unlikely that this will
improve things more than a few percent.
Agreed. My opinion is that before we start to implement difficult and
error-prone multi-threading into FPC we should find out why the hell Delphi 7
can compile so much faster and produces even better code?

Martin
Mattias Gaertner
2010-09-12 07:27:58 UTC
Permalink
On Sun, 12 Sep 2010 07:33:55 +0200
Post by Martin Schreiber
Post by Florian Klämpfl
Anyways, before this ends in an endless discussion: if anybody is
interested in improving FPC compilation speed (for my needs is
sufficient) and have a look at fillchar and, have a look at FPC's unit
loading algorithm (not the actual i/o itself but how all the symbols
classes etc. are restored). This two points are bottlenecks which might
help when they are improved, though it's pretty unlikely that this will
improve things more than a few percent.
Agreed. My opinion is that before we start to implement difficult and
error-prone multi-threading into FPC we should find out why the hell Delphi 7
can compile so much faster and produces even better code?
Seeing that dcc is only 800K:
maybe it fits into the cpu cache.

Mattias
Florian Klämpfl
2010-09-12 08:12:59 UTC
Permalink
Post by Martin Schreiber
Post by Florian Klämpfl
Anyways, before this ends in an endless discussion: if anybody is
interested in improving FPC compilation speed (for my needs is
sufficient) and have a look at fillchar and, have a look at FPC's unit
loading algorithm (not the actual i/o itself but how all the symbols
classes etc. are restored). This two points are bottlenecks which might
help when they are improved, though it's pretty unlikely that this will
improve things more than a few percent.
Agreed. My opinion is that before we start to implement difficult and
error-prone multi-threading into FPC we should find out why the hell Delphi 7
can compile so much faster
Because of the same reason why it seems to take years to port delphi to
64 bit: different design goals. It seems speed were the only design
goal, nobody took care about maintainability or portability.
Martin Schreiber
2010-09-12 08:21:20 UTC
Permalink
Post by Florian Klämpfl
Post by Martin Schreiber
Agreed. My opinion is that before we start to implement difficult and
error-prone multi-threading into FPC we should find out why the hell
Delphi 7 can compile so much faster
Because of the same reason why it seems to take years to port delphi to
64 bit: different design goals. It seems speed were the only design
goal, nobody took care about maintainability or portability.
And that results in a discrepancy of factor 5..10? I can't believe it.

Martin
Florian Klämpfl
2010-09-12 08:29:32 UTC
Permalink
Post by Martin Schreiber
Post by Florian Klämpfl
Post by Martin Schreiber
Agreed. My opinion is that before we start to implement difficult and
error-prone multi-threading into FPC we should find out why the hell
Delphi 7 can compile so much faster
Because of the same reason why it seems to take years to port delphi to
64 bit: different design goals. It seems speed were the only design
goal, nobody took care about maintainability or portability.
And that results in a discrepancy of factor 5..10? I can't believe it.
Digging out 1.0.10 and using some extreme example:

C:\fpc\tests\webtbs>"c:\pp 1.0.10\bin\win32\ppc386.exe" tw2242 -O2
Free Pascal Compiler version 1.0.10 [2003/06/27] for i386
Copyright (c) 1993-2003 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242.pp
Linking tw2242.exe
13083 Lines compiled, 0.8 sec

C:\fpc\tests\webtbs>fpc tw2242 -O2
Free Pascal Compiler version 2.4.0 [2009/12/18] for i386
Copyright (c) 1993-2009 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242.pp
Linking tw2242.exe
13083 lines compiled, 4.7 sec , 301376 bytes code, 1864 bytes data
Martin Schreiber
2010-09-12 08:39:54 UTC
Permalink
Post by Florian Klämpfl
Post by Martin Schreiber
And that results in a discrepancy of factor 5..10? I can't believe it.
C:\fpc\tests\webtbs>"c:\pp 1.0.10\bin\win32\ppc386.exe" tw2242 -O2
Free Pascal Compiler version 1.0.10 [2003/06/27] for i386
Copyright (c) 1993-2003 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242.pp
Linking tw2242.exe
13083 Lines compiled, 0.8 sec
C:\fpc\tests\webtbs>fpc tw2242 -O2
Free Pascal Compiler version 2.4.0 [2009/12/18] for i386
Copyright (c) 1993-2009 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242.pp
Linking tw2242.exe
13083 lines compiled, 4.7 sec , 301376 bytes code, 1864 bytes data
Impressive. Now we can hook in. Where is the difference? What makes 2.4.0 so
much slower?
Jonas Maebe
2010-09-12 08:51:44 UTC
Permalink
Post by Martin Schreiber
Post by Florian Klämpfl
Post by Martin Schreiber
And that results in a discrepancy of factor 5..10? I can't believe it.
C:\fpc\tests\webtbs>"c:\pp 1.0.10\bin\win32\ppc386.exe" tw2242 -O2
Free Pascal Compiler version 1.0.10 [2003/06/27] for i386
Copyright (c) 1993-2003 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242.pp
Linking tw2242.exe
13083 Lines compiled, 0.8 sec
C:\fpc\tests\webtbs>fpc tw2242 -O2
Free Pascal Compiler version 2.4.0 [2009/12/18] for i386
Copyright (c) 1993-2009 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242.pp
Linking tw2242.exe
13083 lines compiled, 4.7 sec , 301376 bytes code, 1864 bytes data
Impressive. Now we can hook in. Where is the difference? What makes 2.4.0 so
much slower?
In the above case: primarily the register allocator (which I mentioned before).


Jonas
Florian Klämpfl
2010-09-12 09:12:07 UTC
Permalink
Post by Martin Schreiber
Post by Florian Klämpfl
Post by Martin Schreiber
And that results in a discrepancy of factor 5..10? I can't believe it.
C:\fpc\tests\webtbs>"c:\pp 1.0.10\bin\win32\ppc386.exe" tw2242 -O2
Free Pascal Compiler version 1.0.10 [2003/06/27] for i386
Copyright (c) 1993-2003 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242.pp
Linking tw2242.exe
13083 Lines compiled, 0.8 sec
C:\fpc\tests\webtbs>fpc tw2242 -O2
Free Pascal Compiler version 2.4.0 [2009/12/18] for i386
Copyright (c) 1993-2009 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242.pp
Linking tw2242.exe
13083 lines compiled, 4.7 sec , 301376 bytes code, 1864 bytes data
Impressive. Now we can hook in. Where is the difference? What makes 2.4.0 so
much slower?
This is a very specific example which allows to explain rather simple
the slowness of 2.x: The reason is a decision geared by maintainability
and portability: 2.x uses a so-called graph colouring register allocator
while 1.x used a pretty simple register allocator specifically tailored
for i386.

The 2.x register allocator is more robust (no more internalerrors 10),
it is small (basically 2k lines, compiler/rgobj.pas) and it generates
reasonable register allocations on all types of CPUs (remember, FPC
supports CPUs with high register pressure like i386 as well as those
with a lot registers: PowerPC) we support, so we need only to maintain
one register allocator.
Martin Schreiber
2010-09-12 12:50:59 UTC
Permalink
Post by Florian Klämpfl
The 2.x register allocator is more robust (no more internalerrors 10),
it is small (basically 2k lines, compiler/rgobj.pas) and it generates
reasonable register allocations on all types of CPUs (remember, FPC
supports CPUs with high register pressure like i386 as well as those
with a lot registers: PowerPC) we support, so we need only to maintain
one register allocator.
I replaced the "+=" by ":= s +", testresults on the same machine as before:

Delphi 7:
E:\FPC\svn\fixes_2_4\tests\webtbf>dcc32 tw2242x.pp
Borland Delphi Version 15.0
Copyright (c) 1983,2002 Borland Software Corporation
tw2242x.pp(20009)
20010 lines, 0.19 seconds, 311088 bytes code, 1801 bytes data.

E:\FPC\svn\fixes_2_4\tests\webtbf>

FPC:
E:\FPC\svn\fixes_2_4\tests\webtbf>ppc386 tw2242x.pp
Free Pascal Compiler version 2.4.0 [2009/12/18] for i386
Copyright (c) 1993-2009 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242x.pp
tw2242x.pp(16386,7) Fatal: Procedure too complex, it requires too many
registers

Fatal: Compilation aborted

Truncated at line 16380:
Delphi 7:
E:\FPC\svn\fixes_2_4\tests\webtbf>dcc32 tw2242xtrunc.pp
Borland Delphi Version 15.0
Copyright (c) 1983,2002 Borland Software Corporation
tw2242xtrunc.pp(16382)
16383 lines, 0.16 seconds, 256684 bytes code, 1801 bytes data.

FPC:
E:\FPC\svn\fixes_2_4\tests\webtbf>ppc386 tw2242xtrunc.pp
Free Pascal Compiler version 2.4.0 [2009/12/18] for i386
Copyright (c) 1993-2009 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242xtrunc.pp
Linking tw2242xtrunc.exe
16381 lines compiled, 12.3 sec , 370736 bytes code, 1864 bytes data

Hmm. ;-)
Please take it with humor. :-)

Martin
Sergei Gorelkin
2010-09-12 14:10:45 UTC
Permalink
Post by Martin Schreiber
E:\FPC\svn\fixes_2_4\tests\webtbf>dcc32 tw2242xtrunc.pp
Borland Delphi Version 15.0
Copyright (c) 1983,2002 Borland Software Corporation
tw2242xtrunc.pp(16382)
16383 lines, 0.16 seconds, 256684 bytes code, 1801 bytes data.
E:\FPC\svn\fixes_2_4\tests\webtbf>ppc386 tw2242xtrunc.pp
Free Pascal Compiler version 2.4.0 [2009/12/18] for i386
Copyright (c) 1993-2009 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242xtrunc.pp
Linking tw2242xtrunc.exe
16381 lines compiled, 12.3 sec , 370736 bytes code, 1864 bytes data
Hmm. ;-)
Please take it with humor. :-)
Does that happen because of the SSA? I mean, it looks like a new register is allocated for every
statement until limit of 16384 is hit. At the same time, this procedure compiles into a sequence of
calls, which doesn't require registers at all (except to place function arguments).

Regards,
Sergei
Jonas Maebe
2010-09-12 14:15:35 UTC
Permalink
Post by Martin Schreiber
Hmm. ;-)
Please take it with humor. :-)
No humor is necessary. Delphi probably uses a linear scan register allocator, which (as I mentioned before) in general generates somewhat worse code in the general case (assuming an implementation of equal quality of both), but which is much faster than graph colouring (especially for very large procedures, of which tw2242 is an extreme test case).
Post by Martin Schreiber
Does that happen because of the SSA? I mean, it looks like a new register is allocated for every statement until limit of 16384 is hit.
No, that's unrelated to SSA (or even graph colouring). Also, I think the limit is 65535 rather than 16384.


Jonas
Jonas Maebe
2010-09-12 14:18:25 UTC
Permalink
Post by Jonas Maebe
Post by Martin Schreiber
Does that happen because of the SSA? I mean, it looks like a new register is allocated for every statement until limit of 16384 is hit.
No, that's unrelated to SSA (or even graph colouring). Also, I think the limit is 65535 rather than 16384.
Well, not entirely unrelated (more virtual registers are used with SSA), but even without SSA you run out of virtual registers after a while.


Jonas
Florian Klämpfl
2010-09-12 16:29:34 UTC
Permalink
Post by Martin Schreiber
E:\FPC\svn\fixes_2_4\tests\webtbf>ppc386 tw2242x.pp
Free Pascal Compiler version 2.4.0 [2009/12/18] for i386
Copyright (c) 1993-2009 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242x.pp
tw2242x.pp(16386,7) Fatal: Procedure too complex, it requires too many
registers
This could be fixed for another speed penalty :)
Post by Martin Schreiber
Please take it with humor. :-)
As long as the compiler itself builds on a reasonable machine in less
than 10 seconds, I'am happy :)
Martin Schreiber
2010-09-12 16:39:33 UTC
Permalink
Post by Florian Klämpfl
Post by Martin Schreiber
Please take it with humor. :-)
As long as the compiler itself builds on a reasonable machine in less
than 10 seconds, I'am happy :)
Yup, I know. But there are people who use FPC for other tasks than compiling
FPC and there are people who have no reasonable machine. ;-)

Martin
Florian Klämpfl
2010-09-12 17:16:45 UTC
Permalink
Post by Martin Schreiber
Post by Florian Klämpfl
Post by Martin Schreiber
Please take it with humor. :-)
As long as the compiler itself builds on a reasonable machine in less
than 10 seconds, I'am happy :)
Yup, I know. But there are people who use FPC for other tasks than compiling
FPC and there are people who have no reasonable machine. ;-)
Those can always use FPC 1.x or Delphi7 :) There is no free lunch ...
Hans-Peter Diettrich
2010-09-13 21:38:40 UTC
Permalink
Post by Florian Klämpfl
This is a very specific example which allows to explain rather simple
the slowness of 2.x: The reason is a decision geared by maintainability
and portability: 2.x uses a so-called graph colouring register allocator
while 1.x used a pretty simple register allocator specifically tailored
for i386.
Shouldn't we make the register allocator configurable, so that e.g.
non-release builds can become faster, and several replacements can be
tested easily?

The same for other parts of the compiler, where the time-per-task is the
first information required to detect real bottlenecks, and to check
alternative solutions.

DoDi
Florian Klaempfl
2010-09-14 14:59:41 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Florian Klämpfl
This is a very specific example which allows to explain rather simple
the slowness of 2.x: The reason is a decision geared by maintainability
and portability: 2.x uses a so-called graph colouring register allocator
while 1.x used a pretty simple register allocator specifically tailored
for i386.
Shouldn't we make the register allocator configurable, so that e.g.
non-release builds can become faster, and several replacements can be
tested easily?
Well, as usual: somebody has to implement one. Problem is also: using
e.g. a different register allocator for -O- and -O2 will result in less
testing by users of one or the other.

Hans-Peter Diettrich
2010-09-12 07:06:03 UTC
Permalink
Post by Florian Klämpfl
Anyways, before this ends in an endless discussion: if anybody is
interested in improving FPC compilation speed (for my needs is
sufficient) and have a look at fillchar
IMO not FillChar is the bottleneck, instead it's the access to newly
allocated memory in/around InitInstance, resulting in page faults.

DoDi
Mattias Gaertner
2010-09-11 18:38:33 UTC
Permalink
On Sat, 11 Sep 2010 19:50:41 +0200
Post by Martin Schreiber
Post by Mattias Gaertner
Maybe dcc32 likes the MSEgui sources.
Or maybe FPC does not like MSEgui sources. ;-)
Post by Mattias Gaertner
Martin, can you give a comparison between win32 and Linux 32?
I don't have a working Kylix 3 environment at the moment. IIRC dcc32 on Linux
and Windows had about the same compiling performance.
smartlink 50.4 sec
no smartlink 50.5 sec
smartlink 126 sec
no smartlink 62.7 sec.
Strange.
Here fpc is much faster, even on a 3 year old linux machine:

343520 lines compiled, 16.2 sec
258 warning(s) issued
570 note(s) issued

real 0m16.249s
user 0m14.953s
sys 0m0.904s

time /usr/lib/fpc/2.4.0/ppc386 -omseide -Fu...msegui/lib/addon/*/ -Fi...msegui/lib/addon/*/ -Fu...msegui/lib/common/kernel/i386-linux/ -Fu...msegui/lib/common/kernel/ -Fi...msegui/lib/common/kernel/ -Fu...msegui/lib/common/*/ -l -Mobjfpc -Sh -gl -O- mseide.pas


Mattias
Dimitri Smits
2010-09-11 23:31:43 UTC
Permalink
Post by Marco van de Voort
Post by Juha Manninen (gmail)
Post by Martin Schreiber
One would think Delphi and FPC need the same disk IO?
I read the threads. My guess is also that the slowness comes from
searching
Post by Juha Manninen (gmail)
and writing many files in big directory structures. It is slow even
if the
Post by Juha Manninen (gmail)
files are cached. Also starting a new process is slow.
These OS kernel tasks are difficult to measure and process monitors
don't
Post by Juha Manninen (gmail)
give reliable results.
Create an API for integrating FPC with IDEs and special "make"
programs.
Post by Juha Manninen (gmail)
The API would pass info about exact file names and locations.
It could also pass the whole source memory buffers.
And why does the Delphi commandline compiler (dcc32) not need this IDE
assistance?
it does. Delphi IDE passes extra assumptions/directories that the commandline tool does not know about (for instance $(DELPHI)/Projects/Bpl).

Juha's idea is the way Borland did it with D7. That is: a shared lib with api & a commandline tool. Unfortunately the DLL is not linked from the cmdline tool, so it seems those 2 are separately compiled and statically linked with 'the same' code.

Donno for sure if that still is the case with 2010/XE.

kind regards,
Dimitri Smits
Martin Schreiber
2010-09-12 05:21:07 UTC
Permalink
Post by Dimitri Smits
Post by Martin Schreiber
And why does the Delphi commandline compiler (dcc32) not need this IDE
assistance?
it does. Delphi IDE passes extra assumptions/directories that the
commandline tool does not know about (for instance $(DELPHI)/Projects/Bpl).
The comparisons I made were with dcc32 not Delphi IDE. The commands are
here:
http://www.mail-archive.com/fpc-devel%40lists.freepascal.org/msg08029.html

Martin
Dimitri Smits
2010-09-11 23:44:41 UTC
Permalink
Post by Juha Manninen (gmail)
Post by Martin Schreiber
And why does the Delphi commandline compiler (dcc32) not need this
IDE
Post by Martin Schreiber
assistance?
My guess is that dcc32 works as an integrated make program + compiler
and thus
doesn't start external processes for each file.
Or, if it starts an external process then it can use some (hidden)
temporary
file with pre-scanned info of the project. So the compiler would only
open one
"info" file instead of scanning the whole search paths.
I used the IDE always when working with Delphi and don't really know
dcc32.
Guessing only.
no, the bin directory of D7 is filled with extra .exe's (tlib, tasm32, ...) and the dcc32.exe is < 800K big.

kind regards,
Dimitri Smits
Dimitri Smits
2010-09-11 23:48:55 UTC
Permalink
Post by Juha Manninen (gmail)
Post by Juha Manninen (gmail)
Post by Martin Schreiber
And why does the Delphi commandline compiler (dcc32) not need this
IDE
Post by Juha Manninen (gmail)
Post by Martin Schreiber
assistance?
My guess is that dcc32 works as an integrated make program +
compiler and thus
Post by Juha Manninen (gmail)
doesn't start external processes for each file.
No it does not.
dcc32 compiles 1 file only, but does compile any additional units it
needs.
You'll need makefiles as well if you use dcc32 (or any other build
tool).
I have an extended build system using dcc32, and it takes easily up to
15
minutes to compile a 1.5 million lines project.
that is my experience as well. (using 'want.exe' aka windows 'ant')

it implies though that you can merely compile your .dpk and .dpr files to get everything compiled ;-)

kind regards,
Dimitri Smits
Dimitri Smits
2010-09-12 02:50:47 UTC
Permalink
Post by Florian Klämpfl
Post by Martin Schreiber
Post by Florian Klämpfl
What machine? Because with hot disk cache, I just build MSEide in
about
Post by Martin Schreiber
The same as for all other tests,
win2000, AMD Athlon XP 3000+, 1GB RAM
Post by Florian Klämpfl
...
Linking mseidefp.exe
308574 lines compiled, 10.6 sec , 2577952 bytes code, 1618920 bytes
data
Post by Martin Schreiber
Then Delphi7 probably uses about 1 sec on your machine. ;-)
...
mseide.pas(63)
280491 lines, 2.18 seconds, 2110568 bytes code, 752073 bytes data.
Anyways, before this ends in an endless discussion: if anybody is
interested in improving FPC compilation speed (for my needs is
sufficient) and have a look at fillchar and, have a look at FPC's
unit
loading algorithm (not the actual i/o itself but how all the symbols
classes etc. are restored). This two points are bottlenecks which
might
help when they are improved, though it's pretty unlikely that this
will
improve things more than a few percent.
I am not as intimately familiar with the compiler internals as most of you, but the writing/reading of those .ppu files looks to me like a prime candidate.

other things that I do not know if they are already implemented this way:
1) if the .ppu/.ppl + .o are merged and they give better results then why not go the extra mile and implement packages (or at least the .dcp)? Although I fail to see why (except for linking) both have to be opened during compilation. With packages I mean the .dcp/.bpl equivalent.
It is my understanding that .dcp is like a .lib + all the stuff in the unitinfo (.dcu / .ppu), and ofcourse the more simpler .bpl = .dll "with extra's". That way, you need to dive into much less files.

2) when fetching / checking the files, does it happen according to the unit & include paths and those paths appended by targetos en architectures for every file?
Wouldn't it be more efficient to fetch/cache some dir-info first for all files in those dirs and then fetch the necessary info from there? For instance: get filenames + size + timestamp for all .ppu files or .pas files first.

3) It is my understanding that the .ppu structures are written with size-footprint in mind, not processing efficiency? Maybe the structures need changing too?

kind regards,
Dimitri Smits
Marco van de Voort
2010-09-12 08:25:22 UTC
Permalink
Post by Mattias Gaertner
Post by Martin Schreiber
Agreed. My opinion is that before we start to implement difficult and
error-prone multi-threading into FPC we should find out why the hell Delphi 7
can compile so much faster and produces even better code?
maybe it fits into the cpu cache.
I assume dcc.exe uses more data than code :-)
Mattias Gaertner
2010-09-12 10:11:42 UTC
Permalink
On Sun, 12 Sep 2010 10:25:22 +0200 (CEST)
Post by Marco van de Voort
Post by Mattias Gaertner
Post by Martin Schreiber
Agreed. My opinion is that before we start to implement difficult and
error-prone multi-threading into FPC we should find out why the hell Delphi 7
can compile so much faster and produces even better code?
maybe it fits into the cpu cache.
I assume dcc.exe uses more data than code :-)
CPU caches do not work FIFO.
If FPC does not fit into the CPU cache, then the CPU has to constantly
load code mem additionally to the data.

Mattias
Marco van de Voort
2010-09-12 08:43:43 UTC
Permalink
Post by Martin Schreiber
Post by Florian Klämpfl
Anyways, before this ends in an endless discussion: if anybody is
interested in improving FPC compilation speed (for my needs is
sufficient) and have a look at fillchar and, have a look at FPC's unit
loading algorithm (not the actual i/o itself but how all the symbols
classes etc. are restored). This two points are bottlenecks which might
help when they are improved, though it's pretty unlikely that this will
improve things more than a few percent.
Agreed. My opinion is that before we start to implement difficult and
error-prone multi-threading into FPC we should find out why the hell Delphi 7
can compile so much faster and produces even better code?
I partially agree with you in the fact that the exact reasons are not known.

I'm no expert on profiling the compiler, but if I read the various threads
over the years I see defensive and conflicting statements:

In discussions with Hans, it is said that I/O is not a factor, since after
one run everything is cached anyway, and then in this thread I/O is to blame
for a huge difference in speed.

The same with the fact that we use shortstring for performance in many
places where delphi in fact allows longer mangled names and is faster.

That leaves the maintainability bit. I think that is certainly true,
and it probably can be tweaked to be a bit better. It won't be easy though
and is more likely to be a lot of things that add a little than few that
matter a lot. Which makes it a permanent positions instead of a one-off
effort. Any takers?

As far as the I/O performance discussion goes, some observations:

- To find something interesting, probably the test has to be done for
various sizes of code. To see if maybe some startup factor (Rather than
actual processing) is a cause.
- FPC has to read at least twice the number of files (.ppu/.o). So if
opening or number of files in the dir path is a factor, it will do
worse by default.
- Possibily, a defaultly installed windows searches through a larger unit
path (more dirs, more file) than Delphi _default_
- For that to be found, more has to be known about dcc.cfg during the
tests.
- THis means that tests will have to be repeated with various sizes
of unitpath trees (both dirs and files) to see if this is a discussion.
- Most profiling recently afaik has been done by Jonas, and thus not
on Windows. Yet the delphi comparisons are on windows.
- Actually linking should be avoided in the test. It might obscure
things.
Jonas Maebe
2010-09-12 09:41:05 UTC
Permalink
Post by Marco van de Voort
I'm no expert on profiling the compiler, but if I read the various threads
In discussions with Hans, it is said that I/O is not a factor, since after
one run everything is cached anyway, and then in this thread I/O is to blame
for a huge difference in speed.
Disk throughput doesn't really matter. Reading directory contents, getting file information and opening/closing files is another matter.
Post by Marco van de Voort
The same with the fact that we use shortstring for performance in many
places where delphi in fact allows longer mangled names and is faster.
That's a non sequitur.
Post by Marco van de Voort
- Possibily, a defaultly installed windows searches through a larger unit
path (more dirs, more file) than Delphi _default_
I've also been thinking that.
Post by Marco van de Voort
- Most profiling recently afaik has been done by Jonas, and thus not
on Windows. Yet the delphi comparisons are on windows.
There's a free profiler for Windows by AMD: http://developer.amd.com/cpu/codeanalyst/Pages/default.aspx


Jonas
Jonas Maebe
2010-09-12 09:53:53 UTC
Permalink
Post by Jonas Maebe
There's a free profiler for Windows by AMD: http://developer.amd.com/cpu/codeanalyst/Pages/default.aspx
And by Microsoft: http://msdn.microsoft.com/en-us/performance/cc825801.aspx


Jonas
Adem
2010-09-12 17:05:15 UTC
Permalink
Post by Jonas Maebe
Post by Jonas Maebe
There's a free profiler for Windows by AMD: http://developer.amd.com/cpu/codeanalyst/Pages/default.aspx
And by Microsoft: http://msdn.microsoft.com/en-us/performance/cc825801.aspx
It's a 2.5G download --running in the background as I write.

But, do you know if it takes into account processes (exe's) started by
the process (exe) being profiled?

Also, does anyone know if Intel's vTune would be useful?
Jonas Maebe
2010-09-12 17:14:04 UTC
Permalink
Post by Adem
Post by Jonas Maebe
Post by Jonas Maebe
There's a free profiler for Windows by AMD: http://developer.amd.com/cpu/codeanalyst/Pages/default.aspx
And by Microsoft: http://msdn.microsoft.com/en-us/performance/cc825801.aspx
It's a 2.5G download --running in the background as I write.
But, do you know if it takes into account processes (exe's) started by the process (exe) being profiled?
I know next to nothing about Windows development since I don't use Windows. I just googled for "profiling windows" and followed the links from the first result I got (http://stackoverflow.com/questions/67554/whats-the-best-free-c-profiler-for-windows-if-there-are). Besides, FPC on Windows does not start any other executables when compiling programs.

And note that even with a profiler you have to know what you should measure, what is relevant and what the results mean before you can draw any conclusions (just like with the disk benchmarking you did). If you don't know anything about that, read the manual/docs. For example, the profiling results I posted earlier to this list do not say anything about the influence of I/O since they were based on sampling the program code executing every 1 millisecond.


Jonas
Adem
2010-09-12 21:01:47 UTC
Permalink
Post by Jonas Maebe
Besides, FPC on Windows does not start any other executables when compiling programs
You might be making a distinction (between compiling and building) here,
but when I press 'rebuild lazarus' on that menu, here the list
executables of executables called are below [numbers represent 'events'].

ppcx64.exe: 274,889
make.exe: 27,664
rm.exe: 17,968
fpc.exe: 8,992
gorc.exe: 6,718
lazarus.exe: 6,593
conhost.exe: 4,751
gdate.exe: 3,122
pwd.exe: 1,483
startlazarus.exe: 1,092
cmd.exe: 786
csrss.exe: 642

here, for example, the 'events' for make.exe:

QueryDirectory: 6,502
CreateFile: 5,800
CloseFile: 4,916
CreateFileMapping: 1,591
RegOpenKey: 1,483
QueryNameInformationFile: 727
QueryOpen: 675
RegCloseKey: 643
RegQueryValue: 591
QueryBasicInformationFile: 554
RegSetInfoKey: 552
ReadFile: 511
Load Image: 431
QueryStandardInformationFile: 367
RegQueryKey: 341
QueryFileInternalInformationFile: 318
SetBasicInformationFile: 318
QueryInformationVolume: 288
QueryAttributeInformationVolume: 276
QuerySecurityFile: 228
QueryAttributeTagFile: 205
FileSystemControl: 126
Process Create: 113
RegEnumKey: 26
RegEnumValue: 26
Process Exit: 13
Process Start: 13
Thread Create: 13
Thread Exit: 13
WriteFile: 2
SetDispositionInformationFile: 1
SetEndOfFileInformationFile: 1

I am not sure what all those do, but 'Load Image: 431' seems to mean
'make.exe' is run 431 times.
Post by Jonas Maebe
And note that even with a profiler you have to know what you should measure, what is relevant and what the results mean before you can draw any conclusions (just like with the disk benchmarking you did). If you don't know anything about that, read the manual/docs. For example, the profiling results I posted earlier to this list do not say anything about the influence of I/O since they were based on sampling the program code executing every 1 millisecond.
ATM, I am mainly interested in finding out how many other external
process Lazarus depends on (how many times does it call them) when
compiling/building an exe. For that, a process explorer should be
sufficient.
Jonas Maebe
2010-09-12 23:47:54 UTC
Permalink
Post by Adem
Post by Jonas Maebe
Besides, FPC on Windows does not start any other executables when compiling programs
You might be making a distinction (between compiling and building) here,
but when I press 'rebuild lazarus' on that menu, here the list executables of executables called are below [numbers represent 'events'].
That's indeed not FPC starting executables, that's Lazarus invoking "make" (which in turn invokes tons of other stuff).
Post by Adem
I am not sure what all those do, but 'Load Image: 431' seems to mean 'make.exe' is run 431 times.
make indeed works by recursively executing itself. And that is a known problem on Windows, because that platform is extremely slow at starting new processes for some reason. Some people have worked on alternatives (http://fastmake.org/, http://benjamin.smedbergs.us/pymake/), but afaik none of them can currently deal with everything that appears in FPC's makefiles.


Jonas
Adem
2010-09-12 16:59:56 UTC
Permalink
Post by Jonas Maebe
Post by Marco van de Voort
In discussions with Hans, it is said that I/O is not a factor, since after
one run everything is cached anyway, and then in this thread I/O is to blame
for a huge difference in speed.
Disk throughput doesn't really matter. Reading directory contents, getting file information and opening/closing files is another matter.
My exeperience seems to confirm this.

I ran the same tests with Lazarus install on a NAS.

As the connection was 2xGigabit (port-trunked), read/write speeds were
higher than my local HHD.

Yet, rebuild times doubled.

It seems latency plays a more significant role than thoughput.
Marco van de Voort
2010-09-12 08:44:54 UTC
Permalink
Post by Florian Klämpfl
C:\fpc\tests\webtbs>"c:\pp 1.0.10\bin\win32\ppc386.exe" tw2242 -O2
Free Pascal Compiler version 1.0.10 [2003/06/27] for i386
Copyright (c) 1993-2003 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242.pp
Linking tw2242.exe
13083 Lines compiled, 0.8 sec
C:\fpc\tests\webtbs>fpc tw2242 -O2
Free Pascal Compiler version 2.4.0 [2009/12/18] for i386
Copyright (c) 1993-2009 by Florian Klaempfl
Target OS: Win32 for i386
Compiling tw2242.pp
Linking tw2242.exe
13083 lines compiled, 4.7 sec , 301376 bytes code, 1864 bytes data
Do both numbers include linking ?
Marco van de Voort
2010-09-12 11:21:43 UTC
Permalink
Post by Mattias Gaertner
Post by Marco van de Voort
Post by Mattias Gaertner
maybe it fits into the cpu cache.
I assume dcc.exe uses more data than code :-)
CPU caches do not work FIFO.
I assume not, since the administration overhead would be too large.
Post by Mattias Gaertner
If FPC does not fit into the CPU cache, then the CPU has to constantly
load code mem additionally to the data.
- only the hot track has to fit. It makes no sense to cache processing
parameter options.
- all caches are afaik unified nowadays, so both
code and data dynamically share the cache

It could be that the working set is larger for FPC, but I doubt you can see
this from the .exe size.
Dimitri Smits
2010-09-12 16:24:02 UTC
Permalink
Post by Marco van de Voort
I partially agree with you in the fact that the exact reasons are not
known.
I'm no expert on profiling the compiler, but if I read the various
threads
In discussions with Hans, it is said that I/O is not a factor, since
after
one run everything is cached anyway, and then in this thread I/O is to
blame
for a huge difference in speed.
that may be the case for reading, not necessarily for the files being written.
in ppu.pas, everything you "put" results in a blockwrite of x bytes. Wouldn't a cached memory stream be better, not resulting in those int21h calls or windows equivalent calls?
Haven't looked at .s creation (donno where to start looking, but I guess this is the same???)
Post by Marco van de Voort
The same with the fact that we use shortstring for performance in
many
places where delphi in fact allows longer mangled names and is
faster.
That leaves the maintainability bit. I think that is certainly true,
and it probably can be tweaked to be a bit better. It won't be easy
though
and is more likely to be a lot of things that add a little than few
that
matter a lot. Which makes it a permanent positions instead of a
one-off
effort. Any takers?
- To find something interesting, probably the test has to be done for
various sizes of code. To see if maybe some startup factor (Rather
than
actual processing) is a cause.
agreed
Post by Marco van de Voort
- FPC has to read at least twice the number of files (.ppu/.o). So if
opening or number of files in the dir path is a factor, it will do
worse by default.
actually, this is a false statement. The units that come with delphi (rtl and others) are packaged in .dcp (and .bpl). That means that not only those "x2 files", but a factor of those "x2" can be done. Admitted that you probably load "way to much" in some cases as well as that it is not easy to know in what .dcp your needed unitinfo is included. Or at least isn't obvious.
Post by Marco van de Voort
- Possibily, a defaultly installed windows searches through a larger
unit
path (more dirs, more file) than Delphi _default_
- For that to be found, more has to be known about dcc.cfg during the
tests.
- THis means that tests will have to be repeated with various sizes
of unitpath trees (both dirs and files) to see if this is a
discussion.
- Most profiling recently afaik has been done by Jonas, and thus not
on Windows. Yet the delphi comparisons are on windows.
- Actually linking should be avoided in the test. It might obscure
things.
in that case, all "external tools" should be avoided.

kind regards,
Dimitri Smits
Florian Klämpfl
2010-09-12 16:27:48 UTC
Permalink
Post by Dimitri Smits
Post by Marco van de Voort
I partially agree with you in the fact that the exact reasons are
not known.
I'm no expert on profiling the compiler, but if I read the various
In discussions with Hans, it is said that I/O is not a factor,
since after one run everything is cached anyway, and then in this
thread I/O is to blame for a huge difference in speed.
that may be the case for reading, not necessarily for the files being
written. in ppu.pas, everything you "put" results in a blockwrite of
x bytes. Wouldn't a cached memory stream be better, not resulting in
those int21h calls or windows equivalent calls?
The ppu writer uses a buffer of 16 kB which is enough. Believe me, it's
no so simple as just use a memory stream to improve performance. If it's
the case, we would have done it for years.
Marco van de Voort
2010-09-12 16:39:25 UTC
Permalink
Post by Dimitri Smits
Post by Marco van de Voort
after
one run everything is cached anyway, and then in this thread I/O is to
blame
for a huge difference in speed.
that may be the case for reading, not necessarily for the files being
written. in ppu.pas, everything you "put" results in a blockwrite of x
bytes. Wouldn't a cached memory stream be better, not resulting in those
int21h calls or windows equivalent calls? Haven't looked at .s creation
(donno where to start looking, but I guess this is the same???)
It's worth checking out. (writing in general), but I don't expect the
difference to be shocking. If only because typically many more .ppu are read
than written.
Post by Dimitri Smits
Post by Marco van de Voort
- FPC has to read at least twice the number of files (.ppu/.o). So if
opening or number of files in the dir path is a factor, it will do
worse by default.
actually, this is a false statement. The units that come with delphi (rtl
and others) are packaged in .dcp (and .bpl). That means that not only
those "x2 files", but a factor of those "x2" can be done. Admitted that
you probably load "way to much" in some cases as well as that it is not
easy to know in what .dcp your needed unitinfo is included. Or at least
isn't obvious.
To my best knowledge that is incorrect. Delphi uses the .dcu's (debug or
not), and only preloads the .dcp that are in the "runtime packages" list
when "compiling with packages" is checked.
Post by Dimitri Smits
Post by Marco van de Voort
of unitpath trees (both dirs and files) to see if this is a
discussion.
- Most profiling recently afaik has been done by Jonas, and thus not
on Windows. Yet the delphi comparisons are on windows.
- Actually linking should be avoided in the test. It might obscure
things.
in that case, all "external tools" should be avoided.
For benchmarking: yes. It would actualy be interesting to learn such info,
since generally I've only regarded command like "make all" in the past.

There, Windows (time) : linux (time) is about 2:1, but I always blamed the
bulk on this on the slower startup time of windows exes. Such benchmarks
could confirm (or dispell) that assumption.
Marco van de Voort
2010-09-12 16:46:18 UTC
Permalink
Post by Jonas Maebe
Post by Marco van de Voort
I'm no expert on profiling the compiler, but if I read the various threads
In discussions with Hans, it is said that I/O is not a factor, since after
one run everything is cached anyway, and then in this thread I/O is to blame
for a huge difference in speed.
Disk throughput doesn't really matter. Reading directory contents, getting
file information and opening/closing files is another matter.
Good. That's what I wanted to say too. If one excludes linking, there is not
that much left. I assume the moments of timing of the FPC and Delphi compiler
of course could be different too.
Post by Jonas Maebe
Post by Marco van de Voort
The same with the fact that we use shortstring for performance in many
places where delphi in fact allows longer mangled names and is faster.
That's a non sequitur.
I'm not creating a complot theory here. I just want to state some facts to
avoid sidediscussion obscuring the main problem.

I'm not expecting sb will go really deep and profiling, but that doesn't
mean a somewhat correct definition of the cause of the difference would
hurt. If only for the next discussion.
Post by Jonas Maebe
Post by Marco van de Voort
- Most profiling recently afaik has been done by Jonas, and thus not
on Windows. Yet the delphi comparisons are on windows.
http://developer.amd.com/cpu/codeanalyst/Pages/default.aspx
I'll see if I can make/enhance the wiki page about profiling the coming
days.
Dimitri Smits
2010-09-12 17:15:29 UTC
Permalink
Post by Mattias Gaertner
On Sun, 12 Sep 2010 10:25:22 +0200 (CEST)
Post by Marco van de Voort
Post by Mattias Gaertner
Post by Martin Schreiber
Agreed. My opinion is that before we start to implement
difficult and
Post by Marco van de Voort
Post by Mattias Gaertner
Post by Martin Schreiber
error-prone multi-threading into FPC we should find out why the
hell Delphi 7
Post by Marco van de Voort
Post by Mattias Gaertner
Post by Martin Schreiber
can compile so much faster and produces even better code?
maybe it fits into the cpu cache.
I assume dcc.exe uses more data than code :-)
CPU caches do not work FIFO.
If FPC does not fit into the CPU cache, then the CPU has to
constantly
load code mem additionally to the data.
in that case, can splitting up the .exe into .exe + more .dll's help?

aka a "compiler package", a "rtl", a "assembler package", ...?

kind regards,
Dimitri Smits
Jonas Maebe
2010-09-12 17:28:41 UTC
Permalink
Post by Dimitri Smits
Post by Mattias Gaertner
CPU caches do not work FIFO.
If FPC does not fit into the CPU cache, then the CPU has to
constantly load code mem additionally to the data.
in that case, can splitting up the .exe into .exe + more .dll's help?
Only the parts of the executable that are actually used (plus some surrounding bytes) are loaded into the caches. Splitting the used (or unused) code will not change that. You might want to read up on cpu caches: http://en.wikipedia.org/wiki/CPU_cache#Details_of_operation

Furthermore, splitting everything will make things worse, because then you get extra glue code and/or dynamic linker fix ups (i.e., it increases the code size and the program execution time).


Jonas
Mattias Gaertner
2010-09-12 17:32:42 UTC
Permalink
On Sun, 12 Sep 2010 19:15:29 +0200 (CEST)
Post by Dimitri Smits
Post by Mattias Gaertner
On Sun, 12 Sep 2010 10:25:22 +0200 (CEST)
Post by Marco van de Voort
Post by Mattias Gaertner
Post by Martin Schreiber
Agreed. My opinion is that before we start to implement
difficult and
Post by Marco van de Voort
Post by Mattias Gaertner
Post by Martin Schreiber
error-prone multi-threading into FPC we should find out why the
hell Delphi 7
Post by Marco van de Voort
Post by Mattias Gaertner
Post by Martin Schreiber
can compile so much faster and produces even better code?
maybe it fits into the cpu cache.
I assume dcc.exe uses more data than code :-)
CPU caches do not work FIFO.
If FPC does not fit into the CPU cache, then the CPU has to
constantly
load code mem additionally to the data.
in that case, can splitting up the .exe into .exe + more .dll's help?
aka a "compiler package", a "rtl", a "assembler package", ...?
No. Putting code into dll/so can even create slower code.
The cpu cache uses cache lines, so only the relevant parts of the
machine code is loaded, no matter if it is in a dll, exe, so or
whatever.

Mattias
Loading...