Joshua Maurice
2011-07-25 07:25:34 UTC
I mentioned this little side project I had here quite a while ago in
comp.lang.c++. The context was me asking for an incrementally correct
build system.
To explain, let me define some terms.
A build system tool - is a distributed, customizable tool that let's
you create a build system. Ex: GNU make.
A build system - is a particular build system, usually built on top of
some build tool. This is the in-fact implementation that does your
build. GNU make is a build tool, which can be used to create a build
system by writing a makefile.
An incremental build - is a build on top of a pre-exsting build, where
there have been some source code changes between the old build and
now. Usually such builds do not rebuild everything and instead use
some "intelligence" to build less than the full clean rebuild, but
enough to get results /as if/ it did a full clean rebuild.
A correct incremental build - is an incremental build which produces
results "equivalent" to that of a hypothetical full clean rebuild.
An incrementally correct build system - is a build system which can
only produce incrementally correct builds. That is, there is no source
code changes which a developer may do, inadvertently or purposefully,
which can result in the build system doing an incrementally incorrect
build.
The key part to realize about this requirement is that the build
system scripts, such as makefiles, are themselves source code under
this model. At least, I claim that they ought to be considered as
source code for the purposes of determining if your incremental build
system is incrementally correct.
Now, at this point, there is a blurring of lines between build system
tool and build system. When the makefiles are both source code and
build system, it becomes hard to distinguish between them. I'm merely
trying to say that I want a tool in which it is exceptionally hard, if
not impossible short of maliciousness, to "break" the build system
such that it might produce an incrementally incorrect build.
Now, why do I want this? I hate waiting for hours each day doing a
build. Yes, I know that the source code base in question has other
problems if it takes that long to build, but unfortunately I am not
the king of the company, and so I have to make due. Besides, even
spending 10 minutes on a full clean rebuild is 10 minutes
(potentially) wasted, if another better solution exists. And even more
importantly, 4 hours spent tracking down a bug which was simply the
result of a broken incremental build is 4 hours wasted.
I claim that it's not that hard. However, as far as I can tell, very
few people have attempted such a thing. In the C++ world at least,
most ended with something that was "good enough" that caught 95% of
the cases. A good example AFAIK is the Recursive Make Considered
Harmful solution, found here:
http://miller.emu.id.au/pmiller/books/rmch/
It's good, but it still misses cases. Depending on the exact details
of implementation, it won't do a rebuild if you change compiler or
linker options, such as a new preprocessor define. It won't relink an
executable if you delete one of its associated cpp source files. It
won't recompile an object file if you introduce a new header file
which hides a previously included header file on the include path.
I'm not presumptuous enough to claim that what I'm about to link to is
fully incrementally correct, but AFAIK it's much further along than
anything else I've ever found, and it has the good potential to get
there.
Do note that you have to build trust on trust. To take a side trip,
security works by trust. You have to have some trusted agent in order
to have secure communication online. In a similar token, the
incremental correctness of a build system cannot be guaranteed if
someone "accidentally" modifies some of the make source code badly,
and installs it locally. The incremental correctness of a build system
is thus contingent on some facts, like on the correctness of some core
build system executables. It would also be dependent on the
correctness of the compiler and other such things.
My aim is simply to catch as much as I can, focusing on common
developer activities, such as changing preprocessor defines, adding
files, removing files, and ultimately using source control to sync to
a new revision of the code, which will include changes of which he is
not aware, and the incremental build system ought to just "work".
So, I'm experimenting around with ideas now. My initial plan was to do
a make clone. What I have is somewhat close to a drop in replacement
to GNU make 3.81. It's missing a lot of the bells and whistles - for
example it doesn't have a job server, nor a bunch of built-in
functions. I also hate the tab syntax, and I would be loathe to add
that back in. However, it has user definable functions, the same basic
scripting language, a bunch of the built-in functions, and simple and
recursive variables.
In fact, it has enough functionality in similar that I am prepared to
even do some speed comparisons between the two, my make, infamake, and
GNU make. On a standard Linux box, infamake parses makefiles about 20
times faster (not 20%, but 20x). On a standard Linux box, infamake
also does an "all from all" build about 8 times faster (not 8%, but
8x) than GNU make. An "all from clean" build invoking g++ on basically
empty cpp files runs about just as fast with my infamake and GNU make
- stuck on IO I presume. I have included the generators of these
tests, so that other people can look at them and confirm, or point out
weaknesses in my testing approach.
While I lack a lot of the functionality of GNU make, I believe that I
have enough of it that the missing features would not noticeably
change the results of these tests.
With a little work, I could put the rule syntax with all of its evil
tabs in, and make it an actual drop-in replacement for GNU make (as
long as you don't do things like job control, vpath, etc.).
https://sourceforge.net/projects/infamake/files/
(Documentation is sorely lacking, I know. It comes with a bunch of
unit tests, which should make it clear if you spend a lot of time with
it. I'll try to add some documentation sometime soon if anyone
cares.)
So, why did I rewrite it instead of fixing GNU make? A couple of
reasons. First, it was a fun exercise. Second, I didn't want to deal
with the existing C source code. Third, I felt that the original make
way of doing incremental builds is fundamentally broken, so I wanted
to diverge a bit from that legacy.
Let me explain why on that third point, why I think the GNU make
approach is fundamentally broken. GNU make, and in fact make in
general, uses a directed acyclic graph of nodes to control which
portions of the build are skipped on an incremental build. The nodes
usually correspond to files. This works out well in a lot of cases,
but it fails spectacularly on some common developer actions. As I
mentioned before, example include adding files, deleting files,
changing build commands. None of this is well captured by looking at a
DAG where the nodes are existing files. There are some clever hacks to
get around some of these issues, namely the .d file generation of
Recursive Make Considered Harmful, but it doesn't catch all of them.
(The src/tests2 folder includes a bunch of tests I run against my own
infamake built in C++ rule to ensure that it produces correct
results.)
Moreover, this problem is even worse for Java. Fortunately /
unfortunately, my company also uses Java, and apart from Eclipse's
compiler (which is almost fully incrementally correct for Java, but
not quite, AFAIK), there isn't anything even resembling an good
attempt an incremental Java builds. Part of my motivation is to create
a build system which handles C++ and Java, and incrementally
correctly, and with fast builds. None of this is uploaded right now,
but I have a worked out implementation lying around that I just need
to merge in.
Currently, the build logic is done largely in C++. I did this for
speed, and honestly because I'm still not quite sure if there's a good
way to generalize it to something easy to use in make's scripting
language. I wanted the full power of C++ when doing the incremental C+
+ build logic. Perhaps I can fix that later. I don't know.
So where do I go form here? I don't know. Just thought I'd post this
up, as I made some pretty large claims a while ago on comp.lang.c++,
and felt I might as well back them up. Took a bit longer than expected
for legal to get back to me to open source it.
So, any comments?
comp.lang.c++. The context was me asking for an incrementally correct
build system.
To explain, let me define some terms.
A build system tool - is a distributed, customizable tool that let's
you create a build system. Ex: GNU make.
A build system - is a particular build system, usually built on top of
some build tool. This is the in-fact implementation that does your
build. GNU make is a build tool, which can be used to create a build
system by writing a makefile.
An incremental build - is a build on top of a pre-exsting build, where
there have been some source code changes between the old build and
now. Usually such builds do not rebuild everything and instead use
some "intelligence" to build less than the full clean rebuild, but
enough to get results /as if/ it did a full clean rebuild.
A correct incremental build - is an incremental build which produces
results "equivalent" to that of a hypothetical full clean rebuild.
An incrementally correct build system - is a build system which can
only produce incrementally correct builds. That is, there is no source
code changes which a developer may do, inadvertently or purposefully,
which can result in the build system doing an incrementally incorrect
build.
The key part to realize about this requirement is that the build
system scripts, such as makefiles, are themselves source code under
this model. At least, I claim that they ought to be considered as
source code for the purposes of determining if your incremental build
system is incrementally correct.
Now, at this point, there is a blurring of lines between build system
tool and build system. When the makefiles are both source code and
build system, it becomes hard to distinguish between them. I'm merely
trying to say that I want a tool in which it is exceptionally hard, if
not impossible short of maliciousness, to "break" the build system
such that it might produce an incrementally incorrect build.
Now, why do I want this? I hate waiting for hours each day doing a
build. Yes, I know that the source code base in question has other
problems if it takes that long to build, but unfortunately I am not
the king of the company, and so I have to make due. Besides, even
spending 10 minutes on a full clean rebuild is 10 minutes
(potentially) wasted, if another better solution exists. And even more
importantly, 4 hours spent tracking down a bug which was simply the
result of a broken incremental build is 4 hours wasted.
I claim that it's not that hard. However, as far as I can tell, very
few people have attempted such a thing. In the C++ world at least,
most ended with something that was "good enough" that caught 95% of
the cases. A good example AFAIK is the Recursive Make Considered
Harmful solution, found here:
http://miller.emu.id.au/pmiller/books/rmch/
It's good, but it still misses cases. Depending on the exact details
of implementation, it won't do a rebuild if you change compiler or
linker options, such as a new preprocessor define. It won't relink an
executable if you delete one of its associated cpp source files. It
won't recompile an object file if you introduce a new header file
which hides a previously included header file on the include path.
I'm not presumptuous enough to claim that what I'm about to link to is
fully incrementally correct, but AFAIK it's much further along than
anything else I've ever found, and it has the good potential to get
there.
Do note that you have to build trust on trust. To take a side trip,
security works by trust. You have to have some trusted agent in order
to have secure communication online. In a similar token, the
incremental correctness of a build system cannot be guaranteed if
someone "accidentally" modifies some of the make source code badly,
and installs it locally. The incremental correctness of a build system
is thus contingent on some facts, like on the correctness of some core
build system executables. It would also be dependent on the
correctness of the compiler and other such things.
My aim is simply to catch as much as I can, focusing on common
developer activities, such as changing preprocessor defines, adding
files, removing files, and ultimately using source control to sync to
a new revision of the code, which will include changes of which he is
not aware, and the incremental build system ought to just "work".
So, I'm experimenting around with ideas now. My initial plan was to do
a make clone. What I have is somewhat close to a drop in replacement
to GNU make 3.81. It's missing a lot of the bells and whistles - for
example it doesn't have a job server, nor a bunch of built-in
functions. I also hate the tab syntax, and I would be loathe to add
that back in. However, it has user definable functions, the same basic
scripting language, a bunch of the built-in functions, and simple and
recursive variables.
In fact, it has enough functionality in similar that I am prepared to
even do some speed comparisons between the two, my make, infamake, and
GNU make. On a standard Linux box, infamake parses makefiles about 20
times faster (not 20%, but 20x). On a standard Linux box, infamake
also does an "all from all" build about 8 times faster (not 8%, but
8x) than GNU make. An "all from clean" build invoking g++ on basically
empty cpp files runs about just as fast with my infamake and GNU make
- stuck on IO I presume. I have included the generators of these
tests, so that other people can look at them and confirm, or point out
weaknesses in my testing approach.
While I lack a lot of the functionality of GNU make, I believe that I
have enough of it that the missing features would not noticeably
change the results of these tests.
With a little work, I could put the rule syntax with all of its evil
tabs in, and make it an actual drop-in replacement for GNU make (as
long as you don't do things like job control, vpath, etc.).
https://sourceforge.net/projects/infamake/files/
(Documentation is sorely lacking, I know. It comes with a bunch of
unit tests, which should make it clear if you spend a lot of time with
it. I'll try to add some documentation sometime soon if anyone
cares.)
So, why did I rewrite it instead of fixing GNU make? A couple of
reasons. First, it was a fun exercise. Second, I didn't want to deal
with the existing C source code. Third, I felt that the original make
way of doing incremental builds is fundamentally broken, so I wanted
to diverge a bit from that legacy.
Let me explain why on that third point, why I think the GNU make
approach is fundamentally broken. GNU make, and in fact make in
general, uses a directed acyclic graph of nodes to control which
portions of the build are skipped on an incremental build. The nodes
usually correspond to files. This works out well in a lot of cases,
but it fails spectacularly on some common developer actions. As I
mentioned before, example include adding files, deleting files,
changing build commands. None of this is well captured by looking at a
DAG where the nodes are existing files. There are some clever hacks to
get around some of these issues, namely the .d file generation of
Recursive Make Considered Harmful, but it doesn't catch all of them.
(The src/tests2 folder includes a bunch of tests I run against my own
infamake built in C++ rule to ensure that it produces correct
results.)
Moreover, this problem is even worse for Java. Fortunately /
unfortunately, my company also uses Java, and apart from Eclipse's
compiler (which is almost fully incrementally correct for Java, but
not quite, AFAIK), there isn't anything even resembling an good
attempt an incremental Java builds. Part of my motivation is to create
a build system which handles C++ and Java, and incrementally
correctly, and with fast builds. None of this is uploaded right now,
but I have a worked out implementation lying around that I just need
to merge in.
Currently, the build logic is done largely in C++. I did this for
speed, and honestly because I'm still not quite sure if there's a good
way to generalize it to something easy to use in make's scripting
language. I wanted the full power of C++ when doing the incremental C+
+ build logic. Perhaps I can fix that later. I don't know.
So where do I go form here? I don't know. Just thought I'd post this
up, as I made some pretty large claims a while ago on comp.lang.c++,
and felt I might as well back them up. Took a bit longer than expected
for legal to get back to me to open source it.
So, any comments?