Software Atrophy - An Example
Sep 20 2015
Previously, I'd put up an article describing the component object model of my SCons based build system at Axham Games. If you haven't already read that, you might want to take a gander; it would ease understanding of what is discussed here. Somewhere along the way, as tends to happen with a lot of software systems, our build scripts began showing signs of atrophy.
Note: This is written with a view to help myself and others running development teams. This post goes into details about poorly done software development work, but the goal is not to cast blame, but to learn how to spot the signs early and make appropriate course corrections before reaching a point where a system needs to be completely re-written.
Note 2: This was originally written in Jan 2015, but was put on the back burner because I got injured in an accident. I didn't find time till now to wrap it up and publish it.
Quick Recap
In our system we treat sets of libraries as a single component and a Python class represents a component. E.g. all the boost libraries have common parameters and are group together as a single component. Each project can choose which libraries within the set to use by specifying them in the SConstruct file and overriding them within per project SConscript files.
Here's an example of the configuration used to link to Lua libraries.
env['LUA_DIR'] = util.join_path(externals_dir, build_mode, 'lua') env['LUA_PLUGIN_LIBRARIES'] = set(['lua52']) env['LUA_INCLUDE_DIR_EPILOG'] = 'include' env['LUA_VARIANT'] = externals_variant env['LUA_VERSION'] = '5-2'
We also rely on the developer having correct environment variables set for paths to tools since they can be installed in different locations on any given machine. As can be seen from the example above, we construct any paths in an OS agnostic manner using Python's standard facilities. One critical rule that has to be followed is "No hard-coded paths" since we compile using different tool chains (mingw GNU C++, Visual C++) and multiple platforms (Win32, Win64, Linux, etc).
What Went Wrong
1. Core Libraries Component
On any given platform, there are usually core libraries like user32.lib on Windows or ws2_32.lib when using mingw. Additionally, there may be libraries such as stdc++-6 which will need to be linked to when compiling with the GNU C++ compiler. One way in which some core libraries are different from others within the component are that the corresponding dynamic libs are installed either as part of the system or via an external package; an easy way to think of them is as though they are similar to the Visual C++ runtime libraries, the .NET runtime or the Java runtime.
Unfortunately, even though this had been explained to developers on the team, it was either forgotten or ignored. So, the result was that the developer set it up to copy the core dynamic libs from their location to the target directory. This is an absolute no-no for system libraries.
2. Hard-coded Path
The developer further compounded the error of copying system libraries to the target directory, by hard-coding the path of the system directory (in this case, the mingw-msys bin directory) in the core libraries component's Python class. But, it was done in a way that's not easy to detect, by constructing the path using Python's standard facilities.
dynamic_lib_path = util.join_path('c:', 'tools', 'mingw', 'bin')
Not only would this fail were mingw installed in a different directory, this is also OS specific. The build would fail in the Linux and OSX environments.
How it was Detected
Simple answer: Completely by accident. I have a couple of machines on which I do development, one of which is a laptop on which I was prototyping before I founded Axham Games. So, the various tools were installed in non-standard directories on that laptop. When I ran a build on the laptop, it promptly broke indicating that the dynamic lib in the above hard-coded path did not exist. When used properly, the build system is designed to be able to deal with tools installed in non-standard directories, since their locations are supposed to be picked up from environment variables. Hard coding the path broke it.
Why the developer did this is mystery (he is no longer at the studio); especially considering that there are pre-existing examples of how to do this according to established convention, which is to have each developer define an environment variable to specify the directory. At best, this piece of work is reflective of a poor work ethic i.e. that the developer adhered only to the letter of the rule, not its spirit. At worst, this is insidious; the hard coding of the path was "cleverly" obfuscated by using Python's standard path construction facilities and unless one looks closely, everything seems kosher.
The Fix
The correct fix is to allow any component to specify that certain libraries that are linked to do not (or should not) have dynamic libraries copied over to the target directory.
To that effect, the base Component class from which each of the individual components derive needs to be modified to store the library list of what is not to be installed. We do this by adding one member list _dynamic_libraries_dni
(dni = do not install) and two properties dynamic_libraries_dni
and dynamic_libraries_installable
as follows:
class Component(object): def __init__(self): ... ... self._dynamic_libraries_dni = [] ... ... ... ... @property def dynamic_libraries_dni(self): return self._dynamic_libraries_dni[:] @property def dynamic_libraries_installable(self): return list({ dl for dl in self._dynamic_libraries if dl not in self._dynamic_libraries_dni }), ... ...
These properties and member variable propagate through to the derived classes and are used in per-project SConscript files which invoke the dynamic_libraries_installable
property to determine the libraries to install.
We also remove the hard-coded path in the Core component implementation as it was incorrect and since it is no longer needed.
Quick Thoughts on Atrophy Prevention
tl;dr version: This problem could have been avoided if the build system changes had been code reviewed like all other production code.
Recalling some discussions with a former colleague, software is not a write-once-and-forget sort of thing. All software is organic in nature; it's more like a tree. It needs to be nurtured, attended to, and from time to time, pruned.
All code, particularly production code, ought to be written carefully and code-reviewed as part of the check-in process, sometimes by more than one pair of eyes. But the same level of rigor is not applied to internal support systems like build systems and developer tools. In small teams like ours, there aren't sufficient resources to pay attention at that level of detail to non-shipping code unless it is a significant pain point or as in this case, breaks in one or more environments.
Looking back at this more critically, possible only with time and distance from the series of events that preceeded this find, the crux of the issue was:
- a bad hire - the hack engineer who implemented the broken solution
- a manager [me] who didn't oversee the engineer's work more closely, and more importantly, terminate the engineer months prior despite sufficient warning signs. Note that the termination is not for mistakes, it's for a myriad of reasons including poor work ethic.