An analysis of NagVis 1.2.x performance with a large number of maps

Or, rather, an analysis of NagVis 1.2.x lack of performance with a “large” number of maps.

Setup

We are monitoring ca. XXX hosts with two Nagios instances populating a common NDO database.

The NDO database and NagVis itself are running on the same machine, an HP c7000 blade equipped with 2 Quad-core Xeon CPUs running at 3 GHz and 2 GB of memory. That machine is otherwise idle, as the NDO database generates no load on the machine.

The NagVis configuration tree weights 2.3 MB, with 413 maps defined. The maps have common children, this is something to keep in mind when reading through this analysis. Nothing fancy like backgrounds or the like.

We have “a lot” of maps defined so as to have per-application, per-rack, per-OS, per-brand, per-site etc. views of the infrastructure we are monitoring. This is a requirement for the systems and networks administrators using NagVis.

Problem

Displaying our main map (with 3 children) is extremely slow. It takes between 30 seconds and just under a minute to display the map. It is unbearable for our users, and it’s not only the main map exhibiting this behaviour. This is a serious usability issue.

Something somewhere must be so totally inefficient that it’s worth investigating.

First elements

Let’s go for the simple things first. Running top while loading the main map shows apache2 taking 90% of a CPU and mysqld taking another 10%.

So the problem lies in the code and not in the queries it’s running. Though there’s certainly something to do on the SQL front too, but for now let’s pretend it’s not an issue.

NagVis profiling

Given the extent of this performance issue, let’s get the big guns out right now. The tool of choice here is the XDebug Zend extension, which will produce a profile trace that can be loaded into Kcachegrind.

NagVis vanilla

Loading our main map takes between 30 seconds and just under a minute, as stated already. That translates to over 265 millions time units as Kcachegrind puts it. We’ll refer to the profiling data for timing, and then we’ll only use that for comparing the time taken by the different runs.

The call graph generated by Kcachegrind reveals that half the time is spent in the getFavicon() method. Can’t believe it? Me neither.

Top-level call graph for NagVis 1.2.x vanilla

As we can see, a cycle is identified, and it’s run through different code paths, including the getFavicon() one. There’s something fishy in there.

For now, getFavicon() is nothing more than a parasite, so let’s get it out of the way.

NagVis without getFavicon()

Commenting out one line of code rewards us with a main map that loads twice as fast as before. 145 millions time units.

The cycle identified previously is still there, as the other code paths still exist, but now we have figures that are a bit more meaningful than previously, as we’ve put aside the useless favicon feature.
The call graph reveals that NagVis spends 70% of the time reading map configs (readMapConfig()), inside this cycle.

readMapConfig()-centered call graph for NagVis 1.2.x without favicon

It also shows that most of this time (50% of the total run time) is spent inside readMapConfig() itself and not in callees.

Investigating the code for the parser shows that the parser is written in an extremely inefficient way:

  • lots of array lookups, partly due to the fact that the config file is shoved into an array using file(), partly due to the way the parser is written. For the latter, I truly hope PHP is capable of optimizing away at least some of those array lookups inside this while() loop;
  • using file() results in more memory usage than is really necessary, also requires building, then accessing, a potentially large array;
  • excessive usage of ereg() when a simpler, cheaper means exists, like checking that the first character of a line is neither ‘#’ nor ‘;’;
  • expensive functions used to check the existence of array keys, typically array_key_exists();
  • excessive usage of isset();
  • couple of other things.

Without trying very hard, I’ve obtained a 10% reduction of the total run time, falling to 130 millions time units. By being more clever in writing the parser, it’s probably possible to gain up to 20-30% of the total run time.

Looking at the code

It’s time to go a bit deeper into the code.

Map configuration

Examination of the readMapConfig() callers makes the whole thing pretty clear: map configuration is fetched into a new NagVisMapCfg object each and every time it is needed.

As you now know, creating a NagVisMapCfg object is quite expensive as it requires parsing the config file for the particular map.
Yet, these objects are created as needed and discarded after use. That’s very, very, very bad. There should be a cache for these objects, at the very least.

In our case, we have a lot of common submaps, which means we’re even more impacted by this problem, as the same map must be evaluated several times.

Map state and host state

The same happens for the map state and host state. They are recomputed each and every time they’re needed, and that involves a lot of SQL queries. In our case, we’re seeing, at that point, over 25 000 SQL queries; no wonder mysqld consumes 10% of a CPU under this load.

Conclusion

The current architecture of NagVis doesn’t scale. If you have more than a couple of hosts on a couple of maps arranged in a relatively flat graph, forget about NagVis.

At that point, NagVis should be rewritten from scratch in anything faster than PHP, in a more clever way. Its configuration file format and general configuration scheme should be redesigned, too.

And let me add: PHP and OO programming at their worst.