Introduction
I’m not a patient kind. And if you’re not the kind either, here’s the TLDR: I’ve release the 1.0 version of the simulator with nice performance and visualization improvements.
So, where was I? Oh yeah, not being patient… I don’t like sitting around waiting for things to happen. I also don’t like boredom. I can’t report too much progress on the front of finding new software for the Crays (though I have some interesting leads but more on those in a different post). So while I was waiting, I tinkered with the simulator.
There were several things that bugged me about it. It lacked even a half-decent feedback of what it’s doing. It’s logging infrastructure – while flexible and fast(ish) – was ugly and somewhat finicky to use. It wasn’t modern C++. And most of all, it was slow. Well, I didn’t even know how slow it was, or to put it in another way, I didn’t know how fast it could be. All I knew was I didn’t invest too much time in making it fast.
This post is about what I did about some of these.
UI
printf
is ugly. std::cout
isn’t nicer either. Windows are cute but are very hard to work with in a platform-independent way. A nice middle-ground is the curses library. It works on the character console, available on all platforms – at least the ones I care about – and is relatively easy to use. It has its own quirks. It lacks the widgets of any modern UI. It has platform differences and behaviors that only work on Windows and brake on Linux. Or the other way around. Still, it can do the job, plus, I’ve already been using it for the terminal emulator.
I’ve enhanced my wrapper classes to support multiple windows, even windows with frames and titles. I’ve added support for streams (my windows now inherit from std::ostream
) so all the usual << operations work. In fact, you can even redirect std::cout
into one of these windows so those sprinkled printed messages neatly show up in a log window.
The next step was to bolt-in a reporting infrastructure, where the various components (CPUs, IOPs, disks and tapes for example) can fill-in their status. For status reporting, I’ve re-used the boost property tree implementation that I’m already using for configuration file parsing. This neatly separates information (status info) from representation (characters on the display). Next I coded up some statistics, like number of reads and writes for disks and instructions executed every second for processors.
Finally, I’ve added a rather simplistic console where the user can type in commands to – among other things – mount virtual tapes or change the amount of logging being done.The results is this nice interface:
OK, well aesthetics is subjective. Point is: it’s better then it was.
Log 2.0
I had this great idea: develop a fast, flexible logging system. I had an implementation. I’ve used it in the simulator for a long time. However it was ugly. It looked like streams, but it wasn’t. It was thread safe, but it was kind of not. You could change a log-level in the middle of the line, which is confusing to say the least.
So, I had this even greater idea: let’s make a new version that corrected most of these issues. It was going to be log 2.0. The implementation was going to split logging and constructing a line of log into two separate classes. This way, you could automatically capture the log-level when you created a log-line, and dump it into the actual file in one, atomic operation.
The syntax wouldn’t even change all that much. You could still do:
1 2 3 |
StreamLogger_c Logger; Logger << setloglevel(3) << "This is a log line" << endl; |
And it would just work. Magically. What was going to happen under the hood was this:
The first operator <<
would be special. Instead of returning StreamLogger_c &
as such operators normally do, it would return a LogLine_c
object. Not even a reference to it. A new object. This class is derived from std::stringstream
so all subsequent operator <<
calls just work. At then end, when the temporary object goes out of scope, the overriden destructor would take action and copy all the content of the log line into the ‘Logger’ object.
StreamLogger_c
is not related to std::ostreams and would only have the operator <<
function for log manipulators (setloglevel
and such). Because of that, this is a syntax error:
1 |
Logger << "Invalid log line" << endl; |
And since operator <<
doesn’t exists for ostreams
and log manipulators, this is invalid as well:
1 |
Logger << "Some logging" << setloglevel(1) << " and some more" << endl; |
To speed up disabled logging, LogLine_c
gets initialized to being ‘bad’ in case – during its construction – if it’s determined that it would not log anything. It’s atomic, so once the object is created it will discard all further collection of logging and in fact std::ostream
does a decent job of early-exiting on operator <<
for bad streams, so it should perform well for large amounts of disabled logs.
It’s also atomic on the destruction, all log entries (even if they are of multiple lines) are dumped as a single entity into the real log file.
I’ve coded this all up, debugged and built things – patching up the use-cases where it was needed. It worked. It was nice. Aaaannnndddd… it was slow. Very very slow. But why? Well, that’s where I whipped out the profiler. The results were clear and troubling: 28% of the application time was spent in
1 |
template LogLine_c operator << (LoggerBase_c &aLogger, const setloglevel &aManipulator); |
which is the first operator <<
invoked for every line in the log file. The one that creates the LogLine_c object. Some further digging showed that most of the time was spent in the construction of the std::stringstream
object.
Apparently I was late to the party as practically everyone except I knew C++ stream construction is slow because of all the code-page conversion work that needs to happen. Well, now everybody knows…
So the question is: how to improve this?
It’s clear that LogLine_c
can’t derive from std::stringstream
. Yet, unless I wanted to re-implement all the operator <<
overloads, I needed a stream implementation underneath. Wherever this stream lives, it must be thread-safe, or to be more precise, per-thread instantiated. There are two such places: stack and thread-local storage. Since stack is not really possible without frequent construction-destruction cycles, the only other option is thread-local storage. The good news is the TLS is supported by boost. The bad news is now you need boost for the log library, not to mention that single-threaded apps now pay the TLS penalty as well.
In this new implementation LogLine_c
would simply have a reference to a TLS allocated string-stream object. It would in fact even check for such an object being already constructed and only create one if it couldn’t retrieve it. This way all threads automagically get their std::stringstream
instance the first time they try to log something.
Creating all the operator <<
-s for LogLine_c
is not that big of a headache as it first sounds: a simple template function that forwards all functionality to the underlying std::stringstream
instance works well even with custom-defined types that know how to print to a stream.
With some cruft removed, LogLine_c
becomes this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
class LogLine_c { public: LogLine_c(class LoggerBase_c &aParent, bool aMakeBad): mParent(aParent), mBad(aMakeBad) {} ~LogLine_c(); bool good() const { return !mBad; } template friend LogLine_c &operator << ( LogLine_c &aLogger, const tElement &aElement) { if (aLogger.good()) aLogger.mParent.GetContext() << aElement; return aLogger; } protected: ostream &GetRealStream(); class LoggerBase_c &mParent; bool mBad; }; |
LoggerBase_c::GetConext
return a context object (allocated as thread-local storage) and is derived from std::strginstream
. This is the guts of it:
1 2 3 4 5 6 7 8 |
ThreadContext_c &LoggerBase_c::GetContext() { if (smThreadContext.get() == nullptr) { smThreadContext.reset(new ThreadContext_c); } return *smThreadContext.get(); } |
Notice how there’s no need for any locks or synchronization: since the allocated object is per thread there’s really no race-condition between the test for the nullptr
and the allocation of the new object.
In this implementation the construction of the LogLine_c
objects that happens for every log-line is very quick: the constructor essentially copies a few values and that’s it. All actual operations are gated by the mBad
member. If it’s set to false, none of the stringstream
operations are executed so we don’t even try to convert things to strings (which is the time-consuming part of most operator <<
calls) if they won’t end up in the log file.
An actual use of this framework would still look identical to the previous version:
1 2 3 4 |
StreamLogger_c Logger; ... mLogger << setloglevel(LogLevel_IoActivity) << "DoIo function: " << aFunction << " data: " << aData << endl; |
The first <<
operator with a StreamLogger_c
on its left-side and a setloglevel
object on its right-side returns a LogLevel_c
object just as before. All the rest of the <<
operators see a LogLevel_c
object on their left side and will invoke the above defined machinery.
Finally after the execution of the last <<
operator, the temporary LogLevel_c
object goes out of scope and gets destructed. The destructor clears the std::stringstream
object on the thread-local storage and dumps its content into the log stream object contained in Logger
. This is done through the mParent
member of LogLevel_c
which maintains a connection to the logger object this line belongs to.
It took me a while to arrive at this solution, but finally I had a functional, fast and more type-safe logger implementation then before. It’s not exactly stream-like (LogLine_c
can’t inherit from std::ostream
) but it gets as close as it can.
All this worked well in Visual Studio. As soon as I moved to GCC, things started to break (yay, portability). As it turns out, GCC assumes temporaries on the stack to be const
. The apparent rationale behind it is that there’s no reason for anybody to change a temporary: it gets destroyed without leaving the expression anyway. It seems no one though of the possibility to propagate state between various parts of an expression during evaluation. Well, I did!
This would have been a major problem with LogLine_c
being a descendant of stringstream
as it would carry the half-constructed log-line string with it. It of course can be worked around with liberal use of mutable
but how can you make your base-class mutable? Luckily that’s not what we’re doing here anymore. The temporary object actually doesn’t change. All the changes (the half-constructed log-line string) is carried as side-effects to – thread-local – variables so making the arguments to operator <<
const
isn’t a problem:
1 2 3 4 5 6 7 8 |
template friend <span style="color: #ff0000;"><code>const</code></span> LogLine_c &operator << ( <span style="color: #ff0000;"><code>const</code></span> LogLine_c &aLogger, const tElement &aElement) { if (aLogger.good()) aLogger.mParent.GetContext() << aElement; return aLogger; } |
The actual implementation is a bit more complicated, but not too bad. You can check it out in the latest published sources.
Configuration file
Well, this is old news. The config file was there from the very beginning and it hasn’t changed much. New fields got added and some obsolete ones removed, but by and large it’s the same thing. So why writing about it now? Because I finally took the time to document its guts so somebody other than me (that would you be you dear reader) can understand and God forbid change things there. You can read all about it in README_config.txt.
Modern C++
C++ these days is “Not your Father’s C++”. With the 2011 standard it gained a much needed face-lift, so much so that for some it almost feels like a new language. It’s 2016, so even that ‘new’ standard is 5 years old. In fact since then, there was a 2014 standard and the next one is expected to come out in 2017. Range-based for loops make iterators actually useful and the auto keyword help with long and complicated types – which are quite often a side-effect of iterators. Pointer wrapper classes (std::unique_ptr
, std::shared_ptr
and their cousins) help with memory management to the point where if you feel the need to write new or delete
these days, you’re probably writing a bug. These same classes (along with move semantics) finally allow for polymorphic containers (vector of base-class pointers) that that also ‘own’ the objects, freeing them when the container gets destroyed.
All this goodness, yet, my simulator source was very much stuck in the last century, in ’98 to be specific. Of course there’s more, there’s always more to be done, but I’ve gone and modernized quite a few places in the code. For example the old code exposed a GetCpuCnt
and GetCpu(int aIdx)
methods on the Mainframe_c
class. Now there’s a GetCpus()
method, that exposes an iterator-based interface, so range-based loops can be used.
Performance improvements
I kept telling myself that performance in the simulator was not important. After all, the whole goal with it was to get to a point where a true HW (FPGA-based) implementation was possible. It was going to be the flexible, somewhat ugly, slow prototyping platform to figure out what the machine used to do and which aspects of the original design were important for proper execution of old code.
The good thing about plans though is that they can be changed. And as I said in the outset: I’m an impatient person. Now, that I have performance displays where I could see the simulated MIPS of the processors, I got curious: what is the actual limit where I can reasonably push this simulation? After all, today’s machines are so much faster then the simulated old HW. Granted, simulation is slow, but how slow?
So I started profiling and finding one problem after another. And let me tell you: in anyone tells you they understand performance, or they know how to write optimal code… All they know is how to tell a lie. Just like code that wasn’t tested isn’t working, code that wasn’t profiled isn’t optimal.
Who would have thought that calling tick()
on a handful of peripherals every simulated clock cycles was a major time-sink? Or to iterate through them to see if they have a pending interrupt was slow? Who would have guessed that putting a single std::stringstream
object on the stack can slow things down? Or that all the talk about jump tables in C(++) don’t amount to Visual C++ actually compiling switch statements into them? Or that it matters. Or that multi-threading is slower than a single-threaded implementation? Or that even the current logging is too slow with all the boolean checks and should not be compiled in for fastest execution? Or that atomic operations slow down a single-threaded application too? Or that there’s no way to force Visual C++ to inline a function and __forceinline
is not forcing an inline, it is only a slightly strongly worded inline
?
Well, I’ll stop now. The point is: you might know about a few of these, but you can’t know about all of them. And you can’t know about the ones that are application-specific without looking into it. So, spend the time and profile your code!
To me, sampling profilers add the biggest value. They work by periodically sampling the PC of the process and correlate those locations back to source code. They are statistical for sure, but if they collect a lot of samples in a function, chances are, you app spends a lot of time in that function. They’re inaccurate due to optimizations: they can’t always find the source line that contributes to a particular machine instruction, but they can reveal details about the implementation of functions, and most importantly inlined functions, which instrumented profiles can’t normally do. Instrumented profilers are intrusive, need modification of the binary and consequently can have nasty interactions with the optimizer. Your milage may vary.
So, how much did things improve? Well, remember the first screenshot in the article? That’s from the latest 1.0 build. Here’s a screenshot from the first build where I had the CPU performance measurement routine working:
So, yes. More then 20x improvement. And it does show. The hour long install process now only takes two minutes or so. Of course the console is much more responsive as well. I could in fact make the real time clocks of the IOPs truly real-time – though I had to cheat on the year: COS is very much not Y2K compliant, in fact it doesn’t accept anything beyond 1999 as a date. So the RTC lies about the year, making COS thinks its 20 years before today.
It’s curious that the 64-bit builds now perform significantly better then the 32-bit ones. The easy (and wrong) explanation would be that since I’m emulating a 64-bit system, of course I get a nice gain by running the code in native 64-bit mode. Except, the big benefit is in the IOP emulation, and the IOPs are 16-bit processors. I’m guessing that the extra 8 registers available in x64 mode are responsible for the extra boost, but I’m not sure. At the same time, does it really matter? It’s fast, so I’ll take it.
Final touches
The published package for the first time contains (64-bit) Linux binaries as well. They are statically linked which is against all best practices, but hopefully it makes the binaries actually work on your system. If not, you can re-compile yours from sources.
The package is available form the download page.
So, till the next time, have fun end enjoy the speed!