On Function Call Inlining

I’ve heard so many people claim that the main benefit of inlining is that the compiler can save the call and return instructions. Some examples:


That’s all well and nice, but that’s not even scratching the surface of the benefit of inlining.

The main reason inlining is a powerful optimization tool is that, once the function is inlined, the compiler can optimize it with the call-site. Constants can be propagated into the function body. Loops can be more efficiently re-arranged, code can be hoisted, registers can be (more) optimally allocated, essentially all the heavy artillery that an optimizing compiler has can be deployed on the merged function. Furthermore, since the compiler has full information of what’s going on inside the function that was called, it doesn’t have to make conservative assumptions about side-effects, subsequent function calls, etc.

The effects of all the above are way more impactful than the saved call and return – even if you take the potential branch-misprediction penalty and parameter save operations into consideration.

Some get it right. Some make a note on this and move on.

The down-side of course is that it could increase code-size. Over-use of inlining might generate more (instruction) cache-misses, which can hurt your performance quite a bit. My personal guidelines:

  • Small functions should be inlined
  • Large functions with only a few call-sites could be inlined
  • Large functions, called from a number of places should not be inlined – though a few specific call-sites could be inlined

Of course most of these decisions are made by your compiler already, but checking the results, and on occasion forcing the compiler to bend your way can be beneficial: unless you use profile-guided optimization, the compiler has to make inlining decisions based on a static view of the code, so it can’t take for example execution frequency into consideration.

Finally, the biggest enemies of inlining (in C++) are virtual functions: it’s very hard for the compiler to see through a virtual function call and realize that you always (or most of the time) call the same virtual function. Providing non-virtual variants and manually calling them in cases when inlining is expected is probably the best way around this problem.

Print Friendly, PDF & Email

4 thoughts on “On Function Call Inlining

  1. The countervailing design element for inlining is the increased pressure on the instruction cache, memory bandwidth, and (for embedded) the mass storage capacity of your processor. Because inlining effectively copies the function code into the calling code, instead of rerunning the same code, increases total code size. If in a critical loop, inlining causes that loop to exceed the cache size, the resulting cache misses can drastically reduce overall performance. If possible always *profile* and *benchmark* your code before and after any changes to measure the impact. Remember, “if you’re not measuring, you’re not optimizing.” 🙂

    From a practical point of view one has to understand that “inline” is only a hint. A good optimizing compiler will frequently inline functions automatically without the inline hint and will refuse to inline functions exceeding a certain complexity (for example have more than a single “flow control” element).

    From a design point of view inlining is a technique the encourages “cost free” encapsulation and modularity. One can take a complex function and break it into a number of small well contained and easily understood functions, which inlining seamlessly integrates. It’s also a good practice as it extends well to advanced design concepts such and templates and traits, where each operation within a larger function is abstracted and encapsulated.

    For a good overview of all of these considerations see: http://www.cplusplus.com/forum/articles/20600/

    • John,

      I agree with most of what you’re saying. A couple points though:

      • In many cases instruction cache size is not a problem. In many it is. So, as always, the answer is: it depends.
      • You’re right ‘inline’ is only a hint if even that. The compiler is free to inline something that wasn’t marked ‘inline’ and not inline something that wasn’t. At the same time there usually exist other (compiler specific) directives to force your decision on the compiler in either direction. While some compilers are smart in making inlining decisions, some aren’t.
      • Partially because of what you’ve mentioned before about the drawbacks the compilers might not inline something that should be. Automatic inlining can be dramatically better with profile-guided optimizations, if available and you have good profiles to train the compiler with. Static code analysis only gets you so far.
      • Your last point is spot-on: inlining helps you structure your code and worry about optimization later.

      Thanks again for your insightful comment!


  2. Of course on platforms where there’s no instruction cache, like on many pipelined cache-less microcontrollers, the following apply:

    1. Functions called only once can ALWAYS be inlined and they will always reduce code size. At least the call and ret instructions will be removed. This is independent of anything else about the function. This optimization could be done, in a pinch, even by the linker. On typical micros this saves you 4 bytes per call (three for the call, one for the return).

    2. Functions that have no side effects (used only for their result, this means no I/O either) and are called with compile-time constant parameters can be ALWAYS executed at the time of compilation, with their results replaced by the returned value. It would take a very incompetent compiler for this operation to grow the code size or runtime, and my experience is that it NEVER grows code size and runtime. In fact, in MOST cases, it decreases code size and runtime. For some functions (say transcendentals on micros w/o FPU), this can result in very significant code size reductions.

    3. Even functions with side effects that are called with some compile-time constants can be specialized on those constants. In some cases, in spite of those specializations, all of the specializations taken together are still smaller than the single copy of the original function, and result in code size savings. This is easy to decide: as long as the code size doesn’t grow, the specializations can be done. So this is a perfectly safe thing to do.

    LLVM’s LTCG is pretty nifty at supporting all that, and I’ve had some fun in my ongoing efforts to port LLVM to some microcontrollers that never had decent, 21st-century compiler support. I’m constantly amazed by how decent assembly can result from non-allocating C++ code that on the surface would seem like a bloated disaster. In fact, I’ve found that I can often generate smaller and faster assembly when writing in C++, not in C!

    • Thanks for the comment. I completely agree with your point, only a few notes.

      For #2, it might be hard (in C++ at least) for a compiler to determine that a function in fact is pure (no side-effects).

      #3 of course is only a win if you have no non-specialized call-sites. If there’s even a single one (or one where the compiler can’t determine that you are in fact calling it with constant parameters), all specialized copies add to the size as the original needs to be kept around. This could also happen any time you take the address of a function: that function pointer will have to point to the non-specialized variant. This could easily happen through virtual function tables. Finally the compiler can only safely throw away the non-specialized variant if it can prove that no external calls can happen to it – that is it’s declared static. If it can’t, the linker might still determine that the generic version is not referenced and throw it out of the binary, but (unless you do link-time-godegen) by that time the optimization step has been done so the code-size estimation was done with the generic version taken into account.

      Especially on small micros without a cache it’s many times the case that code space is much less of a precious resource then working set storage (huge FLASH, tiny RAM). In those architectures, it’s beneficial to grow the code with specialized copies to eliminate stack or static data – within reason of ourse. This can be done – for example – by passing in parameters as template arguments instead of normal parameters to functions. Of course this is only possible for parameters where the value of the parameter is a compile-time constant, but this way you prescribe to the compiler your intent as opposed to guessing what the optimizer might do to your code. I’ve had success with this approach on AVRs getting very efficient ‘device drivers’. And yes, in this case for example C++ generated faster assembly then at least a typical C implementation of the same would have.

Leave a Reply

Your email address will not be published. Required fields are marked *