May 082017

I stumbled upon a talk by Matt Dziubinksi from CppCon 2016 called C++ performance analysis called “Computer Architecture, C++, and High Performance”. I thought it was an excellent talk. I’m acutely aware of how higher level abstractions have created a bubble that we usually don’t need to leave in order to write fast code. That said, Matt makes it clear that if you really want to understand (and hopefully improve) performance than you will probably have to wander out of your comfort zone.

He provides information on common bottlenecks that can occur and a lot of tools available for analyzing the performance of code. Among the topics:

  1. Performance metrics
  2. Cache misses
  3. Branch (mis)prediction
  4. Instruction level parallelism (compiler optimization)

I think an important realization is that there will never be a C++ talk that says “This is how you write code that is fast in all scenarios.” but rather, “Write some code. If it does well that’s great, if it’s not good enough then let’s dig deep and see what we can improve on.” If you have that in mind when watching talks like this, I think they are more enjoyable. There will (likely) never be a std::make_my_code_uber_fast();

I thought it would be nice to summarize the list of tools he has used or recommended when troubleshooting bottlenecks, and analyzing/benchmarking C++ code. Before I list the ones from the talk, I will quickly mention VTune from Intel which is fairly high level and in my experience can be good for finding bottlenecks. Each tool is listed with a description from their website:

Performance on modern processors requires much more than optimizing single thread performance. High-performing code must be:

  • Threaded and scalable to utilize multiple CPUs
  • Vectorized for efficient use of multiple FPUs
  • Tuned to take advantage of non-uniform memory architectures and caches

With Intel® VTune™ Amplifier, you get all these advanced profiling capabilities with a single, friendly analysis interface. And for media applications, you also get powerful tools to tune OpenCL* and the GPU.
If you can’t get the gains you need from using something like VTune, then it’s time to get your hands dirty with the tools Matt mentions in his talk:

Nonius is a framework for benchmarking small snippets of C++ code. It is very heavily inspired by Criterion, a similar Haskell-based tool. It runs your code, measures the time it takes to run, and then performs some statistical analysis on those measurements. The source code can be found on GitHub.

Intel® Memory Latency Checker
Intel® Memory Latency Checker (Intel® MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where b/w and latencies from a specific set of cores to caches or memory can be measured as well.

perf can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing). It is capable of lightweight profiling. It is also included in the Linux kernel, under tools/perf, and is frequently updated and enhanced.

Compiler Explorer
Compiler Explorer is an interactive compiler which shows the assembly output of compiled C/C++/Rust/Go/D code with any given compiler and settings.

Disasm is a browser-based application, built on Flask, that allows you to disassemble ELF files into Intel x86 assembly. The assembly and analysis is displayed in a browser so that you can click around and interact with it.

Intel® Architecture Code Analyzer

Intel® Architecture Code Analyzer helps you statically analyze the data dependency, throughput and latency of code snippets on Intel® microarchitectures. The term kernel is used throughout the rest of this document instead of code snippet.

Likwid is a simple to install and use toolsuite of command line applications for performance oriented programmers. It works for Intel and AMD processors on the Linux operating system.

Sniper is a next generation parallel, high-speed and accurate x86 simulator. This multi-core simulator is based on the interval core model and the Graphite simulation infrastructure, allowing for fast and accurate simulation and for trading off simulation speed for accuracy to allow a range of flexible simulation options when exploring different homogeneous and heterogeneous multi-core architectures.

pmu tools is a collection of tools for profile collection and performance analysis on Intel CPUs on top of Linux perf. This uses performance counters in the CPU.

Pin is a dynamic binary instrumentation framework for the IA-32, x86-64 and MIC instruction-set architectures that enables the creation of dynamic program analysis tools.