In follow up to my last post on C++ performance analysis, I wanted to let people know about another cppcon 2016 talk called “Want fast C++? Be nice to your hardware” by Timur Doumler. Timur gave a great talk in my opinion, allowing the audience to go and dig deeper at their own discretion if any of the particular topics appeal to them. This talk has a bit more C++ than the previous talk I posted, which I appreciate.
Data and instruction cache
Cache levels (L1, L2, L3,…)
Cache lines (typically 64 byte on desktops)
Too long didn’t watch (though I highly recommend you do!):
Be conscious whether you’re bound by data or computation
prefer data to be contiguous in memory
If you can’t, prefer constant strides to randomness
Keep data close together in space (e.g., putting data structures that are used one after another into a struct)
keep accesses to the same data close together in time
Avoid dependencies between successive computations
Avoid dependencies between two iterations of a loop
avoid hard-to-predict branches
be aware of cache lines and alignment
minimized the number of cache lines accessed by multiple threads
don’t be surprised by hardware weirdness (cache associativity, denormals, etc)
I stumbled upon a talk by Matt Dziubinksi from CppCon 2016 called C++ performance analysis called “Computer Architecture, C++, and High Performance”. I thought it was an excellent talk. I’m acutely aware of how higher level abstractions have created a bubble that we usually don’t need to leave in order to write fast code. That said, Matt makes it clear that if you really want to understand (and hopefully improve) performance than you will probably have to wander out of your comfort zone.
He provides information on common bottlenecks that can occur and a lot of tools available for analyzing the performance of code. Among the topics:
I think an important realization is that there will never be a C++ talk that says “This is how you write code that is fast in all scenarios.” but rather, “Write some code. If it does well that’s great, if it’s not good enough then let’s dig deep and see what we can improve on.” If you have that in mind when watching talks like this, I think they are more enjoyable. There will (likely) never be a std::make_my_code_uber_fast();
I thought it would be nice to summarize the list of tools he has used or recommended when troubleshooting bottlenecks, and analyzing/benchmarking C++ code. Before I list the ones from the talk, I will quickly mention VTune from Intel which is fairly high level and in my experience can be good for finding bottlenecks. Each tool is listed with a description from their website:
Tuned to take advantage of non-uniform memory architectures and caches
With Intel® VTune™ Amplifier, you get all these advanced profiling capabilities with a single, friendly analysis interface. And for media applications, you also get powerful tools to tune OpenCL* and the GPU.
If you can’t get the gains you need from using something like VTune, then it’s time to get your hands dirty with the tools Matt mentions in his talk:
Nonius is a framework for benchmarking small snippets of C++ code. It is very heavily inspired by Criterion, a similar Haskell-based tool. It runs your code, measures the time it takes to run, and then performs some statistical analysis on those measurements. The source code can be found on GitHub.
Intel® Memory Latency Checker https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Intel® Memory Latency Checker (Intel® MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where b/w and latencies from a specific set of cores to caches or memory can be measured as well.
perf https://perf.wiki.kernel.org/index.php/Main_Page perf can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing). It is capable of lightweight profiling. It is also included in the Linux kernel, under tools/perf, and is frequently updated and enhanced.
Compiler Explorer https://godbolt.org/
Compiler Explorer is an interactive compiler which shows the assembly output of compiled C/C++/Rust/Go/D code with any given compiler and settings.
Disasm is a browser-based application, built on Flask, that allows you to disassemble ELF files into Intel x86 assembly. The assembly and analysis is displayed in a browser so that you can click around and interact with it.
Intel® Architecture Code Analyzer helps you statically analyze the data dependency, throughput and latency of code snippets on Intel® microarchitectures. The term kernel is used throughout the rest of this document instead of code snippet.
Likwid is a simple to install and use toolsuite of command line applications for performance oriented programmers. It works for Intel and AMD processors on the Linux operating system.
Sniper is a next generation parallel, high-speed and accurate x86 simulator. This multi-core simulator is based on the interval core model and the Graphite simulation infrastructure, allowing for fast and accurate simulation and for trading off simulation speed for accuracy to allow a range of flexible simulation options when exploring different homogeneous and heterogeneous multi-core architectures.