In follow up to my last post on C++ performance analysis, I wanted to let people know about another cppcon 2016 talk called “Want fast C++? Be nice to your hardware” by Timur Doumler. Timur gave a great talk in my opinion, allowing the audience to go and dig deeper at their own discretion if any of the particular topics appeal to them. This talk has a bit more C++ than the previous talk I posted, which I appreciate.

Topics covered:

• Data and instruction cache
• Cache levels (L1, L2, L3,…)
• Cache lines (typically 64 byte on desktops)
• prefetcher
• cache associativity
• pipeline
• instruction level-parallelism
• branch predictor
• memory alignment
• multiple cores
• SIMD

Too long didn’t watch (though I highly recommend you do!):

• Be conscious whether you’re bound by data or computation
• prefer data to be contiguous in memory
• If you can’t, prefer constant strides to randomness
• Keep data close together in space (e.g., putting data structures that are used one after another into a struct)
• keep accesses to the same data close together in time
• Avoid dependencies between successive computations
• Avoid dependencies between two iterations of a loop
• avoid hard-to-predict branches
• be aware of cache lines and alignment
• minimized the number of cache lines accessed by multiple threads
• don’t be surprised by hardware weirdness (cache associativity, denormals, etc)

I stumbled upon a talk by Matt Dziubinksi from CppCon 2016 called C++ performance analysis called “Computer Architecture, C++, and High Performance”. I thought it was an excellent talk. I’m acutely aware of how higher level abstractions have created a bubble that we usually don’t need to leave in order to write fast code. That said, Matt makes it clear that if you really want to understand (and hopefully improve) performance than you will probably have to wander out of your comfort zone.

He provides information on common bottlenecks that can occur and a lot of tools available for analyzing the performance of code. Among the topics:

1. Performance metrics
2. Cache misses
3. Branch (mis)prediction
4. Instruction level parallelism (compiler optimization)

I think an important realization is that there will never be a C++ talk that says “This is how you write code that is fast in all scenarios.” but rather, “Write some code. If it does well that’s great, if it’s not good enough then let’s dig deep and see what we can improve on.” If you have that in mind when watching talks like this, I think they are more enjoyable. There will (likely) never be a std::make_my_code_uber_fast();

I thought it would be nice to summarize the list of tools he has used or recommended when troubleshooting bottlenecks, and analyzing/benchmarking C++ code. Before I list the ones from the talk, I will quickly mention VTune from Intel which is fairly high level and in my experience can be good for finding bottlenecks. Each tool is listed with a description from their website:

VTune
https://software.intel.com/en-us/intel-vtune-amplifier-xe
Performance on modern processors requires much more than optimizing single thread performance. High-performing code must be:

• Threaded and scalable to utilize multiple CPUs
• Vectorized for efficient use of multiple FPUs
• Tuned to take advantage of non-uniform memory architectures and caches

With Intel® VTune™ Amplifier, you get all these advanced profiling capabilities with a single, friendly analysis interface. And for media applications, you also get powerful tools to tune OpenCL* and the GPU.
If you can’t get the gains you need from using something like VTune, then it’s time to get your hands dirty with the tools Matt mentions in his talk:

Nonius:
https://nonius.io/
Nonius is a framework for benchmarking small snippets of C++ code. It is very heavily inspired by Criterion, a similar Haskell-based tool. It runs your code, measures the time it takes to run, and then performs some statistical analysis on those measurements. The source code can be found on GitHub.

Intel® Memory Latency Checker
https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Intel® Memory Latency Checker (Intel® MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where b/w and latencies from a specific set of cores to caches or memory can be measured as well.

perf
https://perf.wiki.kernel.org/index.php/Main_Page
perf can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing). It is capable of lightweight profiling. It is also included in the Linux kernel, under tools/perf, and is frequently updated and enhanced.

Compiler Explorer
https://godbolt.org/
Compiler Explorer is an interactive compiler which shows the assembly output of compiled C/C++/Rust/Go/D code with any given compiler and settings.

Disasm
https://github.com/mongodb-labs/disasm
Disasm is a browser-based application, built on Flask, that allows you to disassemble ELF files into Intel x86 assembly. The assembly and analysis is displayed in a browser so that you can click around and interact with it.

Intel® Architecture Code Analyzer
https://software.intel.com/en-us/articles/intel-architecture-code-analyzer

Intel® Architecture Code Analyzer helps you statically analyze the data dependency, throughput and latency of code snippets on Intel® microarchitectures. The term kernel is used throughout the rest of this document instead of code snippet.

Likwid
https://github.com/RRZE-HPC/likwid
Likwid is a simple to install and use toolsuite of command line applications for performance oriented programmers. It works for Intel and AMD processors on the Linux operating system.

Sniper
http://snipersim.org/w/The_Sniper_Multi-Core_Simulator
Sniper is a next generation parallel, high-speed and accurate x86 simulator. This multi-core simulator is based on the interval core model and the Graphite simulation infrastructure, allowing for fast and accurate simulation and for trading off simulation speed for accuracy to allow a range of flexible simulation options when exploring different homogeneous and heterogeneous multi-core architectures.

pmu-tools
https://github.com/andikleen/pmu-tools
pmu tools is a collection of tools for profile collection and performance analysis on Intel CPUs on top of Linux perf. This uses performance counters in the CPU.

Pin
https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool
Pin is a dynamic binary instrumentation framework for the IA-32, x86-64 and MIC instruction-set architectures that enables the creation of dynamic program analysis tools.

If you are banging your head against the wall because Eclipse Neon is refusing to resolve C++11 functions (unresolved symbols) in Neon you’ve come to the right place. This is the combined effort of multiple Stackoverflow posts and previous experiences, for Eclipse Kepler, I made a post for the same problem.

First of all, if you’ve tried everything I’m about to describe and still haven’t gotten the right results: Get rid of any bad metadata by following this recommendation. This can help when you’re importing projects from previous versions of Eclipse (or overwriting the .cproject with an older version):

Another brute-force way is to close Eclipse, open your workspace directory and go to “.metadata\.plugins\org.eclipse.cdt.core” and delete everything in there.

First, update the CDT GCC Built-in Compiler Settings:

1. Project -> Properties -> C/C++ General -> Preprocessors includes Paths, Macros, etc
2. Select the Providers tab
3. Append “-std=c++11” to the compiler specs command

Second, updated the project paths and symbols:

1. Project -> C/C++ General -> Paths and Symbols
2. Add __cplusplus with a value of 201103L

Third, update the C++ development tooling:

1. Project -> Properties -> C/C++ General -> Preprocessors includes Paths, Macros, etc
2. Under the Entries tab select GNU C++ -> CDT User Setting Entries
3. Add __cplusplus with a value of 201103L, make sure “Treat as built-in” is selected.

Finish off by re-indexing your project. Hopefully that solves it for you, it’s all I needed to do.

I recently ran into a bug where an asynchronous operation, namely boost::asio::async_write from Boost’s asynchronous library,  wasn’t completed before the next call to async_write was made. This results in not being able to guarantee the order in which packets are received on the other end of the socket (or even if all the data from the previous write makes it…?). For reference, someone else had a similar problem and posted on Stackoverflow.

Now the solution offered (and other people recommend the same thing it seems..) is basically to use a queue to manage the write operations, as is expressed in the Boost examples, This solution is lovely, but then you have to start worrying about things like:

1. This queue could grow arbitrarily large
2. It becomes more complicated to manage memory efficiently if you have multiple clients that need to receive the same data

Okay, maybe those aren’t so hard to deal with. However effort also has to be put into establishing a solid protocol. What happens if a client disconnects during a message? Do you cache until they reconnect? This depends on the demands of the client, but it’s obvious things get hairy quite quickly. Questions keep coming up for edge cases and your codebase expands….

I feel like these types of problems have been solved. Where’s the library I’m not using? Or is it just too hard to abstract away problems like this without being too rigid?

So a while back I started doing practice problems in Topcoder for practice. Yesterday at work I noticed that there was one this weekend so I thought why not? The competitive Topcoder match is 1 hour and 15 minutes long, that doesn’t take much out of the day 🙂

Everything started okay, I logged in about 10 minutes before the coding match started. I’m lucky I woke up on time, I didn’t set an alarm..it’s the weekend! I registered and started waiting with everyone else. However when it wouldn’t let me open a match due to timeouts, I refreshed the page. After which I couldn’t login in to Topcoder.com for somewhere between 10-15 minutes after the coding phase had begun. This was frustrating, but I marched forward!

Problem 1:

Given an ordered collection  of coins represented by a string whose state is either ‘H’ or ‘T’, a coin is interesting if one it’s neighbour is a different state. e.g. “HHH” has 0 interesting coins, “HTH” has 3 interesting coins.

Method signature:

My solution felt really clumsy, it takes up a lot of code to check edge cases but maybe it’s necessary. The main strategy I took was to loop through string from 2nd element to second last element and compare neighbours.

Problem 2:

Two robots start in positions given by (x1, x1) and (x2,y2) respectively. You are given a string that represents movements that each robot will take e.g., “LRU” means that each robot will move left and then right and then down. Return “Explosion” if there is the possibility that the two robots will cross paths, “Safe” otherwise.

Method signature:

My strategy was to keep a list of all positions of each robot and then check to see if Robot1 ever has a position that Robot2 also had. This passed all but one of the unit tests… I might spend some time tonight debugging it.

EDIT: Oh my… I misread the problem statement…What I said above was my abbreviation of the statement, but it turns out that the robot might skip subsets of instructions.. so unfortunately the code below solves the wrong problem..

This could do with some obvious refactoring, but time constraints really push you to write code fast. Knowing that I misread the problem makes me feel better, since I was bewildered as to why one of the unit tests was failing. If I were to redo the problem I wouldn’t iterate through every position but rather since every sequence defines a rectangle of possible positions, it is enough to see if the intersection of the rectangles that each robot generates has a non zero intersection.

Problem 3:

Given a sequence of n integers, we can compute the bitwise XOR of each pair of integers. This will give us a list of n^2 integers. Your goal in this task is to reverse this process: you will be given the list of n^2 integers and you have to count all sequences of length n that would produce it.

Method signature:

where each integer in the sequence is less than m.

Unfortunately there was only 1 minute left in the competition when I opened this problem. But I’ll give a shot at the problem solving portion. Here is an example of the integers 5 and 3 whose bitwise XOR produce 6

0101
0011 XOR
—–
0110

but this could also be produce by:

1101
1011 XOR
—–
0110

1100
1010 XOR
—–
0110

0100
0010 XOR
—–
0110

So I think the strategy is to track the 0’s in the bitwise XOR. The integer m will give you the maximum number of bits, so if m was 8, only 4 bits would be used, 16, 5 bits would be used etc. So for every pair, there are 2^(max bits – (number of zeros in result of xor)) combinations. Store this result for every pair and multiply them together..or something along those lines 🙂 Following this logic somewhere would likely be fruitful.

Conclusion:

I was nervous trying this, but now I feel addicted and look forward to the next event!

I ran into std::recursive_mutex while trying to debug something today. First let me illustrate. I found some class methods which looked like this:

Gasp! I’m thinking to myself, “Why the heck haven’t I seen a deadlock with this before?”. Clearly function_A() is getting a lock on the mutex, and calling function_B() which locks the mutex again.

“Brock, what is ‘lock_t’?……”

Ah ha! A type of mutex I haven’t used before. I wrote a previous post about how knowledge of the standard library is a good thing, and I stand by it, but it looks like there is a lot of push against using recursive locks and for good reason:

In this case, everything worked fine, but until I checked the typedef I was questioning my whole universe! It lead me down a path that I didn’t need to for the bug I was trying to find.

My question is the following….

How do I write code that isn’t going to derail the next programmer that is debugging something? Well, clean coding always helps but in this case those functions above look incredibly clean… thankfully there is an effort from the modern C++ community on Cpp Core Guidelines.

And hey, look at this guideline in the works:

• Prefer non-recursive locks (often used to work around bad reasoning, overhead)

When I started piano lessons about 7 months ago, I started to find interesting parallels between writing code and playing piano. It shouldn’t be hard to imagine that one could be learning a programming language like a musician learns an instrument as well. There are a set of basic tools and principals. In piano, there is your basic syntax of reading music, tools such as arpeggios and scales, and principals like circle of fifths.

The first thing I did when I wanted to move into C++ development was write an open source library in C++ to better grasp the tools and principals of the language. By the time I was finished, I had exposure to a lot of the basic tools from the standard library and basic principals like type deduction, polymorphism, templates, and more “language generic” problems such as design patterns, separation of concern, clean coding, and unit testing.

As part of my journey to expose myself to as much piano as I could, I started watching a lot of youtube videos of pianists. Even if you don’t play piano or an instrument, I would suggest listening to this video, at least from minute 2:15 and onwards.

There is an incredible amount of parallels to learning a programming language but the following stuck with me:

Incorporating scales into your daily practice routine is the most efficient shortcut to technical mastery and brilliance, and it’s an indissoluble part of every great pianists journey.

– Ilinca Vartic

Now if Ilinca was a programmer maybe she would have worded it like this:

Incorporating functions from the standard library into your coding projects is the most efficient shortcut to technical master and brilliance, and it’s an indissoluble part of every great programmers journey.

Not only do you gain fundamental knowledge of the language you’re using, but you don’t have a write unit tests for it! At some point we all realize our wheel probably isn’t as good as the wheel from the standard library. Though there are many facets of becoming an expert at a programming language, I think having intimate knowledge of it’s standard library is important. It’s starting to become more important to me at least.

I’m happy to announce that the first working release of spgl1++ is now available. It is based on the Matlab implementation of Michael Friedlander and Ewout van den Berg and is meant to be a standalone library which can be included in your C++ projects.

It’s still “early” in development, as such it only solves the regular basis pursuit problem with real measurement matrix $A$:

$\displaystyle \min_{x \in \mathbb{R}^d}||x||_1$  subject to $Ax=b$

Features I’m currently working on, which shouldn’t be far away, include solving the weighted version with denoising and complex measurement matrix $\Phi$:

$\displaystyle \min_{x \in \mathbb{R}^d}||x||_{1,w}$  subject to $||\Phi x-b||_2 < \epsilon$

I’ll be tweaking it and testing various matrix vector libraries. It currently works with Armadillo, a C++ linear algebra library whose syntax (API) is deliberately similar to Matlab.

Since the goal is to accomodate general matrix vector libraries, you will have to provide spgl1++ with certain template specializations,  such as how it does matrix vector products. For example, with Armadillo:

I say “should”, since I’ve only really tested it with Armadillo and I’m sure I’ve accidentally added some dependencies in the code and will have to work on making it as general as possible. In the coming days/weeks/months I’ll be doing testing and adding documentation.

Happy sparse reconstruction!

I have been using a library called libunittest for unit testing an open source project. A couple months ago I did a feature request for it, namely for an assertion for relative approximation. It’s an incredibly easy library to use and comes with a lot of documentation. Here is one of the examples which shows the “easy” way where you don’t have to register the test class:

After installing, link against it and compile, then execute:

This example and others are covered in the tutorial provided on the website: http://libunittest.sourceforge.net/tutorial.html

I personally like a different style that libunittest supports which is a bit more writing of code but I believe it’s a nice way of organizing a test suite:

For learning about the other types of assertions you can do, you may want to peruse the testing code:

http://sourceforge.net/p/libunittest/code/ci/master/tree/test/test_assertions.cpp

Happy testing!

I had a problem this weekend where the following snippet would compile within Eclipse, however the IDE would still complain with things like “Symbol ‘array’ could not be resolved”:

There were lots of people who were able to fix this problem by simply adding the appropriate paths, but since it compiles in my case this wasn’t the solution. Anyway, some genius on Stackoverflow figured it out: http://stackoverflow.com/questions/17131744/eclipse-cdt-indexer-does-not-know-c11-containers

Add the symbol with name “__cplusplus” (which in this case is apparently an override):

__cplusplus

with the following value:
 201103L 

Now use “Run C/C++ Code Analysis and the red underlining from the unresolved imports goes away:

There is still that red on the left hand side which is just highlighting of the scope which bothers me. This can be removed by right clicking on the red portion, going to preferences, text editors, and removing the “Show range indicator.” check mark:

Finally something nice looking:

Note that for this example I still needed to ensure I was using the C++11 Language Standard: