brock.hargreaves

May 242017
 

In follow up to my last post on C++ performance analysis, I wanted to let people know about another cppcon 2016 talk called “Want fast C++? Be nice to your hardware” by Timur Doumler. Timur gave a great talk in my opinion, allowing the audience to go and dig deeper at their own discretion if any of the particular topics appeal to them. This talk has a bit more C++ than the previous talk I posted, which I appreciate.

Topics covered:

  • Data and instruction cache
  • Cache levels (L1, L2, L3,…)
  • Cache lines (typically 64 byte on desktops)
  • prefetcher
  • cache associativity
  • pipeline
  • instruction level-parallelism
  • branch predictor
  • memory alignment
  • multiple cores
  • SIMD

Too long didn’t watch (though I highly recommend you do!):

  • Be conscious whether you’re bound by data or computation
  • prefer data to be contiguous in memory
  • If you can’t, prefer constant strides to randomness
  • Keep data close together in space (e.g., putting data structures that are used one after another into a struct)
  • keep accesses to the same data close together in time
  • Avoid dependencies between successive computations
  • Avoid dependencies between two iterations of a loop
  • avoid hard-to-predict branches
  • be aware of cache lines and alignment
  • minimized the number of cache lines accessed by multiple threads
  • don’t be surprised by hardware weirdness (cache associativity, denormals, etc)
May 082017
 

I stumbled upon a talk by Matt Dziubinksi from CppCon 2016 called C++ performance analysis called “Computer Architecture, C++, and High Performance”. I thought it was an excellent talk. I’m acutely aware of how higher level abstractions have created a bubble that we usually don’t need to leave in order to write fast code. That said, Matt makes it clear that if you really want to understand (and hopefully improve) performance than you will probably have to wander out of your comfort zone.

He provides information on common bottlenecks that can occur and a lot of tools available for analyzing the performance of code. Among the topics:

  1. Performance metrics
  2. Cache misses
  3. Branch (mis)prediction
  4. Instruction level parallelism (compiler optimization)


I think an important realization is that there will never be a C++ talk that says “This is how you write code that is fast in all scenarios.” but rather, “Write some code. If it does well that’s great, if it’s not good enough then let’s dig deep and see what we can improve on.” If you have that in mind when watching talks like this, I think they are more enjoyable. There will (likely) never be a std::make_my_code_uber_fast();

I thought it would be nice to summarize the list of tools he has used or recommended when troubleshooting bottlenecks, and analyzing/benchmarking C++ code. Before I list the ones from the talk, I will quickly mention VTune from Intel which is fairly high level and in my experience can be good for finding bottlenecks. Each tool is listed with a description from their website:

VTune
https://software.intel.com/en-us/intel-vtune-amplifier-xe
Performance on modern processors requires much more than optimizing single thread performance. High-performing code must be:

  • Threaded and scalable to utilize multiple CPUs
  • Vectorized for efficient use of multiple FPUs
  • Tuned to take advantage of non-uniform memory architectures and caches

With Intel® VTune™ Amplifier, you get all these advanced profiling capabilities with a single, friendly analysis interface. And for media applications, you also get powerful tools to tune OpenCL* and the GPU.
If you can’t get the gains you need from using something like VTune, then it’s time to get your hands dirty with the tools Matt mentions in his talk:

Nonius:
https://nonius.io/
Nonius is a framework for benchmarking small snippets of C++ code. It is very heavily inspired by Criterion, a similar Haskell-based tool. It runs your code, measures the time it takes to run, and then performs some statistical analysis on those measurements. The source code can be found on GitHub.

Intel® Memory Latency Checker
https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Intel® Memory Latency Checker (Intel® MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where b/w and latencies from a specific set of cores to caches or memory can be measured as well.

perf
https://perf.wiki.kernel.org/index.php/Main_Page
perf can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing). It is capable of lightweight profiling. It is also included in the Linux kernel, under tools/perf, and is frequently updated and enhanced.

Compiler Explorer
https://godbolt.org/
Compiler Explorer is an interactive compiler which shows the assembly output of compiled C/C++/Rust/Go/D code with any given compiler and settings.

Disasm
https://github.com/mongodb-labs/disasm
Disasm is a browser-based application, built on Flask, that allows you to disassemble ELF files into Intel x86 assembly. The assembly and analysis is displayed in a browser so that you can click around and interact with it.

Intel® Architecture Code Analyzer
https://software.intel.com/en-us/articles/intel-architecture-code-analyzer

Intel® Architecture Code Analyzer helps you statically analyze the data dependency, throughput and latency of code snippets on Intel® microarchitectures. The term kernel is used throughout the rest of this document instead of code snippet.

Likwid
https://github.com/RRZE-HPC/likwid
Likwid is a simple to install and use toolsuite of command line applications for performance oriented programmers. It works for Intel and AMD processors on the Linux operating system.

Sniper
http://snipersim.org/w/The_Sniper_Multi-Core_Simulator
Sniper is a next generation parallel, high-speed and accurate x86 simulator. This multi-core simulator is based on the interval core model and the Graphite simulation infrastructure, allowing for fast and accurate simulation and for trading off simulation speed for accuracy to allow a range of flexible simulation options when exploring different homogeneous and heterogeneous multi-core architectures.

pmu-tools
https://github.com/andikleen/pmu-tools
pmu tools is a collection of tools for profile collection and performance analysis on Intel CPUs on top of Linux perf. This uses performance counters in the CPU.

Pin
https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool
Pin is a dynamic binary instrumentation framework for the IA-32, x86-64 and MIC instruction-set architectures that enables the creation of dynamic program analysis tools.

 

Jan 242017
 

If you are banging your head against the wall because Eclipse Neon is refusing to resolve C++11 functions (unresolved symbols) in Neon you’ve come to the right place. This is the combined effort of multiple Stackoverflow posts and previous experiences, for Eclipse Kepler, I made a post for the same problem.

First of all, if you’ve tried everything I’m about to describe and still haven’t gotten the right results: Get rid of any bad metadata by following this recommendation. This can help when you’re importing projects from previous versions of Eclipse (or overwriting the .cproject with an older version):

Another brute-force way is to close Eclipse, open your workspace directory and go to “.metadata\.plugins\org.eclipse.cdt.core” and delete everything in there.

First, update the CDT GCC Built-in Compiler Settings:

  1. Project -> Properties -> C/C++ General -> Preprocessors includes Paths, Macros, etc
  2. Select the Providers tab
  3. Append “-std=c++11” to the compiler specs command

-std=c++11

Second, updated the project paths and symbols:

  1. Project -> C/C++ General -> Paths and Symbols
  2. Add __cplusplus with a value of 201103L

__cplusplus

Third, update the C++ development tooling:

  1. Project -> Properties -> C/C++ General -> Preprocessors includes Paths, Macros, etc
  2. Under the Entries tab select GNU C++ -> CDT User Setting Entries
  3. Add __cplusplus with a value of 201103L, make sure “Treat as built-in” is selected.

built-in

 

Finish off by re-indexing your project. Hopefully that solves it for you, it’s all I needed to do.

Dec 252016
 

I recently ran into a bug where an asynchronous operation, namely boost::asio::async_write from Boost’s asynchronous library,  wasn’t completed before the next call to async_write was made. This results in not being able to guarantee the order in which packets are received on the other end of the socket (or even if all the data from the previous write makes it…?). For reference, someone else had a similar problem and posted on Stackoverflow.

Now the solution offered (and other people recommend the same thing it seems..) is basically to use a queue to manage the write operations, as is expressed in the Boost examples, This solution is lovely, but then you have to start worrying about things like:

  1. This queue could grow arbitrarily large
  2. It becomes more complicated to manage memory efficiently if you have multiple clients that need to receive the same data

Okay, maybe those aren’t so hard to deal with. However effort also has to be put into establishing a solid protocol. What happens if a client disconnects during a message? Do you cache until they reconnect? This depends on the demands of the client, but it’s obvious things get hairy quite quickly. Questions keep coming up for edge cases and your codebase expands….

I feel like these types of problems have been solved. Where’s the library I’m not using? Or is it just too hard to abstract away problems like this without being too rigid?

Aug 302016
 

A recent conversation I had about real time and embedded systems reminded me of a problem we encountered at my company not too long ago. Namely, cpu throttling and kernel page caching having negative side effects.

With cpu throttling, a daemon called cpuspeed will try to optimize performance and on real time applications this might actually be an inhibitor. Depending on your flavour of Linux, this daemon may or may not be installed and the location of the configuration file might change. For example, on minimal installations of Redhat the daemon is not installed. On desktop installations it is and it’s not configured for optimal performance. On Redhat you can try:

If it’s there, you can either remove it or change it’s governor to performance in /etc/sysctl.conf from, for example “ondemand” to “performance”.

Kernel page caching essentially tries to cache anything that’s read from or written to disk in RAM. If you have an application that needs an abundant amount of memory but the kernel has tied most of it up in caching, it might take too long for the kernel to relinquish that memory. In more recent versions of the Linux kernel, there is a setting that only allows the kernel a certain amount of memory to cache from. For example, to ensure the kernel leaves at least 10gb of memory available. You can can edit/etc/sysctl.conf with the following setting.

 

 

Jun 132016
 

I was watching a video series about concurrency recently and stumbled upon a function I hadn’t used before but is incredibly useful! Namely, std::call_once: http://en.cppreference.com/w/cpp/thread/call_once

It solves an interesting problem. Suppose you have some shared resource between threads for which the first time it is accessed something special should happen but only once. In the scenario presented in the video series, mutliple threads wish to access a log file to write to but the file only needs to be opened once. The funcction std::call_once and variable type std::once_flag provide an elegant way of doing that without the overhead of having to lock and unluck a mutex and check to see if the file has been opened already every time shared_print() is called.

 

Feb 062016
 

So a while back I started doing practice problems in Topcoder for practice. Yesterday at work I noticed that there was one this weekend so I thought why not? The competitive Topcoder match is 1 hour and 15 minutes long, that doesn’t take much out of the day 🙂

Everything started okay, I logged in about 10 minutes before the coding match started. I’m lucky I woke up on time, I didn’t set an alarm..it’s the weekend! I registered and started waiting with everyone else. However when it wouldn’t let me open a match due to timeouts, I refreshed the page. After which I couldn’t login in to Topcoder.com for somewhere between 10-15 minutes after the coding phase had begun. This was frustrating, but I marched forward!

Problem 1:

Given an ordered collection  of coins represented by a string whose state is either ‘H’ or ‘T’, a coin is interesting if one it’s neighbour is a different state. e.g. “HHH” has 0 interesting coins, “HTH” has 3 interesting coins.

Method signature:

My solution felt really clumsy, it takes up a lot of code to check edge cases but maybe it’s necessary. The main strategy I took was to loop through string from 2nd element to second last element and compare neighbours.

 

Problem 2:

Two robots start in positions given by (x1, x1) and (x2,y2) respectively. You are given a string that represents movements that each robot will take e.g., “LRU” means that each robot will move left and then right and then down. Return “Explosion” if there is the possibility that the two robots will cross paths, “Safe” otherwise.

Method signature:

My strategy was to keep a list of all positions of each robot and then check to see if Robot1 ever has a position that Robot2 also had. This passed all but one of the unit tests… I might spend some time tonight debugging it.

EDIT: Oh my… I misread the problem statement…What I said above was my abbreviation of the statement, but it turns out that the robot might skip subsets of instructions.. so unfortunately the code below solves the wrong problem..

This could do with some obvious refactoring, but time constraints really push you to write code fast. Knowing that I misread the problem makes me feel better, since I was bewildered as to why one of the unit tests was failing. If I were to redo the problem I wouldn’t iterate through every position but rather since every sequence defines a rectangle of possible positions, it is enough to see if the intersection of the rectangles that each robot generates has a non zero intersection.

Problem 3:

Given a sequence of n integers, we can compute the bitwise XOR of each pair of integers. This will give us a list of n^2 integers. Your goal in this task is to reverse this process: you will be given the list of n^2 integers and you have to count all sequences of length n that would produce it.

Method signature:

where each integer in the sequence is less than m.

Unfortunately there was only 1 minute left in the competition when I opened this problem. But I’ll give a shot at the problem solving portion. Here is an example of the integers 5 and 3 whose bitwise XOR produce 6

0101
0011 XOR
—–
0110

but this could also be produce by:

1101
1011 XOR
—–
0110

1100
1010 XOR
—–
0110

0100
0010 XOR
—–
0110

So I think the strategy is to track the 0’s in the bitwise XOR. The integer m will give you the maximum number of bits, so if m was 8, only 4 bits would be used, 16, 5 bits would be used etc. So for every pair, there are 2^(max bits – (number of zeros in result of xor)) combinations. Store this result for every pair and multiply them together..or something along those lines 🙂 Following this logic somewhere would likely be fruitful.

Conclusion:

I was nervous trying this, but now I feel addicted and look forward to the next event!

Jan 252016
 

I ran into std::recursive_mutex while trying to debug something today. First let me illustrate. I found some class methods which looked like this:

Gasp! I’m thinking to myself, “Why the heck haven’t I seen a deadlock with this before?”. Clearly function_A() is getting a lock on the mutex, and calling function_B() which locks the mutex again.

“Brock, what is ‘lock_t’?……”

I’m glad you asked, because look what it’s hiding:

Ah ha! A type of mutex I haven’t used before. I wrote a previous post about how knowledge of the standard library is a good thing, and I stand by it, but it looks like there is a lot of push against using recursive locks and for good reason:

http://www.zaval.org/resources/library/butenhof1.html
http://www.codingstandard.com/rule/18-3-3-do-not-use-stdrecursive_mutex/
https://cppwisdom.quora.com/Recursive-mutexes-considered-harmful

In this case, everything worked fine, but until I checked the typedef I was questioning my whole universe! It lead me down a path that I didn’t need to for the bug I was trying to find.

My question is the following….

How do I write code that isn’t going to derail the next programmer that is debugging something? Well, clean coding always helps but in this case those functions above look incredibly clean… thankfully there is an effort from the modern C++ community on Cpp Core Guidelines.

And hey, look at this guideline in the works:

  • Prefer non-recursive locks (often used to work around bad reasoning, overhead)

 

Jan 162016
 

Although IDE’s will support git, and I’ll use them for things like history comparisons, I find it faster to do a lot of my git management from the command line (merge, rebase, push, pull etc). It’s invaluable to be able to tell which branch you are on when navigating through your workspace from the terminal. In a few easy steps using an OS X software management software package called Homebrew, we can change our git terminal environment on OS X from this:

Default bash
into this:

Git bash

Install brew:

Install git and bash-completion

If you don’t have one already, create .bash_profile in your user home directory and add the following

And that’s it! Since I found myself setting up my developer environment on OS X again, I also configured some global git settings while I’m at it:

Jan 032016
 

Well there it goes, my first year of blogging. It was certainly a bit on and off earlier in the year, I wasn’t entirely sure what I was getting into or what I was going to talk about. All I knew is that I wanted to have a public place to share my thoughts and interesting pieces of literature or code.

I didn’t really start to kick it into gear until I listened to a Ruby Rogues Podcast about marketing yourself as a software developer. I had never really thought about marketing myself to be honest, it’s something I think most scientists cringe about. “Leave it to the business majors.”, is what I’m sure most of my university colleagues would say. It’s an interesting discussion if you have an hour to burn:

 

The Ruby Rogue crew bring up a very interesting point throughout the discussion on if we choose “not to play the game”. The act of not participating is an action in itself. Though it was never my goal when I started posting, it’s been one portion of motivation to contribute more often rather than let it be something that collects dust.

In any case, for anyone who managed to find their way on here and found something useful. Woohoo! To cap it off, here is a graph of my server traffic over the last year.

 

bandwidth

 

So really only HTTP traffic. My webserver generated this image so I couldn’t remove the useless data. Anyway, I’m optimistic for a continued upward trend!