Experiments using BG/Q’s Hardware Transactional Memory

Barna L. Bihari
Lawrence Livermore National Laboratory

The current and expected explosive growth in the availability of shared-memory concurrency greatly increases the burden on the programmer to effectively utilize the large number of cores and even larger number of threads.  One of the main challenges of writing thread-safe code, however, has been the efficient resolution of memory conflicts. Transactional memory (TM) offers all of the attributes of thread-safety, but in a cost-effective and scalable manner.  It is an optimistic synchronization mechanism in thata memory update is allowed to go through without locks, but checked afterwards whether or not a conflict has occurred.  If a conflict was indeed detected, a “rollback” ensues and the memory commit is typically retried a preset number of times, before it becomes serialized.  When TM is implemented in hardware using a multi-versioned L2 cache, such as in the IBM Blue Gene/Q (BG/Q) systems, this method can become very efficient, especially for low memory contention applications.

In this work, we present very recent results using IBM’s hardware transactional memory (HTM) system.  To demonstrate the utility of TM in the scientific computing world, we have built a benchmark code that intends to mimic the algorithms and code behavior of some very large scientific simulations.  Our particular code is named BUSTM (Benchmark for UnStructured-mesh Transactional Memory) because it uses real 3-D unstructured grids and it contains two frequently used algorithms which have very low, but nonzero memory conflict probabilities.  One is the conservative finite volume method often used in CFD (Computational Fluid Dynamics) simulations, and the other is the Monte-Carlo method for particle transport.

In our experiments we analyze the memory access and conflict patterns of both algorithms included in BUSTM as well as compare the number of rollbacks to the number of wrong answers obtained without synchronization.  Our results confirm the low-conflict property of the numerical methods presented, thereby indicating a good fit between these application scenarios and the TM features. Finally, our experiments also demonstrate clear performance advantages when compared to more traditional thread synchronization mechanisms, such as OpenMP “critical,” and, in certain cases, even “atomic.”

Download slides