Introduction
“Lock Free Data Structure” assures the execution of at least one thread, while multiple threads are executing using this data structure as a shared resource. [AT1] [AT2] ‘Lock Free Data Structure’ and ‘Non-Blocking Algorithms’ are leading areas of research these days and in this article we will get to know about Lock Free Data Structure (Circular Queue) [AT3] while using it in “Single Producer and Single Consumer Problem”. Before going further let’s just clear what does ‘LOCK FREE’ means and what is “Single Producer and Single Consumer”?
We call an algorithm ‘LOCK FREE’ if it guaranty overall system wide progress in multithread environment; [AT4] similarly we call an shared data structure ‘LOCK FREE’ if at least one thread accessing it is able to make a significant progress without using any locking procedure (MUTEX, SEMAPHORE and CRITICAL SECTIONS). So obviously these methods are used for systems having multi-threaded design requirements and are of two types: lock-free if overall system wide progress is guaranteed and wait-free if individual thread progress is also guaranteed. This article throw some light on the Lock Free Data Structure and introduced a method to implement it to solve ‘Producer-Consumer problem’.
‘Producer-Consumer Problem (bounded-buffer problem)’ describes two Processes or threads; one is producer and other is consumer. Producer is feeding data to the shared data structure and consumer is reading that common data structure and deleting an item when read. The problem is to make certain that, consumer does not remove the item from shared queue while it is empty and producer do not try to feed more data when queue is full. [AT5]
Motivation
Traditional methods for solving above problem involves use of MUTEX, SEMAPHORE and CRITICAL SECTIONS, preventing simultaneous access of shared queue by Producer and Consumer. But the problem arose when getting LOCK failed and thread is locked until the required LOCK is freed, leading to extra overhead of waking a thread. Channels and FIFOs queues are popular for these kind of problems as they do not need end-to-end atomic synchronization; thus use of above defined primitives can be much expensive and depends upon the implementation of these primitives in the target machine.
To avoid the use of SEMAPHORE and MUTEX and enhance the efficiency, we can use the ‘LOCK FREE’ data structure and algorithms. Thread libraries typically require semaphores or condition variables to control the sleep/wakeup of threads (in case of previous approach). In a multi-processor environment, thread sleep/wakeup would occur much less frequently than passing of data tokens, so avoiding atomic operations on data passing is beneficial promoting use of “LOCK FREE DATA STRUCTURES”.
Implementation
We will implement the ‘LOCK FREE DATA STRUCTURE’ by solving a problem statement that relates to the Producer-Consumer problem and will build our understanding in the process.
Problem Statement
Design a system in which a real time process/thread can delegate some expensive and unimportant work[AT6] to other process/thread ensuring that work would be performed in same order as delegated.
Solution
This stated problem resembles a Producer-Consumer problem hence we will use FIFO circular queue. Thread delegating the job is PRODUCER and thread that is performing that job is CONSUMER. Producer can push job in queue as long as it is not full and consumer can pop the job as long as it is not empty. Push and pop operations at queue are free of lock so there is no waiting time, providing facility to both the threads to do other works if required.
Coding Language
C++ standard 11. Use -std=c++11 in the CFLAGS during Compilation.
Data Structure used
Circular queue (an array of fixed length or linked list can also be used)
Detail Design
We need to use atomic operation in case of incrementing the index of queue for this we will be using C++ 11 standard ‘std::atomic’ types.
A class CircularQueue will provide all operation such as push(), pop(), is_empty(), is_full() and is_lock_free().
Implementation will be of type pop_front and push_back. Producer thread will push on the tail and Consumer thread will pop from the head.
Both threads will increment different values (Produce increment tail and consumer increment head) so that consistency is maintained.
Memory barriers will be used to ensure that threads will have valid snapshot of values that need to be compared.
Before proceeding further, let us explore the significance of memory barrier in lock free programming.
Memory Barrier
Many modern CPU enforce performance optimization that may leads to out of order execution of instructions. In single threaded execution this reordering does not affect the outcome but in multi-threaded environments where many data structures are shared among threads, reordering of memory operation can cause the blunder. In order to prevent these reordering memory barrier are used. Memory Barrier is an instruction that causes the compiler or CPU to enforce the ordering in memory operations as specified by the programmer in the code. In simple terms some operations are performed before the barrier and other after as specified by the programmer.
For an example consider two threads, one prints the value stored in variable based on a flag and other changes the value of both variable and flag.
Thread 1:
while(false == flag) ;[AT7]
print variable;
Thread 2:
variable = 100;
flag = true;
Suppose that memory operation of thread 2 is out of order and flag is true before putting value in variable and at that time thread one is scheduled and began its execution. In normal case output of thread one should be 100 but in this case it would be garbage as value is not assigned to variable by thread2. Same problem will arise if instructions of thread 1 are out of order. To prevent this, memory barrier are required between while loop and print instruction in thread 1 and between assigning values to variable and flag in thread 2.
We can use memory ordering model defined by C++ on atomic variables and these are:
· Sequential model
memory_order_seq_cst
· acquire-release model
memory_order_acquire
memory_order_release
I will explain these model while code walkthrough.
Using the Code
Code for LOCK FREE Data Structure is encapsulated in the class named “CircularQueue”.
template<typename element_type, size_t max_size> class CircularQueue { /** * @publicsection */ public: /** * @enum queue_capacity * @brief define the capacity of queue */ enum CAPACITY { queue_capacity = max_size + 1 }; /** * @brief Default Constructor */ CircularQueue() : queue_tail(0), queue_head(0) { } /** * @brief Default Destructor */ virtual ~CircularQueue() { } /** * @brief push data into the queue * @param item * @return */ bool push(const element_type& item); /** * @brief POP data from queue * @param item * @return */ bool pop(element_type& item); /** * @brief Check if queue is empty * @return */ bool is_empty() const; /** * @brief Check if queue is full * @return */ bool is_full() const; /** * @brief Check if queue is lock free * @return */ bool is_lock_free() const; /** * @privatesection */ private: /** * @brief increment the passed index * @param index * @return */ size_t increment(size_t index) const; std::atomic<size_t> queue_tail; element_type queue[queue_capacity]; std::atomic<size_t> queue_head; };
All you need to do is, just declare the object of this class with appropriate maximum queue size and use the push and pop operations.
CircularQueue<string , 50> *MessageQ;
Is_empty() and is_full() can be used to check for queues state, and is_lock_free() can be used to check for lock free condition.
This Circular queue can be used in implementation of interrupt handler or dumping of debug information, as these kind of applications need some other thread to process the interrupt, or dump the debug information into the file or network stream.
Test application provided shows a simple implementation of Log manager; main thread functions as Producer and secondary thread functions as consumer, dumping the debug in formation passed by Producer to the standard output.
Code Walkthrough
The whole design is based on one basic concept that indexes are updated only after the data is written or read. This will ensure that each snapshot of these indexes to any threads, will represent the previously complete operation. So indexes indicating head and tail must be atomic and their operation must be memory ordered.
Right from the start when the queue is empty I will proceed step wise and explain how it works:
Queue Empty
Queue_head, queue_tail |
First the queue was empty and queue_head and queue_tail both point to same location. Queue status can be accessed through function is_empty().
/**
* @attention NOT ATOMIC
* @return true if queue is empty
*/
template<typename element_type, size_t max_size>
bool CircularQueue<element_type, max_size>::is_empty() const {
return (queue_head.load() == queue_tail.load());
}
Queue full
Full queue is the limiting condition for Producer as it cannot push more data into the queue. In this condition, queue_head would be one greater than queue_tail. Queue full condition can be checked by function is_full().
Queue_head |
Queue_tail |
/**
* @attention NOT ATOMIC
* @return
*/
template<typename element_type, size_t max_size>
bool CircularQueue<element_type, max_size>::is_full() const {
const auto next_tail = increment(queue_tail.load());
return (next_tail == queue_head.load());
}
increment() function in the above code is incrementing the value of index passed to it, in such a manner that it does not exceed the maximum size of queue.
/**
* @brief increment the passed index (to emulate the notion of circular queue '%' is used)
* @param index
* @return
*/
template<typename element_type, size_t max_size>
size_t CircularQueue<element_type, max_size>::increment(size_t index) const {
return (index + 1) % queue_capacity;
}
PUSH Operation
It is important that we increment the index of tail only after the insertion of element in the queue. This will ensure that snapshot of tail will be consistence with the presence of element in the queue.
/**
* @brief push the data into the queue at queue_tail
* @param item
* @return
*/
template<typename element_type, size_t max_size>
bool CircularQueue<element_type, max_size>::push(const element_type& item) {
const auto current_tail = queue_tail.load();
const auto next_tail = increment(current_tail);
if ( next_tail != queue_head.load() ) {
queue[current_tail] = item;
queue_tail.store(next_tail, std::memory_order_release);
return true;
}
return false; // full queue
}
load() and store() function are used to read and write variable of type std::atomic. Let us analyse the function step-by-step.
Find the next index of tail.
Check if this next index is equal to the head index.
If true then return false as queue is full.
If false then push the item and increment the index of tail.
I have been enforcing that the index must be incremented only after the read or write of data. Let me explain what will happen if they not.
POP Operation
As I have already stated that in the push operation that sequence of storing data and incrementing the index is important, same is the case with pop operation.
/**
* @brief only update the head (load with relaxed, store with release)
* the tail must be accessed with at least acquire
* @param item
* @return
*/
template<typename element_type, size_t max_size>
bool CircularQueue<element_type, max_size>::pop(element_type& item) {
const auto current_head = queue_head.load();
if (current_head == queue_tail.load())
return false; // empty queue
item = queue[current_head];
queue_head.store(increment(current_head));
return true;
}
Suppose that the value of queue_head is 1 and queue_tail is 0. We performed a read operation and increment operation on index is executed first and new value of queue_head is 2 but read is not completed yet. At this point of snapshot context is switched and Procedure thread is executed, now the value of queue_head is 2 thus queue full is false leading to the insertion of data on the unread element. So to avoid this race condition, increment of index must be only after the read or write operation and execution must be memory ordered.
As I have explained about memory ordering and its importance now we will see its different types and implementation in push and pop operations.
Sequential Memory Ordering
As name suggest, sequential memory ordering is strict ordering imposed on CPU to execute the instruction in the exact order defined by the programmer. This ordering ensure the lick free behaviour but put some extra overhead for maintaining the exact ordering thus is bit slow than the other techniques. By default, push and pop operations described above, are in the sequential ordering.
Relaxed, Acquire and Release Memory Ordering
This memory ordering is bit faster than the sequential one and imposed less overhead on the CPU for maintaining the ordering of the execution. This ordering is based on simple principle of taking Acquire before read and Release after write. Thus the memory barrier should be between the Read-Acquire and write-Release operations.
Pop Operation
bool CircularQueue<element_type, max_size>::pop(element_type& item) {
const auto current_head = queue_head.load(std::memory_order_relaxed);
if (current_head == queue_tail.load(std::memory_order_acquire))
return false; // empty queue
item = queue[current_head];
queue_head.store(increment(current_head), std::memory_order_release);
return true;
}
Push Operation
template<typename element_type, size_t max_size>
bool CircularQueue<element_type, max_size>::push(const element_type& item) {
const auto current_tail = queue_tail.load(std::memory_order_relaxed);
const auto next_tail = increment(current_tail);
if (next_tail != queue_head.load(std::memory_order_acquire)) {
queue[current_tail] = item;
queue_tail.store(next_tail, std::memory_order_release);
return true;
}
return false; // full queue
}
Limitation
This piece of code is written strictly for Single Producer and Single Consumer. Please do not use it with multiple consumer or producer. Also the relaxed, acquire and release method of memory ordering requires a lot of cautions as wrong order can cause the corruption of data. Please stick to the sequential ordering if you are not certain about relaxed one. This code require the use of C++ 11 standard and I have only compiled it in linux g++ version 4.7. Feel free to try it on WINDOWS but I cannot provide any help with respect to compilations.
[AT1]This means Lock Free Data structures hold significance only in Multithreaded Environment
[AT2]What all resources can possibly use LockFree Data structures
[AT3]Lock Free Data Structures are Circular Queues, or Circular Queue is one example of them. What other possible implementations of Lock Free DS are there?
[AT4]We could explain it better with a system wide example?
[AT5]Let’s explain that with a Window Snacks Selling Outlet
[AT6]Like writing from a RAM buffer into a File on Disk.
[AT7]Why Flase comes before flag in equality?