Notes

Notes - notes.io

Low lat stuff:
===================
Wait Free Algos-
===============
Wait-freedom means that each thread moves forward regardless of external factors like contention from other threads, other thread blocking.
Each operations is executed in a bounded number of steps. It's the strongest guarantee for synchronization algorithms.
Wait-free algorithms usually use such primitives as
atomic_exchange,
atomic_fetch_add (InterlockedExchange, InterlockedIncrement, InterlockedExchangeAdd, __sync_fetch_and_add)
They do not contain cycles that can be affected by other threads.
atomic_compare_exchange primitive (InterlockedCompareExchange, __sync_val_compare_and_swap) is usually not used, because it is usually tied with a "repeat until succeed" cycle - i.e. CAS LOOP BASED ALGOs ARE NOT WAIT FREE though they ARE LOCK FREE

Lock Free Algos
=============
Forward progress for each individual thread is not guaranteed (that is, individual threads can starve).
It's a weaker guarantee than wait-freedom. Lockfree algorithms usually use atomic_compare_exchange primitive
CAS LOOP are typically lock free

Measuring Performance of Synchronization Algorithms :
=================================================
What is the most important thing regarding synchronization algorithm's performance and scalability?
Atomic RMW (read-modify-write) instructions (like Compare-And-Swap or Fetch-And-Add) per operation is wrong.
The most important thing is amount of write sharing per operation.
Numerical measure of write sharing is number cache-line transfers per operation, and the ideal value is 0.
If there is 0 cache-line transfers per operations amortized (we are perfectly Ok with amortization here), then the algorithm is perfectly scalable. Anything other than 0, even 1, is a serious problem for scalability.
-Write sharing system ungracefully degrades, the more threads we add the slower it becomes.
- If there is no write sharing system linearly scales
- Yes, atomic RMW operations are slower than plain stores and loads, but they do scale linearly in itself
-Cost of atomic RMW operations becomes smaller and smaller with each new processor generation
- Loads are always scalable. Several threads are able to read a memory location simultaneously. Read-only accesses are your best friends in a concurrent environment.

Primitives choices for Advanced Synchronization Algorithms:
===================================================
1. CAS (Compare & Swap)
- aka LOСK CMPXCHG-Done atomically at HW level . Is an e.g. of Atomic Read Modify Write (RMW)
2. Fetch & Add
- aka LOСK XADD - also variations like fetch-and-sub, fetch-and-and, fetch-and-or, fetch-and-xor
- Is also an e.g. of Atomic RMW
3. Exchange
- aka XCHG - Also Atomic RMW
4. Atomic Loads & Stores
- They are not RMW. They are just independent atomic loads and stores.
- Atomic loads and stores are better/cheaper/faster than atomic RMW operations.
5. Mutexes and the company
- Why not? The most stupid thing one can do is try to implement everything in a non-blocking style
- Generally it's perfectly Ok to use mutexes/condition variables/semaphores/etc on cold-paths.
- During process or thread startup/shutdown mutexes and condition variables is the way to go.

In GNU world these primitives are available in the form of form of gcc __sync intrinsics.

Memory Model:
===============
A memory model in general is a very broad term, and it determines a lot of things. For example, such an evident thing like pointer size (8, 16, 32, 64 bits) is determined by a memory model. There are also other things like segmentation, paging and cache associativity.
In the context of multi-threading/concurrencythe there are
3 fundamental properties: Atomicity, Ordering and Visibility;
2 levels: Compiler and Hardware.

Atomoicity:
---------
An operation is either does not happen at all or fully completed. No intermediate states and no partial effects can be observed by other threads.
2 classes of operations in the context of atomicity:
(1) read-modify-write (RMW) operations and
(2) loads and stores
RMW -
A memory model along with instruction set determine what RMW ops are available and which of them are atomic. Modern processors usually provide at least compare-and-swap (CAS) operation
Another crucial aspect is whether atomic RMW operations can operate on word/pointer-sized memory locations or also on double-word-sized memory locations (for example, on 128-bit memory locations on 64-bit architecture). On modern IA-32 and Intel 64 architectures double-word atomic operations are available, however they were not available on some early 64-bit processors (to prevent confusion, I call double-word CAS - DWCAS)
Load & Stores -
Memory model along with instruction set specifies whether plain loads and stores are atomic or not.Typical guarantee for all modern commodity hardware is that aligned word-sized loads and stores are atomic. i.e. plain MOV instruction, MOVQ and MOVDQA are atomic.

Ordering:
--------
In reality memory accesses can be made out of program order, however hardware masks that from a single-threaded program.
For required ordering ensuring hardware usually provides special instructions called memory fences/barrierswhich prevent some re-ordering types around them.
2 types of memory fences -
1. bi-directional (store-store, load-load)
2. tied to memory accesses (acquire and release).
Bi-directional fences prevent one type of memory accesses (load or store) from "sinking below" them, while other type of memory accesses (can be the same, though) from "hoisting above" them
Fences tied to memory accesses prevent all memory accesses from moving above (acquire fence) or below (release fence); for example, load-acquire is a load which simultaneously prevents any other memory accesses from hoisting above it (while any memory accesses can sink below it

Scalability Prerequisites:
------------------
1. No mutexes on fast-paths ever (for slow-paths they are Ok, and even recommended because of the usage simplicity).
2. logically read-only operations must be implemented as a physically read-only operations. During logically read-only operation one should not do any single write to a shared memory location. Reader-writer mutexes do writes to internal state in read_lock(). Due to specifics of implementation of cache-coherence in modern concurrent hardware (see MOESI protocol), reads to a shared state have 100% scalability (i.e. any number of threads can read from a memory location simultaneously); while writes to a shared state have zero scalability (i.e. at most 1 thread can write to a memory location at any given moment in time).
3. No writes to a centralized shared state on fast-paths. Writes to a shared state are generally unavoidable for most concurrent data structures. However we can distinguish 4 kinds of a shared state for our needs:
a. Mostly private state:
A statistics counter held in thread-local storage is a good example. Such counter is frequently written by an owner thread, and very infrequently read by some other thread.
b. Mostly read-only state:
That's a state with a very high read-to-write ratio (some real-world data-structures actually have read-to-write ratio of 10^7 and higher). Such state also is of no danger for scalability.
c. Decentralized shared state.
That's a shared state which is frequently written to, but is physically distributed. A good example is a hash map with an array of independent buckets.
d. Centralized shared state.
That's a shared state which is frequently written to, and is physically centralized. A typical example is a counter of elements in a container, which is mutated on every insert and remove operation. That's a scalability killer number one, there is no way to make it scalable. A typical mistake is to maintain such a state with atomic RMW (read-modify-write) operations (InterlockedXXX(), __builtin_sync_XXX(), atomic_XXX()), and think that since there is no mutexes, they should be scalable. It does not work that way, just say no to a centralized shared state.
4. Be aware of false sharing.
Fourth, be aware of false sharing. Due to performance reasons cache-coherence protocols work with whole cache lines, rather than with separate bytes, words or C-language variables. I.e if two variables are contained within a single cache line, for the hardware they look like a single variable with all implications on scalability. So everything said above must be actually extended from distinct memory locations to cache lines. Size of a cache-line is architecture dependent, there are/was architectures with cache line sizes from 16 bytes to 4 kilobytes. However for modern Intel x86 processors (IA-32, Intel 64) cache-line size is fixed to 64 bytes, i.e. 16 consecutive words for IA-32 and 8 consecutive words for Intel 64. ====Cache Line padding approach
5. Atomic RMW operations have some fixed associated costs
For modern Intel x86 processors cost of a single atomic RMW operation (LOCK prefixed instruction) is some 40 cycles (depends on a particular model, and steadily decreases). The cost comes mostly from frequently-unneeded embed full memory fence (so when Intel will separate atomicity and memory ordering, you may cross out this rule). However, the cost is fixed and does not affect scalability, so is far less important than above-outlined scalability-affecting points.

If we summarize we get the following scalability mantra:
The most important aggregate metric for a concurrent data structure is a mean number of cache line transfers between threads per operation.
All possible means must be employed to reduce the value as much as possible.

It's worth noting that some data structures inherently can't be implemented with the linear scalability, for example producer-consumer queue with the strict FIFO ordering requirement (it's 'strict FIFO ordering' part that is problematic, because it inherently requires communication between threads on every enqueue operation).

Put things that must be close to each other... close to each other. Assume following situation. In order to complete some operation thread has to update variable X and variable Y. If variables are situated far from each other (on different cache lines), then thread has to load (from main memory, or from other processor's cache) 2 cache lines instead of 1 (if variables are situated close to each other).

The whole idea of advanced reader-writer synchronization primitives is to make logically read-only accesses physically read-only wrt shared data and/or eliminate mutual exclusion between readers and writers. There are several techniques for that.
1. Multi-Version Concurrency Control (MVCC) -
MVCC allows several versions of an object to exist at the same time. That is, there are the "current" version and one or more previous versions. Readers acquire the current version and work with it as much as they want. There is strong similarity with persistent data structures. However, in a context of persistent data structures we care about all version of an object equally; while in a context of MVCC we have a separated "current" version, and few previous versions that we care only as far as the readers are reading it.
2. Optimistic Concurrency Control -
The idea is that a reader starts reading an object w/o any synchronization (optimistically hoping for success), and when it finishes it verifies that the object was not changed under its feet (verification can be conducted periodically during reading if required). If the object was not changed then it has obtained some consistent view of the object; otherwise he needs to retry reading. The technique gives very good results wrt scalability in many cases. However there is a caveat: a reader must be prepared for reading inconsistent data
3. State Distribution -
Well distributed state (partition for each thread/processor) can be considered as non-shared state for our needs. The underlying idea is that each reader thread/processor obtains a separate piece of state, and works mostly with it. While writers (hopefully episodically and infrequently) synchronize with each reader's state.

Notes.io is a web-based application for taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000 notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 12 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes