Rectangle 27 33

It's done this way so that different cores modifying different fields won't have to bounce the cache line containing both of them between their caches. In general, for a processor to access some data in memory, the entire cache line containing it must be in that processor's local cache. If it's modifying that data, that cache entry usually must be the only copy in any cache in the system (Exclusive mode in the MESI/MOESI-style cache coherence protocols). When separate cores try to modify different data that happens to live on the same cache line, and thus waste time moving that whole line back and forth, that's known as false sharing.

In the particular example you give, one core can be enqueueing an entry (reading (shared) buffer_ and writing (exclusive) only enqueue_pos_) while another dequeues (shared buffer_ and exclusive dequeue_pos_) without either core stalling on a cache line owned by the other.

The padding at the beginning means that buffer_ and buffer_mask_ end up on the same cache line, rather than split across two lines and thus requiring double the memory traffic to access.

cacheline_pad_t

The attribute approach would be more compiler specific, but might cut the size of this structure in half, since the padding would be limited to rounding up each element to a full cache line. That could be quite beneficial if one had a lot of these.

The same concept applies in C as well as C++.

@Novelcrat - Ok that makes a lot of sense. So what about questions 2 & 3?

@MattH: For portability C++11 introduces std::aligned_storage which allow you to require a storage of defined size and alignment. The default alignment for a char [N] is 1 otherwise.

Why would the linker not optimize the padding variables out if they are not used?

Actually, there is no assumption that "cacheline_pad_t will itself be aligned to a 64 byte;" alignment is actually not required. The padding just guarantees the only goal, namely that the variables before and after are in different cache lines.

And the more modern C++11 standard has alignas declaration modifiers to do this portably. This is supported on just about every actively-developed C++ compiler.

c++ - How and when to align to cache line size? - Stack Overflow

c++ c caching
Rectangle 27 33

It's done this way so that different cores modifying different fields won't have to bounce the cache line containing both of them between their caches. In general, for a processor to access some data in memory, the entire cache line containing it must be in that processor's local cache. If it's modifying that data, that cache entry usually must be the only copy in any cache in the system (Exclusive mode in the MESI/MOESI-style cache coherence protocols). When separate cores try to modify different data that happens to live on the same cache line, and thus waste time moving that whole line back and forth, that's known as false sharing.

In the particular example you give, one core can be enqueueing an entry (reading (shared) buffer_ and writing (exclusive) only enqueue_pos_) while another dequeues (shared buffer_ and exclusive dequeue_pos_) without either core stalling on a cache line owned by the other.

The padding at the beginning means that buffer_ and buffer_mask_ end up on the same cache line, rather than split across two lines and thus requiring double the memory traffic to access.

cacheline_pad_t

The attribute approach would be more compiler specific, but might cut the size of this structure in half, since the padding would be limited to rounding up each element to a full cache line. That could be quite beneficial if one had a lot of these.

The same concept applies in C as well as C++.

@Novelcrat - Ok that makes a lot of sense. So what about questions 2 & 3?

@MattH: For portability C++11 introduces std::aligned_storage which allow you to require a storage of defined size and alignment. The default alignment for a char [N] is 1 otherwise.

Why would the linker not optimize the padding variables out if they are not used?

Actually, there is no assumption that "cacheline_pad_t will itself be aligned to a 64 byte;" alignment is actually not required. The padding just guarantees the only goal, namely that the variables before and after are in different cache lines.

And the more modern C++11 standard has alignas declaration modifiers to do this portably. This is supported on just about every actively-developed C++ compiler.

c++ - How and when to align to cache line size? - Stack Overflow

c++ c caching
Rectangle 27 32

It's done this way so that different cores modifying different fields won't have to bounce the cache line containing both of them between their caches. In general, for a processor to access some data in memory, the entire cache line containing it must be in that processor's local cache. If it's modifying that data, that cache entry usually must be the only copy in any cache in the system (Exclusive mode in the MESI/MOESI-style cache coherence protocols). When separate cores try to modify different data that happens to live on the same cache line, and thus waste time moving that whole line back and forth, that's known as false sharing.

In the particular example you give, one core can be enqueueing an entry (reading (shared) buffer_ and writing (exclusive) only enqueue_pos_) while another dequeues (shared buffer_ and exclusive dequeue_pos_) without either core stalling on a cache line owned by the other.

The padding at the beginning means that buffer_ and buffer_mask_ end up on the same cache line, rather than split across two lines and thus requiring double the memory traffic to access.

cacheline_pad_t

The attribute approach would be more compiler specific, but might cut the size of this structure in half, since the padding would be limited to rounding up each element to a full cache line. That could be quite beneficial if one had a lot of these.

The same concept applies in C as well as C++.

@Novelcrat - Ok that makes a lot of sense. So what about questions 2 & 3?

@MattH: For portability C++11 introduces std::aligned_storage which allow you to require a storage of defined size and alignment. The default alignment for a char [N] is 1 otherwise.

Why would the linker not optimize the padding variables out if they are not used?

Actually, there is no assumption that "cacheline_pad_t will itself be aligned to a 64 byte;" alignment is actually not required. The padding just guarantees the only goal, namely that the variables before and after are in different cache lines.

And the more modern C++11 standard has alignas declaration modifiers to do this portably. This is supported on just about every actively-developed C++ compiler.

c++ - How and when to align to cache line size? - Stack Overflow

c++ c caching
Rectangle 27 5

Have a look at Calibrator, all of the work is copyrighted but source code is freely available. From its document idea to calculate cache line sizes sounds much more educated than what's already said here.

The idea underlying our calibrator tool is to have a micro benchmark whose performance only depends on the frequency of cache misses that occur. Our calibrator is a simple C program, mainly a small loop that executes a million memory reads. By changing the stride (i.e., the offset between two subsequent memory accesses) and the size of the memory area, we force varying cache miss rates.

In principle, the occurance of cache misses is determined by the array size. Array sizes that fit into the L1 cache do not generate any cache misses once the data is loaded into the cache. Analogously, arrays that exceed the L1 cache size but still fit into L2, will cause L1 misses but no L2 misses. Finally, arrays larger than L2 cause both L1 and L2 misses.

The frequency of cache misses depends on the access stride and the cache line size. With strides equal to or larger than the cache line size, a cache miss occurs with every iteration. With strides smaller than the cache line size, a cache miss occurs only every n iterations (on average), where n is the ratio cache line size/stride.

This approach only works, if memory accesses are executed purely sequential, i.e., we have to ensure that neither two or more load instructions nor memory access and pure CPU work can overlap. We use a simple pointer chasing mechanism to achieve this: the memory area we access is initialized such that each load returns the address for the subsequent load in the next iteration. Thus, super-scalar CPUs cannot benefit from their ability to hide memory access latency by speculative execution.

To measure the cache characteristics, we run our experiment several times, varying the stride and the array size. We make sure that the stride varies at least between 4 bytes and twice the maximal expected cache line size, and that the array size varies from half the minimal expected cache size to at least ten times the maximal expected cache size.

I had to comment out #include "math.h" to get it compiled, after that it found my laptop's cache values correctly. I also couldn't view postscript files generated.

For my machine (Haswell) Calibrator predicts line size incorrectly, and @AlexD's approach also doesn't work. The problem is the prefetcher, that manages to guess constant-stride patterns and spoof the experiment. I suppose this can be measured with the prefetcher disabled

c++ - How to find the size of the L1 cache line size with IO timing me...

c++ c performance caching cpu-architecture
Rectangle 27 1

  • Your array is 256 bytes, so it will not fit in one 64 byte cache line.
  • Your CPU has multiple cache lines, so it's very likely that 256 bytes will fit in whatever cache you are worried about.

Would add "4. The only thing you can do that's actually useful is to ensure the start of the array is aligned on a cache line boundary; so that it takes 4 full cache lines, and doesn't take (e.g.) half a cache line then 3 full cache lines then another half a cache line."

Brendan - how can i do that (to ensure)?

c - How to store array to fit cache line size - Stack Overflow

c performance cpu cpu-cache cpu-speed
Rectangle 27 96

Most implementations of C memory allocation functions will store accounting information for each block, either in-line or separately.

One typical way (in-line) is to actually allocate both a header and the memory you asked for, padded out to some minimum size. So for example, if you asked for 20 bytes, the system may allocate a 48-byte block:

  • 32 bytes data area (your 20 bytes padded out to a multiple of 16).

The address then given to you is the address of the data area. Then, when you free the block, free will simply take the address you give it and, assuming you haven't stuffed up that address or the memory around it, check the accounting information immediately before it. Graphically, that would be along the lines of:

____ The allocated block ____
/                             \
+--------+--------------------+
| Header | Your data area ... |
+--------+--------------------+
          ^
          |
          +-- The address you are given

Keep in mind the size of the header and the padding are totally implementation defined (actually, the entire thing is implementation-defined (a) but the in-line accounting option is a common one).

The checksums and special markers that exist in the accounting information are often the cause of errors like "Memory arena corrupted" or "Double free" if you overwrite them or free them twice.

The padding (to make allocation more efficient) is why you can sometimes write a little bit beyond the end of your requested space without causing problems (still, don't do that, it's undefined behaviour and, just because it works sometimes, doesn't mean it's okay to do it).

(a) I've written implementations of malloc in embedded systems where you got 128 bytes no matter what you asked for (that was the size of the largest structure in the system), assuming you asked for 128 bytes or less (requests for more would be met with a NULL return value). A very simple bit-mask (i.e., not in-line) was used to decide whether a 128-byte chunk was allocated or not.

Others I've developed had different pools for 16-byte chunks, 64-bytes chunks, 256-byte chunks and 1K chunks, again using a bit-mask to decide what blocks were used or available.

Both these options managed to reduce the overhead of the accounting information and to increase the speed of malloc and free (no need to coalesce adjacent blocks when freeing), particularly important in the environment we were working in.

@paxdiablo Does that mean malloc doesn't allocate contiguous blocks of memory?

@user10678, the only real requirement of malloc is that it give you, for the successful case, a block of memory at least as large as what you asked for. Individual blocks are contiguous in terms of how you access elements within them, but there's no requirement that the arenas the blocks come from are contiguous.

c - How does free know how much to free? - Stack Overflow

c size pointers free
Rectangle 27 1

The best results I've gotten are by adding one more for loop that blocks over your N, and by rearranging the loops. I also hoisted loop-invariant code, but the compiler's optimizer should hopefully do this automatically. The block size should be the cache line size divided by sizeof(float). This got it ~50% faster than the transposed approach.

If you have to pick just one of AVX or blocking, using AVX extensions (vfmadd###ps and haddps) is still substantially faster. Using both is best and straightforward to add given that you're already testing if the block size is a multiple of 64 / sizeof(float) == 16 floats == two 256-bit AVX registers.

void matrix_mult_wiki_block(const float*A , const float* B, float* C,
                            const int N, const int M, const int K) {
    const int block_size = 64 / sizeof(float); // 64 = common cache line size
    for(int i=0; i<N; i++) {
        for(int j=0; j<K; j++) {
            C[K*i + j] = 0;
        }
    }
    for (int i0 = 0; i0 < N; i0 += block_size) {
        int imax = i0 + block_size > N ? N : i0 + block_size;

        for (int j0 = 0; j0 < M; j0 += block_size) {
            int jmax = j0 + block_size > M ? M : j0 + block_size;

            for (int k0 = 0; k0 < K; k0 += block_size) {
                int kmax = k0 + block_size > K ? K : k0 + block_size;

                for (int j1 = j0; j1 < jmax; ++j1) {
                    int sj = M * j1;

                    for (int i1 = i0; i1 < imax; ++i1) {
                        int mi = M * i1;
                        int ki = K * i1;
                        int kij = ki + j1;

                        for (int k1 = k0; k1 < kmax; ++k1) {
                            C[kij] += A[mi + k1] * B[sj + k1];
                        }
                    }
                }
            }
        }
    }
}

As for the Cannon reference, the SUMMA algorithm is a better one to follow.

In case anyone else is optimizing tall-skinny multiplications ({~1e9 x 50} x {50 x 50}, how I ended up here), the transposed approach is nearly identical in performance to the blocked approach up to n=18 (floats). n=18 is a pathological case (way worse than 17 or 19) and I don't quite see the cache access patterns that cause this. All larger n are improved with the blocked approach.

Could you please explain why the for loop "for (int j0 = 0; j0 < M; j0 += block_size) " j0 needs to < N but not K?

c - loop tiling/blocking for large dense matrix multiplication - Stack...

c performance openmp sse matrix-multiplication
Rectangle 27 3

Technical answers aside, I think that your professor wants to help you understand how arguments are passed to your C program, and how variables are stored in memory. The size of the memory is really just used to illustrate the point.

The key things to understand are as follows:

  • Command-line arguments are passed to a C program as null-terminated strings
  • argv[0] contains the name and/or path of the program that is being executed
  • argv[] is a NULL-terminated array (argv[argc] is always NULL)
argv

Exact Memory Size of argv in C - Stack Overflow

c argv
Rectangle 27 26

You are definitely getting what I call a cache resonance. This is similar to aliasing, but not exactly the same. Let me explain.

Caches are hardware data structures that extract one part of the address and use it as an index in a table, not unlike an array in software. (In fact, we call them arrays in hardware.) The cache array contains cache lines of data, and tags - sometimes one such entry per index in the array (direct mapped), sometimes several such (N-way set associativity). A second part of the address is extracted and compared to the tag stored in the array. Together, the index and tag uniquely identify a cache line memory address. Finally, the rest of the address bits identifies which bytes in the cache line are addressed, along with the size of the access.

Usually the index and tag are simple bitfields. So a memory address looks like

...Tag... | ...Index... | Offset_within_Cache_Line

(Sometimes the index and tag are hashes, e.g. a few XORs of other bits into the mid-range bits that are the index. Much more rarely, sometimes the index, and more rarely the tag, are things like taking cache line address modulo a prime number. These more complicated index calculations are attempts to combat the problem of resonance, which I explain here. All suffer some form of resonance, but the simplest bitfield extraction schemes suffer resonance on common access patterns, as you have found.)

So, typical values... there are many different models of "Opteron Dual Core", and I do not see anything here that specifies which one you have. Choosing one at random, the most recent manual I see on AMD's website, Bios and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh, March 12, 2012.

(Family 15h = Bulldozer family, the most recent high end processor - the BKDG mentions dual core, although I don't know the product number that is exactly what you describe. But, anyway, the same idea of resonance applies to all processors, it is just that the parameters like cache size and associativity may vary a bit.)

The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128- bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle. It is divided into 16 banks, each 16 bytes wide. [...] Only one load can be performed from a given bank of the L1 cache in a single cycle.

16KB/4-way => the resonance is 4KB.

I.e. address bits 0-5 are the cache line offset.

4 way associative => 256/4 = 64 indexes in the cache array. I (Intel) call these "sets".

i.e. you can consider the cache to be an array of 32 entries or sets, each entry containing 4 cache lines ad their tags. (It's more complicated than this, but that's okay).

(By the way, the terms "set" and "way" have varying definitions.)

there are 6 index bits, bits 6-11 in the simplest scheme.

This means that any cache lines that have exactly the same values in the index bits, bits 6-11, will map to the same set of the cache.

C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Loop k is the innermost loop. The base type is double, 8 bytes. If dimension=2048, i.e. 2K, then successive elements of B[dimension*k+j] accessed by the loop will be 2048 * 8 = 16K bytes apart. They will all map to the same set of the L1 cache - they will all have the same index in the cache. Which means that, instead of there being 256 cache lines in the cache available for use there will only be 4 - the "4-way associativity" of the cache.

I.e. you will probably get a cache miss every 4 iterations around this loop. Not good.

(Actually, things are a little more complicated. But the above is a good first understanding. The addresses of entries of B mentioned above is a virtual address. So there might be slightly different physical addresses. Moreover, Bulldozer has a way predictive cache, probably using virtual addresses bits so that it doesn't have to wait for a virtual to physical address translation. But, in any case: your code has a "resonance" of 16K. The L1 data cache has a resonance of 16K. Not good.)]

If you change the dimension just a little bit, e.g. to 2048+1, then the addresses of array B will be spread across all of the sets of the cache. And you will get significantly fewer cache misses.

It is a fairly common optimization to pad your arrays, e.g. to change 2048 to 2049, to avoid this srt of resonance. But "cache blocking is an even more important optimization. http://suif.stanford.edu/papers/lam-asplos91.pdf

In addition to the cache line resonance, there are other things going on here. For example, the L1 cache has 16 banks, each 16 bytes wide. With dimension = 2048, successive B accesses in the inner loop will always go to the same bank. So they can't go in parallel - and if the A access happens to go to the same bank, you will lose.

I don't think, looking at it, that this is as big as the cache resonance.

And, yes, possibly, there may be aliasing going. E.g. the STLF (Store To Load Forwarding buffers) may be comparing only using a small bitfield, and getting false matches.

(Actually, if you think about it, resonance in the cache is like aliasing, related to the use of bitfields. Resonance is caused by multiple cache lines mapping the same set, not being spread arond. Alisaing is caused by matching based on incomplete address bits.)

Nice. little typo : 256 cache lines instead of 128.

Thanks for catching that: 2^8 = 256. I'll try to correct, but I bet I don't catch all of the dependencies. Back when I worked at Intel I wrote a little "Free Text Spreadsheet", that allowed formulae to be placed in the text: type in a new number, and the fix propagated. (I wrote that in undergrad; maybe I can revive.)

c - Matrix multiplication: Small difference in matrix size, large diff...

c performance algorithm matrix-multiplication
Rectangle 27 26

You are definitely getting what I call a cache resonance. This is similar to aliasing, but not exactly the same. Let me explain.

Caches are hardware data structures that extract one part of the address and use it as an index in a table, not unlike an array in software. (In fact, we call them arrays in hardware.) The cache array contains cache lines of data, and tags - sometimes one such entry per index in the array (direct mapped), sometimes several such (N-way set associativity). A second part of the address is extracted and compared to the tag stored in the array. Together, the index and tag uniquely identify a cache line memory address. Finally, the rest of the address bits identifies which bytes in the cache line are addressed, along with the size of the access.

Usually the index and tag are simple bitfields. So a memory address looks like

...Tag... | ...Index... | Offset_within_Cache_Line

(Sometimes the index and tag are hashes, e.g. a few XORs of other bits into the mid-range bits that are the index. Much more rarely, sometimes the index, and more rarely the tag, are things like taking cache line address modulo a prime number. These more complicated index calculations are attempts to combat the problem of resonance, which I explain here. All suffer some form of resonance, but the simplest bitfield extraction schemes suffer resonance on common access patterns, as you have found.)

So, typical values... there are many different models of "Opteron Dual Core", and I do not see anything here that specifies which one you have. Choosing one at random, the most recent manual I see on AMD's website, Bios and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh, March 12, 2012.

(Family 15h = Bulldozer family, the most recent high end processor - the BKDG mentions dual core, although I don't know the product number that is exactly what you describe. But, anyway, the same idea of resonance applies to all processors, it is just that the parameters like cache size and associativity may vary a bit.)

The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128- bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle. It is divided into 16 banks, each 16 bytes wide. [...] Only one load can be performed from a given bank of the L1 cache in a single cycle.

16KB/4-way => the resonance is 4KB.

I.e. address bits 0-5 are the cache line offset.

4 way associative => 256/4 = 64 indexes in the cache array. I (Intel) call these "sets".

i.e. you can consider the cache to be an array of 32 entries or sets, each entry containing 4 cache lines ad their tags. (It's more complicated than this, but that's okay).

(By the way, the terms "set" and "way" have varying definitions.)

there are 6 index bits, bits 6-11 in the simplest scheme.

This means that any cache lines that have exactly the same values in the index bits, bits 6-11, will map to the same set of the cache.

C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Loop k is the innermost loop. The base type is double, 8 bytes. If dimension=2048, i.e. 2K, then successive elements of B[dimension*k+j] accessed by the loop will be 2048 * 8 = 16K bytes apart. They will all map to the same set of the L1 cache - they will all have the same index in the cache. Which means that, instead of there being 256 cache lines in the cache available for use there will only be 4 - the "4-way associativity" of the cache.

I.e. you will probably get a cache miss every 4 iterations around this loop. Not good.

(Actually, things are a little more complicated. But the above is a good first understanding. The addresses of entries of B mentioned above is a virtual address. So there might be slightly different physical addresses. Moreover, Bulldozer has a way predictive cache, probably using virtual addresses bits so that it doesn't have to wait for a virtual to physical address translation. But, in any case: your code has a "resonance" of 16K. The L1 data cache has a resonance of 16K. Not good.)]

If you change the dimension just a little bit, e.g. to 2048+1, then the addresses of array B will be spread across all of the sets of the cache. And you will get significantly fewer cache misses.

It is a fairly common optimization to pad your arrays, e.g. to change 2048 to 2049, to avoid this srt of resonance. But "cache blocking is an even more important optimization. http://suif.stanford.edu/papers/lam-asplos91.pdf

In addition to the cache line resonance, there are other things going on here. For example, the L1 cache has 16 banks, each 16 bytes wide. With dimension = 2048, successive B accesses in the inner loop will always go to the same bank. So they can't go in parallel - and if the A access happens to go to the same bank, you will lose.

I don't think, looking at it, that this is as big as the cache resonance.

And, yes, possibly, there may be aliasing going. E.g. the STLF (Store To Load Forwarding buffers) may be comparing only using a small bitfield, and getting false matches.

(Actually, if you think about it, resonance in the cache is like aliasing, related to the use of bitfields. Resonance is caused by multiple cache lines mapping the same set, not being spread arond. Alisaing is caused by matching based on incomplete address bits.)

Nice. little typo : 256 cache lines instead of 128.

Thanks for catching that: 2^8 = 256. I'll try to correct, but I bet I don't catch all of the dependencies. Back when I worked at Intel I wrote a little "Free Text Spreadsheet", that allowed formulae to be placed in the text: type in a new number, and the fix propagated. (I wrote that in undergrad; maybe I can revive.)

c - Matrix multiplication: Small difference in matrix size, large diff...

c performance algorithm matrix-multiplication
Rectangle 27 26

You are definitely getting what I call a cache resonance. This is similar to aliasing, but not exactly the same. Let me explain.

Caches are hardware data structures that extract one part of the address and use it as an index in a table, not unlike an array in software. (In fact, we call them arrays in hardware.) The cache array contains cache lines of data, and tags - sometimes one such entry per index in the array (direct mapped), sometimes several such (N-way set associativity). A second part of the address is extracted and compared to the tag stored in the array. Together, the index and tag uniquely identify a cache line memory address. Finally, the rest of the address bits identifies which bytes in the cache line are addressed, along with the size of the access.

Usually the index and tag are simple bitfields. So a memory address looks like

...Tag... | ...Index... | Offset_within_Cache_Line

(Sometimes the index and tag are hashes, e.g. a few XORs of other bits into the mid-range bits that are the index. Much more rarely, sometimes the index, and more rarely the tag, are things like taking cache line address modulo a prime number. These more complicated index calculations are attempts to combat the problem of resonance, which I explain here. All suffer some form of resonance, but the simplest bitfield extraction schemes suffer resonance on common access patterns, as you have found.)

So, typical values... there are many different models of "Opteron Dual Core", and I do not see anything here that specifies which one you have. Choosing one at random, the most recent manual I see on AMD's website, Bios and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh, March 12, 2012.

(Family 15h = Bulldozer family, the most recent high end processor - the BKDG mentions dual core, although I don't know the product number that is exactly what you describe. But, anyway, the same idea of resonance applies to all processors, it is just that the parameters like cache size and associativity may vary a bit.)

The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128- bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle. It is divided into 16 banks, each 16 bytes wide. [...] Only one load can be performed from a given bank of the L1 cache in a single cycle.

16KB/4-way => the resonance is 4KB.

I.e. address bits 0-5 are the cache line offset.

4 way associative => 256/4 = 64 indexes in the cache array. I (Intel) call these "sets".

i.e. you can consider the cache to be an array of 32 entries or sets, each entry containing 4 cache lines ad their tags. (It's more complicated than this, but that's okay).

(By the way, the terms "set" and "way" have varying definitions.)

there are 6 index bits, bits 6-11 in the simplest scheme.

This means that any cache lines that have exactly the same values in the index bits, bits 6-11, will map to the same set of the cache.

C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Loop k is the innermost loop. The base type is double, 8 bytes. If dimension=2048, i.e. 2K, then successive elements of B[dimension*k+j] accessed by the loop will be 2048 * 8 = 16K bytes apart. They will all map to the same set of the L1 cache - they will all have the same index in the cache. Which means that, instead of there being 256 cache lines in the cache available for use there will only be 4 - the "4-way associativity" of the cache.

I.e. you will probably get a cache miss every 4 iterations around this loop. Not good.

(Actually, things are a little more complicated. But the above is a good first understanding. The addresses of entries of B mentioned above is a virtual address. So there might be slightly different physical addresses. Moreover, Bulldozer has a way predictive cache, probably using virtual addresses bits so that it doesn't have to wait for a virtual to physical address translation. But, in any case: your code has a "resonance" of 16K. The L1 data cache has a resonance of 16K. Not good.)]

If you change the dimension just a little bit, e.g. to 2048+1, then the addresses of array B will be spread across all of the sets of the cache. And you will get significantly fewer cache misses.

It is a fairly common optimization to pad your arrays, e.g. to change 2048 to 2049, to avoid this srt of resonance. But "cache blocking is an even more important optimization. http://suif.stanford.edu/papers/lam-asplos91.pdf

In addition to the cache line resonance, there are other things going on here. For example, the L1 cache has 16 banks, each 16 bytes wide. With dimension = 2048, successive B accesses in the inner loop will always go to the same bank. So they can't go in parallel - and if the A access happens to go to the same bank, you will lose.

I don't think, looking at it, that this is as big as the cache resonance.

And, yes, possibly, there may be aliasing going. E.g. the STLF (Store To Load Forwarding buffers) may be comparing only using a small bitfield, and getting false matches.

(Actually, if you think about it, resonance in the cache is like aliasing, related to the use of bitfields. Resonance is caused by multiple cache lines mapping the same set, not being spread arond. Alisaing is caused by matching based on incomplete address bits.)

Nice. little typo : 256 cache lines instead of 128.

Thanks for catching that: 2^8 = 256. I'll try to correct, but I bet I don't catch all of the dependencies. Back when I worked at Intel I wrote a little "Free Text Spreadsheet", that allowed formulae to be placed in the text: type in a new number, and the fix propagated. (I wrote that in undergrad; maybe I can revive.)

c - Matrix multiplication: Small difference in matrix size, large diff...

c performance algorithm matrix-multiplication
Rectangle 27 20

"Natural" alignment means aligned to it's own type width. Thus, the load/store will never be split across any kind of boundary wider than itself (e.g. page, cache-line, or an even narrower chunk size used for data transfers between different caches).

First, this assumes that the int is updated with a single store instruction, rather than writing different bytes separately. This is part of what std::atomic guarantees, but that plain C or C++ doesn't. It will normally be the case, though. The x86-64 System V ABI doesn't forbid compilers from making accesses to int variables non-atomic, even though it does require int to be 4B with a default alignment of 4B.

For code that is guaranteed not to break, use C11 stdatomic or C++11 std::atomic. Otherwise the compiler will just keep a value in a register instead of reloading every time your read it

std::atomic<int> shared;  // shared variable (in aligned memory)

int x;           // local variable (compiler can keep it in a register)
x = shared.load(std::memory_order_relaxed);
shared.store(x, std::memory_order_relaxed);
// shared = x;  // don't do that unless you actually need seq_cst, because MFENCE is much slower than a simple store

Thus, we just need to talk about the behaviour of an insn like mov [shared], eax.

TL;DR: The x86 ISA guarantees that naturally-aligned stores and loads are atomic, up to 64bits wide.

std::atomic<T>
gcc -m32
_Atomic
atomic_llong
g++ -m32
std::atomic
<atomic>

IIRC, there were SMP 386 systems, but the current memory semantics weren't established until 486. This is why the manual says "486 and newer".

From the "Intel 64 and IA-32 Architectures Software Developer Manuals, volume 3", with my notes in italics. (see also the x86 tag wiki for links: current versions of all volumes, or direct link to page 256 of the vol3 pdf from Dec 2015)

In x86 terminology, a "word" is two 8-bit bytes. 32 bits are a double-word, or DWORD.

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary

Reading or writing a doubleword aligned on a 32-bit boundary

That last point that I bolded is the answer to your question: This behaviour is part of what's required for a processor to be an x86 CPU (i.e. an implementation of the ISA).

The rest of the section provides further guarantees for newer Intel CPUs: Pentium widens this guarantee to 64 bits.

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

double
cmpxchg8b
  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The section goes on to point out that accesses split across cache lines (and page boundaries) are not guaranteed to be atomic, and:

"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses."

AMD's manual agrees with Intel's about aligned 64-bit and narrower loads/stores being atomic

So 64bit x87 and MMX/SSE loads/stores up to 64b (e.g. movq, movsd, movhps, pinsrq, extractps, etc) are atomic if the data is aligned. gcc -m32 uses movq xmm, [mem] to implement atomic 64-bit loads for things like std::atomic<int64_t>. Clang4.0 -m32 unfortunately uses lock cmpxchg8b bug 33109.

On some CPUs with 128b or 256b internal data paths (between execution units and L1, and between different caches), 128b and even 256b vector loads/stores are atomic, but this is not guaranteed by any standard or easily queryable at run-time, unfortunately for compilers implementing std::atomic<__int128> or 16B structs.

If you want atomic 128b across all x86 systems, you must use lock cmpxchg16b (available only in 64bit mode). (And it wasn't available in the first-gen x86-64 CPUs. You need to use -mcx16 with gcc/clang for them to emit it.)

Even CPUs that internally do atomic 128b loads/stores can exhibit non-atomic behaviour in multi-socket systems with a coherency protocol that operates in smaller chunks: e.g. AMD Opteron 2435 (K10) with threads running on separate sockets, connected with HyperTransport.

Intel's and AMD's manuals diverge for unaligned access to cacheable memory. The common subset for all x86 CPUs is the AMD rule. Cacheable means write-back or write-through memory regions, not uncacheable or write-combining, as set with PAT or MTRR regions. They don't mean that the cache-line has to already be hot in L1 cache.

  • Intel P6 and later guarantee atomicity for cacheable loads/stores up to 64 bits as long as they're within a single cache-line (64B, or 32B on very old CPUs like PentiumIII).

AMD guarantees atomicity for cacheable loads/stores that fit within a single 8B-aligned chunk. That makes sense, because we know from the 16B-store test on multi-socket Opteron that HyperTransport only transfers in 8B chunks, and doesn't lock while transferring to prevent tearing. (See above). I guess lock cmpxchg16b must be handled specially.

Possibly related: AMD uses MOESI to share dirty cache-lines directly between caches in different cores, so one core can be reading from its valid copy of a cache line while updates to it are coming in from another cache.

Intel uses MESIF, which requires dirty data to propagate out to the large shared inclusive L3 cache which acts as a backstop for coherency traffic. L3 is tag-inclusive of per-core L2/L1 caches, even for lines that have to be in the Invalid state in L3 because of being M or E in a per-core L1 cache. The data path between L3 and per-core caches is only 32B wide in Haswell/Skylake, so it must buffer or something to avoid a write to L3 from one core happening between reads of two halves of a cache line, which could cause tearing at the 32B boundary.

The relevant sections of the manuals:

The P6 family processors (and newer Intel processors since) guarantee that the following additional memory operation will always be carried out atomically:

  • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

AMD64 Manual 7.3.2 Access Atomicity Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor model, as are misaligned loads or stores of less than a quadword that are contained entirely within a naturally-aligned quadword

Notice that AMD guarantees atomicity for any load smaller than a qword, but Intel only for power-of-2 sizes. 32-bit protected mode and 64-bit long mode can load a 48 bit m16:32 as a memory operand into cs:eip with far-call or far-jmp. (And far-call pushes stuff on the stack.) IDK if this counts as a single 48-bit access or separate 16 and 32-bit.

There have been attempts to formalize the x86 memory model, the latest one being the x86-TSO (extended version) paper from 2009 (link from the memory-ordering section of the x86 tag wiki). It's not usefully skimable since they define some symbols to express things in their own notation, and I haven't tried to really read it. IDK if it describes the atomicity rules, or if it's only concerned with memory ordering.

I mentioned cmpxchg8b, but I was only talking about the load and the store each separately being atomic (i.e. no "tearing" where one half of the load is from one store, the other half of the load is from a different store).

To prevent the contents of that memory location from being modified between the load and the store, you need lock cmpxchg8b, just like you need lock inc [mem] for the entire read-modify-write to be atomic. Also note that even if cmpxchg8b without lock does a single atomic load (and optionally a store), it's not safe in general to use it as a 64b load with expected=desired. If the value in memory happens to match your expected, you'll get a non-atomic read-modify-write of that location.

The lock prefix makes even unaligned accesses that cross cache-line or page boundaries atomic, but you can't use it with mov to make an unaligned store or load atomic. It's only usable with memory-destination read-modify-write instructions like add [mem], eax.

(lock is implicit in xchg reg, [mem], so don't use xchg with mem to save code-size or instruction count unless performance is irrelevant. Only use it when you want the memory barrier and/or the atomic exchange, or when code-size is the only thing that matters, e.g. in a boot sector.)

From the insn ref manual (Intel x86 manual vol2), cmpxchg:

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processors bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

This design decision reduced chipset complexity before the memory controller was built into the CPU. It may still do so for locked instructions on MMIO regions that hit the PCI-express bus rather than DRAM. It would just be confusing for a lock mov reg, [MMIO_PORT] to produce a write as well as a read to the memory-mapped I/O register.

The other explanation is that it's not very hard to make sure your data has natural alignment, and lock store would perform horribly compared to just making sure your data is aligned. It would be silly to spend transistors on something that would be so slow it wouldn't be worth using. If you really need it (and don't mind reading the memory too), you could use xchg [mem], reg (XCHG has an implicit LOCK prefix), which is even slower than a hypothetical lock mov.

Using a lock prefix is also a full memory barrier, so it imposes a performance overhead beyond just the atomic RMW. (Fun fact: Before mfence existed, a common idiom was lock add [esp], 0, which is a no-op other than clobbering flags and doing a locked operation. [esp] is almost always hot in L1 cache and won't cause contention with any other core. This idiom may still be more efficient than MFENCE on AMD CPUs.)

Without it, software would have to use 1-byte locks (or some kind of available atomic type) to guard accesses to 32bit integers, which is hugely inefficient compared to shared atomic read access for something like a global timestamp variable updated by a timer interrupt. It's probably basically free in silicon to guarantee for aligned accesses of bus-width or smaller.

For locking to be possible at all, some kind of atomic access is required. (Actually, I guess the hardware could provide some kind of totally different hardware-assisted locking mechanism.) For a CPU that does 32bit transfers on its external data bus, it just makes sense to have that be the unit of atomicity.

Since you offered a bounty, I assume you were looking for a long answer that wandered into all interesting side topics. Let me know if there are things I didn't cover that you think would make this Q&A more valuable for future readers.

Since you linked one in the question, I highly recommend reading more of Jeff Preshing's blog posts. They're excellent, and helped me put together the pieces of what I knew into an understanding of memory ordering in C/C++ source vs. asm for different hardware architectures, and how / when to tell the compiler what you want if you aren't writing asm directly.

AMD64 Manual 7.3.2 Access Atomicity: "Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor model, as are misaligned loads or stores of less than a quadword that are contained entirely within a naturally-aligned quadword"

@bartolo-otrit: hmm, so AMD has stricter requirements for atomicity of cacheable loads/stores than Intel? That matches up with the fact that HyperTransport between sockets transfers cache lines in aligned chunks as small as 8B. I wish Intel or someone would document the common subset of functionality that is required for a CPU to be called x86.

c++ - Why is integer assignment on a naturally aligned variable atomic...

c++ c concurrency x86 atomic
Rectangle 27 18

"Natural" alignment means aligned to it's own type width. Thus, the load/store will never be split across any kind of boundary wider than itself (e.g. page, cache-line, or an even narrower chunk size used for data transfers between different caches).

First, this assumes that the int is updated with a single store instruction, rather than writing different bytes separately. This is part of what std::atomic guarantees, but that plain C or C++ doesn't. It will normally be the case, though. The x86-64 System V ABI doesn't forbid compilers from making accesses to int variables non-atomic, even though it does require int to be 4B with a default alignment of 4B.

For code that is guaranteed not to break, use C11 stdatomic or C++11 std::atomic. Otherwise the compiler will just keep a value in a register instead of reloading every time your read it

std::atomic<int> shared;  // shared variable (in aligned memory)

int x;           // local variable (compiler can keep it in a register)
x = shared.load(std::memory_order_relaxed);
shared.store(x, std::memory_order_relaxed);
// shared = x;  // don't do that unless you actually need seq_cst, because MFENCE is much slower than a simple store

Thus, we just need to talk about the behaviour of an insn like mov [shared], eax.

TL;DR: The x86 ISA guarantees that naturally-aligned stores and loads are atomic, up to 64bits wide.

std::atomic<T>
gcc -m32
_Atomic
atomic_llong
g++ -m32
std::atomic
<atomic>

IIRC, there were SMP 386 systems, but the current memory semantics weren't established until 486. This is why the manual says "486 and newer".

From the "Intel 64 and IA-32 Architectures Software Developer Manuals, volume 3", with my notes in italics. (see also the x86 tag wiki for links: current versions of all volumes, or direct link to page 256 of the vol3 pdf from Dec 2015)

In x86 terminology, a "word" is two 8-bit bytes. 32 bits are a double-word, or DWORD.

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary

Reading or writing a doubleword aligned on a 32-bit boundary

That last point that I bolded is the answer to your question: This behaviour is part of what's required for a processor to be an x86 CPU (i.e. an implementation of the ISA).

The rest of the section provides further guarantees for newer Intel CPUs: Pentium widens this guarantee to 64 bits.

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

double
cmpxchg8b
  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The section goes on to point out that accesses split across cache lines (and page boundaries) are not guaranteed to be atomic, and:

"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses."

AMD's manual agrees with Intel's about aligned 64-bit and narrower loads/stores being atomic

So 64bit x87 and MMX/SSE loads/stores up to 64b (e.g. movq, movsd, movhps, pinsrq, extractps, etc) are atomic if the data is aligned. gcc -m32 uses movq xmm, [mem] to implement atomic 64-bit loads for things like std::atomic<int64_t>. Clang4.0 -m32 unfortunately uses lock cmpxchg8b bug 33109.

On some CPUs with 128b or 256b internal data paths (between execution units and L1, and between different caches), 128b and even 256b vector loads/stores are atomic, but this is not guaranteed by any standard or easily queryable at run-time, unfortunately for compilers implementing std::atomic<__int128> or 16B structs.

If you want atomic 128b across all x86 systems, you must use lock cmpxchg16b (available only in 64bit mode). (And it wasn't available in the first-gen x86-64 CPUs. You need to use -mcx16 with gcc/clang for them to emit it.)

Even CPUs that internally do atomic 128b loads/stores can exhibit non-atomic behaviour in multi-socket systems with a coherency protocol that operates in smaller chunks: e.g. AMD Opteron 2435 (K10) with threads running on separate sockets, connected with HyperTransport.

Intel's and AMD's manuals diverge for unaligned access to cacheable memory. The common subset for all x86 CPUs is the AMD rule. Cacheable means write-back or write-through memory regions, not uncacheable or write-combining, as set with PAT or MTRR regions. They don't mean that the cache-line has to already be hot in L1 cache.

  • Intel P6 and later guarantee atomicity for cacheable loads/stores up to 64 bits as long as they're within a single cache-line (64B, or 32B on very old CPUs like PentiumIII).

AMD guarantees atomicity for cacheable loads/stores that fit within a single 8B-aligned chunk. That makes sense, because we know from the 16B-store test on multi-socket Opteron that HyperTransport only transfers in 8B chunks, and doesn't lock while transferring to prevent tearing. (See above). I guess lock cmpxchg16b must be handled specially.

Possibly related: AMD uses MOESI to share dirty cache-lines directly between caches in different cores, so one core can be reading from its valid copy of a cache line while updates to it are coming in from another cache.

Intel uses MESIF, which requires dirty data to propagate out to the large shared inclusive L3 cache which acts as a backstop for coherency traffic. L3 is tag-inclusive of per-core L2/L1 caches, even for lines that have to be in the Invalid state in L3 because of being M or E in a per-core L1 cache. The data path between L3 and per-core caches is only 32B wide in Haswell/Skylake, so it must buffer or something to avoid a write to L3 from one core happening between reads of two halves of a cache line, which could cause tearing at the 32B boundary.

The relevant sections of the manuals:

The P6 family processors (and newer Intel processors since) guarantee that the following additional memory operation will always be carried out atomically:

  • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

AMD64 Manual 7.3.2 Access Atomicity Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor model, as are misaligned loads or stores of less than a quadword that are contained entirely within a naturally-aligned quadword

Notice that AMD guarantees atomicity for any load smaller than a qword, but Intel only for power-of-2 sizes. 32-bit protected mode and 64-bit long mode can load a 48 bit m16:32 as a memory operand into cs:eip with far-call or far-jmp. (And far-call pushes stuff on the stack.) IDK if this counts as a single 48-bit access or separate 16 and 32-bit.

There have been attempts to formalize the x86 memory model, the latest one being the x86-TSO (extended version) paper from 2009 (link from the memory-ordering section of the x86 tag wiki). It's not usefully skimable since they define some symbols to express things in their own notation, and I haven't tried to really read it. IDK if it describes the atomicity rules, or if it's only concerned with memory ordering.

I mentioned cmpxchg8b, but I was only talking about the load and the store each separately being atomic (i.e. no "tearing" where one half of the load is from one store, the other half of the load is from a different store).

To prevent the contents of that memory location from being modified between the load and the store, you need lock cmpxchg8b, just like you need lock inc [mem] for the entire read-modify-write to be atomic. Also note that even if cmpxchg8b without lock does a single atomic load (and optionally a store), it's not safe in general to use it as a 64b load with expected=desired. If the value in memory happens to match your expected, you'll get a non-atomic read-modify-write of that location.

The lock prefix makes even unaligned accesses that cross cache-line or page boundaries atomic, but you can't use it with mov to make an unaligned store or load atomic. It's only usable with memory-destination read-modify-write instructions like add [mem], eax.

(lock is implicit in xchg reg, [mem], so don't use xchg with mem to save code-size or instruction count unless performance is irrelevant. Only use it when you want the memory barrier and/or the atomic exchange, or when code-size is the only thing that matters, e.g. in a boot sector.)

From the insn ref manual (Intel x86 manual vol2), cmpxchg:

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processors bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

This design decision reduced chipset complexity before the memory controller was built into the CPU. It may still do so for locked instructions on MMIO regions that hit the PCI-express bus rather than DRAM. It would just be confusing for a lock mov reg, [MMIO_PORT] to produce a write as well as a read to the memory-mapped I/O register.

The other explanation is that it's not very hard to make sure your data has natural alignment, and lock store would perform horribly compared to just making sure your data is aligned. It would be silly to spend transistors on something that would be so slow it wouldn't be worth using. If you really need it (and don't mind reading the memory too), you could use xchg [mem], reg (XCHG has an implicit LOCK prefix), which is even slower than a hypothetical lock mov.

Using a lock prefix is also a full memory barrier, so it imposes a performance overhead beyond just the atomic RMW. (Fun fact: Before mfence existed, a common idiom was lock add [esp], 0, which is a no-op other than clobbering flags and doing a locked operation. [esp] is almost always hot in L1 cache and won't cause contention with any other core. This idiom may still be more efficient than MFENCE on AMD CPUs.)

Without it, software would have to use 1-byte locks (or some kind of available atomic type) to guard accesses to 32bit integers, which is hugely inefficient compared to shared atomic read access for something like a global timestamp variable updated by a timer interrupt. It's probably basically free in silicon to guarantee for aligned accesses of bus-width or smaller.

For locking to be possible at all, some kind of atomic access is required. (Actually, I guess the hardware could provide some kind of totally different hardware-assisted locking mechanism.) For a CPU that does 32bit transfers on its external data bus, it just makes sense to have that be the unit of atomicity.

Since you offered a bounty, I assume you were looking for a long answer that wandered into all interesting side topics. Let me know if there are things I didn't cover that you think would make this Q&A more valuable for future readers.

Since you linked one in the question, I highly recommend reading more of Jeff Preshing's blog posts. They're excellent, and helped me put together the pieces of what I knew into an understanding of memory ordering in C/C++ source vs. asm for different hardware architectures, and how / when to tell the compiler what you want if you aren't writing asm directly.

AMD64 Manual 7.3.2 Access Atomicity: "Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor model, as are misaligned loads or stores of less than a quadword that are contained entirely within a naturally-aligned quadword"

@bartolo-otrit: hmm, so AMD has stricter requirements for atomicity of cacheable loads/stores than Intel? That matches up with the fact that HyperTransport between sockets transfers cache lines in aligned chunks as small as 8B. I wish Intel or someone would document the common subset of functionality that is required for a CPU to be called x86.

c++ - Why is integer assignment on a naturally aligned variable atomic...

c++ c concurrency x86 atomic
Rectangle 27 3

It is even a lot easier; you can just multiply a, b and c by 1.2. This gives a line that is 1.2 times the size of the original line. share|improve this answer answered Oct 19 '12 at 16:13 bartlaarhoven 5251618

You mean x, y and z instead of a,b,c?

x2 = 1.2 * x1
y2 = 1.2 * y1
z2 = 1.2 * z1

Can you explain a little please?

You want to say that if you have two points O and A, a third point B with the coordinates as a multiple of the coordinates of A will be on the line determined by O and A? It doesn't seem possiblr

Given that O equals (0,0,0), it's simple vector multiplication. With a point A (x,y,z), the line OA represents a vector with direction (x,y,z). A point along the same line is a multiplication of this vector with a factor lambda. The length of the line then also multiplies with factor lambda. So 1.2 * (x,y,z) = (1.2x, 1.2y, 1.2z) and that represents a point on the line OA.

Find 3D point along the line at given distance - Stack Overflow

line distance direction
Rectangle 27 1

class C {
public:
   double getValue() {
      if (alreadyCalculated == true)
         return m_val;

      bool expected = false;
      if (calculationInProgress.compare_exchange_strong(expected, true)) {
         m_val = calculate(m_param);
         alreadyCalculated = true;
      // calculationInProgress = false;
      }
      else {
     //  while (calculationInProgress == true)
         while (alreadyCalculated == false)
            ; // spin
      }
      return m_val;
   }

private:
   double m_val;
   std::atomic<bool> alreadyCalculated {false};
   std::atomic<bool> calculationInProgress {false};
};

It's not in fact lock-free, there is a spin lock inside. But I think you cannot avoid such a lock if you don't want to run calculate() by multiple threads.

getValue() gets more complicated here, but the important part is that once m_val is calculated, it will always return immediately in the first if statement.

For performance reasons, it might also be a good idea do pad the whole class to a cache line size.

There was a bug in the original answer, thanks JVApen to pointing this out (it's marked by comments). The variable calculationInProgress would be better renamed to something as calculationHasStarted.

Also, please note that this solution assumes that calculate() does not throw an exception.

Actually, I don't really mind if calculate() runs once or twice too many, since it's not -very- expensive. I just don't want to run it more times than necessary. I've edited my question to reflect that.

It only matters in the not-yet-calculated part of getValue(), you would still need to protect the update of m_val somehow. This solution is IMO more elegant than to invoke calculate() multiple times, which would bring no benefit here. Crucial is to run getValue() as quickly as possible once the value is calculated.

alreadyCalculated

Because it is written by one thread a read by others. Without atomics, it would be a data race.

I believe so, since the alreadyCalculated=true; statement should generate memory fence.

c++ - Lock-free cache implementation in C++11 - Stack Overflow

c++ multithreading c++11 caching lock-free
Rectangle 27 5

Here is a simple C implementation (without modulus and trinary operators) for raw base64 encoded size (with standard '=' padding):

int output_size;
output_size = ((input_size - 1) / 3) * 4 + 4;

To that you will need to add any additional overhead for CRLF if required. The standard base64 encoding (RFC 3548 or RFC 4648) allows CRLF line breaks (at either 64 or 76 characters) but does not require it. The MIME variant (RFC 2045) requires line breaks after every 76 characters.

For example, the total encoded length using 76 character lines building on the above:

int final_size;
final_size = output_size + (output_size / 76) * 2;

c - Calculate the size to a Base 64 encoded message - Stack Overflow

c base64
Rectangle 27 21

You cannot calculate the size of an int array when all you've got is an int pointer.

#define ARRAY_SIZE( array ) ( sizeof( array ) / sizeof( array[0] )

This comes with all the usual caveats of macros, of course.

  • You cannot determine the number of elements initialized within an array, unless you initialize all elements to an "invalid" value first and doing the counting of "valid" values manually. If your array has been defined as having 8 elements, for the compiler it has 8 elements, no matter whether you initialized only 5 of them.
  • You cannot determine the size of an array within a function to which that array has been passed as parameter. Not directly, not through a macro, not in any way. You can only determine the size of an array in the scope it has been declared in.

The impossibility of determining the size of the array in a called function can be understood once you realize that sizeof() is a compile-time operator. It might look like a run-time function call, but it isn't: The compiler determines the size of the operands, and inserts them as constants.

In the scope the array is declared, the compiler has the information that it is actually an array, and how many elements it has.

In a function to which the array is passed, all the compiler sees is a pointer. (Consider that the function might be called with many different arrays, and remember that sizeof() is a compile-time operator.

You can switch to C++ and use <vector>. You can define a struct vector plus functions handling that, but it's not really comfortable:

#include <stdlib.h>

typedef struct
{
    int *  _data;
    size_t _size;
} int_vector;

int_vector * create_int_vector( size_t size )
{
    int_vector * _vec = malloc( sizeof( int_vector ) );
    if ( _vec != NULL )
    {
        _vec._size = size;
        _vec._data = (int *)malloc( size * sizeof( int ) );
    }
    return _vec;
}

void destroy_int_vector( int_vector * _vec )
{
    free( _vec->_data );
    free( _vec );
}

int main()
{
    int_vector * myVector = create_int_vector( 8 );
    if ( myVector != NULL && myVector->_data != NULL )
    {
        myVector->_data[0] = ...;
        destroy_int_vector( myVector );
    }
    else if ( myVector != NULL )
    {
        free( myVector );
    }
    return 0;
}

Bottom line: C arrays are limited. You cannot calculate their length in a sub-function, period. You have to code your way around that limitation, or use a different language (like C++).

Thanks! This is good but I would be more happy with Function!

@Swanand Purankar: I can understand that, but it is just plain impossible to do in C.

@Swanand Purankar: I just saw your comment under your question... note that this macro will not give you a 5 in your example, but 8. There is no way for C to tell which / how many array elements have actually been assigned (unless you manually insert special values for uninitialized elements and do likewise manual counting of non-special values). C itself can only tell you the overall size of the array.

@DevSolar: I tried this Macro but the problem is I can not use this Macro in Functions to which an array is passed by pointer! Any Help??

Calculate Length of Array in C by Using Function - Stack Overflow

c arrays
Rectangle 27 500

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit
split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac ...

Another option, split by size of output file (still splits on line breaks):

split -C 20m --numeric-suffixes input_filename output_prefix
output_prefix01 output_prefix02 output_prefix03 ...

you can also split a file by size: split -b 200m filename (m for megabytes, k for kilobytes or no suffix for bytes)

split by size and ensure files are split on line breaks: split -C 200m filename

split produces garbled output with Unicode (UTF-16) input. At least on Windows with the version I have.

Using split data.csv in OSX 10.8.4 to separate a 5k line file just produces an identical file named xaa..

@geotheory, be sure to follow LeberMac's advice earlier in the thread about first converting CR (Mac) line endings to LR (Linux) line endings using TextWrangler or BBEdit. I had the exact same problem as you until I found that piece of advice.

Sign up for our newsletter and get our top new questions delivered to your inbox (see an example).

bash - How to split a large text file into smaller files with equal nu...

bash file unix
Rectangle 27 494

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit
split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac ...

Another option, split by size of output file (still splits on line breaks):

split -C 20m --numeric-suffixes input_filename output_prefix
output_prefix01 output_prefix02 output_prefix03 ...

you can also split a file by size: split -b 200m filename (m for megabytes, k for kilobytes or no suffix for bytes)

split by size and ensure files are split on line breaks: split -C 200m filename

split produces garbled output with Unicode (UTF-16) input. At least on Windows with the version I have.

Using split data.csv in OSX 10.8.4 to separate a 5k line file just produces an identical file named xaa..

@geotheory, be sure to follow LeberMac's advice earlier in the thread about first converting CR (Mac) line endings to LR (Linux) line endings using TextWrangler or BBEdit. I had the exact same problem as you until I found that piece of advice.

Sign up for our newsletter and get our top new questions delivered to your inbox (see an example).

bash - How to split a large text file into smaller files with equal nu...

bash file unix
Rectangle 27 9

Don't use C++ stl strings and getline ( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.

Then scan the block at the native word size of your system ( ie either uint32_t or uint64_t) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )

If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.

If the file is only ever being appended to, you could also remember the number of lines and previous length, and start from there.

It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.

There is no reason you can't use mmap or reading blocks in C++.

You have to jump through a lot of hoops to convert between stl containers and strings and raw mmapped memory, which often involves copying and indirection, rather than just calling the functions and using the memory directly using the C subset of C++.

There is of course no "C subset" of C++. Just because you don't use a std container or string does not make it "not C++" in some way.

And to add to what Neil said, std algorithms work perfectly fine on raw memory, using pointers as iterators. You could trivially write a functor performing the SIMD tricks specified, and run that over the file data with std::for_each. STL is more than just the containers

@Neil you know full well that there are a subset of C++ which is close to idiomatic C I've posted this exact response before and been criticised for it being C rather than C++; you can't win.

Fastest way to find the number of lines in a text (C++) - Stack Overfl...

c++ line-count