Wednesday, March 19, 2014

Speculative stores

Recently I read this wonderful article on C11 atomic variables and its possible usage in Linux Kernel. This is why admire community coding. People throng into fruitful discussions and eventually best comes out. The final result is lot of learning from all corners. In the article, Mr.Corbet mentioned about speculative stores which means the consequences of nasty compiler optimizations. I wrote about how intelligibly compilers handle certain cases here and here. So lets look at following example which is also mentioned in the article.

int y=2;

int do_some_work()
{
    y = 2;

    if (y)
        ....
    else
        ....
}


In above code, many programmers may expect compiler to rip off the 'else' branch. That's the dangerous part if the code belongs to kernel space. Why? Now 'y' is a data segment entity and can be manipulated by any CPU in SMP system. It can be even set to zero by some CPU. In that case, the optimization of compiler will result in untidy results. I cannot say how do compilers treat such code in kernel space since it requires bit of time to experiment.

How can we do it in user space? Very simple! Run more than one thread and lets see how assembly looks like. Note that this code is not thread safe

#include<pthread.h>
#include<stdio.h>
#include<assert.h>

int y=0;
void* thread_routine(void* arg);

void* thread_routine(void* arg)
{
        y=1;
        if(y)
                printf("Y is Y in thread = %d\n", pthread_self());
        else
                printf("Y is !Y in thread = %d\n", pthread_self());
}

void* thread_routine2(void* arg)
{
        y=0;
        if(y)
                printf("Y is !Y in thread = %d\n", pthread_self());
        else
                printf("Y is Y in thread = %d\n", pthread_self());
}

int main(int argc, char **argv)
{
        pthread_t tid[2];

        int thread_rc = 0;

        thread_rc = pthread_create(&tid[0], NULL, thread_routine, NULL);
        assert(!thread_rc);
        thread_rc = pthread_create(&tid[1], NULL, thread_routine2, NULL);
        assert(!thread_rc);
}


I am stripping of unnecessary sections and retaining only the thread stack assembly. The thread_routine2 function also looks similar

thread_routine:
.LFB2:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movq    %rdi, -8(%rbp)
        movl    $1, y(%rip)
       movl    y(%rip), %eax <-- I know you are tricking me :D
       testl   %eax, %eax
       je      .L2

        call    pthread_self
        movq    %rax, %rsi
        movl    $.LC0, %edi
        movl    $0, %eax
        call    printf
        jmp     .L4
.L2:
        call    pthread_self
        movq    %rax, %rsi
        movl    $.LC1, %edi
        movl    $0, %eax
        call    printf

.L4:
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc


If you glance at assembly code, gcc is smart man :D. It knows that it should not optimize in such cases. If you observe the assembly, gcc emits code for both if and else part even though there is straight forward assignment before the branching. Also look at "movl y(%rip), %eax"! Instead of blindly copying value of '1' to EAX register, the actual value of 'y' is copied and tested :-).

Caveat: Multi threading may not emulate a SMP scenario in linux. Nowadays operating systems tend to hook threads to particular CPU rather than multiple CPUs. This is mainly to avoid penalty incurred due to cache line invalidations especially when global variable is involved and can be modified. Nevertheless, a thread can be pre-empted in middle of operation (say while if{} branch can be precisely evaluated) unless lock is held explicitly. Understanding SMP systems is quite intricate however opens up to wide variety of thoughts in programming world. Two cores are not two brains you know ;-). There are lot difficulties while handling such scenarios!

Finally short assignments ;-): 

1) Examine the assembly in case of -O2 switch
2) Remove threads and run bare minimal program while retaining data segment  variable and observe what gcc does!