2

C99 doesn't need function bodies, or 'VLAs are Turing complete'

 2 years ago
source link: https://lemon.rip/w/c99-vla-tricks/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Home Go Back

C99 doesn't need function bodies, or 'VLAs are Turing complete'

19 Feb 2022


Preface

The 1999 revision to the C programming language brought several interesting changes and additions to the standard. Among those are variable length arrays, or VLAs for short, which allow array types to have lengths that are determined at runtime. These can show up in different contexts, but for the purposes of this post we will be looking at VLAs as parameters to functions. For example:

void f(int n, float v[n]);

Since array parameters decay to their corresponding pointer type, that declaration is functionally equivalent, and compatible (6.7.5.3p7) to the following:

void f(int n, float *v);

The length of v in the first declaration is given by the value of a previous parameter of the function, which can only be known at runtime. However, in a function prototype (i.e. not a function definition) like the above, the length of a VLA parameter is never computed and essentially ignored (6.7.5.2p5). But in a function definition, the length is evaluated and must be greater than zero. In practice this is rather useless because the sizeof operator doesn't work as one might expect it to with array parameters (6.5.3.4-note88):

void f(int n, float param[n]) {
    float local[n];
    int a = sizeof local,
        b = sizeof param;

    printf("n = %d\n", n);
    printf("sizeof local = %d\n", a); // this will be n * sizeof(float)
    printf("sizeof param = %d\n", b); // but this will be sizeof(float *)
}

In other words, using sizeof with a VLA will evaluate the runtime-known length and calculate the array size based on that, except when that VLA is a function parameter, then, for whatever reason, it is the size of the corresponding pointer type (in this example, sizeof local === n * sizeof(float) but sizeof param == sizeof(float *)). The length of a VLA parameter may be used when e.g. computing indices when accessing multi-dimensional variable length arrays.

Alas, the standard mandates that the variable array length be computed when the function is called. Of course, the expression in between the square brackets is not limited to simple expressions like n, so one can write something like:

void f(int a, int b, int m[(a + b) / 2]) {}
void f(int x, int b[abs(x)]) {}

or even

void f(int v[getchar()]) {}

Disembodiment

The following program should give you an idea of the kind of constructs that these rules allow for (try it on compiler explorer):

int main(int argc, char *argv[printf("Hello")])
{
    printf(" world!\n");
}

The length expression of the VLA is evaluated before any of the statements in main. I couldn't find anywhere in the standard saying whether this evaluation order is well-defined but it is what clang and gcc do, and luckily, it does not matter for the sake of this article, as we will see shortly.

Let us refer to the subset of C99 where function bodies must be empty as disembodied C. You might naturally ask yourself what things can be accomplished in this subset (though you can probably guess the answer from the title of this page). In other words, what can you do in C if you are limited to just evaluating expressions and no statements?

  • Using the comma operator, we can sequence arbitrarily many expressions for their side effects, so

    void f() {
        printf("Press enter to confirm: ");
        getchar();
        printf("thanks.\n");
    }
    

    becomes

    void f(char _[(
        printf("Press enter to confirm: "),
        getchar(),
        printf("thanks.\n"),
        1
    )]) {}
    

    In a comma expression, the operands are evaluated left-to-right and the value of the last operand is the resulting value of the whole expression. The 1 at the end ensures that the evaluated array length is >0 (to avoid UB).

  • Functions in disembodied C are going to need a dummy parameter where the VLA length expression evaluation can take place. For consistency, we will denote it as char _[...] and give it the value "" (the empty string) when calling said functions (note that the value we give it doesn't actually matter, though its size should be at least as big as the computed VLA size to avoid UB).

  • If-else statements can be replaced with ternary conditional expressions, such that

    void f(int n) {
        if (n < 0)
            printf("negative!");
        else if (n > 0)
            printf("positive!");
        else
            printf("zero!");
    }
    

    becomes

    void f(int n, char _[(
        (n < 0) ?
            printf("negative!")
        : (n > 0) ?
            printf("positive!")
        :
            printf("zero!")
        , 1
    )]) {}
    
  • Remember that the VLA length expression can access previous function arguments, so parameter passing is essentially unchanged.

  • We cannot return values, but we can use out parameters by taking advantage of the fact that assignments are expressions in C, so instead of

    int imax(int a, int b) {
        return a > b ? a : b;
    }
    

    we can write

    void imax(int *out, int a, int b, char _[
        (*out = a > b ? a : b),
        1
    ]) {}
    
  • We cannot define local variables inside of expressions, but we can just add extra function parameters to use as temporaries, rewriting

    void fswapf(float *a, float *b) {
        float tmp = *a;
        *a = *b;
        *b = tmp;
    }
    
    static void fswapf_aux(float *a, float *b, float tmp, char _[(
        tmp = *a,
        *a = *b,
        *b = tmp,
        1
    )]) {}
    
    void fswapf(float *a, float *b, char _[(
        fswapf_aux(a, b, 0, ""), 1
    )]) {}
    

    Alternatively, if re-entrancy and thread-safety are disregarded, we could just use global (static) variables.

  • What about loops? Clearly we cannot use while or for inside expressions. Thankfully, they are unnecessary thanks to recursion. For example:

    float sum(float *v, size_t n) {
        float sum = 0.0f;
        for (size_t i = 0; i < n; ++i)
            sum += v[i];
        return sum;
    }
    

    can be expressed as

    /* the forward declaration is necessary */
    static void sum_aux(float *out, float *v, size_t n, char *);
    static void sum_aux(float *out, float *v, size_t n, char _[(
        (n > 0) ? (
            *out += *v,
            sum_aux(out, v + 1, n - 1, ""),
            1
        ) : 1
    )]) {}
    
    void sum(float *out, float *v, size_t n, char _[(
        *out = 0.0f,
        sum_aux(out, v, n, ""),
        1
    )]) {}
    

    In fact, any imperative-style loop can be turned into an equivalent recursive loop (as any functional programmer will be happy to demonstrate), though since C lacks closures or any form of anonymous functions, it can get quite unwieldy (I hope you like auxiliary functions).

    The astute reader might point out that these two versions of sum are not equivalent because the recursive definition may cause a stack overflow for large enough values of n. This is unfortunately true, and the major hurdle for the practicality of disembodied C, but does not preclude Turing completeness (an ideal Turing machine has infinite memory at its disposal). Luckily, modern compilers are smart, and if we write our functions carefully to be tail recursive, they will often be able to perform tail call elimination, removing the risk of a stack overflow. For the above example, both gcc and clang are able to optimize the tail call (see on compiler explorer).

  • Although not relevant to standard C99, in case you had the idea of using gcc statement expressions to bypass these limitations, the compiler will stop you right on your tracks since it doesn't allow those outside of function bodies.

Limitations

After all of that, it seems that basically anything can be done in disembodied C, so, is there something that can't be expressed in this C99 subset?

With the rules that we've laid down so far, yes. In particular:

  • It is not possible in the general case to use APIs that require callbacks. For example, look at the standard qsort function:

     void qsort(void *base, size_t nmemb, size_t size,
                int (*compar)(const void *, const void *));
    

    The compar parameter is a pointer to a function that should return the result of comparing two values in the given array. Since its parameters are void pointers, we can't have the VLA argument we need to perform computations (arrays of void are not valid C), and we also cannot provide an int return value. Thus there is no way (that I know of) to use this function in standards-compliant disembodied C.

    This is not a dealbreaker since we could just reimplement qsort ;).

  • We cannot access the program arguments. The entry point of a program in disembodied C looks like this:

    int main(int argc, char *argv[ /* code here ... */ ]) {}
    

    In the VLA length expression for argv, argc is in scope, but argv is not.

    Fortunately there is a way to work around this: we can use the common extension where main can take a third argument which is a pointer to the environment. According to the standard (J.5.1), this is implemented in hosted environments. In practice, POSIX environments and Microsoft both support it. Then, our entry point looks like this instead:

    int main(int argc, char **argv, char *envp[/* ... */]) {}
    

    Now argv is in scope within the envp array length expression.

    Side note: even though we can't return, main is the exception to the rule that reaching the closing } of a function returning non-void is verboten, and is equivalent to returning zero (5.1.2.2.3). If we need to terminate with an exit code other than zero, we can use exit().

Final Words

It seems that with enough effort anything can be done with these peculiar restrictions. As an example, I wrote implementations of the MD5 and SHA256 hash functions as well as Conway's Game of Life using SDL for graphics. I was also working on a more interesting and "useful" program but I kind of lost motivation halfway through. I'll update this article if I ever come back to finish that.

In case it wasn't clear enough, there is zero practical reason to ever do any of this. I hope you'll agree it's a fun party trick though :).


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK