GitHub - spencertipping/jit-tutorial: How to write a very simple JIT compiler
source link: https://github.com/spencertipping/jit-tutorial
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
How to write a JIT compiler
First up, you probably don't want to. JIT, or more accurately "dynamic code generation," is typically not the most effective way to optimize a project, and common techniques end up trading away a lot of portability and require fairly detailed knowledge about processor-level optimization.
That said, though, writing JIT compiler is a lot of fun and a great way to learn stuff. The first thing to do is to write an interpreter.
NOTE: If you don't have solid grasp of UNIX system-level programming, you might want to read about how to write a shell, which covers a lot of the fundamentals.
MandelASM
GPUs are fine for machine learning, but serious fractal enthusiasts design their own processors to generate Mandelbrot sets. And the first step in processor design, of course, is to write an emulator for it. Our emulator will interpret the machine code we want to run and emit an image to stdout.
To keep it simple, our processor has four complex-valued registers called a
,
b
, c
, and d
, and it supports three in-place operations:
=ab
: assign registera
to registerb
+ab
: add registera
to registerb
*ab
: multiply registerb
by registera
For each pixel, the interpreter will zero all of the registers and then set a
to the current pixel's coordinates. It then iterates the machine code for up to
256 iterations waiting for register b
to "overflow" (i.e. for its complex
absolute value to exceed 2). That means that the code for a standard Mandelbrot
set is *bb+ab
.
Simple interpreter
The first thing to do is write up a bare-bones interpreter in C. It would be
simpler to use complex.h
here, but I'm going to write it in terms of
individual numbers because the JIT compiler will end up generating the longhand
logic. In production code we'd include bounds-checks and stuff, but I'm
omitting those here for simplicity.
// simple.c #include <stdio.h> #include <stdlib.h> #define sqr(x) ((x) * (x)) typedef struct { double r; double i; } complex; void interpret(complex *registers, char const *code) { complex *src, *dst; double r, i; for (; *code; code += 3) { dst = ®isters[code[2] - 'a']; src = ®isters[code[1] - 'a']; switch (*code) { case '=': dst->r = src->r; dst->i = src->i; break; case '+': dst->r += src->r; dst->i += src->i; break; case '*': r = dst->r * src->r - dst->i * src->i; i = dst->r * src->i + dst->i * src->r; dst->r = r; dst->i = i; break; default: fprintf(stderr, "undefined instruction %s (ASCII %x)\n", code, *code); exit(1); } } } int main(int argc, char **argv) { complex registers[4]; int i, x, y; char line[1600]; printf("P5\n%d %d\n%d\n", 1600, 900, 255); for (y = 0; y < 900; ++y) { for (x = 0; x < 1600; ++x) { registers[0].r = 2 * 1.6 * (x / 1600.0 - 0.5); registers[0].i = 2 * 0.9 * (y / 900.0 - 0.5); for (i = 1; i < 4; ++i) registers[i].r = registers[i].i = 0; for (i = 0; i < 256 && sqr(registers[1].r) + sqr(registers[1].i) < 4; ++i) interpret(registers, argv[1]); line[x] = i; } fwrite(line, 1, sizeof(line), stdout); } return 0; }
Now we can see the results by using display
from ImageMagick
(apt-get install imagemagick
), or by saving to a file:
$ gcc simple.c -o simple $ ./simple *bb+ab | display - # imagemagick version $ ./simple *bb+ab > output.pgm # save a grayscale PPM image $ time ./simple *bb+ab > /dev/null # quick benchmark real 0m2.369s user 0m2.364s sys 0m0.000s $
Performance analysis
In the real world, JIT is absolutely the wrong move for this problem.
Array languages like APL, Matlab, and to a large extent Perl, Python, etc, manage to achieve reasonable performance by having interpreter operations that apply over a large number of data elements at a time. We've got exactly that situation here: in the real world it's a lot more practical to vectorize the operations to apply simultaneously to a screen-worth of data at a time -- then we'd have nice options like offloading stuff to a GPU, etc.
However, since the point here is to compile stuff, on we go.
JIT can basically eliminate the interpreter overhead, which we can easily model
here by replacing interpret()
with a hard-coded Mandelbrot calculation. This
will provide an upper bound on realistic JIT performance, since we're unlikely
to optimize as well as gcc
does.
// hardcoded.c #include <stdio.h> #include <stdlib.h> #define sqr(x) ((x) * (x)) typedef struct { double r; double i; } complex; void interpret(complex *registers, char const *code) { complex *a = ®isters[0]; complex *b = ®isters[1]; double r, i; r = b->r * b->r - b->i * b->i; i = b->r * b->i + b->i * b->r; b->r = r; b->i = i; b->r += a->r; b->i += a->i; } int main(int argc, char **argv) { complex registers[4]; int i, x, y; char line[1600]; printf("P5\n%d %d\n%d\n", 1600, 900, 255); for (y = 0; y < 900; ++y) { for (x = 0; x < 1600; ++x) { registers[0].r = 2 * 1.6 * (x / 1600.0 - 0.5); registers[0].i = 2 * 0.9 * (y / 900.0 - 0.5); for (i = 1; i < 4; ++i) registers[i].r = registers[i].i = 0; for (i = 0; i < 256 && sqr(registers[1].r) + sqr(registers[1].i) < 4; ++i) interpret(registers, argv[1]); line[x] = i; } fwrite(line, 1, sizeof(line), stdout); } return 0; }
This version runs about twice as fast as the simple interpreter:
$ gcc hardcoded.c -o hardcoded $ time ./hardcoded *bb+ab > /dev/null real 0m1.329s user 0m1.328s sys 0m0.000s $
JIT design and the x86-64 calling convention
The basic strategy is to replace interpret(registers, code)
with a function
compile(code)
that returns a pointer to a function whose signature is this:
void compiled(registers*)
. The memory for the function needs to be allocated
using mmap
so we can set permission for the processor to execute it.
The easiest way to start with something like this is probably to emit the
assembly for simple.c
to see how it works:
$ gcc -S simple.c
Edited/annotated highlights from the assembly simple.s
, which is much more
complicated than what we'll end up generating:
interpret: // The stack contains local variables referenced to the "base pointer" // stored in hardware register %rbp. Here's the layout: // // double i = -8(%rbp) // double r = -16(%rbp) // src = -24(%rbp) // dst = -32(%rbp) // registers = -40(%rbp) <- comes in as an argument in %rdi // code = -48(%rbp) <- comes in as an argument in %rsi pushq %rbp movq %rsp, %rbp // standard x86-64 function header subq $48, %rsp // allocate space for six local vars movq %rdi, -40(%rbp) // registers arg -> local var movq %rsi, -48(%rbp) // code arg -> local var jmp for_loop_condition // commence loopage
Before getting to the rest, I wanted to call out the %rsi
and %rdi
stuff
and explain a bit about how calls work on x86-64. %rsi
and %rdi
seem
arbitrary, which they are to some extent -- C obeys a platform-specific calling
convention that specifies how arguments get passed in. On x86-64, up to six
arguments come in as registers; after that they get pushed onto the stack. If
you're returning a value, it goes into %rax
.
The return address is automatically pushed onto the stack by call
instructions like e8 <32-bit relative>
. So internally, call
is the same as
push ADDRESS; jmp <call-site>; ADDRESS: ...
. ret
is the same as pop %rip
,
except that you can't pop into %rip
. This means that the return address is
always the most immediate value on the stack.
Part of the calling convention also requires callees to save a couple of
registers and use %rbp
to be a copy of %rsp
at function-call-time, but our
JIT can mostly ignore this stuff because it doesn't call back into C.
for_loop_body: // (a bunch of stuff to set up *src and *dst) cmpl $43, %eax // case '+' je add_branch cmpl $61, %eax // case '=' je assign_branch cmpl $42, %eax // case '*' je mult_branch jmp switch_default // default assign_branch: // the "bunch of stuff" above calculated *src and *dst, which are // stored in -24(%rbp) and -32(%rbp). movq -24(%rbp), %rax // %rax = src movsd (%rax), %xmm0 // %xmm0 = src.r movq -32(%rbp), %rax // %rax = dst movsd %xmm0, (%rax) // dst.r = %xmm0 movq -24(%rbp), %rax // %rax = src movsd 8(%rax), %xmm0 // %xmm0 = src.i movq -32(%rbp), %rax // %rax = dst movsd %xmm0, 8(%rax) // dst.i = %xmm0 jmp for_loop_step add_branch: movq -32(%rbp), %rax // %rax = dst movsd (%rax), %xmm1 // %xmm1 = dst.r movq -24(%rbp), %rax // %rax = src movsd (%rax), %xmm0 // %xmm0 = src.r addsd %xmm1, %xmm0 // %xmm0 += %xmm1 movq -32(%rbp), %rax // %rax = dst movsd %xmm0, (%rax) // dst.r = %xmm0 movq -32(%rbp), %rax // same thing for src.i and dst.i movsd 8(%rax), %xmm1 movq -24(%rbp), %rax movsd 8(%rax), %xmm0 addsd %xmm1, %xmm0 movq -32(%rbp), %rax movsd %xmm0, 8(%rax) jmp for_loop_step mult_branch: movq -32(%rbp), %rax movsd (%rax), %xmm1 movq -24(%rbp), %rax movsd (%rax), %xmm0 mulsd %xmm1, %xmm0 movq -32(%rbp), %rax movsd 8(%rax), %xmm2 movq -24(%rbp), %rax movsd 8(%rax), %xmm1 mulsd %xmm2, %xmm1 subsd %xmm1, %xmm0 movsd %xmm0, -16(%rbp) // double r = src.r*dst.r - src.i*dst.i movq -32(%rbp), %rax movsd (%rax), %xmm1 movq -24(%rbp), %rax movsd 8(%rax), %xmm0 mulsd %xmm0, %xmm1 movq -32(%rbp), %rax movsd 8(%rax), %xmm2 movq -24(%rbp), %rax movsd (%rax), %xmm0 mulsd %xmm2, %xmm0 addsd %xmm1, %xmm0 movsd %xmm0, -8(%rbp) // double i = src.r*dst.i + src.i*dst.r movq -32(%rbp), %rax movsd -16(%rbp), %xmm0 movsd %xmm0, (%rax) // dst.r = r movq -32(%rbp), %rax movsd -8(%rbp), %xmm0 movsd %xmm0, 8(%rax) // dst.i = i jmp for_loop_step for_loop_step: addq $3, -48(%rbp) for_loop_condition: movq -48(%rbp), %rax // %rax = code (the pointer) movzbl (%rax), %eax // %eax = *code (move one byte) testb %al, %al // is %eax 0? jne for_loop_body // if no, then continue leave // otherwise rewind stack ret // pop and jmp
Compilation strategy
Most of the above is register-shuffling fluff that we can get rid of. We're
compiling the code up front, which means all of our register addresses are
known quantities and we won't need any unknown indirection at runtime. So all
of the shuffling into and out of %rax
can be replaced by a much simpler move
directly to or from N(%rdi)
-- since %rdi
is the argument that points to
the first register's real component.
If you haven't already, at this point I'd recommend downloading the Intel software developer's manual, of which volume 2 describes the semantics and machine code representation of every instruction.
NOTE: GCC uses AT&T assembly syntax, whereas the Intel manuals use Intel
assembly syntax. An important difference is that AT&T reverses the arguments:
mov %rax, %rbx
(AT&T syntax) assigns to %rbx
, whereas mov rax, rbx
(Intel
syntax) assigns to rax
. All of my code examples use AT&T, and none of this
will matter once we're working with machine code.
Example: the Mandelbrot function *bb+ab
// Step 1: multiply register B by itself movsd 16(%rdi), %xmm0 // %xmm0 = b.r movsd 24(%rdi), %xmm1 // %xmm1 = b.i movsd 16(%rdi), %xmm2 // %xmm2 = b.r movsd 24(%rdi), %xmm3 // %xmm3 = b.i movsd %xmm0, %xmm4 // %xmm4 = b.r mulsd %xmm2, %xmm4 // %xmm4 = b.r*b.r movsd %xmm1, %xmm5 // %xmm5 = b.i mulsd %xmm3, %xmm5 // %xmm5 = b.i*b.i subsd %xmm5, %xmm4 // %xmm4 = b.r*b.r - b.i*b.i movsd %xmm4, 16(%rdi) // b.r = %xmm4 mulsd %xmm0, %xmm3 // %xmm3 = b.r*b.i mulsd %xmm1, %xmm2 // %xmm2 = b.i*b.r addsd %xmm3, %xmm2 // %xmm2 = b.r*b.i + b.i*b.r movsd %xmm2, 24(%rdi) // b.i = %xmm2 // Step 2: add register A to register B movpd (%rdi), %xmm0 // %xmm0 = (a.r, a.i) addpd %xmm0, 16(%rdi) // %xmm0 += (b.r, b.i) movpd %xmm0, 16(%rdi) // (b.r, b.i) = %xmm0
The multiplication code isn't optimized for the squaring-a-register use case; instead, I left it fully general so we can use it as a template when we start generating machine code.
JIT mechanics
Before we compile a real language, let's just get a basic code generator working.
// jitproto.c #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> typedef long(*fn)(long); fn compile_identity(void) { // Allocate some memory and set its permissions correctly. In particular, we // need PROT_EXEC (which isn't normally enabled for data memory, e.g. from // malloc()), which tells the processor it's ok to execute it as machine // code. char *memory = mmap(NULL, // address 4096, // size PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, // fd (not used here) 0); // offset (not used here) if (memory == MAP_FAILED) { perror("failed to allocate memory"); exit(1); } int i = 0; // mov %rdi, %rax memory[i++] = 0x48; // REX.W prefix memory[i++] = 0x8b; // MOV opcode, register/register memory[i++] = 0xc7; // MOD/RM byte for %rdi -> %rax // ret memory[i++] = 0xc3; // RET opcode return (fn) memory; } int main() { fn f = compile_identity(); int i; for (i = 0; i < 10; ++i) printf("f(%d) = %ld\n", i, (*f)(i)); munmap(f, 4096); return 0; }
This does what we expect: we've just produced an identity function.
$ gcc jitproto.c -o jitproto $ ./jitproto f(0) = 0 f(1) = 1 f(2) = 2 f(3) = 3 f(4) = 4 f(5) = 5 f(6) = 6 f(7) = 7 f(8) = 8 f(9) = 9
TODO: explanation about userspace page mapping/permissions, and how ELF instructions tie into this (maybe also explain stuff like the FD table while we're at it)
Generating MandelASM machine code
This is where we start to get some serious mileage out of the Intel manuals. We need encodings for the following instructions:
f2 0f 11
:movsd reg -> memory
f2 0f 10
:movsd memory -> reg
f2 0f 59
:mulsd reg -> reg
f2 0f 58
:addsd reg -> reg
f2 0f 5c
:subsd reg -> reg
66 0f 11
:movpd reg -> memory
(technicallymovupd
for unaligned move)66 0f 10
:movpd memory -> reg
66 0f 58
:addpd memory -> reg
The gnarly bits: how operands are specified
Chapter 2 of the Intel manual volume 2 contains a roundabout, confusing description of operand encoding, so I'll try to sum up the basics here. (TODO)
For the operators above, we've got two ModR/M configurations:
movsd reg <-> X(%rdi)
: mod = 01, r/m = 111, disp8 = Xaddsd reg -> reg
: mod = 11
At the byte level, they're written like this:
movsd %xmm0, 16(%rdi) # f2 0f 11 47 10
# modr/m = b01 000 111 = 47
# disp = 16 = 10
addsd %xmm3, %xmm4 # f2 0f 58 e3
# modr/m = b11 100 011 = e3
A simple micro-assembler
// micro-asm.h #include <stdarg.h> typedef struct { char *dest; } microasm; // this makes it more obvious what we're doing later on #define xmm(n) (n) void asm_write(microasm *a, int n, ...) { va_list bytes; int i; va_start(bytes, n); for (i = 0; i < n; ++i) *(a->dest++) = (char) va_arg(bytes, int); va_end(bytes); } void movsd_reg_memory(microasm *a, char reg, char disp) { asm_write(a, 5, 0xf2, 0x0f, 0x11, 0x47 | reg << 3, disp); } void movsd_memory_reg(microasm *a, char disp, char reg) { asm_write(a, 5, 0xf2, 0x0f, 0x10, 0x47 | reg << 3, disp); } void movsd_reg_reg(microasm *a, char src, char dst) { asm_write(a, 4, 0xf2, 0x0f, 0x11, 0xc0 | src << 3 | dst); } void mulsd(microasm *a, char src, char dst) { asm_write(a, 4, 0xf2, 0x0f, 0x59, 0xc0 | dst << 3 | src); } void addsd(microasm *a, char src, char dst) { asm_write(a, 4, 0xf2, 0x0f, 0x58, 0xc0 | dst << 3 | src); } void subsd(microasm *a, char src, char dst) { asm_write(a, 4, 0xf2, 0x0f, 0x5c, 0xc0 | dst << 3 | src); } void movpd_reg_memory(microasm *a, char reg, char disp) { asm_write(a, 5, 0x66, 0x0f, 0x11, 0x47 | reg << 3, disp); } void movpd_memory_reg(microasm *a, char disp, char reg) { asm_write(a, 5, 0x66, 0x0f, 0x10, 0x47 | reg << 3, disp); } void addpd_memory_reg(microasm *a, char disp, char reg) { asm_write(a, 5, 0x66, 0x0f, 0x58, 0x47 | reg << 3, disp); }
Putting it all together
Now that we can write assembly-level stuff, we can take the structure from the prototype JIT compiler and modify it to compile MandelASM.
// mandeljit.c #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include "micro-asm.h" #define sqr(x) ((x) * (x)) typedef struct { double r; double i; } complex; typedef void(*compiled)(complex*); #define offsetof(type, field) ((unsigned long) &(((type *) 0)->field)) compiled compile(char *code) { char *memory = mmap(NULL, 4096, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); microasm a = { .dest = memory }; char src_dsp, dst_dsp; char const r = offsetof(complex, r); char const i = offsetof(complex, i); for (; *code; code += 3) { src_dsp = sizeof(complex) * (code[1] - 'a'); dst_dsp = sizeof(complex) * (code[2] - 'a'); switch (*code) { case '=': movpd_memory_reg(&a, src_dsp, xmm(0)); movpd_reg_memory(&a, xmm(0), dst_dsp); break; case '+': movpd_memory_reg(&a, src_dsp, xmm(0)); addpd_memory_reg(&a, dst_dsp, xmm(0)); movpd_reg_memory(&a, xmm(0), dst_dsp); break; case '*': movsd_memory_reg(&a, src_dsp + r, xmm(0)); movsd_memory_reg(&a, src_dsp + i, xmm(1)); movsd_memory_reg(&a, dst_dsp + r, xmm(2)); movsd_memory_reg(&a, dst_dsp + i, xmm(3)); movsd_reg_reg (&a, xmm(0), xmm(4)); mulsd (&a, xmm(2), xmm(4)); movsd_reg_reg (&a, xmm(1), xmm(5)); mulsd (&a, xmm(3), xmm(5)); subsd (&a, xmm(5), xmm(4)); movsd_reg_memory(&a, xmm(4), dst_dsp + r); mulsd (&a, xmm(0), xmm(3)); mulsd (&a, xmm(1), xmm(2)); addsd (&a, xmm(3), xmm(2)); movsd_reg_memory(&a, xmm(2), dst_dsp + i); break; default: fprintf(stderr, "undefined instruction %s (ASCII %x)\n", code, *code); exit(1); } } // Return to caller (important! otherwise we'll segfault) asm_write(&a, 1, 0xc3); return (compiled) memory; } int main(int argc, char **argv) { compiled fn = compile(argv[1]); complex registers[4]; int i, x, y; char line[1600]; printf("P5\n%d %d\n%d\n", 1600, 900, 255); for (y = 0; y < 900; ++y) { for (x = 0; x < 1600; ++x) { registers[0].r = 2 * 1.6 * (x / 1600.0 - 0.5); registers[0].i = 2 * 0.9 * (y / 900.0 - 0.5); for (i = 1; i < 4; ++i) registers[i].r = registers[i].i = 0; for (i = 0; i < 256 && sqr(registers[1].r) + sqr(registers[1].i) < 4; ++i) (*fn)(registers); line[x] = i; } fwrite(line, 1, sizeof(line), stdout); } return 0; }
Now let's benchmark the interpreted and JIT-compiled versions:
$ gcc mandeljit.c -o mandeljit $ time ./simple *bb+ab > /dev/null real 0m2.348s user 0m2.344s sys 0m0.000s $ time ./mandeljit *bb+ab > /dev/null real 0m1.462s user 0m1.460s sys 0m0.000s
Very close to the limit performance of the hardcoded version. And, of course, the JIT-compiled result is identical to the interpreted one:
$ ./simple *bb+ab | md5sum 12a1013d55ee17998390809ffd671dbc - $ ./mandeljit *bb+ab | md5sum 12a1013d55ee17998390809ffd671dbc -
Further reading
Debugging JIT compilers
First, you need a good scotch; this one should work.
Once you've got that set up, gdb
can probably be scripted to do what you
need. I've used it somewhat
successfully
to debug a bunch of hand-written self-modifying machine code with no debugging
symbols -- the limitations of the approach ended up being whiskey-related
rather than any deficiency of GDB itself.
I've also had some luck using radare2 to figure out when I was generating bogus instructions.
Offline disassemblers like NASM and YASM won't help you.
Low-level
-
The Intel guides cover a lot of stuff we didn't end up using here: addressing modes, instructions, etc. If you're serious about writing JIT compilers, it's worth an in-depth read.
-
Agner Fog's guides to processor-level optimization: an insanely detailed tour through processor internals, instruction parsing pipelines, and pretty much every variant of every processor in existence.
-
The V8 source code: how JIT assemblers are actually written
-
Jonesforth: a well-documented example of low-level code generation and interpreter structure (sort of a JIT alternative)
-
Canard machine code: similar to jonesforth, but uses machine code for its data structures
Recommend
-
203
angular 4 mean starter This is a very neat and simple starter kit for an angular 4 MEAN SPA applications. Obviously this won't be perfect for every situation, but I found this really necessary to overcome the fear of the scary blank proje...
-
109
Write a C Compiler! This is a set of C test programs to help you write your own compiler. They were written to accompany this tutorial. Usage
-
22
Simpleton Simpleton is a dead simple UDP to database logging solution that just accepts UDP packets and stores them in an SQLite3 database. Per default it stores them in simpleton.db in the curren...
-
10
A very Simple and Stupid plugin system in pythonA very Simple and Stupid plugin system in python Fri 02 September 2011Two convenience functions for listing and importing python modules : # utils.py import os def plugi...
-
6
Very simple ways to tell if a website is unsafe to register on Now and again, you can hear news stories about large-scale cyber attacks. Many of those are...
-
1
According to Wikipedia GraphQL is an open-source data query and manipulation language for APIs, and a runtime for fulfilling queries with existing data.[2] GraphQL was developed internally by Facebook in 2012 before being...
-
4
Levenshtein A very simple implementation of the Levenshtein distance in Go. The levenshtein distance is a measure of the similarity between two strings. It is the minimum number of single-character edits (insertions, deletions or sub...
-
2
Instantly share code, notes, and snippets. Very simple Python script to dump all emails in an I...
-
6
I Had An AI Chatbot Write My Eulogy. It Was Very WeirdThe inability to deal with heavy topics like death is another reminder that “AI” chatbots are just glitchy funhouse mirrors.by
-
3
Social Follow us on social media for more news, content and background stories from our authors, editors and events. Share your personal experience with us.
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK