Follow

I went down a deep rabbit hole (and I'm still falling). It started out with the idea of trying to compile into LLVM IR (and maybe WASM). Now I've jumped ships mid-course and changed to C as a target; I will probably, out of sheer perverseness, do Fortran as well. So this is decompilation of a stack-based assembly language into a non-stack-based higher-level language.

It's one of these "how hard can it be?" things.

The answer in short is, it's impossible; but there are degrees of approximation, and they are increasingly difficult. So it's a lot of fun.


Β· Β· Web Β· 5 Β· 6 Β· 15

My compiler now works on the Uxntal "Hello, world" so I guess that's it, job done ^_^

Let me introduce "nito", a proof-of-concept compiler from to C.

You can find it at codeberg.org/wimvanderbauwhede.

I might still do an actual LLVM IR backend, and will almost certainly do Fortran as it is too good to let pass.

@wim_v12e Factor used an SSA compiler, so it should be somewhat possible, but if I remember correctly, the optimization guidelines say that some functions do need to use the stack because the compiler doesn't know ahead of time what they will do. πŸ€”
I guess it falls into halting problem / Rice's theorem territory?

@csepp Yes, that's it. In Uxntal you can compute the address you want to jump to. So the only way to know that value for sure is to run the program, and you can't be sure that it will every terminate. There is of course an obvious way around that which is to delegate all such constructs to runtime, by integrating the VM into the compiler. So when I say "impossible", it means "impossible to create a direct translation to a language that does not allow jumps to computed addresses".
I am a bit tempted to allow e.g. linear operations on constants, as these are mostly likely the most common ones, but I'm not sufficiently motivated to implement that.

My approach is indeed to create SSA-style registers, very much like LLVM does. But looking at LLVM IR in more detail, there is actually very little to be gained by using it as a target rather than C. And in C I don't need to insert phi functions.

@wim_v12e llvm ir and wasm are both stack based right?

@charlag LLVM IR is register based, you can't access the stack directly.
WASM is superficially stack-based. I say superficially because there are hardly any stack manipulation operations, you can only implement them via registers. Also, loops are register-based and can only return a single value. So in practice you still have to do a stack-to-register transformation for all loops.
I was a bit disappointed at WASM, also because it only supports 32-bit values, nothing smaller; and it is an odd mix of s-expressions and stack syntax.
I am definitely not targetting it directly because there is an LLVM backend for WASM.

@lanodan Ah, nice, I didn't know about that. Looks a bit like a middle ground between C and LLVM IR. It would be a great target for my compiler.

@wim_v12e It's because uxntal can self-modify, right? Because the program and the data live in the same memory and it's entirely mutable...

I thought about similar things, and i came to the conclusion that if i made something like uxn and tried to make a runtime that translated it into machine code for efficiency, i'd have to do enough static analysis to compile a lump of the code all at once and have "oops, you jumped to an unknown place, stop and recompile starting there" on the edges, or something along those lines. I'm sure it gets even more complicated/intractible fast, though.

@autumnal I think even if uxntal was not self-modifiable, you could still write tal programs that can't be compiled to a language with static labels.
It would even be the case if the program lived in a separate memory space. The fundamental issue is that in tal you can calculate at run time to which token of the program you want to jump. And such a run-time calculation is unpredictable at compile time.

You can get around this by combining the compiler with the VM: when you detect a computed jump, you could generate code that runs the VM from that address. But I don't want to do that.

That the code is self-modifiable makes it even worse of course, because even with static labels you could create new code that is only known at run time.

The only way to support that is again by integration of the VM. With my compiler, any such attempt will be a segfault because the code is not stored in the data memory.

@wim_v12e @autumnal Actually, there's a way to do it without integrating the compiler and the VM: generate every possible instruction in the host (or, rather: have a function for each), and make running the code parameterized calls to that.

It's sort of a half-way step between interpreting and recompiling. You have an interpreter, and you "compile" the machine code into interpreter calls. It trims down the biggest inefficiency with interpreters, at least, since you don't have to decode at rt.

@wim_v12e

> on_reset_while: 1; /* label needs statement */

That's horrible. I love it! :D

@pixelherodev 🀣
That's me being lazy, because I use combined declarations + statements and that is not a valid target.

@wim_v12e I'm aware! :)

I've written some transpilers targeting C for fun; my solution was, more convolutedly, to track whether a label was needed :P

@pixelherodev Ah, I should probably do too. But laziness is a virtue πŸ˜„

@wim_v12e It's a matter of axioms, really: do you intend for the output of your transpiler to be machine code that happens to be valid input to a C compiler, or to be C language code?

If the former, the "lazy" approach is actually the *correct* one. You shouldn't optimize for human readability for the C, because it's basically trash. Its only purpose is to be lowered to QBE/GIMPLE/LLVM/SDIR/etc and from there to machine code, which will be more readable than the C anyways.

@pixelherodev I prefer the former, that is why I'll probably go to LLVM IR directly (but after I've done Fortran ^_^).

@pixelherodev By the way thanks for looking at it in that level of detail!

@wim_v12e Well I wasn't going to comment _without_ at least reading the README! >_>

@pixelherodev You're way ahead of lots of people on the internet with that attitude :-D

Sign in to participate in the conversation
Cybrespace

cybrespace: the social hub of the information superhighway jack in to the mastodon fediverse today and surf the dataflow through our cybrepunk, slightly glitchy web portal support us on patreon or liberapay!