Curve25519 implementation is both the most blessed and cursed code in the world.
On one hand, it has the cleanest math and attracted the best minds to work on it. Now the code deployed in most projects have been fully optimized, and formally verified to be mathematically correct and secure. One of the biggest accomplishment in crypto applications in recent years.
On the other hand, the monstrous, machine-generated assembly or C code will give any unsuspected programmer a heart attack.
Seriously, "you are not expected to understand this."
BTW, human-understandable implementations do exist...
/me now trying to implement it in PDP-11 assembly...
This paper is even better, fully annotated commentary of TweetNaCl's X25519 code. https://martin.kleppmann.com/papers/curve25519.pdf
Now I need to find a way to do carry propagation on smaller limbs. Two papers all "cheated" by representing 256-bit bignum using 16 x 64-bit integers to ensure they never overflow, avoiding the need to carry completely, but it would be too slow on the PDP-11..
Michael Hutter & Peter Schwabe's code for 8-bit Atmel chips look like a good start, but it also features multi-thousand lines of fully-unrolled and semi-machine generated code. No macros, the same thing just repeats over and over again, not the easiest thing to read... 🙈
Just figured out the One Weird Trick to add numbers mod 2^255 - 19 for Curve25519. The actual normalization is only needed at the last moment, meanwhile, during the many calculations, you don't reduce results to mod 2^255 - 19, but to mod 2^256 - 38. Now the computer can calculate mod 2^256 automatically due to integer overflow. The only problem is the result is off by 38, so you simply add 38 if the most significant byte has overflowed. Neat.
I was staring at the screen for an entire afternoon and trying to understand why the carry propagation is performed twice in Curve25519 modular additions, no idea at all... Why is the second carry necessary?
Later, after a web search, I found a historical OpenSSL bug caused by not doing the second carry... Okay, I'm glad I'm not the only one who doesn't understand it.
Bignum with 15-bit integers work surprisingly well. It only takes 2 instructions to simulate a carry. And coincidentally, Curve25519 works on a 255-bit finite field, not 256-bit, so a field element is representable in exactly 17 words and the fast modular reduction trick still works. It's also naturally represented in octal, PDP-11's preferred number system.