Speaking as an Asm programmer for several decades: Calling conventions are stupid. They are the results of mindless stupid-compiler-oriented thinking from a time when compilers produced horrible copy-paste-replace code. The CPU itself couldn't care less which registers you use for what. So many wasted bytes on moving values between registers, just because the calling convention wanted it there, and no other reason. The only need to pay attention to calling conventions is when you're interfacing with compiler-generated code. Modern CPUs are fast, but there's still tons of inefficiency in compiler output.
lmz 4 days ago [-]
> The only need to pay attention to calling conventions is when you're interfacing with compiler-generated code.
So, the vast majority of code out there in the wild?
Spivak 4 days ago [-]
If you're not interfacing with it, say linking it as a library, then it doesn't matter what you do.
lmz 3 days ago [-]
Sure, but this sort of limits the kinds of thing you can realistically build unless you want to build everything from the ground up. Even in the case of code reuse with statically linked assembly files there would be some sort of "convention" about how to call and be called.
almostgotcaught 4 days ago [-]
vast majority doesn't even begin to describe it - i would wager 10 years of my salary that the fraction of all currently running CPU instructions that were handwritten is so small that it's within the margin of error (i.e., random bit flips) for whatever computer you use to perform the count.
RandomBK 4 days ago [-]
Depending on how you count, the ratio might not be that small. A lot of hot code are written in hand-coded inline assembly, so in terms of CPU cycles run it's probably non-negligible.
i.e. take a look at the glibc implementation of 'strcmp` [0]
Now how much of that doesn't interface with compiler generated code?
almostgotcaught 4 days ago [-]
> A lot of hot code are written in hand-coded inline assembly
I know... I write GPU assembly for a living... And still I make that wager. It's not a lot. It's not even a little. It's an epsilon (overall). And it gets smaller over time.
userbinator 4 days ago [-]
I mean it only matters at the interface.
timewizard 4 days ago [-]
> The CPU itself couldn't care less which registers you use for what.
Not all registers encode as operands equivalently (implicit rdx:rax, implicit [rbx+al], limited [rbp/r13+imm8]). Some have other encoding restrictions or special purposes (rdi, rsi, rcx). When segmentation was a thing there were different default segments for each. Some are destroyed when certain opcodes used (syscall: rcx, r11).
> So many wasted bytes on moving values between registers [...] Modern CPUs are fast
Well, they've special cased this anyways, as these will often be caught in the rename stage and not even occupy an execution slot. Since we've long recognized that passing these values in registers instead of the stack is far more efficient, which is why the `fastcall` convention came about and got it's name way back in the x86 days.
> but there's still tons of inefficiency in compiler output.
Which is also why the 'inline' heuristic exists. In which case all of the calling conventions are fully abandoned. I mean, things like ELF dynamic symbol tables, and linux thread local storage annoy me far more than calling conventions ever have.
userbinator 3 days ago [-]
Well, they've special cased this anyways, as these will often be caught in the rename stage and not even occupy an execution slot
They still need to be fetched and decoded, and take up space in caches and RAM that could be used for more purposeful instructions.
Which is also why the 'inline' heuristic exists.
Inlining has its own problems too.
I mean, things like ELF dynamic symbol tables, and linux thread local storage annoy me far more than calling conventions ever have.
Don't get me started on the whole ELF and dynamic linking situation...
almostgotcaught 4 days ago [-]
do people think this is insightful? do you?
> conventions are stupid
all conventions are stupid when examined through the lens of an isolated island dweller. you might as well be saying something like "you only need to drive on the left-hand side of the road when you're driving on public roads".
userbinator 4 days ago [-]
Compilers were stupid, and that's how we ended up with this constant overhead of inefficiency long after they could've done better; it's only within the last decade or so that "custom calling conventions" started being even considered.
almostgotcaught 4 days ago [-]
Calling conventions have no more to do with dumb or smart compilers than driving on the left-hand side of the road has to do with dumb or smart urban planners.
userbinator 4 days ago [-]
Of course they do. An Asm programmer will naturally use the appropriate registers to minimise data movement (see also: PC BIOS interface - no stupid stack shit) depending on the circumstances, a stupid compiler will just push everything on the stack. A more intelligent compiler will behave more like the human programmer and decide how to pass parameters and save or restore registers on a case-by-case basis.
almostgotcaught 4 days ago [-]
[flagged]
josephg 4 days ago [-]
I think you two are arguing past each other.
Calling conventions are obviously needed for syscalls and dynamically linked libraries. I don't think anyone is denying that. But most function calls aren't made to shared code. Almost all function calls are made to private functions, which exist within a binary, and are under the control of the compiler. If I steelman the person you're arguing with, I think what they're claiming that adhering to any specific calling convention for internal functions results in a lot of dumb assembly.
For example, imagine I have a C program where function a() calls b(). So long as b is confined to my binary, the "calling convention" of b doesn't matter. All that matters is that the compiler knows how to call it, and pass all the arguments. The compiler could jump to the function or call it. It could put parameters in registers or leave them on the stack. An awful lot of executed instructions exist to move parameters into the correct registers, save whatever was using those registers before the call, and restore them afterwards.
If we imagine this simple C code:
int do_stuff(int a, int b, int c, int d) {
int i = 1000;
i += func1(a, b); // a, b passed in rdi, rsi.
i += func2(c, d); // Call to func2 overwrites rdi & rsi.
return i;
}
If both func1 and func2 are forced to use the same calling convention, between calls to func1 and func2, the CPU needs to place c and d into whatever registers a and b were in a moment ago. But modern CPUs have lots of registers. If the compiler were more clever, it could just use different registers for the arguments of both functions and avoid shuffling everything around.
(Aside: Its weird how different the assembly is between GCC and clang in this example!)
But - moving values between registers (or between a register and the stack) is crazy fast anyway. I'd love to see some benchmarks showing how much of a difference this optimisation would make in practice.
userbinator 4 days ago [-]
You get it. That's exactly what I'm referring to.
But - moving values between registers (or between a register and the stack) is crazy fast anyway. I'd love to see some benchmarks showing how much of a difference this optimisation would make in practice.
It's fast, but the instructions still occupy space in caches, memory, and decoders; space that could've been used for other more valuable instructions. The problem is that the typical microbenchmarks don't show this sort of inefficiency (and on the other hand makes questionable-at-large options like loop unrolling seem great), because they're too small to see the effects of the waste.
josephg 3 days ago [-]
I think you could benchmark that.
Take a big program like chrome, and measure the effect of adding even more meaningless assembly instructions around each call site. If that makes it slower, removing pointless instructions will make it faster by a similar amount.
I’d be fascinated to know the effect size of a test like that.
isotypic 3 days ago [-]
[dead]
almostgotcaught 3 days ago [-]
> Almost all function calls are made to private functions, which exist within a binary, and are under the control of the compiler
Do you understand that those private functions are often called from multiple places in your private binary? Do you expect the compiler to emit different prologues at each call site? And even if it did, one side (callee) is fixed right? Do you expect the compiler to find the globally optimal layout across all calls? okay sure maybe that's possible if the compiler were an oracle.
There's already a compiler technique for what you're describing: function inlining. The reality is that neither compilers nor compiler engineers are dumb and if you (one) has a bright idea that hasn't been implemented widely then it's highly likely someone has already considered it and there are flaws.
josephg 3 days ago [-]
> Do you understand that those private functions are often called from multiple places in your private binary? Do you expect the compiler to emit different prologues at each call site? And even if it did, one side (callee) is fixed right?
Yes, I understand all that. But this optimisation is clearly possible to write. On the face of it, it sounds similar to the register allocation problem within a function, but it needs to make decisions globally. Rewriting each caller's code would be easy. The hard part would be deciding which registers to use for each emitted function's parameters. You'd have to look at all of the callers to figure out the best choices. And each choice affects all the other choices.
So yes, it would be a very, very difficult optimisation to implement. Especially given how modern compilers are architected, using code units and a linker. The compiler needs to look at the functions "all at once".
Inlining isn't the same. But I agree that you'd get the most benefit from this optimisation in small, hot functions. If those functions are already being inlined, all of the benefit of this optimisation would disappear.
> The reality is that neither compilers nor compiler engineers are dumb and if you (one) has a bright idea that hasn't been implemented widely then it's highly likely someone has already considered it and there are flaws.
Compiler engineers are smart. But they were also smart 20 years ago, and its clear in hindsight that the compilers of the time left a lot of performance on the table. I bet we're still leaving a lot of performance on the table.
The question in my mind is simply, would the juice be worth the squeeze? It sounds like we agree that it would be a very difficult optimisation to implement. I think the commenter above is right that it would make programs smaller. The remaining question is, would it make them significantly faster? Would it make programs faster enough to justify the implementation complexity?
I personally doubt it. Look at Rust. Rust binaries are often much bigger than their C equivalents because of bounds checks and various other runtime checks. But in my experience, the bloated binary sizes don't seem to make much of a performance difference. If anything, rust code is often (somehow) slightly faster than the equivalent C when I've measured it. (And this was true even before noalias was enabled.)
But it would be a cool thing to try out. Godspeed to anyone interested in giving it a go.
antics 4 days ago [-]
Since no one seems to be pushing back I'll add my 2¢ here as a former compilers engineer. Calling conventions are just like any other style guide. Yes, any particular coding style is stupid, but it's still useful to have one that you are more or less committed to, especially if everyone else in the ecosystem is committed to it too.
Frame pointers are a great example. Having a well-known and generic representation of %rbp is helpful when you go to use or integrate with existing tools like debuggers, link editors, or (say) most of the existing LLVM/GCC/whatever toolchain. Or when you want to expose a stable ABI to consumers for whatever reason (as, e.g., the Linux kernel famously does). Or, or, or.
I think it's reasonable to say this has been mostly uncontroversial since at least the 90s. The discussion has changed a bit since (apparently) Go needed none of these things to succeed—not the LLVM compiler toolchain infrastructure, and also not the user-facing things like the debuggers. To hear Russ Cox tell the tale, this is mostly because they required flexibility, and I suppose they were right, since they did rewrite their linker 3 times, and sure enough, 15 years later, in the year of our lord 2024, most debugging in Go seems to happen by writing ASCII-shaped bytes to some kind of a file, somewhere, and then using the world's most expensive full text search engine to get those bytes so you can physically read them on a screen. A debugger does seem limited in use for that specific workflow, so maybe that was the right call, who knows.
Anyway, I don't think there's ever been any real doubt that something like LLVM imposes a serious integration cost, but now that we have Go, the discussion has mostly shifted to "is it worth it", and seemingly the answer is "mostly yes" since nearly everyone building a new and hip native language uses LLVM or something like it. Every language is different, YMMV, etc., but I personally don't hear a lot of complaining about what a bummer it is that all these tools work pleasantly together instead of secretly sabotaging each other by loading up FP with whatever cursed data Go wanted to use it for. And why would they?
What is more mysterious to me is how an actual assembly programmer came to defend Go's stance on ... assembly. Perhaps I'm the only one who reads these things, but at various points in the ASM docs[1] (which I will heretofore call "Mr Pike's wild ride") the author expresses a view that I think is reasonably well-described as "a tempered but pretty much open contempt for the practice as a whole". cf.,
> Instructions, registers, and assembler directives are always in UPPER CASE to remind you that assembly programming is a fraught endeavor. (Exception: the g register renaming on ARM.)
Even if your feelings are hard to hurt though, if you ever crack open the toolchain and attempt to read the golang kind-of-IL-kind-of-x86, it is hard to walk away thinking "these people really get me and my profession". DI and FP are both normal registers! It uses the unicode interpunct instead of the plain old dot operator! It uses NIL instead of NULL! It is one thing to say calling conventions are stupid, but it's another thing entirely to give a great big hug to an almost-but-not-quite-assembly-code that is convenient neither for humans to type nor for tools to consume.
for functions that don't escape the current compilation unit (`static` functions, anonymous namespace functions), can/do compilers ignore calling conventions and do the faster thing? Of course, they can just inline, and that makes this moot.
malkia 4 days ago [-]
GoLang assembly boggles my mind - I understand why it's there, but having looked at it few times makes me wonder if it could've been prevented somehow (I guess not, cryptographic primitives would be way too slow, redirecting them through some kind of ffi would require a shared lib, yada yada yada)...
kristianp 4 days ago [-]
Intrinsics would be a great quality of life improvement for low-level optimisations. They don't require understanding register allocation, but obviously they would add complexity to the compiler and they aren't cross-architecture. I have tried some tools that convert a C function with intrinsics to Go assembly, but they were buggy for my use case [1],[2].
They could have added crypto primitives via intrinsic, or had some other way of including the edge case functionality it solves.
But it's good enough and I guess it compiles quick which was a major goal for golang.
tptacek 4 days ago [-]
Major Rust cryptography libraries (see for instance Ring) use assembly, too. It's a pretty normal thing to do.
nimish 4 days ago [-]
Sure but golang has its own special assembly flavor rather than using standard gcc flavor inline assembly. Probably because it's a soup to nuts compiler but still.
tptacek 4 days ago [-]
The point of this article is that Go-specific assembler generators (Avo in particular) are better than standard assembly for this purpose.
nimish 4 days ago [-]
That doesn't preclude syntactic compatibility, does it?
Veserv 4 days ago [-]
This is not at all comparable to inline assembly which interleaves two different languages into the same source file.
The presented examples are just a straight distinct assembly language associated with the golang ecosystem used in their own dedicated source files called via a FFI. This is comparable to just writing a pure assembly file and linking it into your program which is actually a much more reasonable thing to do than the insanity of inline assembly.
The problems being highlighted are just cases of people who do not understand ABIs and ABI compatibility. This is extremely common when crossing a language boundary due to abstraction mismatches and is made worse, not better, by doing even more magic behind the scenes to implicitly paper over the mismatches.
citizenpaul 4 days ago [-]
>soup to nuts compiler
Any chance you can explain that to rubes like me?
tptacek 4 days ago [-]
Go isn't built on an existing compiler framework like LLVM. It does its own code generation, has its own assembler.
nwokon 4 days ago [-]
There is an accident of history here. Go was developed with the plan 9 C compiler suite as a starting point. Most notably those compilers did not generate assembler -- they emitted object code directly. This is described here: https://9p.io/sys/doc/compiler.html. The assembler facilitated transforming hand-written assembly to object code. And here the plan 9 folks chose a new syntax, probably because it was simpler to start afresh over using the existing "AT&T" or "Intel" syntax.
nimish 3 days ago [-]
Typical plan 9, change for the sake of change. Second system effect writ large.
kibwen 4 days ago [-]
It's kinda weird that languages (or at least languages with pretensions to cryptography) are still forcing people to resort to asm directly rather than offering some sort of first-class support for constant time operations and not leaving secrets lying around in memory. It doesn't need to be super high level, it just needs to clear the infinitely low bar of assembly language. Does any language offer such a dedicated facility?
wbl 4 days ago [-]
That's not why cryptographers use assembly. We use assembly because performance often requires instructions the complier will never use that the CPU maker makes for us. Intrinsics invite all sorts of spilling issues and aren't quite as good.
matheusmoreira 4 days ago [-]
As someone who's made a very simple language, I would say there are far too many moving parts involved to guarantee anything of the sort. It's probably better to just integrate libsodium.
Interpreters will literally switch on the type of things in order to figure out what to do with the value. They've lost the side channel battle before it even began. Compilers? Who knows what sort of code they will generate? Who knows how many of your precautions they will delete in an effort to "optimize"? Libsodium has its own memory zeroing function because compilers were "optimizing" the usage of the standard ones.
If you're writing anything cryptography related, you probably want to be talking directly to the processor which will be running your code. And only after you've studied the entire manual. Because even CPUs have significant gaps in the properties they guarantee and the conditions they guarantee them in.
Cryptographers might even consider lowering the level even further. They might want to consider building their own cryptoprocessor that works exactly like they want it to work. Especially if you need to guarantee things like "it's impossible to copy keys and secrets". I own three yubikeys for the sole purpose of guaranteeing this.
wbl 4 days ago [-]
Interpreters don't need to have dynamic typing: for example the JVM and the interpreters before JIT. Even with dynamic types there are some spectacularly clever tricks people use: Smalltalk VMs are where they were invented and practiced a bunch.
In crypto code branching is exactly what you don't want to do to guarantee security. Branches go both ways if an attacker can force a mispeculate and microarchitectural state is not rolled back because it can't be.
the8472 4 days ago [-]
Most crypto wants constant-time execution, optimizing compilers are not designed with that in mind. They have optimization passes that will happily turn your carefully crafted constant-time code back into branches when their heuristics deem that profitable.
Currently the most reliable way to get exactly the assembly you want is to write the assembly you want.
So, the vast majority of code out there in the wild?
i.e. take a look at the glibc implementation of 'strcmp` [0]
[0] https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/m...
I know... I write GPU assembly for a living... And still I make that wager. It's not a lot. It's not even a little. It's an epsilon (overall). And it gets smaller over time.
Not all registers encode as operands equivalently (implicit rdx:rax, implicit [rbx+al], limited [rbp/r13+imm8]). Some have other encoding restrictions or special purposes (rdi, rsi, rcx). When segmentation was a thing there were different default segments for each. Some are destroyed when certain opcodes used (syscall: rcx, r11).
> So many wasted bytes on moving values between registers [...] Modern CPUs are fast
Well, they've special cased this anyways, as these will often be caught in the rename stage and not even occupy an execution slot. Since we've long recognized that passing these values in registers instead of the stack is far more efficient, which is why the `fastcall` convention came about and got it's name way back in the x86 days.
> but there's still tons of inefficiency in compiler output.
Which is also why the 'inline' heuristic exists. In which case all of the calling conventions are fully abandoned. I mean, things like ELF dynamic symbol tables, and linux thread local storage annoy me far more than calling conventions ever have.
They still need to be fetched and decoded, and take up space in caches and RAM that could be used for more purposeful instructions.
Which is also why the 'inline' heuristic exists.
Inlining has its own problems too.
I mean, things like ELF dynamic symbol tables, and linux thread local storage annoy me far more than calling conventions ever have.
Don't get me started on the whole ELF and dynamic linking situation...
> conventions are stupid
all conventions are stupid when examined through the lens of an isolated island dweller. you might as well be saying something like "you only need to drive on the left-hand side of the road when you're driving on public roads".
Calling conventions are obviously needed for syscalls and dynamically linked libraries. I don't think anyone is denying that. But most function calls aren't made to shared code. Almost all function calls are made to private functions, which exist within a binary, and are under the control of the compiler. If I steelman the person you're arguing with, I think what they're claiming that adhering to any specific calling convention for internal functions results in a lot of dumb assembly.
For example, imagine I have a C program where function a() calls b(). So long as b is confined to my binary, the "calling convention" of b doesn't matter. All that matters is that the compiler knows how to call it, and pass all the arguments. The compiler could jump to the function or call it. It could put parameters in registers or leave them on the stack. An awful lot of executed instructions exist to move parameters into the correct registers, save whatever was using those registers before the call, and restore them afterwards.
If we imagine this simple C code:
If both func1 and func2 are forced to use the same calling convention, between calls to func1 and func2, the CPU needs to place c and d into whatever registers a and b were in a moment ago. But modern CPUs have lots of registers. If the compiler were more clever, it could just use different registers for the arguments of both functions and avoid shuffling everything around.Godbolt: https://c.godbolt.org/z/zqxbb447e
(Aside: Its weird how different the assembly is between GCC and clang in this example!)
But - moving values between registers (or between a register and the stack) is crazy fast anyway. I'd love to see some benchmarks showing how much of a difference this optimisation would make in practice.
But - moving values between registers (or between a register and the stack) is crazy fast anyway. I'd love to see some benchmarks showing how much of a difference this optimisation would make in practice.
It's fast, but the instructions still occupy space in caches, memory, and decoders; space that could've been used for other more valuable instructions. The problem is that the typical microbenchmarks don't show this sort of inefficiency (and on the other hand makes questionable-at-large options like loop unrolling seem great), because they're too small to see the effects of the waste.
Take a big program like chrome, and measure the effect of adding even more meaningless assembly instructions around each call site. If that makes it slower, removing pointless instructions will make it faster by a similar amount.
I’d be fascinated to know the effect size of a test like that.
Do you understand that those private functions are often called from multiple places in your private binary? Do you expect the compiler to emit different prologues at each call site? And even if it did, one side (callee) is fixed right? Do you expect the compiler to find the globally optimal layout across all calls? okay sure maybe that's possible if the compiler were an oracle.
There's already a compiler technique for what you're describing: function inlining. The reality is that neither compilers nor compiler engineers are dumb and if you (one) has a bright idea that hasn't been implemented widely then it's highly likely someone has already considered it and there are flaws.
Yes, I understand all that. But this optimisation is clearly possible to write. On the face of it, it sounds similar to the register allocation problem within a function, but it needs to make decisions globally. Rewriting each caller's code would be easy. The hard part would be deciding which registers to use for each emitted function's parameters. You'd have to look at all of the callers to figure out the best choices. And each choice affects all the other choices.
So yes, it would be a very, very difficult optimisation to implement. Especially given how modern compilers are architected, using code units and a linker. The compiler needs to look at the functions "all at once".
Inlining isn't the same. But I agree that you'd get the most benefit from this optimisation in small, hot functions. If those functions are already being inlined, all of the benefit of this optimisation would disappear.
> The reality is that neither compilers nor compiler engineers are dumb and if you (one) has a bright idea that hasn't been implemented widely then it's highly likely someone has already considered it and there are flaws.
Compiler engineers are smart. But they were also smart 20 years ago, and its clear in hindsight that the compilers of the time left a lot of performance on the table. I bet we're still leaving a lot of performance on the table.
The question in my mind is simply, would the juice be worth the squeeze? It sounds like we agree that it would be a very difficult optimisation to implement. I think the commenter above is right that it would make programs smaller. The remaining question is, would it make them significantly faster? Would it make programs faster enough to justify the implementation complexity?
I personally doubt it. Look at Rust. Rust binaries are often much bigger than their C equivalents because of bounds checks and various other runtime checks. But in my experience, the bloated binary sizes don't seem to make much of a performance difference. If anything, rust code is often (somehow) slightly faster than the equivalent C when I've measured it. (And this was true even before noalias was enabled.)
But it would be a cool thing to try out. Godspeed to anyone interested in giving it a go.
Frame pointers are a great example. Having a well-known and generic representation of %rbp is helpful when you go to use or integrate with existing tools like debuggers, link editors, or (say) most of the existing LLVM/GCC/whatever toolchain. Or when you want to expose a stable ABI to consumers for whatever reason (as, e.g., the Linux kernel famously does). Or, or, or.
I think it's reasonable to say this has been mostly uncontroversial since at least the 90s. The discussion has changed a bit since (apparently) Go needed none of these things to succeed—not the LLVM compiler toolchain infrastructure, and also not the user-facing things like the debuggers. To hear Russ Cox tell the tale, this is mostly because they required flexibility, and I suppose they were right, since they did rewrite their linker 3 times, and sure enough, 15 years later, in the year of our lord 2024, most debugging in Go seems to happen by writing ASCII-shaped bytes to some kind of a file, somewhere, and then using the world's most expensive full text search engine to get those bytes so you can physically read them on a screen. A debugger does seem limited in use for that specific workflow, so maybe that was the right call, who knows.
Anyway, I don't think there's ever been any real doubt that something like LLVM imposes a serious integration cost, but now that we have Go, the discussion has mostly shifted to "is it worth it", and seemingly the answer is "mostly yes" since nearly everyone building a new and hip native language uses LLVM or something like it. Every language is different, YMMV, etc., but I personally don't hear a lot of complaining about what a bummer it is that all these tools work pleasantly together instead of secretly sabotaging each other by loading up FP with whatever cursed data Go wanted to use it for. And why would they?
What is more mysterious to me is how an actual assembly programmer came to defend Go's stance on ... assembly. Perhaps I'm the only one who reads these things, but at various points in the ASM docs[1] (which I will heretofore call "Mr Pike's wild ride") the author expresses a view that I think is reasonably well-described as "a tempered but pretty much open contempt for the practice as a whole". cf.,
> Instructions, registers, and assembler directives are always in UPPER CASE to remind you that assembly programming is a fraught endeavor. (Exception: the g register renaming on ARM.)
Even if your feelings are hard to hurt though, if you ever crack open the toolchain and attempt to read the golang kind-of-IL-kind-of-x86, it is hard to walk away thinking "these people really get me and my profession". DI and FP are both normal registers! It uses the unicode interpunct instead of the plain old dot operator! It uses NIL instead of NULL! It is one thing to say calling conventions are stupid, but it's another thing entirely to give a great big hug to an almost-but-not-quite-assembly-code that is convenient neither for humans to type nor for tools to consume.
[1]: https://go.dev/doc/asm
[1] github.com/minio/c2goasm (no longer updated)
[2] https://github.com/gorse-io/goat
But it's good enough and I guess it compiles quick which was a major goal for golang.
The presented examples are just a straight distinct assembly language associated with the golang ecosystem used in their own dedicated source files called via a FFI. This is comparable to just writing a pure assembly file and linking it into your program which is actually a much more reasonable thing to do than the insanity of inline assembly.
The problems being highlighted are just cases of people who do not understand ABIs and ABI compatibility. This is extremely common when crossing a language boundary due to abstraction mismatches and is made worse, not better, by doing even more magic behind the scenes to implicitly paper over the mismatches.
Any chance you can explain that to rubes like me?
Interpreters will literally switch on the type of things in order to figure out what to do with the value. They've lost the side channel battle before it even began. Compilers? Who knows what sort of code they will generate? Who knows how many of your precautions they will delete in an effort to "optimize"? Libsodium has its own memory zeroing function because compilers were "optimizing" the usage of the standard ones.
If you're writing anything cryptography related, you probably want to be talking directly to the processor which will be running your code. And only after you've studied the entire manual. Because even CPUs have significant gaps in the properties they guarantee and the conditions they guarantee them in.
Cryptographers might even consider lowering the level even further. They might want to consider building their own cryptoprocessor that works exactly like they want it to work. Especially if you need to guarantee things like "it's impossible to copy keys and secrets". I own three yubikeys for the sole purpose of guaranteeing this.
In crypto code branching is exactly what you don't want to do to guarantee security. Branches go both ways if an attacker can force a mispeculate and microarchitectural state is not rolled back because it can't be.
Currently the most reliable way to get exactly the assembly you want is to write the assembly you want.
It's not at all weird that the language authors needed assembly to implement such a thing. They figured out the tricky bits so you don't have to.