@theruran @psf Worth reading the original paper on defective CPUs. https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf
> A deterministic AES miscomputation, which was “self-inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption else where yielded gibberish.
This reminds me of a story from one flatmate who repaired machine for a supplier who had a warehouse in North London.
Their work-flow process meant that the desktop machine would arrive in the warehouse to be stored there, before being sent ot the test-lab where my flatmate worked.
They would test the machines, then send the back to a different part of the warehouse before the machines were sent to the customer's offices.
When the machines were in the offices they would start crashing after 3 days.
So replacements were sent out, and the original machines would be sent back to the warehouse where they were stored until they could be tested again.
All of the tests came out fine, so they were sent back to the warehouse before being sent out to a different set of customers.
Rinse-And-Repeat for several iterations, before they got serious in trying to trace the problems.
Eventually someone noticed that the machines that were failing had a common element.
They were using +/-10% rated-value resistors.
When they started testing the resistors, they found that ALL of the resistor ratings were either -10% to -5% rated value, or +5% to +10% rated value.
NONE of them were in the centre bands.
That's when they worked out that the resistor manufacturer had been cherry-picking the resistors from the manufacturing process.
All of the most accurate resistors went into the +/-0.1% product line, the next batch wen into the =/-2% product line, then +/-5%, and the +/-10% product line that my flatmate came across.
He ordered batches directly from a range of resistor suppliers.
ALL of the resistor manufacturers were doing it.
If you wanted an accurately-specced resistor, you had to buy the most expensive resistors, otherwise your were just having to guess whether the components would work on the circuit boards.
The reason that the PC's were working in the test lab, but not the customer's offices, is that they didn;t get the chance to warm up enough, so that they would fail, as the warehouse was unheated, but the offices were room temperature.
It wouldn't surprise me if the CPU manufacturers were doing the same.
Test the chips and sell the most accurate verrsions at the highest prices, and have a set of band ranges for the rest.
I know that Intel WAS doing this in the early 90's, but changed the way they were doing things after they were sued by some banks that had spent a LOT of money buying the Math-Co-Processors, that failed if you pushed them too far.
Someone at the CPU manufacturer has fired the staff that knew this failure mode, and there's been a corresponding loss of institutional memory.
The CPU manufacturer has been banding the chips to increase their margins by creating differential product lines.
Someone at the computer manufacturer has been trying to improve their margins by buying the cheaper chips.
Someone at Google/FB has been shaving their costs by buying cheaper machines.