basically hacker news went down yesterday and in this screenshot someone chimes in talking about it saying that they had a batch of SSDs manufactured by SanDisk all get bricked after exactly 40,000 hours (4.5 years) uptime because it overflowed an internal counter and corrupted the SSD's internal state.
someone from hackernews replies and says that the SSDs HN was hosted on were in fact SanDisk Optimus Lightning IIs and almost exactly 4.5 years old.
never trust a firmware
@artemis ...I'm gonna have to check what brand my SSDs are 😬
I know one of them is a Samsung but I don't know about the other one
@artemis Oh I don't have two SSDs anymore, apparently. Strange. Legit confused about what happened to the second one.
@hazelnot @artemis it’s not as simple as “choose the right brand” anymore. i worked for a distributed database company a while back (think: Backblaze, but self-hosted for large customers). we never certified whole brands for use in our product: we certified individual drive models + FW builds, because every brand out there has bugs somewhere in their stack: op reordering that causes writes to be dropped under specific edgecases, supposedly “atomic” operations leaving internal drive state corrupted after power loss, etc. the brand you reference is not immune 🙃
@artemis I still remember a few years ago when I woke up, went to school... just to find out my laptop gave a "No operating system error" (even though I used it literally the evening before)...
Came to notice that suddenly my 80GB Intel 320 SSD turned into an 8MB brick...
Texted a friend of mine... Just to find out there was a relatively wide-spread bug in the firmware that caused the drive to basically nuke itself :|
@artemis my first SSD was a Crucial that had an integer overflow error that basically made it *disappear* from the system. Had to boot into DOS to patch it. Most of my coworkers were just saying I should replace it, but I still remember DOS enough to use the tooling.
@artemis Yikes... Also that's why I like to recommend folks to chose a different vendor or at least drive model for their redundant or backup data storage. I've heard my share of stories about folks buying drives from the same batch in bulk and then having them fail rapidly over the a few days or sometimes hours. Had a NAS with spinning rust drives once where all 6 drives died within 6 days. Each time I swapped one and restored the raid the next one would fail shortly after.
yes! I was bitten by a seagate firmware bug some years back. 6 drives in two boxes making up a DRBD. I think it was 2 out of the 6 effected. Fortunately for me, the Secondary took over from the Primary and limped on a degraded RAID while the drives were cross-shipped to me.
@artemis this makes me actually IRL upset. Why do we all put up with this kind of thing, why is this allowed to happen
@artemis Seagate SV35 drives have a similar counter for number of reads, and if you exceed that counter, it bricks the control firmware. they did that on purpose though.
@artemis SV35 are cheap drives designed to be basically write-only, for things like surveillance camera recording, so they artificially limited the drives
@Fishou @sebsauvage @artemis ah no, different issue it seems : https://www.thetruthaboutcars.com/2019/10/tesla-troubles-models-bricking-over-flash-memory-problem/ (here it was about writing too much)
@artemis Firmware updates to fix this issue have been published by Dell and HPE two years ago but must be applied *before* the SSDs reach 40,000 hrs uptime otherwise both the data and the SSD itself will be rendered unusable
@artemis This... is concerning. The Samsung SSD on my computer still works after 8 loyal years of service. I have an ADATA SSD on an XP rig I rescued (which I admit I chose because they had software that could run TRIM on XP). Any reports on either brand spontaneously failing?
@artemis (realizes that his work laptop uses a Western Digital SSD, and WD also owns SanDisk) This cannot end well for my work laptop...
@artemis I would change "Never trust firmware" with "Never trust proprietary firmware".
Sadly it's very difficult to find real free/libre hardware. I don't know if RISCV would change this in the future.
@jrballesteros05 nah i don't trust libre firmware either. firmware is software and software is treacherous
OCR Output (chars: 1354)
kabdib 7 hours ago I unvote I next [-]
I once had a small fleet of SSDs fail because they had some uptime counters that
overflowed after 4.5 years, and that somehow persistently wrecked some internal data
structures. It turned them into little, unrecoverable bricks.
It was not awesome seeing a bunch of servers go dark in just about the order we had
originally powered them on. Not a fun day at all.
mikiem 2 hours ago I parent I next [-]
You are never going to guess how long the HN SSDs were in the servers... never ever...
OK... I'll tell you: 4.5years. I am not even kidding.
kabdib 2 hours ago I unvote I root I parent I next [-]
Let me narrow my guess: They hit 4 years, 206 days and 16 hours. . . or 40,000 hours.
And that they were sold by HP or Dell, and manufactured by SanDisk.
Do I win a prize?
(None of us win prizes on this one).
mikiem 2 hours ago I root I parent I next [-]
These were made by SanDisk (SanDisk Optimus Lightning II) and the number of hours is
between 39,984 and 40,032... I can't be precise because they are dead and I am going off
of when the hardware configurations were entered in to our database (could have been
before they were powered on) or when we handed them over to HN, and when the disks
Unbelievable. Thank you for sharing your experience!