mastodon.xyz is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Mastodon instance, open to everyone, but mainly English and French speaking.

Administered by:

Server stats:

790
active users

#cursedhomelab

1 post1 participant0 posts today

Cursed homelab update:

I learned a _lot_ about Rook and how it manages PersistentVolumes today while getting my PiHole working properly. (Rook is managed Ceph in Kubernetes)

In Kubernetes, the expectation is that your persistent volume provider has registered a CSI driver (Container Storage Interface) and defined StorageClasses for the distinct "places" where volumes can be. You then create a volume by defining a PersistentVolumeClaim (PVC) which defines a single volume managed by a StorageClass. The machinery behind this then automatically creates a PersistentVolume to define the underlying storage. You can create PersistentVolumes manually, but this isn't explored much in the documentation.

In Rook, this system is mapped onto Ceph structures using a bunch of CSI drivers. The default configuration defines StorageClasses for RBD images and CephFS filesystems. There are also CSI drivers for RGW and NFS backed by CephFS. You then create PVCs the normal way using those StorageClasses and Rook takes care of creating structures where required and mounting those into the containers.

However there's a another mechanism which is much more sparsely mentioned and isn't part of the default setup: "static provisioning". You see, Ceph clusters are used to store stuff for systems that aren't Kubernetes and people tend to organise things in ways that the "normal" CSI Driver + StorageClass + PVC mechanism can't understand and shouldn't manage. So if we want to share that data with some pod, you need to create specially structured PersistentVolumes to map those structures into Kubernetes.

Once you set up one of these special PersistentVolumes and attach them to a pod using a PVC, then you effectively get a "traditional" "cephfs" volume mount, but using Rook's infrastructure and configuration, so all you need to specify is the authentication data and the details for that specific volume and you're done.

The only real complication is that you need a separate secret for this, but chances are you're referencing things in separate places to the "normal" StorageClass stuff and giving Rook very limited access to your storage, so this isn't a big deal.

So circling back around to the big question I wanted answers for: Does Rook mess with stuff it doesn't know about in a CephFS filesystem?

No.

If you use the CSI driver + StorageClass mechanism it will only delete stuff that it creates itself and won't touch anything else existing in the filesystem, even if it's in folders it would create or use.

If you use a static volume, then you're in control of everything it has access to and the defaults are set so that even if the PersistentVolume is deleted, the underlying storage remains.

So now onto services that either should be using CephFS volumes or need to access "non-kubernetes" storage, starting with finding a way to make Samba shares in a container.

#ceph#rook#homelab

Cursed homelab update:

So server #2 is humming along nicely, however continuing to use the disk that nearly scuttled my repartition / recovery effort was not a good idea.

Me: creates a Ceph OSD on a known faulty hard disk
Faulty hard disk: has read errors causing a set of inconsistent PGs
Me: Surprised Pikachu face

Thankfully this was just read errors, no actual data has been lost.

So for a brief, glorious moment, I had just under 50TB of raw storage and now it's just under 49TB.

And for me now, the big question is: do I do complicated partition trickery to work around the bad spots (it's a consecutive set of sectors) or do I junk the disk and live with 1 less TB of raw storage?

In other news, I know understand a little bit more about Ceph and how it recovers from errors: PGs don't get "fixed" until they are next (deep) scrubbed, which means that if your PGs get stuck undersized, degraded or inconsistent (or any other state) it could be that they're not getting scrubbed.

So taking the broken OSD on the bad HDD offline immediately caused all but 2 of the inconsistent PGs to get fixed, and the remaining 2 just wouldn't move, so I smashed out a trivial script to deep scrub all PGs and last night, a couple of days after this all went down, one got fixed. Now hopefully the other will get sorted out soon.

ceph pg ls | awk '{print $1}' | grep '^[[:digit:]]' | xargs -l1 ceph pg deep-scrub

So read errors -> scrub errors -> inconsistent PGs.

Then: inconsistent PGs -> successful scrub -> recovery

What this also means is that while I stopped the latest phase of the Big Copy to (hopefully) protect my data, I think I can start it again with some level of confidence.

Cursed #homelab update:

So recovery completed and I realised that the OSD process on server #3 was using 18GB of RAM, so I was all "I'll just restart it".

So it CORRUPTED THE FRICKING DISK.

I'm now half a day of #Ceph BlueStore fsck into trying to fix this including doing some slightly terrifying stuff to get it to actually open the disk without crashing.

You know you're out on the bleeding fricking edge when you find maybe 1 other person in existence who's seen the errors you're seeing. I guess the SOP for dealing with failed disks in Ceph clusters is to junk them and stuff a new one in.

I'm hoping I can recover this so I don't need to spend another 4 days waiting for Ceph to copy data into the re-zapped disk. I'm not at any risk of losing data as I have two other OSDs up and everything has 2 or 3 replicas, so I've got a level of backstop here if this isn't fixable.

That said, ceph-osd may not be culpable for this: it's looking like the "good" disk on this server might also be sick which means one of three things:

1. Both disks (which are approximately the same age, but different brands) have both gone bad at (roughly) the same time
2. My slightly hacky external disk setup is bad for disks (maybe vibrations, maybe the cheap MiniSAS to SATA cable?)
3. The SATA interface on the cheap Chinese server is flaky

Drive 1 which is definitely bad was having "Device Error"s which makes me think that the board has gone bad as SMART wasn't showing anything obviously wrong.

Drive 2 which seems to be working has "ATA bus error"s which points the finger firmly at the SATA cable.

I guess if this doesn't "fix itself" or "not happen frequently enough to cause big problems" I might be opening up the server and using one of the internal SATA ports with a known-not-bad cable to see if that fixes it. And if that doesn't fix it, maybe this server can be compute-only.