mastodon.xyz is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Mastodon instance, open to everyone, but mainly English and French speaking.

Administered by:

Server stats:

809
active users

#ceph

2 posts2 participants1 post today

For the past month, I have been flitting about from one project to another as my ADHD sees fit. Today, that MO has to end if I'm ever going to complete this #homelab setup.

So, in order, I have to:

* Back up all my LXC's
* Install RAM in a Mac mini
* Reinstall ProxMox with #ZFS on all 3 minis
* Format a bunch of spinning rust disks
* Mount those spinning rust drives in a #Thunderbolt array
* Configure #CEPH
* Restore my LXC's

I figure I'll be done around midnight? Maybe? Fingers crossed.

Cursed homelab update:

So server #2 is humming along nicely, however continuing to use the disk that nearly scuttled my repartition / recovery effort was not a good idea.

Me: creates a Ceph OSD on a known faulty hard disk
Faulty hard disk: has read errors causing a set of inconsistent PGs
Me: Surprised Pikachu face

Thankfully this was just read errors, no actual data has been lost.

So for a brief, glorious moment, I had just under 50TB of raw storage and now it's just under 49TB.

And for me now, the big question is: do I do complicated partition trickery to work around the bad spots (it's a consecutive set of sectors) or do I junk the disk and live with 1 less TB of raw storage?

In other news, I know understand a little bit more about Ceph and how it recovers from errors: PGs don't get "fixed" until they are next (deep) scrubbed, which means that if your PGs get stuck undersized, degraded or inconsistent (or any other state) it could be that they're not getting scrubbed.

So taking the broken OSD on the bad HDD offline immediately caused all but 2 of the inconsistent PGs to get fixed, and the remaining 2 just wouldn't move, so I smashed out a trivial script to deep scrub all PGs and last night, a couple of days after this all went down, one got fixed. Now hopefully the other will get sorted out soon.

ceph pg ls | awk '{print $1}' | grep '^[[:digit:]]' | xargs -l1 ceph pg deep-scrub

So read errors -> scrub errors -> inconsistent PGs.

Then: inconsistent PGs -> successful scrub -> recovery

What this also means is that while I stopped the latest phase of the Big Copy to (hopefully) protect my data, I think I can start it again with some level of confidence.

Cluster Rebuild Project

Thanks to the peeps in the sharkey discord walking through some db fun, the last major service is migrated off of the old cluster!

Remaining tasks:
* Get Woodpecker-CI building the few local containers
* Configure backups
* Configure DB backups
* Rebuild Blog, not sure I can do it without a nginx container. (or changing my entire cluster's ingress, oooof)

A few notes:
* migrating sharkey object storage is a mess and best avoided. That was the last reference to firefish
:firefish_crying:
* DB performance in Ceph was not impressive. Still not entirely sure why. I don't think it'll be a problem but worth noting
* I wish there was a good non-NFS volume provisioner that I could just point at a simple ZFS-based NAS

Next I think I am gonna try and get that odroid joined to the cluster via matchbox/PXE/Talos so I can experiment with the iGPU in kubernetes
#Homelab #Kubernetes #Ceph

I've been staring at a "3 Pods pending" panel on my dashboard for three days now and finally checked which ones they were, because all of my systems were showing green.

Turns out it was a couple of Rook pods, two exporters and one crash collector. And as it turns out they were correct, two of my Ceph nodes have "run out" of CPU. While being 80% and 90% idle respectively.

Those default Rook resource requests are definitely intended for a larger setup than mine.

Cluster rebuild project

Ok so the databases for the last two services have been shrunk* by removing their current backup targets and forcing a checkpoint.

The
#Sharkey db is 25gb currently, I should look if there are ways to shrink it.

I still need to figure out exactly what is happening with backups that cause the db to grow like that. Something isn't being released properly

Also the Ceph dashboard ingress ate shit, not quite sure why but I'm able to expose the service via a load balancer and that works


#Kubernetes #Homelab #Ceph #CNPG #Postgres

Hm, it seems removing entire nodes and their devices from a Rook Ceph cluster is not entirely simple. I just deleted the entries for one of my nodes from the Ceph Cluster CR, and nothing happened. I would have expected that Rook would start the standard OSD puge procedure. And the Rook OSD removal producedure is surprisingly vague.

New blog post: blog.mei-home.net/posts/ceph-c

I take a detailed look at the copy operation I recently did on my media collection, moving 1.7 TB from my old Ceph clusters to my Rook one.

Some musings about Ceph and HDDs as well as a satisfying amount of plots. Which are sadly not really readable. 😔 I definitely need a different blog theme which allows enlargement of figures.

ln --help · Ceph: My Story of Copying 1.7 TB from one Cluster to AnotherLots of plots and metrics, some grumbling. Smartctl makes an appearance as well.

There is no reason at all to entrust your company or personal data to ANY #Cloud service. If you are a company, build your own hardware infrastructure with #Ceph, #Proxmox, #Openstack or others. IT WILL SAVE YOU MONEY. If an individual back your data up at home on a NAS.
Use and support #OpenSource.
Ditch #Microsoft.
Now the US is bad, but no government or megacorporation can be trusted.
osnews.com/story/141794/it-is-

www.osnews.comIt is no longer safe to move our governments and societies to US clouds – OSnews

Here's a little #ceph tip - if you have ever noticed that deleting a bunch of files from rbd volumes doesn't reduce the storage used in ceph, run `fstrim`. There's a similar issue to ceph rbd as with solid state drives. When using ext4 or xfs, deleting data unlinks it but leaves it present on the disk until fstrim runs and actually clears the unlinked data. btrfs cleans up after itself by default now, so you probably don't need to do this if you're running btrfs on ceph.