Proxmox Virtual Environment 8.4 est disponible https://linuxfr.org/news/proxmox-virtual-environment-8-4-est-disponible #Virtualisation #virtualisation #proxmox #debian #linux #ceph #sdn #gpu
Proxmox Virtual Environment 8.4 est disponible https://linuxfr.org/news/proxmox-virtual-environment-8-4-est-disponible #Virtualisation #virtualisation #proxmox #debian #linux #ceph #sdn #gpu
#ProxmoxVE 8.4 has been released (#Proxmox / #VirtualEnvironment / #Virtualization / #VirtualMachine / #VM / #Linux / #Debian / #Bookworm / #DebianBookworm / #QEMU / #LXC / #KVM / #ZFS / #OpenZFS / #Ceph) https://proxmox.com/
For the past month, I have been flitting about from one project to another as my ADHD sees fit. Today, that MO has to end if I'm ever going to complete this #homelab setup.
So, in order, I have to:
* Back up all my LXC's
* Install RAM in a Mac mini
* Reinstall ProxMox with #ZFS on all 3 minis
* Format a bunch of spinning rust disks
* Mount those spinning rust drives in a #Thunderbolt array
* Configure #CEPH
* Restore my LXC's
I figure I'll be done around midnight? Maybe? Fingers crossed.
Cursed homelab update:
So server #2 is humming along nicely, however continuing to use the disk that nearly scuttled my repartition / recovery effort was not a good idea.
Me: creates a Ceph OSD on a known faulty hard disk
Faulty hard disk: has read errors causing a set of inconsistent PGs
Me: Surprised Pikachu face
Thankfully this was just read errors, no actual data has been lost.
So for a brief, glorious moment, I had just under 50TB of raw storage and now it's just under 49TB.
And for me now, the big question is: do I do complicated partition trickery to work around the bad spots (it's a consecutive set of sectors) or do I junk the disk and live with 1 less TB of raw storage?
In other news, I know understand a little bit more about Ceph and how it recovers from errors: PGs don't get "fixed" until they are next (deep) scrubbed, which means that if your PGs get stuck undersized, degraded or inconsistent (or any other state) it could be that they're not getting scrubbed.
So taking the broken OSD on the bad HDD offline immediately caused all but 2 of the inconsistent PGs to get fixed, and the remaining 2 just wouldn't move, so I smashed out a trivial script to deep scrub all PGs and last night, a couple of days after this all went down, one got fixed. Now hopefully the other will get sorted out soon.
ceph pg ls | awk '{print $1}' | grep '^[[:digit:]]' | xargs -l1 ceph pg deep-scrub
So read errors -> scrub errors -> inconsistent PGs.
Then: inconsistent PGs -> successful scrub -> recovery
What this also means is that while I stopped the latest phase of the Big Copy to (hopefully) protect my data, I think I can start it again with some level of confidence.
I am testing out some tech I want to learn more about :-)
I have used #Terraform to create some VMs in my #Proxmox server, then with #kubespray installed a kubernetes cluster on them.
Next I'll install #rook / #ceph so I have some storage, and last but not least I will install #CloudNativePG @CloudNativePG@mastodon.social on it.
If that all works, I'll repeat that in another datacenter to test cloudnativepg replica clusters.
Combining hobby and work..
„Hey, na wie läuft euer neues Server- und Storagesystem?“ – „Geht ab wie ein #Cephchen.“
Danke, ich finde alleine raus.
Cluster Rebuild Project
Thanks to the peeps in the sharkey discord walking through some db fun, the last major service is migrated off of the old cluster!
Remaining tasks:
* Get Woodpecker-CI building the few local containers
* Configure backups
* Configure DB backups
* Rebuild Blog, not sure I can do it without a nginx container. (or changing my entire cluster's ingress, oooof)
A few notes:
* migrating sharkey object storage is a mess and best avoided. That was the last reference to firefish
* DB performance in Ceph was not impressive. Still not entirely sure why. I don't think it'll be a problem but worth noting
* I wish there was a good non-NFS volume provisioner that I could just point at a simple ZFS-based NAS
Next I think I am gonna try and get that odroid joined to the cluster via matchbox/PXE/Talos so I can experiment with the iGPU in kubernetes
#Homelab #Kubernetes #Ceph
I've been staring at a "3 Pods pending" panel on my dashboard for three days now and finally checked which ones they were, because all of my systems were showing green.
Turns out it was a couple of Rook pods, two exporters and one crash collector. And as it turns out they were correct, two of my Ceph nodes have "run out" of CPU. While being 80% and 90% idle respectively.
Those default Rook resource requests are definitely intended for a larger setup than mine.
Cluster rebuild project
Ok so the databases for the last two services have been shrunk* by removing their current backup targets and forcing a checkpoint.
The #Sharkey db is 25gb currently, I should look if there are ways to shrink it.
I still need to figure out exactly what is happening with backups that cause the db to grow like that. Something isn't being released properly
Also the Ceph dashboard ingress ate shit, not quite sure why but I'm able to expose the service via a load balancer and that works
#Kubernetes #Homelab #Ceph #CNPG #Postgres
Ah. It seems that this is a safety mechanism. Once I actually read the OSD removal docs properly, it worked without issue.
Just one thing: The example purge-osd command probably should not contain the "--force" flag.
Hm, it seems removing entire nodes and their devices from a Rook Ceph cluster is not entirely simple. I just deleted the entries for one of my nodes from the Ceph Cluster CR, and nothing happened. I would have expected that Rook would start the standard OSD puge procedure. And the Rook OSD removal producedure is surprisingly vague.
I have always appreciated how much Ceph tries to safe me from myself.
realized I gotta figure out bucket policies for real before I migrate sharkey or the blog
not exactly sure what the default ceph bucket policy is, and I don't know if they're additive....
need a user read/write, public read policy that works with rook-ceph
#Ceph #Homelab #Kubernetes
New blog post: https://blog.mei-home.net/posts/ceph-copy-latency/
I take a detailed look at the copy operation I recently did on my media collection, moving 1.7 TB from my old Ceph clusters to my Rook one.
Some musings about Ceph and HDDs as well as a satisfying amount of plots. Which are sadly not really readable. I definitely need a different blog theme which allows enlargement of figures.
An excellent cheat sheet for Ceph commands: https://github.com/TheJJ/ceph-cheatsheet/
And I was almost done writing the Ceph copy performance post.
Hmmmmmm, setup Ceph replication/mirroring or use velero to backup to remote object storage (ceph or MinIO)?
I don't have a remote Ceph cluster currently but that could change. A remote MinIO could be setup much more easily
#Homelab #Kubernetes #Ceph
There is no reason at all to entrust your company or personal data to ANY #Cloud service. If you are a company, build your own hardware infrastructure with #Ceph, #Proxmox, #Openstack or others. IT WILL SAVE YOU MONEY. If an individual back your data up at home on a NAS.
Use and support #OpenSource.
Ditch #Microsoft.
Now the US is bad, but no government or megacorporation can be trusted.
https://www.osnews.com/story/141794/it-is-no-longer-safe-to-move-our-governments-and-societies-to-us-clouds/
Here's a little #ceph tip - if you have ever noticed that deleting a bunch of files from rbd volumes doesn't reduce the storage used in ceph, run `fstrim`. There's a similar issue to ceph rbd as with solid state drives. When using ext4 or xfs, deleting data unlinks it but leaves it present on the disk until fstrim runs and actually clears the unlinked data. btrfs cleans up after itself by default now, so you probably don't need to do this if you're running btrfs on ceph.
Ok two Ceph questions:
1. Does anyone have a monitoring/alerting rec, or example for rook/ceph, or link to good article on it?
2. Any recs for a gui s3 browser? I can see details of the buckets from the dashboard, but nothing about the contents like I could with minio
#Homelab #Kubernetes #Ceph