How to Poison AI Scrapers With Colorless, Odorless Iocaine: The Current Arms Race Between Billionaires and Hosters
Last week, Wikimedia reported that AI bots saturated their available bandwidth. Here's why the bad bots are getting so much worse...
https://www.plagiarismtoday.com/2025/04/10/the-battle-against-the-bots/
1. Vereinbarung geschlossener KI-Systeme: die eigenen Daten bleiben hier in einem "Silo".
2. #Anonymisierung der eigenen Daten: aufwändig und meistens letztlich nicht wirksam genug (verbleibende Rest-Informationen können de-anonymisiert werden).
3. Schutz der eigenen Daten vor KI-#Scraping durch #Wasserzeichen und Widerspruchs-Hinweise – leider nur sehr eingeschränkt wirksam.
Die Frage, die sich jede Organisation stellen sollte: Wie schütze ich meine IP vor "datenhungrigen" KI-Anbietern?
KI-Verordnung: verbotene Praktiken und die DSGVO
Die Umsetzung der KI-Verordnung wird von der Europäischen Kommission vorangetrieben. Ein Schwerpunkt liegt dabei auf den Praktiken, die nach der KI-Verordnung verboten sind. Zum Teil bietet auch die DSGVO bereits Schutz vor solchen Praktiken. Der Beitrag geht hierauf anhand eines Beispiels genauer e(...)
https://www.dr-datenschutz.de/ki-verordnung-verbotene-praktiken-und-die-dsgvo/
Managing AI Bots+ w/ Apache MPM, FPM, & Fail2Ban: https://tech.haacksnetworking.org/2025/04/06/managing-ai-bots-w-apache-mpm-fpm-fail2ban/ There's been a lot of continued discussion on this topic, so I decided to investigate some of the common reports, compare those to my own hardware, theoretical ceilings and caps, and then adjusted my LAMP stack and fail2ban as per this blog entry. Let me know what yall think or if you find any errors or questionable claims. -oemb1905 #ai #scraping #apache #opensource #freesoftware #floss #ddos #php
Nice nice take https://blog.sysopscafe.com/posts/ai-crawlers-hammering-git-repos/ ty to @p4p4j0hn for sharing! #ai #scraping #floss #freesoftware #opensource
Web Scraping With Cheerio in 2025, by @apify.bsky.social:
With all the talk about AI-scraping, I decided to run the numbers on my little wiki and tech blog. Here's what we got. https://haacksnetworking.org/bot-scrape-04-04-25.txt #AI #bot #scraping
I've set up my new #inkscape website AI bot trap. It works by giving everyone a chance to not fall into it.
An anchor link that says "I am a bot" and links to /P3W-451/{datetime}/ it's got a fixed position at top -100px so should never be seen
The robots.txt says "Disallow: /P3W-451/" so if you were reading the robots, you'd know.
Then #nginx logs the requests to a log of their ip-addresses and browser strings and sends them a 301 redirect to google.com
1/2
Joshua Yuvaraj, co-director of the New Zealand Centre for Intellectual Property, was interviewed on RNZ yesterday, about the degree to which copyright law might be used to prevent scraping of the open web by #MOLE Trainers;
As Cory Doctorow noted back in 2023;
"In privacy and labor fights, copyright is a clumsy tool at best."
https://pluralistic.net/2023/09/17/how-to-think-about-scraping/
How crawlers impact the operations of the Wikimedia projects https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/ #AI, #Crawlers, #Infrastructure, #KnowledgeAsAService, #KnowledgeContent, #Operations, #Scraping, #ScrapingBots, #Traffic, #WikimediaFoundation, #WikimediaProjects
@susankayequinn Here's another article by @brianmerchant : https://www.bloodinthemachine.com/p/openais-studio-ghibli-meme-factory
"AI giants are indeed eating away at the livelihoods and dignity of working artists, and this devouring, appropriating, and automation of the production of art, of culture, at a scale truly never seen before, should not be underestimated as a menace"
"GPT-4o is partly (aside from some licensed content) a product of a massive scrape of the Internet without regard to copyright or consent from artists ... GPT-4o's image generation model (and the technology behind it, once open source) feels like it further erodes trust in remotely produced media ... Everyone needs media literacy skills ..." https://arstechnica.com/ai/2025/03/openais-new-ai-image-generator-is-potent-and-bound-to-provoke/?utm_brand=arstechnica&utm_social-type=owned&utm_source=mastodon&utm_medium=social via @arstechnica
another part of my day job involves working around systems designed to prevent mass AI-driven scraping, because humans and well-behaved query scripts are accidentally caught up in all the war-of-the-scrapers, because Cloudflare etc are offering what seems to management to be a magic bullet, and putting the bluntest of tools in front of anywhere that needs to be public, including APIs.
#scraping #api