mastodon.xyz is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Mastodon instance, open to everyone, but mainly English and French speaking.

Administered by:

Server stats:

812
active users

#scraping

7 posts3 participants1 post today
@reiver ⊼ (Charles) :batman:<p>5/</p><p>For example, if software request data from a web-site, and the web-site returns HTML, but parts of the HTML has semantics marked up with a machine-legible format such as microformats, microdata, RDFa, etc, then it is NOT scraping.</p><p>(microformats, microdata, RDFa, etc, are machine-legible format, designed to express semantics to machines.)</p><p><a href="https://mastodon.social/tags/Scraper" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Scraper</span></a> <a href="https://mastodon.social/tags/Scraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Scraping</span></a> <a href="https://mastodon.social/tags/WebScraper" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>WebScraper</span></a> <a href="https://mastodon.social/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>WebScraping</span></a></p>
@reiver ⊼ (Charles) :batman:<p>4/</p><p>For example, if software request data from a web-site, and the web-site returns HTML, but that HTML contains a &lt;script&gt; tag with JSON-LD in it, and the software consumes that JSON-LD, then it is NOT scraping.</p><p>(JSON-LD is a machine-legible format, designed to express semantics to machines.)</p><p><a href="https://mastodon.social/tags/Scraper" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Scraper</span></a> <a href="https://mastodon.social/tags/Scraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Scraping</span></a> <a href="https://mastodon.social/tags/WebScraper" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>WebScraper</span></a> <a href="https://mastodon.social/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>WebScraping</span></a></p>
@reiver ⊼ (Charles) :batman:<p>3/</p><p>For example, if software request data from a web-site, and the web-site returns JSON, XML, or some other machine-legible format, then it is NOT scraping.</p><p><a href="https://mastodon.social/tags/Scraper" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Scraper</span></a> <a href="https://mastodon.social/tags/Scraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Scraping</span></a> <a href="https://mastodon.social/tags/WebScraper" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>WebScraper</span></a> <a href="https://mastodon.social/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>WebScraping</span></a></p>
Continued thread

1. Vereinbarung geschlossener KI-Systeme: die eigenen Daten bleiben hier in einem "Silo".
2. #Anonymisierung der eigenen Daten: aufwändig und meistens letztlich nicht wirksam genug (verbleibende Rest-Informationen können de-anonymisiert werden).
3. Schutz der eigenen Daten vor KI-#Scraping durch #Wasserzeichen und Widerspruchs-Hinweise – leider nur sehr eingeschränkt wirksam.

Die Frage, die sich jede Organisation stellen sollte: Wie schütze ich meine IP vor "datenhungrigen" KI-Anbietern?

KI-Verordnung: verbotene Praktiken und die DSGVO

Die Umsetzung der KI-Verordnung wird von der Europäischen Kommission vorangetrieben. Ein Schwerpunkt liegt dabei auf den Praktiken, die nach der KI-Verordnung verboten sind. Zum Teil bietet auch die DSGVO bereits Schutz vor solchen Praktiken. Der Beitrag geht hierauf anhand eines Beispiels genauer e(...)
dr-datenschutz.de/ki-verordnun

Dr. DatenschutzKI-Verordnung: verbotene Praktiken und die DSGVO
More from Dr. Datenschutz

Managing AI Bots+ w/ Apache MPM, FPM, & Fail2Ban: tech.haacksnetworking.org/2025 There's been a lot of continued discussion on this topic, so I decided to investigate some of the common reports, compare those to my own hardware, theoretical ceilings and caps, and then adjusted my LAMP stack and fail2ban as per this blog entry. Let me know what yall think or if you find any errors or questionable claims. -oemb1905 #ai #scraping #apache #opensource #freesoftware #floss #ddos #php

I've set up my new #inkscape website AI bot trap. It works by giving everyone a chance to not fall into it.

An anchor link that says "I am a bot" and links to /P3W-451/{datetime}/ it's got a fixed position at top -100px so should never be seen

The robots.txt says "Disallow: /P3W-451/" so if you were reading the robots, you'd know.

Then #nginx logs the requests to a log of their ip-addresses and browser strings and sends them a 301 redirect to google.com

#ai #Scraping

1/2

Replied in thread

@susankayequinn Here's another article by @brianmerchant : bloodinthemachine.com/p/openai
"AI giants are indeed eating away at the livelihoods and dignity of working artists, and this devouring, appropriating, and automation of the production of art, of culture, at a scale truly never seen before, should not be underestimated as a menace"

Blood in the Machine · OpenAI's Studio Ghibli meme factory is an insult to art itselfBy Brian Merchant

"GPT-4o is partly (aside from some licensed content) a product of a massive scrape of the Internet without regard to copyright or consent from artists ... GPT-4o's image generation model (and the technology behind it, once open source) feels like it further erodes trust in remotely produced media ... Everyone needs media literacy skills ..." arstechnica.com/ai/2025/03/ope via @arstechnica

Ars Technica · OpenAI’s new AI image generator is potent and bound to provokeBy Benj Edwards
Continued thread

another part of my day job involves working around systems designed to prevent mass AI-driven scraping, because humans and well-behaved query scripts are accidentally caught up in all the war-of-the-scrapers, because Cloudflare etc are offering what seems to management to be a magic bullet, and putting the bluntest of tools in front of anywhere that needs to be public, including APIs.
#scraping #api