Views
Web Scraping Legal Issues
From OpenCongress Wiki
(some info on robots.txt) |
(→Barriers to Access) |
||
| Line 13: | Line 13: | ||
sign-in systems (see PACER) | sign-in systems (see PACER) | ||
| − | + | trespassing | |
terms of use | terms of use | ||
| + | |||
| + | incomplete information // not stable | ||
=== Barriers to Reuse === | === Barriers to Reuse === | ||
Revision as of 15:27, November 19, 2012
Contents |
Introduction
The public interest case for web scraping is well understood among technologists and public advocates, but often poorly understood by everyone else. A developer setting up a scraper faces an often uncertain legal context, as laws and precedents can vary different countries, and the legal issues surrounding web scraping have been only vaguely formalized or decided by courts. This page seeks to gather relevant resources about the legal concerns surrounding scraping.
Scraping and its legal context should also be normalized a bit, since legal advice often doesn't take into account public interest motivations behind copying public sector information, and treats scraping as a sort of childish malevolence, even as huge businesses are more quietly built around web scraping. We would all be better off if smaller public interest actors had that same confidence. While this page isn't intended as legal advice, hopefully it can be a helpful first step.
Barriers to Access
with national examples, and news coverage
sign-in systems (see PACER)
trespassing
terms of use
incomplete information // not stable
Barriers to Reuse
privacy laws
national security laws
IP laws / copyright
Robots.txt
Search engines work by scraping, and always have. Google determines pagerank by scraping the web and pulling out links from pages to determine who's linking to who. Even before Google, this is how the web was searchable. This led to the robots.txt standard in 1994, which states what pages people are and aren't allowed to scrape. Compliance is voluntarily.
Few organizations block Google and other search engines from indexing their content. But many organizations take a strict approach to ordinary users in their robots.txt, even if it's not enforced technically or legally. For example, THOMAS' robots.txt allows Google to scrape everything, but ordinary people nothing. (The new Congress.gov beta allows everyone to scrape everything, but to wait 2 seconds between each download.)
The Internet Archive's Wayback Machine exists solely through scraping. It does obey robots.txt standards, even retroactively, and will even obey simple human requests to stop crawling a website (their FAQ goes into detail). It's not clear whether they do this out of fear of legal action, or because of how they view the ethics of archiving, or whatever else.
Web Scraping Legal Issues - OpenCongress Wiki
