Web Scraping Legal Issues
From OpenCongress Wiki
The public interest case for web scraping is well understood among technologists and public advocates, but often poorly understood by everyone else. A developer setting up a scraper faces an often uncertain legal context, as laws and precedents can vary different countries, and the legal issues surrounding web scraping have been only vaguely formalized or decided by courts. This page seeks to gather relevant resources about the legal concerns surrounding scraping.
Scraping and its legal context should also be normalized a bit, since legal advice often doesn't take into account public interest motivations behind copying public sector information, and treats scraping as a sort of childish malevolence, even as huge businesses are more quietly built around web scraping. We would all be better off if smaller public interest actors had that same confidence. While this page isn't intended as legal advice, hopefully it can be a helpful first step.
Barriers to Access
with national examples, and news coverage
sign-in systems (see PACER)
- 'Scrapers' Dig Deep for Data on the Web - Julia Angwin and Steve Stecklow, Wall Street Journal, 10/11/2012 - This article begins with an anecdote outlining how big businesses create dummy accounts in order to scrape information from websites that require a login, before delving into the broader business applications of scraping and some of the legal complications associated with personal data and scraping. According to the article, anti-scraping laws vary by country. In the US, courts have released contradictory opinions on scraping issues.
incomplete information // not stable
Barriers to Reuse
national security laws
IP laws / copyright
- How legal is content scraping? - Curtis Smolar, 5/30/2011 - Outlines some of the legal issues associated with scraping, going into particular depth about how copyright can (and cannot) be applied to information being scraped. Simply put, according to this piece, having a copyright might not protect the "pure facts" contained on a website. The article outlines the Supreme Court's opinion that "while the arrangement, formatting, or a collection of pure facts may be copyrighted, the facts themselves may not be." The author leaves readers with the helpful advice "It's still the wild west in this field- so proceed with caution."
copyright by third party hosting the information
Search engines work by scraping, and always have. Google determines pagerank by scraping the web and pulling out links from pages to determine who's linking to who. Even before Google, this is how the web was searchable. This led to the robots.txt standard in 1994, which states what pages people are and aren't allowed to scrape. Compliance is voluntarily.
Few organizations block Google and other search engines from indexing their content. But many organizations take a strict approach to ordinary users in their robots.txt, even if it's not enforced technically or legally. For example, THOMAS' robots.txt allows Google to scrape everything, but ordinary people nothing. (The new Congress.gov beta allows everyone to scrape everything, but to wait 2 seconds between each download.)
The Internet Archive's Wayback Machine exists solely through scraping. It does obey robots.txt standards, even retroactively, and will even obey simple human requests to stop crawling a website (their FAQ goes into detail). It's not clear whether they do this out of fear of legal action, or because of how they view the ethics of archiving, or whatever else.