Web Scraping Legal Issues

From OpenCongress Wiki

Jump to: navigation, search


This page is part of the Transparency Hub project.
Add what you know.

Contents

Introduction

The public interest case for web scraping is well understood among technologists and public advocates, but often poorly understood by everyone else. A developer setting up a scraper faces an often uncertain legal context, as laws and precedents can vary different countries, and the legal issues surrounding web scraping have been only vaguely formalized or decided by courts. This page seeks to gather relevant resources about the legal concerns surrounding scraping.

Scraping and its legal context should also be normalized a bit, since legal advice often doesn't take into account public interest motivations behind copying public sector information, and treats scraping as a sort of childish malevolence, even as huge businesses are more quietly built around web scraping. We would all be better off if smaller public interest actors had that same confidence. While this page isn't intended as legal advice, hopefully it can be a helpful first step.

Barriers to Access

with national examples, and news coverage

sign-in systems (see PACER)

trespassing

  • 'Scrapers' Dig Deep for Data on the Web - Julia Angwin and Steve Stecklow, Wall Street Journal, 10/11/2012 - This article begins with an anecdote outlining how big businesses create dummy accounts in order to scrape information from websites that require a login, before delving into the broader business applications of scraping and some of the legal complications associated with personal data and scraping. According to the article, anti-scraping laws vary by country. In the US, courts have released contradictory opinions on scraping issues. 

terms of use

  • How Zappos' User Agreement Failed in Court and Left Zappos Legally Naked - Eric Goldman, 10/29/2012 - This post explains some of the ways that courts have looked at the legality of website terms of use agreements. Courts tend to break user agreements into three groups: "clickwraps" or clickthrough agreements, "browsewraps", and "clearly not a contract". According to the post, courts have consistently ruled clickthrough agreements to be legal and binding. Since users can not be reasonably expected to have read and confirmed their understanding of "browsewraps", which just appear somewhere on the webpage, courts have tended to not treat these "agreements" as a contract. Since websites often outline their scraping policies in these sections, it is important to consider the way that a given website presents its terms of use. 

incomplete information // not stable

Barriers to Reuse

privacy laws

national security laws

IP laws / copyright

  • How legal is content scraping? - Curtis Smolar, 5/30/2011 - Outlines some of the legal issues associated with scraping, going into particular depth about how copyright can (and cannot) be applied to information being scraped. Simply put, according to this piece, having a copyright might not protect the "pure facts" contained on a website. The article outlines the Supreme Court's opinion that "while the arrangement, formatting, or a collection of pure facts may be copyrighted, the facts themselves may not be." The author leaves readers with the helpful advice "It's still the wild west in this field- so proceed with caution."

copyright by third party hosting the information

Robots.txt

Search engines work by scraping, and always have. Google determines pagerank by scraping the web and pulling out links from pages to determine who's linking to who. Even before Google, this is how the web was searchable. This led to the robots.txt standard in 1994, which states what pages people are and aren't allowed to scrape. Compliance is voluntarily.

Few organizations block Google and other search engines from indexing their content. But many organizations take a strict approach to ordinary users in their robots.txt, even if it's not enforced technically or legally. For example, THOMAS' robots.txt allows Google to scrape everything, but ordinary people nothing. (The new Congress.gov beta allows everyone to scrape everything, but to wait 2 seconds between each download.)

The Internet Archive's Wayback Machine exists solely through scraping. It does obey robots.txt standards, even retroactively, and will even obey simple human requests to stop crawling a website (their FAQ goes into detail). It's not clear whether they do this out of fear of legal action, or because of how they view the ethics of archiving, or whatever else.

Public Interest Scraping Justifications

Other Relevant Resources

http://blog.scraperwiki.com/2012/04/02/is-scraping-legal/

https://scraperwiki.com/docs/python/faq/#scraping_legality

Toolbox