Web Scraping Legal Issues

From OpenCongress Wiki

(Difference between revisions)
Jump to: navigation, search

(some info on robots.txt)
(Barriers to Access)
Line 13: Line 13:
 
sign-in systems (see PACER)
 
sign-in systems (see PACER)
  
tresspassing
+
trespassing
  
 
terms of use
 
terms of use
 +
 +
incomplete information // not stable
  
 
=== Barriers to Reuse ===
 
=== Barriers to Reuse ===

Revision as of 15:27, November 19, 2012


This page is part of the Transparency Hub project.
Add what you know.

Contents

Introduction

The public interest case for web scraping is well understood among technologists and public advocates, but often poorly understood by everyone else. A developer setting up a scraper faces an often uncertain legal context, as laws and precedents can vary different countries, and the legal issues surrounding web scraping have been only vaguely formalized or decided by courts. This page seeks to gather relevant resources about the legal concerns surrounding scraping.

Scraping and its legal context should also be normalized a bit, since legal advice often doesn't take into account public interest motivations behind copying public sector information, and treats scraping as a sort of childish malevolence, even as huge businesses are more quietly built around web scraping. We would all be better off if smaller public interest actors had that same confidence. While this page isn't intended as legal advice, hopefully it can be a helpful first step.

Barriers to Access

with national examples, and news coverage

sign-in systems (see PACER)

trespassing

terms of use

incomplete information // not stable

Barriers to Reuse

privacy laws

national security laws

IP laws / copyright

Robots.txt

Search engines work by scraping, and always have. Google determines pagerank by scraping the web and pulling out links from pages to determine who's linking to who. Even before Google, this is how the web was searchable. This led to the robots.txt standard in 1994, which states what pages people are and aren't allowed to scrape. Compliance is voluntarily.

Few organizations block Google and other search engines from indexing their content. But many organizations take a strict approach to ordinary users in their robots.txt, even if it's not enforced technically or legally. For example, THOMAS' robots.txt allows Google to scrape everything, but ordinary people nothing. (The new Congress.gov beta allows everyone to scrape everything, but to wait 2 seconds between each download.)

The Internet Archive's Wayback Machine exists solely through scraping. It does obey robots.txt standards, even retroactively, and will even obey simple human requests to stop crawling a website (their FAQ goes into detail). It's not clear whether they do this out of fear of legal action, or because of how they view the ethics of archiving, or whatever else.

Public Interest Scraping Justifications

Other Relevant Resources

http://blog.scraperwiki.com/2012/04/02/is-scraping-legal/

https://scraperwiki.com/docs/python/faq/#scraping_legality

Toolbox

OpenCongress is a joint project of the Participatory Politics Foundation and the Sunlight Foundation. Questions? Comments? Contact Us