Citizenly · Legal

Citizenly Corpus Crawler

Last updated: 2026-05-11

This page describes the automated crawler operated by Citizenly for indexing public U.S. immigration policy documents.

What it does

The crawler periodically polls a small allowlist of public-information pages on official U.S. government immigration sites. It compares each page's main article body against the previous fetch using a content hash. When the body changes substantively, a human reviewer at Citizenly reviews the change before any indexed copy is updated.

The crawler does not: republish government content, archive entire sites, index personal data, scrape user accounts, or aggregate beyond the public policy pages listed below.

User-Agent

Every request the crawler issues carries the following User-Agent string:

Citizenly Corpus Crawler (https://citizenly.ai/crawler-info)

If you operate one of the source sites and see this User-Agent in your logs, this is us.

Sites the crawler accesses

  • www.uscis.gov U.S. Citizenship and Immigration Services
  • egov.uscis.gov USCIS — e-government services (processing times)
  • travel.state.gov U.S. Department of State — Travel
  • www.justice.gov U.S. Department of Justice — EOIR
  • www.dhs.gov U.S. Department of Homeland Security
  • www.cbp.gov U.S. Customs and Border Protection
  • www.ice.gov U.S. Immigration and Customs Enforcement
  • studyinthestates.dhs.gov DHS — Study in the States
  • ohss.dhs.gov DHS Office of Homeland Security Statistics
  • www.federalregister.gov Federal Register (via API)
  • api.congress.gov Congress.gov (via official API)

The Federal Register and Congress.gov sources use official structured APIs (federalregister.gov/api/v1 and api.congress.gov/v3), not page scraping.

Politeness

  • The crawler respects robots.txt on every domain. Disallowed paths are skipped and not retried for at least 24 hours after the first disallow.
  • Per-domain request rate is limited (default: at most one request every five seconds per source).
  • Conditional requests (If-None-Match, If-Modified-Since) are sent whenever the previous response included an ETag or Last-Modified header — so unchanged pages return 304 Not Modified and consume minimal upstream bandwidth.
  • Repeated upstream errors pause polling for that source until a human acknowledges the situation.

Contact

If you operate one of the sites above and would like the crawler to slow down, change its access pattern, or stop accessing your site entirely, please email crawler@citizenly.ai. We honor opt-out requests within 48 hours.