Bulk downloads
Download a whole dataset as a single file (Parquet, gzipped CSV, or GeoParquet) — no per-request row caps, no pagination, no client-side stitching.
The endpoint is /v1/bulk/{namespace}/{table} on api.eolas.fyi. It's cached behind Cloudflare, so monthly snapshots typically arrive in milliseconds after the first download warms the edge.
The easy path — client.get() / eolas_get() are now smart
client.get() and eolas_get() are now smart entry points: they automatically route large or geospatial datasets through the cache+sync path, with no code change needed. The 15-minute client.get("nz_parcels") call is now a seconds-fast cached GeoParquet read.
How the routing works:
- If you pass any slice argument (
start=,end=,limit=) the live API is always used — a whole-file cache cannot serve a filtered subset. - Otherwise the client calls
info(name)to check the dataset'sbulk_export_class,geometry_type, androw_count_at_last_refresh. Datasets that are bulk-eligible AND (geospatial OR >100k rows) are routed through the cache+sync path. - Licence-restricted datasets (OECD etc.,
bulk_export_class="none") always go live regardless of size.
When the client first auto-routes a dataset through the cache path it prints a one-time message explaining what happened:
INFO eolas_data: auto-routing 'nz_parcels' through cache+sync (large/geo dataset).
Cache lives at ~/.cache/eolas/. Use mode='live' to override.
Python
from eolas_data import Client
client = Client("your_api_key")
# Smart default: nz_parcels auto-routes to cache+sync — first call downloads
# from CDN, subsequent calls in any session return in <1 s.
gdf = client.get("nz_parcels") # geopandas.GeoDataFrame
df = client.get("nz_cpi") # pd.DataFrame (small → stays live)
# Source-specific helpers inherit smart routing automatically:
gdf = client.linz("nz_parcels") # also auto-routes
Escape hatches when you need explicit control:
# Force the live Iceberg scan — useful for freshest data, OECD-restricted sources,
# or small slices (e.g. with limit=/start=/end=).
# Refused with 413 if the dataset is large/geo and you don't pass a filter.
gdf = client.get("nz_parcels", mode="live")
# Force the cache+sync path explicitly (same as get_local)
gdf = client.get("nz_parcels", mode="cached")
# get_local() is the explicit alias for mode="cached" — useful when you want
# to be unambiguous in a script or when cache_dir / format matter:
gdf = client.get_local("nz_parcels")
gdf = client.get_local("nz_parcels", cache_dir="/data/eolas", freshness="monthly")
format auto-detects from metadata (geo → geoparquet, else parquet). By default files cache at ~/.cache/eolas/. For reproducible pipelines, set up a library — see Authentication → Library. Exceptions from the bulk endpoint (BulkLicenceRestricted, BulkUpgradeRequired, BulkNotYetAvailable) propagate unchanged.
A download progress bar is shown automatically in interactive sessions (terminal or VSCode notebook) so 1 GB+ files are never silent. Pass progress=False / progress = FALSE to suppress it in scripts, or set EOLAS_NO_PROGRESS=1 in the environment for a global CI-friendly escape hatch.
R
library(eolas)
eolas_key("your_key")
# Smart default: auto-routes to cache+sync for large/geo datasets
gdf <- eolas_get("nz_parcels") # sf object
df <- eolas_get("nz_cpi") # data.frame (small → stays live)
# Source-specific helpers inherit smart routing:
gdf <- eolas_get_linz("nz_parcels") # also auto-routes
Escape hatches:
# Force the live Iceberg scan — useful for freshest data, OECD-restricted sources,
# or small slices (e.g. with limit/start/end).
# Refused with 413 if the dataset is large/geo and you don't pass a filter.
gdf <- eolas_get("nz_parcels", mode = "live")
# Force cache+sync (same as eolas_get_local)
gdf <- eolas_get("nz_parcels", mode = "cached")
# eolas_get_local() is the explicit alias — useful when cache_dir / format matter:
gdf <- eolas_get_local("nz_parcels")
df <- eolas_get_local("nz_cpi", cache_dir = "/data/eolas", format = "csv_gz")
See the Python reference and R reference for the full parameter list.
When to use bulk vs the live API
| You want… | Use |
|---|---|
| A few hundred or thousand recent rows | /v1/datasets/{name}/data (filterable, JSON/CSV/Arrow/Parquet) |
| The full dataset as one file | /v1/bulk/{ns}/{table} (this page) |
| The data joined live into your warehouse | Snowflake share (Enterprise) — see Snowflake share |
The live /data endpoint is capped (Free: 50,000 rows/request). For multi-million-row geospatial datasets like LINZ parcels, the bulk endpoint is the right tool — it lazy-generates the snapshot file once and serves it from CDN thereafter.
Tiers
The bulk path has two axes: licence (set per dataset, not by you) and freshness (set by your plan).
Licence axis
Each dataset's licence determines whether bulk is permitted at all:
- CC-BY-licensed datasets (the vast majority — Stats NZ, LINZ, RBNZ, MBIE, NZ Treasury, councils, etc.): bulk download permitted, subject to attribution.
- Non-CC-BY datasets (notably OECD, which we serve under terms that prohibit commercial redistribution): bulk returns 403. Query the data via the live
/dataendpoint instead.
On the dataset browse page, look for the green "Bulk download" badge (eligible) vs the slate "Query-only (licence)" badge (restricted).
Freshness axis
| Plan | Freshness |
|---|---|
| Free | Latest monthly snapshot (generated 1st of each month at 14:00 UTC). |
| Pro | The current Iceberg snapshot (generated on first request, served from cache thereafter). |
| Enterprise | Same as Pro, plus the Snowflake zero-copy share for direct warehouse access. |
If a Free user requests ?freshness=current, the response is 402 with a link to /pricing. If a Pro user requests ?freshness=monthly they get the monthly snapshot (sometimes useful for reproducibility).
Formats
| Format | Query | Best for |
|---|---|---|
| Parquet (default) | ?format=parquet |
Pandas, DuckDB, Polars, Spark — fastest read, best compression |
| gzipped CSV | ?format=csv_gz |
Excel, anything that doesn't speak Parquet |
| GeoParquet | ?format=geoparquet |
QGIS, GeoPandas, DuckDB-spatial — only available on datasets with geometry |
GeoParquet output uses OGC:CRS84 (WGS84 longitude/latitude) and the GeoParquet 1.0.0 metadata spec — every reader that supports the spec will pick it up natively.
Examples
curl
# Pro: current snapshot of NZ Treasury fiscal spending as Parquet
curl -H "X-API-Key: vs_..." \
-o treasury_fiscal_spending.parquet \
"https://api.eolas.fyi/v1/bulk/treasury/treasury_fiscal_spending?freshness=current&format=parquet"
# Any plan: latest monthly snapshot as gzipped CSV
curl -H "X-API-Key: vs_..." \
-o treasury_fiscal_spending.csv.gz \
"https://api.eolas.fyi/v1/bulk/treasury/treasury_fiscal_spending?freshness=monthly&format=csv_gz"
# GeoParquet for a Stats NZ Geo layer
curl -H "X-API-Key: vs_..." \
-o ta2023.geo.parquet \
"https://api.eolas.fyi/v1/bulk/statsnz_geo/territorial_authority_2023?freshness=current&format=geoparquet"
If you omit ?freshness=, the API picks the right one for your plan (Free → monthly, Pro → current) and 302-redirects to the canonical URL. Curl follows by default with -L.
Python (pandas)
import pandas as pd
import requests
headers = {"X-API-Key": "vs_..."}
url = "https://api.eolas.fyi/v1/bulk/treasury/treasury_fiscal_spending?freshness=current"
# pandas can read remote Parquet directly, but we route through requests
# so the API key header is included.
r = requests.get(url, headers=headers, allow_redirects=True)
r.raise_for_status()
df = pd.read_parquet(io.BytesIO(r.content))
print(df.shape, df.columns.tolist())
Python (GeoPandas — for geo datasets)
import io
import geopandas as gpd
import requests
headers = {"X-API-Key": "vs_..."}
url = ("https://api.eolas.fyi/v1/bulk/statsnz_geo/territorial_authority_2023"
"?freshness=current&format=geoparquet")
r = requests.get(url, headers=headers, allow_redirects=True)
r.raise_for_status()
gdf = gpd.read_parquet(io.BytesIO(r.content))
print(gdf.crs) # OGC:CRS84 (WGS84 lon/lat)
print(gdf.geometry.iloc[0].geom_type)
DuckDB (read Parquet directly via HTTP)
-- DuckDB can read remote Parquet without downloading first, but it can't
-- send a custom header. Easiest: download with curl, then query the local file.
INSTALL httpfs; LOAD httpfs;
SELECT count(*) FROM read_parquet('treasury_fiscal_spending.parquet');
Browser
On any dataset page, if you're logged in, click the Parquet / CSV.gz / GeoParquet download buttons in the header. Same gate as the API — Free gets the monthly snapshot, Pro gets the current.
Keeping a downloaded file in sync
After the first download, call sync_bulk / eolas_sync_bulk / eolas sync instead of download_bulk on subsequent runs. A HEAD request (~200 bytes) checks the server's X-Snapshot-Version header; if the local sidecar records the same snapshot, no data is transferred.
Python
from eolas_data import Client
client = Client("your_api_key")
r = client.sync_bulk("nz_cpi", path="nz_cpi.parquet")
# r.status is "downloaded" (first run), "updated" (new snapshot), or "unchanged"
# r.bytes_downloaded is 0 when unchanged
print(r.status, r.current_snapshot_id)
R
library(eolas)
eolas_key("your_key")
r <- eolas_sync_bulk("nz_cpi", path = "nz_cpi.parquet")
# r$status: "downloaded" | "updated" | "unchanged"
# r$bytes_downloaded: 0 when unchanged
message(r$status, " — snapshot: ", r$current_snapshot_id)
CLI (one-shot or watch loop)
# Single check
eolas sync nz_cpi --out ~/data/nz_cpi.parquet
# Foreground loop — polls every hour, exits on Ctrl-C
eolas sync nz_cpi --watch hourly --out ~/data/nz_cpi.parquet
The sidecar <path>.eolas-meta.json (written automatically) contains the snapshot id, download timestamp, format, and source URL. It is safe to delete — the next call will treat it as a first download. See the Python reference and R reference pages for the full sync_bulk API.
What you get in the download
Every bulk file is named {namespace}__{table}@{snapshot_id}.{ext}. The snapshot id is the Iceberg snapshot the file was generated from — files are immutable by name, so a new refresh of the source data produces a new URL (CDN cache invalidates itself).
Each bulk download also includes:
- A
NOTICE.txtsidecar in the eolas-bulk S3 bucket (also retrievable at{namespace}/{namespace}__{table}@{snapshot_id}.NOTICE.txt). Contains the source, licence, source URL, snapshot version, "data as of" date, and the attribution clause. - File-level metadata embedded in the Parquet/GeoParquet (key-value
geoblock for GeoParquet, standard Parquet metadata for the rest).
The NOTICE file must travel with the data if you redistribute it — that's how the CC-BY attribution obligation is satisfied. See Terms §5.
Caching, freshness, and refresh
- Monthly snapshots are pre-generated on the 1st of each month at 14:00 UTC by a server-side cron.
- Current snapshots for Pro are lazy-generated on first request and cached at S3 forever (well — for 180 days, per the bucket lifecycle policy).
- Cloudflare caches monthly URLs at the edge with
Cache-Control: public, max-age=31536000, immutable. URLs include the snapshot id, so a new refresh → new URL → automatic CDN invalidation. No manual cache busting needed. X-Snapshot-Versionresponse header tells you which Iceberg snapshot you got. Pin it in scripts if you need reproducibility.
Rate limits
| Layer | Limit |
|---|---|
| App-level monthly quota | Free: 10 req/month total (bulk counts). Pro/Enterprise: unlimited. |
| Per-IP at the edge | 5 req per 10 s (Cloudflare zone rule). Bursty downloads get a 10 s 429; spread requests if scripting. |
If you're systematically pulling many datasets, space requests by ~2 s and you'll never trip the per-IP rule.
Redistribution
You may redistribute bulk-downloaded files — including in commercial work and on paid platforms — provided the NOTICE.txt sidecar continues to travel with the data. Removing or altering it breaches the upstream CC-BY licence (not just our terms).
OECD and other non-CC-BY datasets are excluded from bulk; do not attempt to reconstruct them via repeated /v1/data calls — that's a breach of our acceptable use and the upstream licence.
Full terms: eolas.fyi/terms §5.