Research · Sources

Where the data
comes from.

Shorelife integrates eight primary data sources, all public or government-archived, into a daily pipeline that runs at 9 AM PT. The county-direct scrapers below — now covering 12 of 15 California coastal counties — surface official postings within hours of county publication, vs the statewide data.ca.gov feed which lags by 2-8 weeks. Each source is attributed in the model feature space and versioned in the parquet archive.

Pipeline heartbeat

28m ago

beaches

1h ago

observations

1h ago

advisories

1h ago

12 of 15 CA coastal counties · San Diego, Orange, San Mateo, LA, Marin, Long Beach, East Bay Parks, Ventura, San Francisco, Humboldt, Sonoma, San Luis Obispo

County-Direct Advisory Scrapers

Last seen 1h ago

Update cadence

Daily at 16:00 UTC (9 AM PT). Retry-with-backoff on 403/429/5xx so transient county-site rate limits don't silently zero-out a county.

github.com/kylechoi101/surf-health/blob/main/backend/scripts/fetch_county_advisories.py ↗

Live advisory + closure postings pulled directly from each county health department's public source. Three integration paths: (1) a structured public API where available (San Francisco via the data.sfgov.org Socrata endpoint), (2) regex parsers against static HTML pages (the original 8 counties + Humboldt + Sonoma), (3) Playwright-rendered LLM extraction for JavaScript-rendered dashboards (San Luis Obispo's ArcGIS dashboard). When a county scraper succeeds, its records become authoritative for that county — stale state-feed records auto-demote (12 zombie postings older than a year cleared from this rollout alone). Closures are exempted from the 14-day acute window so chronic situations stay visible until officially lifted.

Fields used

advisory_type (Posting/Closure/Rain/Chronic)station_codestarted_atcauseadvisory_website (county source)

San Francisco (Socrata) · Humboldt · Sonoma · San Luis Obispo (LLM-extracted)

County-Direct Sample Feed

Update cadence

Daily, alongside the advisory scrape

github.com/kylechoi101/surf-health/blob/main/backend/scripts/fetch_county_advisories.py ↗

Same-day numeric water-quality values — enterococcus, fecal coliform, E. coli, total coliform — scraped directly from each county's public source, parallel to the BeachWatch historical sample archive. Lets us apply AB411 single-sample thresholds (ENT ≥104, FECAL ≥400 MPN/100mL) to the latest reading per station independently of when the state advisory feed catches up. Used today for cross-validation; usable tomorrow as a near-real-time training signal. 200 sample readings currently captured. LLM extractions include hallucination defenses: sample_date must be within 90 days, analyte must be in a strict whitelist, value must be plausible.

Fields used

analyte (ENTEROCOCCUS/FECAL_COLIFORM/E_COLI/TOTAL_COLIFORM)value (MPN/100mL)sample_dateexceeds_limitstation_codesource_url

CDPH / CA SWRCB

CA BeachWatch (statewide feed)

Last seen 1h ago

Update cadence

Daily batch-ingested each morning. Status-change cadence lags county-direct sources by weeks.

data.ca.gov/dataset/beach-monitoring ↗

Official marine enterococcus culture-based sample results and the statewide advisory feed. Used (a) for the multi-year historical sample archive that trains the ML model, and (b) as a fallback for advisory status in the 3 counties where direct scraping is blocked or unavailable: Monterey (datacenter-IP WAF blocks all our requests), Santa Barbara (public page lists sampling sites but does not publish current status), and Mendocino / Santa Cruz (no first-class scraper yet). The statewide feed's status-change cadence lags county-direct sources by 2-8 weeks, which is why the County-Direct entry above overrides it for the 12 covered counties.

Fields used

enterococcus CFU/100mLadvisory typeadvisory start/endsample station ID

NOAA NDBC

NDBC Buoy Network

Last seen 1h ago

Update cadence

Hourly; aggregated to daily forecast-safe windows (≤ 5 AM PT)

www.ndbc.noaa.gov ↗

Hourly ocean observations from nearshore buoys: significant wave height, dominant period, sea-surface temperature, salinity, and wind.

Fields used

wave_height_mdominant_period_swater_temperature_csalinity_psuwind_speed_mps

Scripps Institution of Oceanography

CDIP Wave Model

Update cadence

Hourly; aggregated daily

cdip.ucsd.edu ↗

Nearshore wave model output and buoy telemetry at California coastal sites, used to supplement NDBC coverage.

Fields used

HsTpwave direction

Open-Meteo / ECMWF

Open-Meteo ERA5-Land

Update cadence

Backfill on first run (6-year archive); then daily append. Cached per (lat, lon, date) parquet.

open-meteo.com ↗

Reanalysis archive of hourly cloud cover, shortwave radiation, UV index, and near-surface wind at 0.1° resolution. Used for solar inactivation and wind plume features.

Fields used

uv_indexshortwave_radiationcloud_coverwind_uwind_vsolar_inactivation_indexshore_normal_wind_ms

USGS National Water Information System

USGS NWIS

Update cadence

Daily; lag-adjusted (1–3 day lag to beach)

waterdata.usgs.gov/nwis ↗

Daily streamflow and gage-height records for rivers and creeks draining to monitored beaches. Used as hydrology covariate proxying storm-runoff pollution.

Fields used

discharge_cfsgage_height_ft

CA SWRCB

CEDEN

Update cadence

Bulk pull (≤ 50,000 rows); merged at pipeline init

ceden.waterboards.ca.gov ↗

California Environmental Data Exchange Network — additional water quality measurements used to supplement BeachWatch gaps.

Fields used

enterococcusE. coli (excluded from model labels)total coliform (excluded)

Forecast-safe cutoff at 5 AM PT.

The pipeline runs daily at 6 AM PT. All environmental covariates are summarised up to 5 AM PT of the forecast day so no same-morning laboratory results leak into the features. Open-Meteo data is cached per (lat, lon, date) in a local parquet store and incrementally updated; a full 6-year backfill runs only on the first cold-cache CI run.

Pipeline reference →

Where the datacomes from.

Forecast-safe cutoff at 5 AM PT.

Where the data
comes from.