Research · Sources

Where the data
comes from.

Shorelife integrates eight primary data sources, all public or government-archived, into a daily pipeline that runs at 9 AM PT. The county-direct scrapers below — now covering 12 of 15 California coastal counties — surface official postings within hours of county publication, vs the statewide data.ca.gov feed which lags by 2-8 weeks. Each source is attributed in the model feature space and versioned in the parquet archive.

Pipeline heartbeat
23h ago
beaches
24h ago
observations
24h ago
advisories
24h ago
12 of 15 CA coastal counties · San Diego, Orange, San Mateo, LA, Marin, Long Beach, East Bay Parks, Ventura, San Francisco, Humboldt, Sonoma, San Luis Obispo
County-Direct Advisory Scrapers
Last seen 24h ago
Update cadence
Daily at 16:00 UTC (9 AM PT). Retry-with-backoff on 403/429/5xx so transient county-site rate limits don't silently zero-out a county.
github.com/kylechoi101/surf-health/blob/main/backend/scripts/fetch_county_advisories.py

Live advisory + closure postings pulled directly from each county health department's public source. Three integration paths: (1) a structured public API where available (San Francisco via the data.sfgov.org Socrata endpoint), (2) regex parsers against static HTML pages (the original 8 counties + Humboldt + Sonoma), (3) Playwright-rendered LLM extraction for JavaScript-rendered dashboards (San Luis Obispo's ArcGIS dashboard). When a county scraper succeeds, its records become authoritative for that county — stale state-feed records auto-demote (12 zombie postings older than a year cleared from this rollout alone). Closures are exempted from the 14-day acute window so chronic situations stay visible until officially lifted.

Fields used
advisory_type (Posting/Closure/Rain/Chronic)station_codestarted_atcauseadvisory_website (county source)
San Francisco (Socrata) · Humboldt · Sonoma · San Luis Obispo (LLM-extracted)
County-Direct Sample Feed

Same-day numeric water-quality values — enterococcus, fecal coliform, E. coli, total coliform — scraped directly from each county's public source, parallel to the BeachWatch historical sample archive. Lets us apply AB411 single-sample thresholds (ENT ≥104, FECAL ≥400 MPN/100mL) to the latest reading per station independently of when the state advisory feed catches up. Used today for cross-validation; usable tomorrow as a near-real-time training signal. 200 sample readings currently captured. LLM extractions include hallucination defenses: sample_date must be within 90 days, analyte must be in a strict whitelist, value must be plausible.

Fields used
analyte (ENTEROCOCCUS/FECAL_COLIFORM/E_COLI/TOTAL_COLIFORM)value (MPN/100mL)sample_dateexceeds_limitstation_codesource_url
CDPH / CA SWRCB
CA BeachWatch (statewide feed)
Last seen 24h ago
Update cadence
Daily batch-ingested each morning. Status-change cadence lags county-direct sources by weeks.
data.ca.gov/dataset/beach-monitoring

Official marine enterococcus culture-based sample results and the statewide advisory feed. Used (a) for the multi-year historical sample archive that trains the ML model, and (b) as a fallback for advisory status in the 3 counties where direct scraping is blocked or unavailable: Monterey (datacenter-IP WAF blocks all our requests), Santa Barbara (public page lists sampling sites but does not publish current status), and Mendocino / Santa Cruz (no first-class scraper yet). The statewide feed's status-change cadence lags county-direct sources by 2-8 weeks, which is why the County-Direct entry above overrides it for the 12 covered counties.

Fields used
enterococcus CFU/100mLadvisory typeadvisory start/endsample station ID
NOAA NDBC
NDBC Buoy Network
Last seen 24h ago
Update cadence
Hourly; aggregated to daily forecast-safe windows (≤ 5 AM PT)
www.ndbc.noaa.gov

Hourly ocean observations from nearshore buoys: significant wave height, dominant period, sea-surface temperature, salinity, and wind.

Fields used
wave_height_mdominant_period_swater_temperature_csalinity_psuwind_speed_mps
Scripps Institution of Oceanography
CDIP Wave Model
Update cadence
Hourly; aggregated daily
cdip.ucsd.edu

Nearshore wave model output and buoy telemetry at California coastal sites, used to supplement NDBC coverage.

Fields used
HsTpwave direction
Open-Meteo / ECMWF
Open-Meteo ERA5-Land
Update cadence
Backfill on first run (6-year archive); then daily append. Cached per (lat, lon, date) parquet.
open-meteo.com

Reanalysis archive of hourly cloud cover, shortwave radiation, UV index, and near-surface wind at 0.1° resolution. Used for solar inactivation and wind plume features.

Fields used
uv_indexshortwave_radiationcloud_coverwind_uwind_vsolar_inactivation_indexshore_normal_wind_ms
USGS National Water Information System
USGS NWIS
Update cadence
Daily; lag-adjusted (1–3 day lag to beach)
waterdata.usgs.gov/nwis

Daily streamflow and gage-height records for rivers and creeks draining to monitored beaches. Used as hydrology covariate proxying storm-runoff pollution.

Fields used
discharge_cfsgage_height_ft
CA SWRCB
CEDEN
Update cadence
Bulk pull (≤ 50,000 rows); merged at pipeline init
ceden.waterboards.ca.gov

California Environmental Data Exchange Network — additional water quality measurements used to supplement BeachWatch gaps.

Fields used
enterococcusE. coli (excluded from model labels)total coliform (excluded)

Forecast-safe cutoff at 5 AM PT.

The pipeline runs daily at 6 AM PT. All environmental covariates are summarised up to 5 AM PT of the forecast day so no same-morning laboratory results leak into the features. Open-Meteo data is cached per (lat, lon, date) in a local parquet store and incrementally updated; a full 6-year backfill runs only on the first cold-cache CI run.

Pipeline reference →