Where the data
comes from.
Shorelife integrates eight primary data sources, all public or government-archived, into a daily pipeline that runs at 9 AM PT. The county-direct scrapers below — now covering 12 of 15 California coastal counties — surface official postings within hours of county publication, vs the statewide data.ca.gov feed which lags by 2-8 weeks. Each source is attributed in the model feature space and versioned in the parquet archive.
Live advisory + closure postings pulled directly from each county health department's public source. Three integration paths: (1) a structured public API where available (San Francisco via the data.sfgov.org Socrata endpoint), (2) regex parsers against static HTML pages (the original 8 counties + Humboldt + Sonoma), (3) Playwright-rendered LLM extraction for JavaScript-rendered dashboards (San Luis Obispo's ArcGIS dashboard). When a county scraper succeeds, its records become authoritative for that county — stale state-feed records auto-demote (12 zombie postings older than a year cleared from this rollout alone). Closures are exempted from the 14-day acute window so chronic situations stay visible until officially lifted.
Same-day numeric water-quality values — enterococcus, fecal coliform, E. coli, total coliform — scraped directly from each county's public source, parallel to the BeachWatch historical sample archive. Lets us apply AB411 single-sample thresholds (ENT ≥104, FECAL ≥400 MPN/100mL) to the latest reading per station independently of when the state advisory feed catches up. Used today for cross-validation; usable tomorrow as a near-real-time training signal. 200 sample readings currently captured. LLM extractions include hallucination defenses: sample_date must be within 90 days, analyte must be in a strict whitelist, value must be plausible.
Official marine enterococcus culture-based sample results and the statewide advisory feed. Used (a) for the multi-year historical sample archive that trains the ML model, and (b) as a fallback for advisory status in the 3 counties where direct scraping is blocked or unavailable: Monterey (datacenter-IP WAF blocks all our requests), Santa Barbara (public page lists sampling sites but does not publish current status), and Mendocino / Santa Cruz (no first-class scraper yet). The statewide feed's status-change cadence lags county-direct sources by 2-8 weeks, which is why the County-Direct entry above overrides it for the 12 covered counties.
Hourly ocean observations from nearshore buoys: significant wave height, dominant period, sea-surface temperature, salinity, and wind.
Nearshore wave model output and buoy telemetry at California coastal sites, used to supplement NDBC coverage.
Reanalysis archive of hourly cloud cover, shortwave radiation, UV index, and near-surface wind at 0.1° resolution. Used for solar inactivation and wind plume features.
Daily streamflow and gage-height records for rivers and creeks draining to monitored beaches. Used as hydrology covariate proxying storm-runoff pollution.
California Environmental Data Exchange Network — additional water quality measurements used to supplement BeachWatch gaps.
Forecast-safe cutoff at 5 AM PT.
The pipeline runs daily at 6 AM PT. All environmental covariates are summarised up to 5 AM PT of the forecast day so no same-morning laboratory results leak into the features. Open-Meteo data is cached per (lat, lon, date) in a local parquet store and incrementally updated; a full 6-year backfill runs only on the first cold-cache CI run.
Pipeline reference →