How the forecast
is built.
Daily forecast across 9 of every 10 California surf spots — 229 of 250 ocean-facing beach groups, plus 66 of 78 bay and lake monitoring sites. Beaches outside the model show their latest official sample instead of a forecast.
Numerically grounded by default.
Label policy
The forecast label is marine enterococcus exceedance, scored method-aware: culture samples (MPN/CFU) against the 104 single-sample STV, and ddPCR results (copies/100mL) against the molecular 1413-copy threshold — correcting a prior bug that compared ddPCR copies against the culture number and false-flagged most San Diego rapid-test samples. Freshwater E. coli and total/fecal coliform stay in the warehouse, outside the pooled label.
Daily forecast
The pipeline refreshes once each morning. It is a batch forecast that cannot react to an intra-day spill after publication.
Baselines first
Persistence, logistic/linear, and gradient-boosted tree baselines are mandatory. A gradient-boosted tree from the hist_gbm family serves production, chosen by held-out leave-one-county-out and leave-one-beach-out spatial backtests — not a single temporal split — so the winner has to generalize to unseen locations, not just memorize recent patterns. Sequence models (LSTM/TCN) train as research candidates.
Calibration
Public-facing bands come from calibrated exceedance probabilities. Unsupported stations do not receive a colored risk claim or a fallback model badge on public surfaces.
Model forecast and official postings, side by side.
Shorelife runs two parallel feeds. The model produces a daily probability-of-exceedance forecast from sample history and ocean covariates. A second pipeline scrapes the live advisory page that each county health department publishes for itself — same-day visibility into Postings, Closures, and Rain Advisories, versus the statewide data.ca.gov feed which lags by 4-8 weeks for status changes.
12 counties live-scraped
San Diego, Orange, San Mateo, LA, Marin, Long Beach, East Bay Parks, Ventura, San Francisco, Humboldt, Sonoma, San Luis Obispo. Monterey and Santa Barbara still rely on the state feed (Monterey blocks datacenter traffic; SB does not publish current postings on the public web).
Direct or AI extraction
San Francisco ingests directly from its Socrata API; Humboldt and Sonoma are read by a structured-output LLM extraction; San Luis Obispo is rendered with a headless browser before extraction. All 12 fall back to the state feed if the county source flakes.
County-direct sample feed
A second parallel feed, county_direct_samples.parquet, captures the numeric enterococcus and coliform values each county publishes — independent of the BeachWatch sample archive. Used today for cross-validation; usable tomorrow as a near-real-time training signal.
AB411 single-sample triggers
For counties that publish raw values rather than a posted/not-posted flag (SF, Humboldt, Sonoma), Shorelife applies the same California AB411 thresholds the counties themselves apply: ENT ≥104, FECAL ≥400 MPN/100mL.
Advisory band, not model band
When the county marks a station Posted or Closed, the public surface shows a purple Advisory band with the cause and a deep-link back to the county source — never overridden by a model-derived risk band.
Stale records auto-demote
When a first-class county scraper succeeds, any state-feed record for that county that the county no longer affirms is moved to historical status — closes a real failure mode where year-old admin entries kept showing as active.
What this model can't do.
- 01Forecasts complement official monitoring; they do not replace county advisories.
- 02Only a subset of beaches currently has model coverage; the rest should be read as latest-official-sample views, not forecast gaps filled by guesswork.
- 03Heavy storm, sewage, or spill events can outrun any historical statistical model, especially after the morning publish cutoff.
- 04LLM explanations summarize the forecast; the numeric risk comes from the ML model.
Sources and prior work.
- [1]Searcy, R. T. & Boehm, A. B. (2021). A day at the beach: enabling coastal water quality prediction with high-frequency sampling and machine learning. Water Research · DOI: 10.1016/j.watres.2021.117051 ↗
- [2]U.S. EPA. Virtual Beach 3 (VB3): User Guide. Enterococcus-based per-station MLR as production baseline. EPA VB3 Reference Manual ↗
- [3]California Department of Public Health. AB411 Annual Beach Report. County advisory thresholds and official culture-based sampling protocol. CDPH AB411 / Beach Monitoring Program ↗
- [4]NOAA National Data Buoy Center (NDBC). Real-time and historical wave, wind, and sea surface temperature observations used as model covariates. NDBC · noaa.gov/ndbc ↗
- [5]Scripps Institution of Oceanography. Coastal Data Information Program (CDIP). Nearshore wave model output and buoy telemetry. CDIP · cdip.ucsd.edu ↗
- [6]Open-Meteo. Open-source weather API — hourly UV index and solar radiation used for solar inactivation index feature. Open-Meteo API · open-meteo.com ↗