Database Architecture

The Pristine Seas Science Database is a modular, analysis-ready system for organizing ecological data collected across more than 40 expeditions worldwide. Its design balances flexibility—to accommodate diverse survey methods—with consistency—to support clean integration, querying, and long-term analysis.

Design Philosophy

This architecture prioritizes:

  • Traceability: Every observation links back to spatial and taxonomic references
  • Flexibility: New methods integrate seamlessly without restructuring existing data
  • Analysis-Ready: Pre-computed summaries enable fast queries while preserving raw data for reproducibility

Core Structure

All data are hosted in Google BigQuery under the project pristine-seas, organized into two major dataset groups:

  • Method Datasets: One for each survey technique (e.g., UVS, BRUVS, eDNA), with standardized tables for sites, stations, observations, and summaries
  • Reference Datasets: Shared taxonomic, spatial, and lookup tables that provide a unified backbone across all methods

graph TB
    A[pristine-seas] --> B[Method Datasets]
    A --> C[Reference Datasets]
    
    B --> D[uvs]
    B --> E[BRUVS_seabed]
    B --> F[BRUVS_pelagic]
    B --> G[edna]
    B --> H[sub]
    B --> I[...]
    
    C --> J[expeditions]
    C --> K[taxa_info]
    C --> L[lookup]
    
    style A fill:#004165,stroke:#fff,color:#fff
    style C fill:#006C9B,stroke:#fff,color:#fff
    style B fill:#006C9B,stroke:#fff,color:#fff

High-level database organization


Reference Datasets

These datasets provide a shared scaffold for spatial joins, species trait integration, QA/QC, and consistent metadata across all methods:

📂 expeditions/Spatial backbone & metadata for all field campaigns

  • 📄 info — Expedition metadata: dates, regions, vessels, participants, partners
  • 📄 sites — Master site registry across all methods (lat/lon, habitat, depth range)

📂 taxa_info/Standardized species reference data

  • 📄 fish — Fish taxonomy (WoRMS), traits (length-weight), trophic groups
  • 📄 benthos — Benthic organisms: hard/soft corals, algae, sponges, CCA
  • 📄 inverts — Mobile and sessile invertebrates: echinoderms, mollusks, crustaceans

📂 lookup/Controlled vocabularies and reference tables

  • 📄 researchers — Team members: roles, affiliations, expedition participation

Method Datasets

Each method has its own dataset (e.g., uvs, pbruv, sub, edna) that stores raw and summarized outputs from a given type of field survey. These follow a common structure, with tables for:

  • sites — Survey events or deployments (e.g., dive, deployment, survey)
  • stations — Sampling units within a site (e.g., depth strata, rigs)
  • observations — Primary ecological or environmental records
  • summaries — Aggregated station-level metrics

Common Method Dataset Structure

While each method has unique requirements, all follow this general pattern:

Core Tables:

  • {method}.sites — Deployment-level metadata (when/where)
  • {method}.stations — Sampling units (replicates, strata, depth)
  • {method}.*_observations — Raw ecological records
  • {method}.*_by_station_taxa — Pre-aggregated summaries

Key Principles:

  • Observations are immutable: Raw data preserved as collected after QA/QC
  • Stations link everything: All analyses join through ps_station_id
  • Summaries are derived: Calculated from observations, can be regenerated
  • Taxonomy is external: Most species traits live in taxa_info.*, not method tables

📂 uvs/Underwater Visual Survey (SCUBA-based)

  • 📄 sites — Dive site metadata: location, conditions, protocols used
  • 📄 stations_unified — Unified view of all UVS stations across protocols

Fish Belt Transects

  • 📄 blt_observations — Raw fish counts by species, size, and transect
  • 📄 blt_stations — Fish station metadata: divers, effort, summaries
  • 📄 blt_biomass_by_station_taxa — Station-level fish biomass summaries by taxa

Benthic Surveys (LPI)

  • 📄 lpi_observations — Raw benthic point data by morphotaxon
  • 📄 lpi_stations — Benthic transect metadata: divers, effort, summaries
  • 📄 lpi_cover_by_station_taxa — Station-level benthic percent cover by taxa

Invertebrate Surveys

  • 📄 inverts_observations — Raw invertebrate counts and sizes
  • 📄 inverts_stations — Invertebrate transect metadata: divers, effort, summaries
  • 📄 inverts_density_by_station_taxa — Station-level invertebrate density

Coral Recruitment

  • 📄 recruits_observations — Juvenile coral counts by genus
  • 📄 recruits_stations — Coral recruitment quadrat metadata: divers, effort, summaries
  • 📄 recruits_density_by_station_taxa — Station-level coral recruit density by taxa

📂 BRUVS_seabed/Seabed Baited Remote Underwater Video

  • 📄 stations — Camera deployment metadata: location, depth, bait
  • 📄 maxn_by_station_taxa — Fish MaxN summaries by species

📂 BRUVS_pelagic/Pelagic BRUVS (mid-water camera arrays)

  • 📄 rigs — Individual rig metadata (e.g. positions and drift)
  • 📄 stations — String deployment metadata
  • 📄 maxn_by_station_taxa — Fish MaxN by species and station

📂 edna/Environmental DNA sampling

  • 📄 sites — Collection location metadata
  • 📄 stations — Depth-stratified water samples
  • 📄 detections_by_station_taxa — Species detections and read counts

📂 sub/Submersible surveys

  • 📄 sites — Dive metadata: pilot, observers, conditions
  • 📄 stations — Transect segments within dives
  • 📄 observations — Species observations

📂 dscm/Deep Sea Camera deployments

  • 📄 stations — Camera drop metadata: depth, duration
  • 📄 maxn_by_station_taxa — MaxN summaries

📂 birds/Seabird transect surveys

  • 📄 stations — Transect metadata: vessel vs land-based
  • 📄 observations — Bird sightings with counts and behavior

Data Model

All data in the Pristine Seas Science Database are structured around a shared, hierarchical spatial model. This enables standardized grouping, filtering, and integration across methods, expeditions, and geographies.

  • Expedition
    A defined field campaign
    e.g., COL_2024

    • Region
      Broad geographic or administrative area
      e.g., Murat, Chocó, Temotu

      • Subregion
        Intermediate unit like an island, atoll, gulf, or reef complex
        e.g., Navigator Bank, Mussau, Golfo de Tribugá, Utupua

        • Locality
          Local named feature such as a community, village, islet
          e.g., Lolieng, Tasitel

          • Site
            A specific location within the subregion where a survey method (e.g., UVS, BRUVS, eDNA) was deployed
            e.g., COL_2024_uvs_001

            • Station
              A discrete sampling unit within a site, corresponding to a specific depth or replicate
              e.g., COL_2024_uvs_001_10m

Key Relationships:

  • One expedition → many sites (across methods)
  • One site → many stations (across depths/replicates)
  • Spatial context cascades: stations inherit their site’s region, subregion, and locality
  • Method independence: UVS, BRUVS, and eDNA can occur at the same geographic location as separate sites/stations
Standard Identifiers

To ensure consistency, traceability, and joinability across all datasets, Pristine Seas uses structured identifiers for sites and stations.

Expedition (exp_id): ISO3_YEAR

  • Examples: FJI_2025, COL_2022, PNG_2024

Site (ps_site_id): exp_id_method_###

Always use 3-digit padding for site numbers.

  • Examples: FJI_2025_uvs_001, FJI_2025_pbruv_003, FJI_2025_sub_014

Station (ps_station_id): ps_site_id_suffix

Depth and replicate suffixes use fixed formats.

Suffix Formats by Method:

  • UVS (depth-stratified): _{depth}m
    Examples: FJI_2025_uvs_001_5m, FJI_2025_uvs_001_10m, FJI_2025_uvs_001_20m

  • BRUVS pelagic (rig-based): _r{rig_number}
    Examples: FJI_2025_pbruv_003_r101, FJI_2025_pbruv_003_r102

  • Deep cameras: _{depth}m
    Examples: FJI_2025_dscm_006_300m, FJI_2025_dscm_006_500m

  • Submersibles (transects): _t{transect_number}
    Examples: FJI_2025_sub_014_t01, FJI_2025_sub_014_t02


Integration & Joins

Spatial Joins

All method datasets link to the spatial backbone:

  • {method}.sites.ps_site_idexpeditions.sites.ps_site_id
  • Inherited fields: exp_id, region, subregion, locality

Taxonomic Joins

Observations link to trait databases via AphiaID:

  • {method}.*_observations.accepted_aphia_idtaxa_info.fish.accepted_aphia_id
  • {method}.*_observations.accepted_aphia_idtaxa_info.benthos.accepted_aphia_id
  • {method}.*_observations.accepted_aphia_idtaxa_info.inverts.accepted_aphia_id

Effort Standardization

Station-level summaries include survey effort for density calculations:

  • Fish: biomass per m² (biomass_gm2)
  • Benthos: percent cover (pct_cover)
  • Inverts: density per m² (density_m2)
  • BRUVS: MaxN (standardized count)

Style & Conventions

Naming Standards

All database objects follow consistent patterns:

Tables & Fields
snake_case throughout • lowercase except proper nouns • no spaces or special characters
Examples: blt_observations, depth_m, accepted_name

Identifiers
Self-documenting structure • sortable • joinable
Examples: FJI_2025_uvs_001_10m, ps_station_id

Special Cases:

  • Acronyms: Stay uppercase when widely recognized (e.g., CCA, DNA, MPA)
  • Geographic names: Preserve proper capitalization (e.g., Golfo de Tribugá)
  • Species names: Use lowercase except genus initial (e.g., Acropora cervicornis)
  • Boolean prefixes: Use is_*, has_*, in_* (e.g., in_mpa, has_blt)

Data Standards

Units
Metric (SI) with suffix • depth_mweight_gtemperature_cbiomass_gm2

Formats
Dates: YYYY-MM-DD • Times: HH:MM:SS • Coordinates: decimal degrees, WGS84

Types
Booleans: TRUE/FALSE • Missing: NULL

Structure Principles

Long Format
One observation per row • One variable per column

Keys & References
Primary: ps_site_id, ps_station_id • Foreign keys explicit • Natural hierarchy

Modularity
Method-specific structure • Shared references • Clear boundaries