Database Architecture

The Pristine Seas Science Database is a modular, analysis-ready system for organizing ecological data collected across more than 40 expeditions worldwide. Its design balances flexibility — to accommodate diverse survey methods — and consistency — to support clean integration, querying, and long-term analysis.

Core Structure

All data are hosted in Google BigQuery under the project pristine-seas, organized into two major dataset groups:

Method Datasets: One for each survey technique (e.g., UVS, BRUVS, eDNA), with standardized tables for sites, stations, observations, and summaries
Reference Datasets: Shared taxonomic, spatial, and lookup tables that provide a unified backbone across all methods

graph TB
    A[pristine-seas] --> B[Method Datasets]
    A --> C[Reference Datasets]
    
    B --> D[uvs]
    B --> E[sbruv]
    B --> F[pbruv]
    B --> G[edna]
    B --> H[sub]
    B --> I[...]
    
    C --> J[expeditions]
    C --> K[taxonomy]
    C --> L[lookup]
    
    style A fill:#004165,stroke:#fff,color:#fff
    style C fill:#006C9B,stroke:#fff,color:#fff
    style B fill:#006C9B,stroke:#fff,color:#fff

High-level database organization

Reference Datasets

These datasets provide a shared scaffold for spatial joins, species trait integration, QA/QC, and consistent metadata across all methods:

📂 expeditions/ — Spatial backbone & metadata for all field campaigns

📄 info — Expedition registry: dates, vessels, teams, partners
📄 sites — Master site registry across all survey methods

📂 taxonomy/ — Standardized species reference data

📄 fish — Fish taxonomy, traits, and life history
📄 benthos — Benthic organisms: corals, algae, sponges
📄 inverts — Mobile and sessile invertebrates

📂 lookup/ — Controlled vocabularies & reference tables

📄 habitat_types — Standardized habitat classifications
📄 exposure_types — Wave exposure categories

Method Datasets

Each method has its own dataset (e.g., uvs, pbruv, sub, edna) that stores raw and summarized outputs from a given type of field survey. These follow a common structure, with tables for:

sites — Survey events or deployments (e.g., dive, deployment, survey)
stations — Sampling units within a site (e.g., depth strata, rigs)
observations — Primary ecological or environmental records
summaries — Aggregated station-level metrics

📂 uvs/ — Underwater Visual Survey (SCUBA-based)

📄 sites — Dive site metadata: location, conditions, protocols used
📄 stations — Depth-stratified sampling units within each site
Fish Belt Transects
- 📄 blt_stations — Fish belt transect metadata: divers, effort, conditions
- 📄 blt_observations — Raw fish counts by species, size, and transect
- 📄 blt_biomass_by_taxa — Station-level fish biomass summaries
Benthic Surveys
- 📄 lpi_stations — Benthic line-point intercept transect metadata
- 📄 lpi_counts — Raw benthic point data by morphotaxon
- 📄 lpi_cover_by_taxa — Station-level benthic percent cover
Other Protocols
- 📄 inverts_stations — Mobile invertebrate transect metadata
- 📄 inverts_counts — Raw invertebrate observations and sizes
- 📄 recruits_stations — Coral recruitment quadrat metadata
- 📄 recruits_counts — Juvenile coral counts by genus

📂 sbruv/ — Seabed Baited Remote Underwater Video

📄 stations — Camera deployment metadata: location, depth, bait
📄 maxn_by_taxa — Fish MaxN summaries by species

📂 pbruv/ — Pelagic BRUVS (mid-water camera arrays)

📄 stations — String deployment metadata
📄 rigs — Individual rig positions and drift
📄 maxn_by_taxa — Fish MaxN by species and rig

📂 edna/ — Environmental DNA sampling

📄 sites — Collection location metadata
📄 stations — Depth-stratified water samples
📄 detections — Species detections and read counts

📂 sub/ — Submersible surveys

📄 sites — Dive metadata: pilot, observers, conditions
📄 stations — Transect segments within dives
📄 observations — Species observations along transects

📂 dscm/ — Deep Sea Camera deployments

📄 stations — Camera drop metadata: depth, duration
📄 maxn_by_taxa — Fish MaxN summaries

📂 birds/ — Seabird transect surveys

📄 stations — Transect metadata: vessel vs land-based
📄 observations — Bird sightings with counts and behavior

Data Model

All data in the Pristine Seas Science Database are structured around a shared, hierarchical spatial model. This enables standardized grouping, filtering, and integration across methods, expeditions, and geographies.

Expedition
A defined field campaign
e.g., COL_2024
- Region
  Broad geographic or administrative area
  e.g., Murat, Chocó, Temotu
  - Subregion
    Intermediate unit like an island, atoll, gulf, or reef complex
    e.g., Navigator bank, Mussau, Golfo de Tribugá, Utupua
    - Locality
      Local named feature such as a community, village, islet
      e.g., Lolieng, Tasitel
      - Site
        A specific location within the subregion where a survey method (e.g., UVS, BRUVS, eDNA) was used
        e.g., COL_2024_uvs_001
        
        Station
        A discrete sampling unit within a site, corresponding to a specific depth or replicate
        e.g., COL_2024_uvs_001_10m

Standard Identifiers

To ensure consistency, traceability, and joinability across all datasets, Pristine Seas uses structured identifiers for sites and stations.

Expedition (`exp_id`): `ISO3_YEAR`

Examples: FJI_2025, COL_2022,

Site (`ps_site_id`): `exp_id_method_###`

Always use 3-digit padding for site numbers.

Examples: FJI_2025_uvs_001, FJI_2025_pbruv_003, FJI_2025_sub_014

Station (`ps_station_id`): `ps_site_id_suffix`

Depth and replicate suffixes use fixed formats.

Examples:
- FJI_2025_uvs_001_10m – Underwater visual survey at 10m depth
- FJI_2025_pbruv_003_r102 – rig 102 of pelagic BRUV deployment 3
- FJI_2025_dscm_006_300m – deep sea camera at 300m

Style & Conventions

Naming Standards

All database objects follow consistent patterns:

Tables & Fields
snake_case throughout • lowercase except proper nouns • no spaces or special characters
Examples: blt_observations, depth_m, accepted_name

Identifiers
Self-documenting structure • sortable • joinable
Examples: FJI_2025_uvs_001_10m, ps_station_id

Data Standards

Units
Metric (SI) with suffix • depth_m • weight_g • temperature_c • biomass_gm2

Formats
Dates: YYYY-MM-DD • Times: HH:MM:SS • Coordinates: decimal degrees, WGS84

Types
Booleans: TRUE/FALSE • Missing: NULL

Structure Principles

Long Format
One observation per row • One variable per column

Keys & References
Primary: ps_site_id, ps_station_id • Foreign keys explicit • Natural hierarchy

Modularity
Method-specific structure • Shared references • Clear boundaries