graph LR A[Clone Repository] --> B[Make Changes] B --> C[Stage Changes] C --> D[Commit Changes] D --> E[Pull] E --> F[Push] style F fill:#d8f3dc,stroke:#95d5b2
GitHub
Version control and collaborative coding
Overview
GitHub serves as the foundation of our code management system, providing robust version control, collaborative workflows, and reproducible research capabilities. This central platform enables the Pristine Seas Science Team to maintain high standards of scientific rigor and transparency while facilitating efficient collaboration across our distributed team.
Core GitHub Benefits
- Version control: Track changes over time with complete history
- Collaboration: Multiple scientists can contribute to the same codebase
- Reproducibility: Code tied to specific analyses or expeditions is preserved exactly as used
- Knowledge sharing: Public repositories make methods transparent to the broader scientific community
- Quality control: Pull request review system ensures code quality and consistency
Organization Structure
The Pristine Seas Science Team maintains a dedicated GitHub organization (pristine-seas
) that houses all our code repositories, documentation, and analytical frameworks.
Repository Types
Our repositories fall into three main categories, each with specific naming conventions and purposes:
Repository Structure
Expedition repositories follow a standardized structure to ensure consistency and ease of navigation. This structure helps new team members quickly understand our workflow and facilitates efficient collaboration.
exp-[ISO3]-[YEAR]/
├── .github/ # GitHub specific files (actions, templates)
├── R/ # R functions and utility scripts
├── pipeline/ # Data processing pipelines
│ ├── pipe_benthic.qmd # Benthic data processing
│ ├── pipe_fish.qmd # Fish data processing
│ ├── pipe_pbruv.qmd # BRUVS data processing
├── eda/ # Exploratory data analysis
│ ├── eda_lpi.qmd # Benthic data analysis
│ ├── eda_fish.qmd # Fish data analysis
│ └── eda_pbruv.qmd # Pelagic BRUVS data analysis
├── docs/ # qmds are rendered here
├── .gitignore # Files to exclude from version control
├── README.md # Repository overview and instructions
└── exp-[ISO3]-[YEAR].Rproj # RStudio project file
Project and infrastructure repositories have flexible structures based on their specific needs, but should still maintain clear organization and documentation.
We do not store data in GitHub repositories. This is both a practical consideration (GitHub has file size limits) and a scientific best practice (data should be managed separately from code).
Instead:
- Raw data is stored in Google Drive (see Google Drive documentation)
- Processed data is stored in BigQuery (see BigQuery documentation)
GitHub Workflow with R
Our team primarily interacts with GitHub through RStudio, following the principles outlined in Jenny Bryan’s “Happy Git with R” (happygitwithr.com). This approach simplifies version control while maintaining scientific rigor.
Basic Git Workflow in RStudio
The fundamental GitHub workflow for our team involves these key steps:
Clone: Create a local copy of the repository
- In RStudio: File → New Project → Version Control → Git
- Enter repository URL:
https://github.com/pristine-seas/repository-name.git
Make Changes: Edit scripts, add analyses, or improve documentation
Stage: Select which changes to include in your commit
- In RStudio: Use the Git pane to check boxes next to changed files
Commit: Save your changes with a clear, descriptive message
- In RStudio: Click “Commit” in the Git pane
- Write a meaningful commit message that describes what you changed and why
Pull: Incorporate any changes others have made
- In RStudio: Click “Pull” in the Git pane
- Address any conflicts if they arise
Push: Share your changes with the team
- In RStudio: Click “Push” in the Git pane
Quarto Documents for Reproducible Research
Our workflow heavily leverages Quarto documents (.qmd
files) as the foundation for reproducible research. Quarto combines narrative text, code, and visualizations in a single document that serves as both lab notebook and reproducible pipeline.
Why We Use Quarto
Quarto documents offer several advantages for scientific workflows:
- Code and narrative integration: Seamlessly mix text with executable R code
- Reproducibility: Anyone can re-run analyses and get identical results
- Self-documentation: Methods are explicitly documented alongside the code
- Multiple output formats: Generate HTML, PDF, Word, or presentations from the same source
- Version control friendly: Text-based format works well with Git
Quarto in Our Repository Structure
Our expedition repositories are organized around two main types of Quarto documents:
- Pipeline documents (
pipeline/
folder):- Linear processing workflows from raw data to analysis-ready datasets
- Emphasize data cleaning, validation, and transformation
- Each document focuses on a specific data stream (benthic, fish, etc.)
- Final output is typically data tables ready for database ingestion
- Exploratory Data Analysis (EDA) documents (
eda/
folder):- Scientific analysis of processed data
- Focus on visualization, statistical testing, and interpretation
- Often include publication-quality figures
- Can be method-specific or integrate across multiple methods
Embracing a Reproducibility Mindset
Reproducibility is fundamental to scientific integrity and is a core value of the Pristine Seas Science Team. Beyond just technical practices, it requires a specific mindset when approaching our work.
Key Principles
Think of your future self: Write code and documentation as if you’ll need to understand it months from now with no memory of what you did
Think of your teammates: Assume someone else will need to use and understand your work without having you available to explain it
Documentation is not optional: Clear documentation is as important as the code itself
- Document the “why” not just the “how”
- Explain analytical choices and their implications
- Note data quirks and handling decisions
No magic numbers: Any constant or parameter should be explained and defined at the top of scripts
Embrace iteration: Start with messy exploration, but refine toward reproducible pipelines
- Initial EDA can be exploratory
- Final analyses should be structured as reproducible workflows
- Improve code as clarity emerges
Practical Habits
- Start each analysis session by pulling the latest code
- Commit frequently with clear messages
- Review your own work before considering it final
- Test with fresh environments to ensure all dependencies are documented
- Keep raw data pristine and document all transformations