---
title: "Getting Started with r4subcore"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with r4subcore}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r library}
library(r4subcore)
```

## Overview

`r4subcore` is the foundation of the R4SUB (R for Regulatory Submission)
ecosystem. It defines three responsibilities that every other R4SUB pillar
package depends on:

1. **Run context** — a lightweight metadata envelope that ties every piece of
   evidence back to a specific study and execution event.
2. **Evidence schema** — a fixed 17-column contract that standardises how
   findings from heterogeneous sources (validators, parsers, manual checks) are
   stored and exchanged.
3. **Validation helpers** — functions that enforce the schema at every ingestion
   point so downstream scoring and profiling can always trust the data.

## Creating a Run Context

A *run context* captures who ran the assessment, on which study, in which
environment, and at what time. Every evidence row you create is stamped with the
`run_id` and `study_id` from its context, ensuring full traceability.

```{r run_context}
ctx <- r4sub_run_context(study_id = "STUDY001", environment = "DEV")
print(ctx)
```

`run_id` is generated automatically from the timestamp, but you can supply your
own if you need reproducibility in tests or pipelines.

```{r run_context_custom}
ctx_custom <- r4sub_run_context(
  study_id    = "STUDY001",
  environment = "PROD",
  run_id      = "RUN-2024-001"
)
ctx_custom$run_id
```

## Understanding the Evidence Schema

The evidence schema defines the 17 columns that every evidence table must
contain. Call `evidence_schema()` to inspect the contract at any time:

```{r schema}
schema <- evidence_schema()
# Column names in canonical order
names(schema)
```

| Column | Type | Required | Notes |
|---|---|---|---|
| `run_id` | character | yes | Set from run context |
| `study_id` | character | yes | Set from run context |
| `asset_type` | character | yes | One of: dataset, define, program, validation, spec, other |
| `asset_id` | character | yes | e.g. "ADSL", "define.xml" |
| `source_name` | character | yes | Tool or package that produced the finding |
| `source_version` | character | nullable | Version of the source tool |
| `indicator_id` | character | yes | e.g. "P21-001", "U-001" |
| `indicator_name` | character | yes | Human-readable indicator name |
| `indicator_domain` | character | yes | One of: quality, trace, risk, usability |
| `severity` | character | yes | One of: info, low, medium, high, critical |
| `result` | character | yes | One of: pass, fail, warn, na |
| `metric_value` | double | nullable | Numeric score (0–1 scale typical) |
| `metric_unit` | character | nullable | e.g. "score", "proportion", "count" |
| `message` | character | nullable | Human-readable finding description |
| `location` | character | nullable | e.g. "ADSL.USUBJID" |
| `evidence_payload` | character | nullable | JSON string for extended detail |
| `created_at` | POSIXct | yes | Set automatically if omitted |

### Controlled vocabulary helpers

`canon_severity()` and `canon_result()` normalise common aliases to the
canonical values accepted by the schema:

```{r canon}
canon_severity(c("ERROR", "warning", "Minor", "CRITICAL"))
canon_result(c("PASS", "Failed", "Warning", "N/A"))
```

## Building Evidence with `as_evidence()`

`as_evidence()` is the main ingestion function. You supply a data frame that
contains at minimum the required columns, pass a run context, and the function:

- fills `run_id` and `study_id` from the context,
- fills nullable columns with appropriately-typed `NA`,
- sets `created_at` to the current time if absent,
- validates the result before returning it.

```{r as_evidence}
raw <- data.frame(
  asset_type       = "validation",
  asset_id         = "ADSL",
  source_name      = "pinnacle21",
  indicator_id     = "P21-SD0001",
  indicator_name   = "Missing variable label",
  indicator_domain = "quality",
  severity         = "high",
  result           = "fail",
  message          = "Variable AGEU is missing a label",
  location         = "ADSL.AGEU",
  metric_value     = 0,
  metric_unit      = "score",
  stringsAsFactors = FALSE
)

ev <- as_evidence(raw, ctx = ctx)
```

You can inspect the resulting evidence table:

```{r evidence_inspect}
# All 17 schema columns are present
ncol(ev)
ev[, c("run_id", "study_id", "indicator_id", "result", "severity")]
```

## Validating Evidence

`validate_evidence()` runs the same checks that `as_evidence()` calls
internally. Use it when you receive evidence produced externally and want to
confirm it meets the contract before processing:

```{r validate}
validate_evidence(ev)  # returns TRUE invisibly if everything is valid
```

## Binding Multiple Evidence Tables

When combining evidence from different sources or indicators, use
`bind_evidence()`. It validates each table individually before combining,
preventing schema violations from silently propagating:

```{r bind}
# A second finding — a passed check on the same dataset
raw2 <- data.frame(
  asset_type       = "dataset",
  asset_id         = "ADSL",
  source_name      = "r4subcore",
  indicator_id     = "Q-NROW-001",
  indicator_name   = "Dataset row count",
  indicator_domain = "quality",
  severity         = "info",
  result           = "pass",
  message          = "ADSL has 254 subjects",
  metric_value     = 254,
  metric_unit      = "count",
  stringsAsFactors = FALSE
)
ev2 <- as_evidence(raw2, ctx = ctx)

combined <- bind_evidence(ev, ev2)
nrow(combined)
```

## Quick Overview with `evidence_summary()`

`evidence_summary()` aggregates an evidence table by domain, severity, result,
and source, giving a one-page digest of the findings:

```{r summary}
evidence_summary(combined)
```

## Exporting and Importing Evidence

Evidence tables can be persisted and reloaded in CSV, RDS, or JSON format. The
exported file retains the full schema so `import_evidence()` can re-validate it
on the way back in.

```{r export_import, eval = FALSE}
# Export to CSV
tmp <- tempfile(fileext = ".csv")
export_evidence(combined, file = tmp, format = "csv")

# Import and re-validate
ev_reloaded <- import_evidence(tmp, format = "csv")
nrow(ev_reloaded)
```

RDS is the most faithful format because it preserves POSIXct without any
string-conversion round-trip. JSON is useful when evidence needs to be consumed
by non-R tooling.

## What's Next

Once you have an evidence table, the other R4SUB pillar packages consume it
directly:

- **r4subusability** — four usability indicators (label quality, Define-XML
  completeness, annotation coverage, reviewer guide presence).
- **r4subscore** — weighted scoring across all domains into a single submission
  readiness index.
- **r4subprofile** — regulatory authority profile templates (FDA, EMA, PMDA,
  ANVISA, Health Canada, NMPA) that map scores to submission requirements.