Appearance
Data Archiver
Overview
Data Archiver manages the final stage of the OR platform's data lifecycle — moving aged operational data out of PostgreSQL and into long-term cold storage on S3. As the Data Recorder continuously writes live data snapshots to PostgreSQL, the database would grow indefinitely without active management. Data Archiver prevents this buildup by identifying data older than a configured threshold, exporting it as CSV files to S3, and then removing it from the database.
This autonomous lifecycle management ensures that PostgreSQL queries remain fast and responsive for the real-time operations that depend on them, while no data is lost — it is simply moved to a more cost-effective storage tier. The archived CSV files on S3 remain available for historical analysis, compliance, and audit purposes.
Data Archiver operates independently of most other microservices, requiring only the Experiment Manager (GraphQL) for session triggers and database access.
Architecture
- Port:
:9500 - Language: Julia
- Scaling: Singleton
Key Components
- Session-triggered archiving — Subscribes to session updates from the Experiment Manager via GraphQL. Archiving runs on a 24-hour session cycle.
- Configurable age threshold — Data older than a specified duration (e.g. one week) is identified for archiving.
- CSV export to S3 — Aged data is exported as CSV files and stored in a configured S3 bucket before being removed from PostgreSQL.
- Table-by-table processing — Iterates through all configured actuals tables, processing each independently.
Connections
| Direction | Service | Purpose |
|---|---|---|
| In | Experiment Manager (GraphQL) | Session subscription triggers |
| Out | Experiment Manager (GraphQL) | Queries and completion notices |
| Out | PostgreSQL (RDS) | Query and delete aged data |
| Out | AWS S3 | Store archived CSV files |
Data Flow
Session Manager → Experiment Manager (24-hour session tick)
↓ (GraphQL subscription)
Data Archiver [:9500]
├── Thread 1: HTTP server (health checks, metrics)
└── Thread 2: Subscription controller
↓ (on session trigger)
For each actuals table:
├── Query PostgreSQL for data older than threshold
├── Export to CSV
├── Upload CSV to S3
└── Delete archived rows from PostgreSQL
↓
Post completion notice to GraphQLSequence of Operations
- Pod deploys — Thread 1 initialises and starts the HTTP server
- Subscription starts — Thread 2 subscribes to session details from
config.yaml - Session tick (24 hours) — For each configured table:
- Preprocess and query RDS for aged data
- Export data as CSV and upload to S3
- Remove archived data from PostgreSQL
- Completion notice — Posts archive completion to GraphQL (triggers downstream processes like baseline creation)
Archived Tables
Data Archiver processes the following PostgreSQL actuals tables:
| Table | Content |
|---|---|
segment_actual | Segment-level metric snapshots |
way_actual | Way-level speed, volume, flow snapshots |
event | Incident and event records |
env_cctv_actual | CCTV camera status snapshots |
env_intersection_actual | SCATS intersection signal data |
env_intersection_group_actual | Grouped intersection data |
env_sensor_actual | Sensor readings |
env_sign_actual | VMS and sign display states |
Configuration
Archiving behaviour is controlled through config.yaml:
- S3 bucket name — Target bucket for archived CSV files
- Session subscription — Which session to listen to for triggering the archive cycle
- Age threshold — How old data must be before it qualifies for archiving
Related Services
- Data Recorder — Upstream service that writes the live data snapshots that Data Archiver eventually archives
- Data Loader — Loads the reference data for tables that Data Archiver manages
- Experiment Manager — Central coordination service (GraphQL on
:5100); provides session triggers and completion signalling
