Appearance
Data Exporter
Overview
Data Exporter is the OR platform's data extraction and delivery service — responsible for exporting operational and analytical data from AWS Redshift as Parquet files and making them available to external consumers via presigned S3 URLs. It serves as the primary interface between the platform's internal data stores and external analytics environments such as Azure Databricks.
The service operates in two modes: an on-demand GraphQL-triggered export that generates Parquet files for a requested time range, and a scheduled config-driven export that runs daily via Kubernetes CronJobs. Both modes extract data from Redshift, apply PII removal and data classification filters, write Parquet files to S3, and expose download links through the platform's GraphQL API.
Data Exporter is tightly coupled with Post Monitoring, which maintains the Redshift schemas that Data Exporter reads from. Together, they form the platform's historical data pipeline — Post Monitoring handles the ETL from RDS to Redshift, while Data Exporter handles the extraction from Redshift to external consumers.
Architecture
- Language: Julia (on-demand export), Python (config-driven export)
- Scaling: Kubernetes CronJobs (scheduled) / Singleton pod (on-demand)
- Port: Not externally exposed (triggered via GraphQL or CRON)
- Dependencies: AWS Redshift, S3, PostgreSQL (RDS), Experiment Manager (GraphQL)
Key Components
- On-demand Export (Julia) — Original export flow triggered via GraphQL. Accepts a time range, generates a request ID, executes Redshift stored procedures to create Parquet files, and returns presigned S3 URLs.
- Config-driven Export (Python) — Newer scheduled export flow using AWS Glue. Reads export configuration from DynamoDB/Redshift config tables, applies data classification rules, and exports tables in Parquet format to S3 on a daily CRON schedule.
- PII Removal — Strips personally identifiable information during the export transformation step before data leaves the platform boundary.
- Data Classification Engine — Filters exported columns and rows based on classification rules (
nr= no restrictions,mx= mixed,de= do not export,hwc= handle with care). - Redshift UNLOAD — Uses Redshift's native
UNLOADcommand to export query results directly to S3 in Parquet format.
Connections
| Direction | Service | Purpose |
|---|---|---|
| In | Experiment Manager (GraphQL) | Triggers on-demand export requests |
| In | Kubernetes CronJob | Triggers scheduled daily exports |
| In | AWS Redshift | Source data (pm_internal schema) |
| Out | AWS S3 | Parquet file storage and presigned URL generation |
| Out | DynamoDB | Export configuration tables (config-driven mode) |
| Out | SNS | Success/failure notifications for scheduled exports |
Data Flow
On-demand Export
Requester (e.g. Azure Databricks)
↓ requestData mutation (start_time, end_time)
Experiment Manager (GraphQL)
↓
Data Exporter
├── Generate Request ID
├── Submit to Redshift stored procedures
│ ├── Select from pm_internal tables
│ ├── Apply PII removal
│ └── UNLOAD to S3 as Parquet
└── Track progress in request_table
↓ requestStatus query (request_id)
Experiment Manager (GraphQL)
├── Check completion status
├── Generate presigned S3 URLs (valid 1 day)
└── Return grouped URLs by table
↓
Requester receives download linksConfig-driven Export
Kubernetes CronJob (daily at 1 AM Melbourne time)
↓
Data Exporter (Python / AWS Glue)
├── Read export config from DynamoDB
├── Read data classification tables
├── For each enabled table:
│ ├── Build filtered UNLOAD query
│ ├── Apply column/row classification filters
│ ├── Export to S3: YYYY/MM/DD/{table_name}/
│ └── Update status table (IN-PROGRESS → COMPLETED/FAILED)
├── Export classification tables to YYYY/MM/DD/classifications/
└── Send SNS notification (SUCCESS / FAILED / PARTIAL-SUCCESS)GraphQL Endpoints
requestData (Mutation)
Triggers the on-demand data export process.
| Parameter | Type | Description |
|---|---|---|
startTime | Integer (epoch) | Start of the export time range |
endTime | Integer (epoch) | End of the export time range (max 1 day from start) |
| Returns | String | Request ID (UUID) for status tracking |
requestStatus (Mutation)
Checks export progress and retrieves download links on completion.
| Parameter | Type | Description |
|---|---|---|
requestId | String (UUID) | The Request ID from requestData |
| Returns | ||
status | String | In Progress, Completed, or Request has not been requested |
progress | Float | Completion percentage (0.0–100.0) |
dataRequestDownloadLinks | JSON | { table_name: [url, url, ...], ... } — presigned URLs grouped by table |
TIP
Presigned URLs are valid for 1 day only. Files are grouped by table name (e.g. { way: ["..."], cctv: ["..."] }).
Data Classification
The config-driven export uses a multi-layered classification system to control what data is exported:
Classification Types
| Code | Meaning |
|---|---|
nr | No restrictions on exporting |
mx | Mixed — depends on source type |
de | Do not export |
hwc | Handle with care — export with restrictions |
Classification Tables
| Table | Purpose |
|---|---|
data_column_classification | Per-column export rules for each database table |
actual_data_classifications | Per-source export rules (e.g. FUSION = nr, HERE = de, TOMTOM = de) |
long_table_data_classifications | Per-metric-name export rules for long-format tables |
export_tables | Master table controlling which tables are enabled for export |
Export Status Tracking
Each table export is tracked in table_export_job_statuses:
| Status | Meaning |
|---|---|
IN-PROGRESS | UNLOAD query submitted to Redshift |
COMPLETED | Export finished successfully |
FAILED | Export failed after all retry attempts |
S3 Storage Structure
Exported Parquet files are stored in a date-partitioned structure:
s3://{bucket}/{env}/data_exporter_service/
└── {table_name}/
└── YYYYMMDD/
└── {dt_start}_{dt_end}/
├── {table_name}_000.parquet
├── {table_name}_001.parquet
└── manifestConfig-driven exports use:
s3://{bucket}/{env}/
└── YYYY/MM/DD/
├── {table_name}/
│ └── *.parquet
└── classifications/
└── *.parquetExploratory Data Analysis (EDA)
Data Exporter supports EDA workflows through AWS Athena integration. The pm_internal and pm_external Redshift schemas are crawled by AWS Glue, making the exported Parquet data queryable via Athena for ad-hoc analysis without impacting the production Redshift cluster.
Athena queries target the or-post-monitoring database in the AwsDataCatalog workgroup, with results output to s3://or-dot-post-monitoring/.
Known Constraints
- Redshift string limit — Redshift has a maximum character length of 65,535. Fields like
ar_f_pathandgeoj_way_featurecan exceed this when converted from PostgreSQL arrays to strings during federation. Mitigation: coordinate rounding + GZIP compression into_compressedfields. - Duration limit — On-demand exports via
requestDatacannot span more than 1 day. - Concurrency —
MAX_CONCURRENCY_LIMITcontrols the number of parallel Redshift UNLOAD operations to prevent cluster overload. - Retry behaviour —
MAX_RETRY_COUNTcontrols per-table retry attempts before marking asFAILED.
Troubleshooting
Request Progress Not Advancing
Cause: Data Exporter pod killed by liveness check due to too many concurrent requests. This prevents all Redshift stored procedures from executing.
Resolution:
- Retrieve the troubled Request ID
- Clear the request from Redshift:
DELETE FROM de_internal.request_table WHERE de_request_id = '<request-id>'; - Restart the Data Exporter pod
- Request the user to rerun the export with the same start and end time
Related Services
- Post Monitoring — Upstream ETL service that maintains the Redshift schemas Data Exporter reads from
- Batch Ingestion — Manages reference data that Data Exporter may export; responsible for compressed field generation
- Data Archiver — Manages PostgreSQL data lifecycle; data must be exported before archival
- Data Recorder — Writes the live data snapshots that eventually flow through Post Monitoring to Data Exporter
- Experiment Manager — Central coordination service (GraphQL on
:5100); provides the API for on-demand export triggers - Baseline Manager — Generates baseline data that may be included in exports
