Data Exporter

Overview

Data Exporter is the OR platform's data extraction and delivery service — responsible for exporting operational and analytical data from AWS Redshift as Parquet files and making them available to external consumers via presigned S3 URLs. It serves as the primary interface between the platform's internal data stores and external analytics environments such as Azure Databricks.

The service operates in two modes: an on-demand GraphQL-triggered export that generates Parquet files for a requested time range, and a scheduled config-driven export that runs daily via Kubernetes CronJobs. Both modes extract data from Redshift, apply PII removal and data classification filters, write Parquet files to S3, and expose download links through the platform's GraphQL API.

Data Exporter is tightly coupled with Post Monitoring, which maintains the Redshift schemas that Data Exporter reads from. Together, they form the platform's historical data pipeline — Post Monitoring handles the ETL from RDS to Redshift, while Data Exporter handles the extraction from Redshift to external consumers.

Architecture

Language: Julia (on-demand export), Python (config-driven export)
Scaling: Kubernetes CronJobs (scheduled) / Singleton pod (on-demand)
Port: Not externally exposed (triggered via GraphQL or CRON)
Dependencies: AWS Redshift, S3, PostgreSQL (RDS), Experiment Manager (GraphQL)

Key Components

On-demand Export (Julia) — Original export flow triggered via GraphQL. Accepts a time range, generates a request ID, executes Redshift stored procedures to create Parquet files, and returns presigned S3 URLs.
Config-driven Export (Python) — Newer scheduled export flow using AWS Glue. Reads export configuration from DynamoDB/Redshift config tables, applies data classification rules, and exports tables in Parquet format to S3 on a daily CRON schedule.
PII Removal — Strips personally identifiable information during the export transformation step before data leaves the platform boundary.
Data Classification Engine — Filters exported columns and rows based on classification rules (nr = no restrictions, mx = mixed, de = do not export, hwc = handle with care).
Redshift UNLOAD — Uses Redshift's native UNLOAD command to export query results directly to S3 in Parquet format.

Connections

Direction	Service	Purpose
In	Experiment Manager (GraphQL)	Triggers on-demand export requests
In	Kubernetes CronJob	Triggers scheduled daily exports
In	AWS Redshift	Source data (pm_internal schema)
Out	AWS S3	Parquet file storage and presigned URL generation
Out	DynamoDB	Export configuration tables (config-driven mode)
Out	SNS	Success/failure notifications for scheduled exports

Data Flow

On-demand Export

Requester (e.g. Azure Databricks)
    ↓ requestData mutation (start_time, end_time)
Experiment Manager (GraphQL)
    ↓
Data Exporter
  ├── Generate Request ID
  ├── Submit to Redshift stored procedures
  │     ├── Select from pm_internal tables
  │     ├── Apply PII removal
  │     └── UNLOAD to S3 as Parquet
  └── Track progress in request_table
    ↓ requestStatus query (request_id)
Experiment Manager (GraphQL)
  ├── Check completion status
  ├── Generate presigned S3 URLs (valid 1 day)
  └── Return grouped URLs by table
    ↓
Requester receives download links

Config-driven Export

Kubernetes CronJob (daily at 1 AM Melbourne time)
    ↓
Data Exporter (Python / AWS Glue)
  ├── Read export config from DynamoDB
  ├── Read data classification tables
  ├── For each enabled table:
  │     ├── Build filtered UNLOAD query
  │     ├── Apply column/row classification filters
  │     ├── Export to S3: YYYY/MM/DD/{table_name}/
  │     └── Update status table (IN-PROGRESS → COMPLETED/FAILED)
  ├── Export classification tables to YYYY/MM/DD/classifications/
  └── Send SNS notification (SUCCESS / FAILED / PARTIAL-SUCCESS)

GraphQL Endpoints

`requestData` (Mutation)

Triggers the on-demand data export process.

Parameter	Type	Description
`startTime`	Integer (epoch)	Start of the export time range
`endTime`	Integer (epoch)	End of the export time range (max 1 day from start)
Returns	String	Request ID (UUID) for status tracking

`requestStatus` (Mutation)

Checks export progress and retrieves download links on completion.

Parameter	Type	Description
`requestId`	String (UUID)	The Request ID from `requestData`
Returns
`status`	String	`In Progress`, `Completed`, or `Request has not been requested`
`progress`	Float	Completion percentage (0.0–100.0)
`dataRequestDownloadLinks`	JSON	`{ table_name: [url, url, ...], ... }` — presigned URLs grouped by table

TIP

Presigned URLs are valid for 1 day only. Files are grouped by table name (e.g. { way: ["..."], cctv: ["..."] }).

Data Classification

The config-driven export uses a multi-layered classification system to control what data is exported:

Classification Types

Code	Meaning
`nr`	No restrictions on exporting
`mx`	Mixed — depends on source type
`de`	Do not export
`hwc`	Handle with care — export with restrictions

Classification Tables

Table	Purpose
`data_column_classification`	Per-column export rules for each database table
`actual_data_classifications`	Per-source export rules (e.g. FUSION = `nr`, HERE = `de`, TOMTOM = `de`)
`long_table_data_classifications`	Per-metric-name export rules for long-format tables
`export_tables`	Master table controlling which tables are enabled for export

Export Status Tracking

Each table export is tracked in table_export_job_statuses:

Status	Meaning
`IN-PROGRESS`	UNLOAD query submitted to Redshift
`COMPLETED`	Export finished successfully
`FAILED`	Export failed after all retry attempts

S3 Storage Structure

Exported Parquet files are stored in a date-partitioned structure:

s3://{bucket}/{env}/data_exporter_service/
  └── {table_name}/
      └── YYYYMMDD/
          └── {dt_start}_{dt_end}/
              ├── {table_name}_000.parquet
              ├── {table_name}_001.parquet
              └── manifest

Config-driven exports use:

s3://{bucket}/{env}/
  └── YYYY/MM/DD/
      ├── {table_name}/
      │   └── *.parquet
      └── classifications/
          └── *.parquet

Exploratory Data Analysis (EDA)

Data Exporter supports EDA workflows through AWS Athena integration. The pm_internal and pm_external Redshift schemas are crawled by AWS Glue, making the exported Parquet data queryable via Athena for ad-hoc analysis without impacting the production Redshift cluster.

Athena queries target the or-post-monitoring database in the AwsDataCatalog workgroup, with results output to s3://or-dot-post-monitoring/.

Known Constraints

Redshift string limit — Redshift has a maximum character length of 65,535. Fields like ar_f_path and geoj_way_feature can exceed this when converted from PostgreSQL arrays to strings during federation. Mitigation: coordinate rounding + GZIP compression into _compressed fields.
Duration limit — On-demand exports via requestData cannot span more than 1 day.
Concurrency — MAX_CONCURRENCY_LIMIT controls the number of parallel Redshift UNLOAD operations to prevent cluster overload.
Retry behaviour — MAX_RETRY_COUNT controls per-table retry attempts before marking as FAILED.

Troubleshooting

Request Progress Not Advancing

Cause: Data Exporter pod killed by liveness check due to too many concurrent requests. This prevents all Redshift stored procedures from executing.

Resolution:

Retrieve the troubled Request ID
Clear the request from Redshift: DELETE FROM de_internal.request_table WHERE de_request_id = '<request-id>';
Restart the Data Exporter pod
Request the user to rerun the export with the same start and end time

Post Monitoring — Upstream ETL service that maintains the Redshift schemas Data Exporter reads from
Batch Ingestion — Manages reference data that Data Exporter may export; responsible for compressed field generation
Data Archiver — Manages PostgreSQL data lifecycle; data must be exported before archival
Data Recorder — Writes the live data snapshots that eventually flow through Post Monitoring to Data Exporter
Experiment Manager — Central coordination service (GraphQL on :5100); provides the API for on-demand export triggers
Baseline Manager — Generates baseline data that may be included in exports

Creating a Data Schema

Building and Configuring Workflows

DDK (Data)

MDK (Modelling)

Modelling Library

FDK (Frontend)

Nexus (Deployment)

Data Exporter

Overview

Architecture

Key Components

Connections

Data Flow

On-demand Export

Config-driven Export

GraphQL Endpoints

`requestData` (Mutation)

`requestStatus` (Mutation)

Data Classification

Classification Types

Classification Tables

Export Status Tracking

S3 Storage Structure

Exploratory Data Analysis (EDA)

Known Constraints

Troubleshooting

Request Progress Not Advancing

Modelling Library

Data Exporter ​

Overview ​

Architecture ​

Key Components ​

Connections ​

Data Flow ​

On-demand Export ​

Config-driven Export ​

GraphQL Endpoints ​

requestData (Mutation) ​

requestStatus (Mutation) ​

Data Classification ​

Classification Types ​

Classification Tables ​

Export Status Tracking ​

S3 Storage Structure ​

Exploratory Data Analysis (EDA) ​

Known Constraints ​

Troubleshooting ​

Request Progress Not Advancing ​

Related Services ​

Data Exporter

Overview

Architecture

Key Components

Connections

Data Flow

On-demand Export

Config-driven Export

GraphQL Endpoints

`requestData` (Mutation)

`requestStatus` (Mutation)

Data Classification

Classification Types

Classification Tables

Export Status Tracking

S3 Storage Structure

Exploratory Data Analysis (EDA)

Known Constraints

Troubleshooting

Request Progress Not Advancing

Related Services