Skip to content

Data Exporter

Platform Users — Engineers & Low-code Ops Users (ORA / Panel Builder) OR Platform ORA — AI Planning Interface Agent Workflows Plan Visualisation ADK Integration SDK UI — Frontend Shell FDK Architecture Low code Config-driven DDK Schema Definition Code Generator Generated Server MDK WEM DAL Experiment Manager Nexus Deployment Control Live Monitoring Registry Browser SCDK Source Control Pipeline Mgmt Azure DevOps deploys ↓ SDK API — GraphQL Federation Gateway Federation Gateway Component Resolvers Auth & Licensing Plugins: gql-autogeneration Migrator Helm KinD Boilerplate GenAI ··· Microservices — Domain IP Services Data Pipeline Core Platform Metrics & Analytics Spatial & Geo Simulation Event Detection Camera & Device Fire & Resource Opt. Satellite Modelling ↓ Nexus deploys Deployed OR Applications Rail Ops Dashboard Mine Mgmt Dashboard Port Ops Dashboard ··· FDK-built · DDK-backed · MDK-powered · deployed via Nexus ↑ Application Users — Operations Teams (shift managers, analysts, planners)

Overview

Data Exporter is the OR platform's data extraction and delivery service — responsible for exporting operational and analytical data from AWS Redshift as Parquet files and making them available to external consumers via presigned S3 URLs. It serves as the primary interface between the platform's internal data stores and external analytics environments such as Azure Databricks.

The service operates in two modes: an on-demand GraphQL-triggered export that generates Parquet files for a requested time range, and a scheduled config-driven export that runs daily via Kubernetes CronJobs. Both modes extract data from Redshift, apply PII removal and data classification filters, write Parquet files to S3, and expose download links through the platform's GraphQL API.

Data Exporter is tightly coupled with Post Monitoring, which maintains the Redshift schemas that Data Exporter reads from. Together, they form the platform's historical data pipeline — Post Monitoring handles the ETL from RDS to Redshift, while Data Exporter handles the extraction from Redshift to external consumers.

Architecture

  • Language: Julia (on-demand export), Python (config-driven export)
  • Scaling: Kubernetes CronJobs (scheduled) / Singleton pod (on-demand)
  • Port: Not externally exposed (triggered via GraphQL or CRON)
  • Dependencies: AWS Redshift, S3, PostgreSQL (RDS), Experiment Manager (GraphQL)

Key Components

  • On-demand Export (Julia) — Original export flow triggered via GraphQL. Accepts a time range, generates a request ID, executes Redshift stored procedures to create Parquet files, and returns presigned S3 URLs.
  • Config-driven Export (Python) — Newer scheduled export flow using AWS Glue. Reads export configuration from DynamoDB/Redshift config tables, applies data classification rules, and exports tables in Parquet format to S3 on a daily CRON schedule.
  • PII Removal — Strips personally identifiable information during the export transformation step before data leaves the platform boundary.
  • Data Classification Engine — Filters exported columns and rows based on classification rules (nr = no restrictions, mx = mixed, de = do not export, hwc = handle with care).
  • Redshift UNLOAD — Uses Redshift's native UNLOAD command to export query results directly to S3 in Parquet format.

Connections

DirectionServicePurpose
InExperiment Manager (GraphQL)Triggers on-demand export requests
InKubernetes CronJobTriggers scheduled daily exports
InAWS RedshiftSource data (pm_internal schema)
OutAWS S3Parquet file storage and presigned URL generation
OutDynamoDBExport configuration tables (config-driven mode)
OutSNSSuccess/failure notifications for scheduled exports

Data Flow

On-demand Export

Requester (e.g. Azure Databricks)
    ↓ requestData mutation (start_time, end_time)
Experiment Manager (GraphQL)

Data Exporter
  ├── Generate Request ID
  ├── Submit to Redshift stored procedures
  │     ├── Select from pm_internal tables
  │     ├── Apply PII removal
  │     └── UNLOAD to S3 as Parquet
  └── Track progress in request_table
    ↓ requestStatus query (request_id)
Experiment Manager (GraphQL)
  ├── Check completion status
  ├── Generate presigned S3 URLs (valid 1 day)
  └── Return grouped URLs by table

Requester receives download links

Config-driven Export

Kubernetes CronJob (daily at 1 AM Melbourne time)

Data Exporter (Python / AWS Glue)
  ├── Read export config from DynamoDB
  ├── Read data classification tables
  ├── For each enabled table:
  │     ├── Build filtered UNLOAD query
  │     ├── Apply column/row classification filters
  │     ├── Export to S3: YYYY/MM/DD/{table_name}/
  │     └── Update status table (IN-PROGRESS → COMPLETED/FAILED)
  ├── Export classification tables to YYYY/MM/DD/classifications/
  └── Send SNS notification (SUCCESS / FAILED / PARTIAL-SUCCESS)

GraphQL Endpoints

requestData (Mutation)

Triggers the on-demand data export process.

ParameterTypeDescription
startTimeInteger (epoch)Start of the export time range
endTimeInteger (epoch)End of the export time range (max 1 day from start)
ReturnsStringRequest ID (UUID) for status tracking

requestStatus (Mutation)

Checks export progress and retrieves download links on completion.

ParameterTypeDescription
requestIdString (UUID)The Request ID from requestData
Returns
statusStringIn Progress, Completed, or Request has not been requested
progressFloatCompletion percentage (0.0–100.0)
dataRequestDownloadLinksJSON{ table_name: [url, url, ...], ... } — presigned URLs grouped by table

TIP

Presigned URLs are valid for 1 day only. Files are grouped by table name (e.g. { way: ["..."], cctv: ["..."] }).

Data Classification

The config-driven export uses a multi-layered classification system to control what data is exported:

Classification Types

CodeMeaning
nrNo restrictions on exporting
mxMixed — depends on source type
deDo not export
hwcHandle with care — export with restrictions

Classification Tables

TablePurpose
data_column_classificationPer-column export rules for each database table
actual_data_classificationsPer-source export rules (e.g. FUSION = nr, HERE = de, TOMTOM = de)
long_table_data_classificationsPer-metric-name export rules for long-format tables
export_tablesMaster table controlling which tables are enabled for export

Export Status Tracking

Each table export is tracked in table_export_job_statuses:

StatusMeaning
IN-PROGRESSUNLOAD query submitted to Redshift
COMPLETEDExport finished successfully
FAILEDExport failed after all retry attempts

S3 Storage Structure

Exported Parquet files are stored in a date-partitioned structure:

s3://{bucket}/{env}/data_exporter_service/
  └── {table_name}/
      └── YYYYMMDD/
          └── {dt_start}_{dt_end}/
              ├── {table_name}_000.parquet
              ├── {table_name}_001.parquet
              └── manifest

Config-driven exports use:

s3://{bucket}/{env}/
  └── YYYY/MM/DD/
      ├── {table_name}/
      │   └── *.parquet
      └── classifications/
          └── *.parquet

Exploratory Data Analysis (EDA)

Data Exporter supports EDA workflows through AWS Athena integration. The pm_internal and pm_external Redshift schemas are crawled by AWS Glue, making the exported Parquet data queryable via Athena for ad-hoc analysis without impacting the production Redshift cluster.

Athena queries target the or-post-monitoring database in the AwsDataCatalog workgroup, with results output to s3://or-dot-post-monitoring/.

Known Constraints

  • Redshift string limit — Redshift has a maximum character length of 65,535. Fields like ar_f_path and geoj_way_feature can exceed this when converted from PostgreSQL arrays to strings during federation. Mitigation: coordinate rounding + GZIP compression into _compressed fields.
  • Duration limit — On-demand exports via requestData cannot span more than 1 day.
  • ConcurrencyMAX_CONCURRENCY_LIMIT controls the number of parallel Redshift UNLOAD operations to prevent cluster overload.
  • Retry behaviourMAX_RETRY_COUNT controls per-table retry attempts before marking as FAILED.

Troubleshooting

Request Progress Not Advancing

Cause: Data Exporter pod killed by liveness check due to too many concurrent requests. This prevents all Redshift stored procedures from executing.

Resolution:

  1. Retrieve the troubled Request ID
  2. Clear the request from Redshift: DELETE FROM de_internal.request_table WHERE de_request_id = '<request-id>';
  3. Restart the Data Exporter pod
  4. Request the user to rerun the export with the same start and end time
  • Post Monitoring — Upstream ETL service that maintains the Redshift schemas Data Exporter reads from
  • Batch Ingestion — Manages reference data that Data Exporter may export; responsible for compressed field generation
  • Data Archiver — Manages PostgreSQL data lifecycle; data must be exported before archival
  • Data Recorder — Writes the live data snapshots that eventually flow through Post Monitoring to Data Exporter
  • Experiment Manager — Central coordination service (GraphQL on :5100); provides the API for on-demand export triggers
  • Baseline Manager — Generates baseline data that may be included in exports

User documentation for Optimal Reality