Skip to content

Batch Ingestion

Platform Users — Engineers & Low-code Ops Users (ORA / Panel Builder) OR Platform ORA — AI Planning Interface Agent Workflows Plan Visualisation ADK Integration SDK UI — Frontend Shell FDK Architecture Low code Config-driven DDK Schema Definition Code Generator Generated Server MDK WEM DAL Experiment Manager Nexus Deployment Control Live Monitoring Registry Browser SCDK Source Control Pipeline Mgmt Azure DevOps deploys ↓ SDK API — GraphQL Federation Gateway Federation Gateway Component Resolvers Auth & Licensing Plugins: gql-autogeneration Migrator Helm KinD Boilerplate GenAI ··· Microservices — Domain IP Services Data Pipeline Core Platform Metrics & Analytics Spatial & Geo Simulation Event Detection Camera & Device Fire & Resource Opt. Satellite Modelling ↓ Nexus deploys Deployed OR Applications Rail Ops Dashboard Mine Mgmt Dashboard Port Ops Dashboard ··· FDK-built · DDK-backed · MDK-powered · deployed via Nexus ↑ Application Users — Operations Teams (shift managers, analysts, planners)

Overview

Batch Ingestion is the OR platform's configuration-driven data loader for slow-moving reference and geospatial datasets. While the real-time pipeline handles live sensor feeds and streaming data, Batch Ingestion manages the periodic update of static assets — traffic signal locations, road features, device registries, and geospatial reference data — that form the foundational layer the platform operates on. These datasets change infrequently (daily, weekly, or monthly) but must be kept current and correctly structured for consumption by the rest of the platform.

The service operates in two modes: as a CRON-controlled automated loader that runs on a pre-defined schedule without human intervention, and as a live microservice with GraphQL API integration for ad-hoc data loading during development or initial deployment. It collects data from multiple sources — client-provided APIs, flat files stored in S3, and open data hubs — and transforms it into OR-compliant structures before loading into the PostgreSQL reference datastore.

Batch Ingestion is controlled through a digraph configuration file that defines the data pipeline for each source. Each source has its own schedule, transformation logic, and target tables, making the service highly extensible for new data types without code changes.

Architecture

  • Language: Julia
  • Scaling: CRON-scheduled pods (one pod per ingestion job)
  • Dependencies: PostgreSQL (RDS), S3, Experiment Manager (GraphQL)

Key Components

  • Digraph Configuration — Each data source is defined in a YAML config that specifies the source location (API or S3), transformation SQL scripts, target tables, and scheduling parameters. The digraph structure allows complex multi-step ingestion pipelines.
  • CRON Scheduler — Kubernetes CronJobs trigger ingestion pods at configured intervals. Each data source runs on its own schedule (daily, weekly, etc.).
  • GraphQL API — Exposes an executeBatchIngestion mutation for ad-hoc manual triggers during development and initial deployment.
  • SQL Transformation Scripts — Raw data loaded into temporary tables is transformed into the platform's target schema using SQL scripts before being committed to the main database.

Connections

DirectionServicePurpose
InClient APIsExternal REST APIs for device and asset data
InAWS S3Flat files (CSV, JSON) for reference data
InVic Open Data HubOpen geospatial datasets
OutPostgreSQL (RDS)Target reference datastore
OutExperiment Manager (GraphQL)API integration for manual triggers

Data Flow

Data Sources
  ├── Client APIs (SCATS, STREAMS, device registries)
  ├── S3 flat files (reference data, clearways, restrictions)
  └── Vic Open Data Hub (open geospatial data)
        ↓ (CRON schedule or manual GraphQL trigger)
Batch Ingestion Service
  ├── Load raw data into temporary tables
  ├── Apply SQL transformation scripts
  └── Commit transformed data to main database

PostgreSQL (RDS) — Reference tables

Platform services (real-time pipeline, frontend, spatial)

Ingested Data Sources

Batch Ingestion manages a wide catalogue of reference data sources:

Devices

SourceDescription
SCATS sitesTraffic signal intersection controllers
STREAMS detectorsFreeway and arterial traffic detectors
Pump StationsDrainage and flood management pumps
AAWSAdverse weather warning stations
AddInsight sites and linksPrivate traffic analytics network

Road Features

SourceDescription
VMSVariable Message Signs
CCTV camerasVideo surveillance camera locations
LUMSLane Use Management Signs
ESLSElectronic Speed Limit Signs
RampsFreeway on/off ramp definitions
Ice StationsRoad ice detection stations
ClearwaysTimed clearway zone definitions
Declared roadsClient-managed declared road network
Height clearanceBridge and overpass height restrictions
Parking restrictionsParking zone definitions
Turn restrictionsIntersection turn restrictions

INFO

Live data for these sources is handled by the real-time ingestion pipeline (Data Ingestion, Data Stream Ingestion), not Batch Ingestion. Batch Ingestion only manages the static reference and geospatial attributes.

LUMS Mapping

Batch Ingestion includes a specialised mapping pipeline for LUMS (Lane Use Management Signs) that ensures no overlap of signs as they map to the OSM ways used for the platform's road network links.

The mapping uses a two-pass nearest-node approach:

  1. Retrieve STREAMS links — Get all STREAMS links associated with LUMS devices from the STREAMS Gateway
  2. Hidden Markov Method (HMM) mapping — Use the OSM network to find the best-fitting path of OSM ways for each STREAMS link's coordinates
  3. Overlap check — Lower-triangle-style comparison to detect and remove overlapping way assignments (FIFO removal to prevent overlap at road segment boundaries)
  4. First pass: nearest way — Map each LUMS to the leading edge of its nearest way using the first node, ensuring a minimum of one way per device
  5. Second pass: remaining ways — Map any unmapped ways to their nearest LUMS device by minimum distance

Edge Cases

  • No link assigned to LUMS — Device cannot be mapped without a STREAMS link association
  • HMM mapping failure — Link coordinates cannot be resolved to OSM ways
  • More LUMS than ways — Only a number of LUMS equal to the number of ways will have mappings; closest devices are prioritised

Manual Trigger

For development, testing, or initial deployment, Batch Ingestion can be triggered manually via GraphQL:

graphql
mutation executeBatchIngestion {
  executeBatchIngestion(sourceName: "TRAFFIC_LIGHTS") {
    message
  }
}

In production, Batch Ingestion should only run on its CRON schedule.

Troubleshooting

Failed Ingestion Load

Symptoms: Duplicate data / multiple envObjId values for the same sObjSourceId, or CRON job logs show correct data but database state is wrong.

Common cause: Primary Key–Foreign Key relationship invalidated during the update step. Most likely for the Traffic Signals dataset due to strongly connected event data that can arrive during the delete step.

Resolution:

  1. Rerun the update manually:
    graphql
    mutation updateScats {
      executeBatchIngestion(sourceName: "SCATS_INTERSECTION_GROUPS") {
        message
      }
    }
  2. If it fails again: wait and retry, or stop Data Recorder pods and rerun. Restart Data Recorder after completion.

WARNING

Stopping Data Recorder will create a gap in playback data for the duration it is unavailable.

Stalled Update (30+ minutes)

Action: Do not restart or terminate the container. Investigate long-running database processes and identify the stalled query. Only cancel the process if you confirm it cannot self-resolve.

Monitoring Database Table Size

From Batch Ingestion v0.6.24+, run the following to check database table sizes:

graphql
mutation executeBatchIngestion {
  executeBatchIngestion(sourceName: "TURN_RESTRICTIONS") {
    message
  }
}

Results are visible on the Batch Ingestion Grafana dashboard page.

  • Data Ingestion — Handles live/real-time data ingestion; Batch Ingestion handles the complementary static reference data
  • Data Stream Ingestion — Streaming data counterpart for continuous feeds
  • Data Loader — Loads initial reference data on stack spin-up; Batch Ingestion handles ongoing updates
  • Data Recorder — Writes live data snapshots; can conflict with Batch Ingestion during FK-constrained updates
  • Experiment Manager — Central coordination service (GraphQL on :5100); provides the API for manual batch triggers
  • Post Monitoring — Downstream consumer of reference data via Redshift ETL
  • Data Exporter — Exports data that depends on correctly ingested reference datasets

User documentation for Optimal Reality