Skip to content

Batch Ingestion

Platform Users — Engineers & Low-code Ops Users (ORA / Panel Builder) OR Platform ORA — AI Planning Interface Agent Workflows Plan Visualisation ADK Integration SDK UI — Frontend Shell FDK Architecture Low code Config-driven DDK Schema Definition Code Generator Generated Server MDK WEM DAL Experiment Manager Nexus Deployment Control Live Monitoring Registry Browser SCDK Source Control Pipeline Mgmt Azure DevOps deploys ↓ SDK API — GraphQL Federation Gateway Federation Gateway Component Resolvers Auth & Licensing Plugins: gql-autogeneration Migrator Helm KinD Boilerplate GenAI ··· Microservices — Domain IP Services Data Pipeline Core Platform Metrics & Analytics Spatial & Geo Simulation Event Detection Camera & Device Fire & Resource Opt. Satellite Modelling ↓ Nexus deploys Deployed OR Applications Rail Ops Dashboard Mine Mgmt Dashboard Port Ops Dashboard ··· FDK-built · DDK-backed · MDK-powered · deployed via Nexus ↑ Application Users — Operations Teams (shift managers, analysts, planners)

Overview

Batch Ingestion is the OR platform's configuration-driven data loader for slow-moving reference and geospatial datasets. While the real-time pipeline handles live sensor feeds and streaming data, Batch Ingestion manages the periodic update of static assets — traffic signal locations, road features, device registries, and geospatial reference data — that form the foundational layer the platform operates on. These datasets change infrequently (daily, weekly, or monthly) but must be kept current and correctly structured for consumption by the rest of the platform.

The service operates in two modes: as a CRON-controlled automated loader that runs on a pre-defined schedule without human intervention, and as a live microservice with GraphQL API integration for ad-hoc data loading during development or initial deployment. It collects data from multiple sources — client-provided APIs, flat files stored in S3, and open data hubs — and transforms it into OR-compliant structures before loading into the PostgreSQL reference datastore.

Batch Ingestion is controlled through a digraph configuration file that defines the data pipeline for each source. Each source has its own schedule, transformation logic, and target tables, making the service highly extensible for new data types without code changes.

Architecture

The service operates on automated schedules, executing data loading jobs at configured intervals. It integrates with the platform's central datastore and cloud storage systems.

Key Components

  • Configuration-Driven Pipeline — Each data source is defined through declarative configuration that specifies source location, transformation logic, target destinations, and scheduling parameters. The configuration structure allows complex multi-step ingestion pipelines.
  • Automated Scheduling — Data loading jobs run automatically at configured intervals without manual intervention. Each data source operates on its own schedule (daily, weekly, etc.).
  • Manual Trigger API — Provides an interface for ad-hoc data loading during development and initial deployment.
  • Data Transformation — Raw source data is transformed into the platform's standardized schema before being committed to the reference database.

Connections

DirectionServicePurpose
InClient APIsExternal APIs for device and asset data
InCloud StorageFlat files for reference data
InOpen Data HubOpen geospatial datasets
OutReference DatabaseTarget reference datastore
OutCentral Orchestration ServiceAPI integration for manual triggers

Data Flow

The service collects data from multiple external sources (client APIs, flat files, and open data hubs), transforms it into the platform's standardized format, and loads it into the reference database where it becomes available to all platform services including the real-time pipeline, frontend, and spatial services.

Ingested Data Sources

Batch Ingestion manages a wide catalogue of reference data sources:

Devices

SourceDescription
SCATS sitesTraffic signal intersection controllers
STREAMS detectorsFreeway and arterial traffic detectors
Pump StationsDrainage and flood management pumps
AAWSAdverse weather warning stations
AddInsight sites and linksPrivate traffic analytics network

Road Features

SourceDescription
VMSVariable Message Signs
CCTV camerasVideo surveillance camera locations
LUMSLane Use Management Signs
ESLSElectronic Speed Limit Signs
RampsFreeway on/off ramp definitions
Ice StationsRoad ice detection stations
ClearwaysTimed clearway zone definitions
Declared roadsClient-managed declared road network
Height clearanceBridge and overpass height restrictions
Parking restrictionsParking zone definitions
Turn restrictionsIntersection turn restrictions

INFO

Live data for these sources is handled by the real-time ingestion pipeline (Data Ingestion, Data Stream Ingestion), not Batch Ingestion. Batch Ingestion only manages the static reference and geospatial attributes.

LUMS Mapping

Batch Ingestion includes a specialised mapping pipeline for LUMS (Lane Use Management Signs) that ensures accurate assignment to road network segments without overlaps.

The mapping process:

  1. Retrieve device-to-link associations — Get all link associations for LUMS devices from the external gateway
  2. Identify best-fitting road segments — Find the best-fitting path of road segments for each link's coordinates
  3. Detect and remove overlaps — Identify and eliminate overlapping segment assignments to prevent conflicts at road boundaries
  4. First pass: nearest segment — Map each LUMS to its nearest road segment, ensuring every device has at least one segment assignment
  5. Second pass: remaining segments — Map any unmapped segments to their nearest LUMS device

Edge Cases

  • No link assigned to LUMS — Device cannot be mapped without a link association
  • Mapping failure — Link coordinates cannot be resolved to road segments
  • More LUMS than segments — Only a number of LUMS equal to the number of segments will have mappings; closest devices are prioritised

Manual Trigger

For development, testing, or initial deployment, Batch Ingestion can be triggered manually through the platform's API. Contact your platform administrator for API access details and available data sources.

In production, Batch Ingestion runs automatically on its configured schedule.

Troubleshooting

Failed Ingestion Load

Symptoms: Duplicate data / multiple envObjId values for the same sObjSourceId, or CRON job logs show correct data but database state is wrong.

Common cause: Primary Key–Foreign Key relationship invalidated during the update step. Most likely for the Traffic Signals dataset due to strongly connected event data that can arrive during the delete step.

Resolution:

  1. Rerun the update manually through the platform's API interface
  2. If it fails again: wait and retry, or temporarily stop the Data Recorder service and rerun. Restart Data Recorder after completion.

WARNING

Stopping Data Recorder will create a gap in playback data for the duration it is unavailable.

Stalled Update (30+ minutes)

Action: Do not restart or terminate the container. Investigate long-running database processes and identify the stalled query. Only cancel the process if you confirm it cannot self-resolve.

Monitoring Database Table Size

The Batch Ingestion service provides database table size monitoring capabilities. Results are visible on the Batch Ingestion monitoring dashboard.

  • Data Ingestion — Handles live/real-time data ingestion; Batch Ingestion handles the complementary static reference data
  • Data Stream Ingestion — Streaming data counterpart for continuous feeds
  • Data Loader — Loads initial reference data on stack spin-up; Batch Ingestion handles ongoing updates
  • Data Recorder — Writes live data snapshots; can conflict with Batch Ingestion during FK-constrained updates
  • Experiment Manager — Central coordination service providing the API for manual batch triggers
  • Post Monitoring — Downstream consumer of reference data via Redshift ETL
  • Data Exporter — Exports data that depends on correctly ingested reference datasets

User documentation for Optimal Reality