Appearance
Batch Ingestion
Overview
Batch Ingestion is the OR platform's configuration-driven data loader for slow-moving reference and geospatial datasets. While the real-time pipeline handles live sensor feeds and streaming data, Batch Ingestion manages the periodic update of static assets — traffic signal locations, road features, device registries, and geospatial reference data — that form the foundational layer the platform operates on. These datasets change infrequently (daily, weekly, or monthly) but must be kept current and correctly structured for consumption by the rest of the platform.
The service operates in two modes: as a CRON-controlled automated loader that runs on a pre-defined schedule without human intervention, and as a live microservice with GraphQL API integration for ad-hoc data loading during development or initial deployment. It collects data from multiple sources — client-provided APIs, flat files stored in S3, and open data hubs — and transforms it into OR-compliant structures before loading into the PostgreSQL reference datastore.
Batch Ingestion is controlled through a digraph configuration file that defines the data pipeline for each source. Each source has its own schedule, transformation logic, and target tables, making the service highly extensible for new data types without code changes.
Architecture
The service operates on automated schedules, executing data loading jobs at configured intervals. It integrates with the platform's central datastore and cloud storage systems.
Key Components
- Configuration-Driven Pipeline — Each data source is defined through declarative configuration that specifies source location, transformation logic, target destinations, and scheduling parameters. The configuration structure allows complex multi-step ingestion pipelines.
- Automated Scheduling — Data loading jobs run automatically at configured intervals without manual intervention. Each data source operates on its own schedule (daily, weekly, etc.).
- Manual Trigger API — Provides an interface for ad-hoc data loading during development and initial deployment.
- Data Transformation — Raw source data is transformed into the platform's standardized schema before being committed to the reference database.
Connections
| Direction | Service | Purpose |
|---|---|---|
| In | Client APIs | External APIs for device and asset data |
| In | Cloud Storage | Flat files for reference data |
| In | Open Data Hub | Open geospatial datasets |
| Out | Reference Database | Target reference datastore |
| Out | Central Orchestration Service | API integration for manual triggers |
Data Flow
The service collects data from multiple external sources (client APIs, flat files, and open data hubs), transforms it into the platform's standardized format, and loads it into the reference database where it becomes available to all platform services including the real-time pipeline, frontend, and spatial services.
Ingested Data Sources
Batch Ingestion manages a wide catalogue of reference data sources:
Devices
| Source | Description |
|---|---|
| SCATS sites | Traffic signal intersection controllers |
| STREAMS detectors | Freeway and arterial traffic detectors |
| Pump Stations | Drainage and flood management pumps |
| AAWS | Adverse weather warning stations |
| AddInsight sites and links | Private traffic analytics network |
Road Features
| Source | Description |
|---|---|
| VMS | Variable Message Signs |
| CCTV cameras | Video surveillance camera locations |
| LUMS | Lane Use Management Signs |
| ESLS | Electronic Speed Limit Signs |
| Ramps | Freeway on/off ramp definitions |
| Ice Stations | Road ice detection stations |
| Clearways | Timed clearway zone definitions |
| Declared roads | Client-managed declared road network |
| Height clearance | Bridge and overpass height restrictions |
| Parking restrictions | Parking zone definitions |
| Turn restrictions | Intersection turn restrictions |
INFO
Live data for these sources is handled by the real-time ingestion pipeline (Data Ingestion, Data Stream Ingestion), not Batch Ingestion. Batch Ingestion only manages the static reference and geospatial attributes.
LUMS Mapping
Batch Ingestion includes a specialised mapping pipeline for LUMS (Lane Use Management Signs) that ensures accurate assignment to road network segments without overlaps.
The mapping process:
- Retrieve device-to-link associations — Get all link associations for LUMS devices from the external gateway
- Identify best-fitting road segments — Find the best-fitting path of road segments for each link's coordinates
- Detect and remove overlaps — Identify and eliminate overlapping segment assignments to prevent conflicts at road boundaries
- First pass: nearest segment — Map each LUMS to its nearest road segment, ensuring every device has at least one segment assignment
- Second pass: remaining segments — Map any unmapped segments to their nearest LUMS device
Edge Cases
- No link assigned to LUMS — Device cannot be mapped without a link association
- Mapping failure — Link coordinates cannot be resolved to road segments
- More LUMS than segments — Only a number of LUMS equal to the number of segments will have mappings; closest devices are prioritised
Manual Trigger
For development, testing, or initial deployment, Batch Ingestion can be triggered manually through the platform's API. Contact your platform administrator for API access details and available data sources.
In production, Batch Ingestion runs automatically on its configured schedule.
Troubleshooting
Failed Ingestion Load
Symptoms: Duplicate data / multiple envObjId values for the same sObjSourceId, or CRON job logs show correct data but database state is wrong.
Common cause: Primary Key–Foreign Key relationship invalidated during the update step. Most likely for the Traffic Signals dataset due to strongly connected event data that can arrive during the delete step.
Resolution:
- Rerun the update manually through the platform's API interface
- If it fails again: wait and retry, or temporarily stop the Data Recorder service and rerun. Restart Data Recorder after completion.
WARNING
Stopping Data Recorder will create a gap in playback data for the duration it is unavailable.
Stalled Update (30+ minutes)
Action: Do not restart or terminate the container. Investigate long-running database processes and identify the stalled query. Only cancel the process if you confirm it cannot self-resolve.
Monitoring Database Table Size
The Batch Ingestion service provides database table size monitoring capabilities. Results are visible on the Batch Ingestion monitoring dashboard.
Related Services
- Data Ingestion — Handles live/real-time data ingestion; Batch Ingestion handles the complementary static reference data
- Data Stream Ingestion — Streaming data counterpart for continuous feeds
- Data Loader — Loads initial reference data on stack spin-up; Batch Ingestion handles ongoing updates
- Data Recorder — Writes live data snapshots; can conflict with Batch Ingestion during FK-constrained updates
- Experiment Manager — Central coordination service providing the API for manual batch triggers
- Post Monitoring — Downstream consumer of reference data via Redshift ETL
- Data Exporter — Exports data that depends on correctly ingested reference datasets
