Appearance
Data Ingestion
Overview
The Data Ingestion service is the primary entry point for polling-based external data into the OR platform. It collects data from HTTP, FTP, and SQS sources — including client internal feeds (via the ESB), third-party APIs, and government open data endpoints — and forwards it to the Data Transformer for normalisation into OR-compliant formats.
Each external API endpoint has its own schema, authentication method, and update frequency. Rather than building bespoke integrations per source, Data Ingestion provides a configuration-driven ingestion framework that handles the polling lifecycle, credential management, and delivery to the transformation layer. This pattern keeps complexity low and latency minimal, while making it straightforward to add or remove data sources as operational requirements evolve.
Within the broader data pipeline, Data Ingestion sits at the earliest stage — upstream of Data Transformer, Data Fusion, and all downstream consumers. It operates alongside Data Stream Ingestion (which handles event-driven message sources) and Data Redis Ingestion (which handles real-time computer vision feeds), collectively forming the platform's ingestion layer.
Architecture
The Data Ingestion service operates as a configuration-driven polling engine that supports multiple data source types and protocols. It scales horizontally to handle high-frequency data collection across multiple tenants while maintaining efficient resource utilization.
Key Capabilities
- Configuration-driven polling — Each data source is defined through configuration, specifying how to connect, authenticate, and poll the source. This approach enables rapid addition or removal of data sources without code changes.
- Coordinated scheduling — The service manages polling frequencies across distributed instances, ensuring data sources are queried at appropriate intervals without duplication or conflicts.
- Reliable forwarding — Collected data is immediately forwarded to transformation services for normalization and enrichment before entering the platform's data pipeline.
- Health monitoring — The service provides health check endpoints for deployment orchestration and monitoring systems.
Data Flow
External data sources are polled at configured intervals, and the collected data flows through a transformation layer that normalizes it into platform-standard formats before distribution to downstream consumers.
Configuration
The service uses a configuration-driven approach where each data source is defined with its connection details, authentication requirements, and polling frequency. This enables operational teams to manage data sources through configuration updates rather than code deployments, reducing time-to-integrate new data feeds and simplifying maintenance.
Ingested Sources
The service currently polls a wide range of client and third-party sources:
| Source | Protocol | Frequency |
|---|---|---|
| AddInsight Links | HTTPS | 30s |
| STREAMS Vehicle Detectors | HTTPS | 5 min |
| RTDMS | HTTPS | 60s |
| SITREP (Road Closures) | HTTPS | 30s |
| LUMS | HTTPS | 10 min |
| VSLS | HTTPS | 10 min |
| VMS / VMS Composites | HTTPS | 10 min |
| Metro Train Positions | HTTPS | 30s |
| Metro Train Trip Updates | HTTPS | 30s |
| Metro Train Service Alerts | HTTPS | 30s |
| PTV Disruptions | HTTPS | 5 min |
| BOM Rainfall | SFTP | 5 min |
| RAI Jobs | HTTPS | 5 min |
| IRS EyeFi | HTTPS | 30s |
| ServiceNow IRS | HTTPS | 30 min |
| Tow Allocation | HTTPS | 60s |
| RWE | HTTPS | 5 min |
| ETS | HTTPS | 5 min |
| Ramp AHS / Metering / Operations | HTTPS | 5 min |
| Off Ramps | HTTPS | 5 min |
| ESLS | HTTPS | 2 min |
| SCATS Site Status / PFL | HTTPS | 1 hour |
| RID Impacts | SQS | Variable (long poll) |
| OneView | HTTPS | 5 min |
Message Queue Ingestion (RID)
RID publishes impact data through a message queue service. Unlike HTTP sources that poll on a fixed interval, message queue ingestion uses a long polling approach: the service requests data and waits for new messages, reducing both end-to-end latency and API call volume.
Long polling enables near-real-time data delivery by starting a new ingestion cycle immediately when a message is received rather than waiting for the next scheduled poll interval.
Platform logging tracks which RID impacts have been successfully processed, supporting debugging and audit workflows.
Related Services
- Data Transformer — Downstream consumer that normalises ingested data
- Data Stream Ingestion — Sibling ingestion service for event-driven message streams
- Data Redis Ingestion — Sibling ingestion service for real-time data sources
- Data Fusion — Fuses transformed data from multiple sources
- Experiment Manager — Central orchestration service
