Appearance
Data Ingestion
Overview
The Data Ingestion service is the primary entry point for polling-based external data into the OR platform. It collects data from HTTP, FTP, and SQS sources — including client internal feeds (via the ESB), third-party APIs, and government open data endpoints — and forwards it to the Data Transformer for normalisation into OR-compliant formats.
Each external API endpoint has its own schema, authentication method, and update frequency. Rather than building bespoke integrations per source, Data Ingestion provides a configuration-driven ingestion framework that handles the polling lifecycle, credential management, and delivery to the transformation layer. This pattern keeps complexity low and latency minimal, while making it straightforward to add or remove data sources as operational requirements evolve.
Within the broader data pipeline, Data Ingestion sits at the earliest stage — upstream of Data Transformer, Data Fusion, and all downstream consumers. It operates alongside Data Stream Ingestion (which handles event-driven AMQP sources) and Data Redis Ingestion (which handles Redis-based computer vision feeds), collectively forming the platform's ingestion layer.
Architecture
- Port:
:7000 - Language: Julia
- Scaling: Fully distributable (multi-tenant or single-tenant)
- Protocols: HTTPS over TLS, SFTP, SQS
Key Components
- Configuration-driven polling — A per-environment YAML config defines each data source with its protocol, target URL, request type, response format, headers, API key references, and polling frequency. API keys are loaded as environment variables; certificates are mounted via volume mounts.
- Redis-managed frequency control — A Redis connection controls request scheduling across tenants. Each source has a Redis key whose TTL governs when the next poll is allowed. This enables multi-tenant scaling where multiple pods can coordinate without duplicate requests.
- Threaded ingestion — When a request is available (based on Redis keyspace expiry), it runs on a separate thread. Data received is forwarded via HTTP POST to the Data Transformer
/dataendpoint. - HTTP health server — The first thread runs an HTTP server for Kubernetes
/readyand/livehealth checks.
Data Flow
External APIs / SFTP / SQS
↓ (configurable polling frequency)
Data Ingestion [:7000]
↓ (HTTP POST with target + message)
Data Transformer [:5800]Configuration
The ingestion config defines data sources as a YAML dictionary. Each source specifies its protocol, polling frequency, and one or more target endpoints:
yaml
DATA_SOURCES:
ADDINSIGHT:
protocol: https
ingestion_frequency: 60
target_config:
- target_name: ADDINSIGHT_LINKS
url: https://data-exchange-api.vicroads.vic.gov.au/bluetooth_data/links?expand=latest_stats
http_config:
request: GET
response_type: Vector{Dict}
headers:
Content-Type: application/json
Ocp-Apim-Subscription-Key: '{API_KEY_ADDINSIGHT}'Key configuration fields:
| Field | Description |
|---|---|
protocol | Connection type: https, sftp, or sqs |
ingestion_frequency | Polling interval in seconds |
target_name | Identifier used by Data Transformer to route processing |
url | Source endpoint |
http_config | Request method, response type, and headers |
Ingested Sources
The service currently polls a wide range of client and third-party sources:
| Source | Protocol | Frequency |
|---|---|---|
| AddInsight Links | HTTPS | 30s |
| STREAMS Vehicle Detectors | HTTPS | 5 min |
| RTDMS | HTTPS | 60s |
| SITREP (Road Closures) | HTTPS | 30s |
| LUMS | HTTPS | 10 min |
| VSLS | HTTPS | 10 min |
| VMS / VMS Composites | HTTPS | 10 min |
| Metro Train Positions | HTTPS | 30s |
| Metro Train Trip Updates | HTTPS | 30s |
| Metro Train Service Alerts | HTTPS | 30s |
| PTV Disruptions | HTTPS | 5 min |
| BOM Rainfall | SFTP | 5 min |
| RAI Jobs | HTTPS | 5 min |
| IRS EyeFi | HTTPS | 30s |
| ServiceNow IRS | HTTPS | 30 min |
| Tow Allocation | HTTPS | 60s |
| RWE | HTTPS | 5 min |
| ETS | HTTPS | 5 min |
| Ramp AHS / Metering / Operations | HTTPS | 5 min |
| Off Ramps | HTTPS | 5 min |
| ESLS | HTTPS | 2 min |
| SCATS Site Status / PFL | HTTPS | 1 hour |
| RID Impacts | SQS | Variable (long poll) |
| OneView | HTTPS | 5 min |
SQS Ingestion (RID)
RID publishes impact data onto an Amazon SQS queue (via SNS). Unlike HTTP sources that poll on a fixed interval, SQS ingestion uses a long polling approach: the service requests data and waits up to 20 seconds for a message, reducing both end-to-end latency and API call volume.
Long polling is achieved by intentionally deleting the Redis key that controls ingestion frequency for RID when a message is received, allowing a new ingestion cycle to start immediately rather than waiting for key expiry.
Cloudwatch logging tracks which RID impacts have been successfully read from the queue, supporting debugging and audit workflows.
Related Services
- Data Transformer — Downstream consumer that normalises ingested data
- Data Stream Ingestion — Sibling ingestion service for AMQP event streams
- Data Redis Ingestion — Sibling ingestion service for Redis-based sources
- Data Fusion — Fuses transformed data from multiple sources
- Experiment Manager — Central coordination service (GraphQL on
:5100)
