Discovery Pipeline

Discovery provides tools and utilities for fast data ingestion and schema management to bring in data into the Invariant platform for ad-hoc analysis and reports development. Discovery can be used to in combination with DB replication or DB Change Data Capture systems to push the data to HDFS for streaming data ingestion. It can also use periodic batch pull if the source systems do not support a streaming data output channel.

Discovery automates much of the laborious tasks of data mapping and movement. This frees up the data analysts to focus more on the downstream ad-hoc discovery tasks including data shaping and transformation to fit their business needs.

Overview

Discovery leverages the Invariant data platform and use Source DB Schema metadata and HCatalog to manage the schema. The discovery pipeline maps source data streams from Kafka streams and JMS queues and loads the data into target location in HDFS. Data from multiple source systems can be ingested and managed, with user defined functions and rules applied, to build the warehouse for discovery.

Discovery inventory management components are designed to help you keep your data in synch. Utilities to generate required configuration, workflows and reports allows the data engineers to keep on top of data pipeline management tasks.

Features

Discovery provides much needed automation for data collection process, which is the pre-requisite for and often overlooked part of the data analysis and exploration process. It does the heavy lifting of data management and delivery and can easily be integrated into business workflows.

Key benefits

Automate data pipeline for moving data into HDFS
Keep data in synch and identify gaps by utilizing Data Inventory Reports
Use for ad-hoc data exploration and analysis.
Operationalize data workflows to feed the data to downstream applications and reports

Architecture

Discovery is designed to help ease the pain of sourcing data from diverse relational databases and file-systems into the store. Production systems can generate large quantity of data. It becomes difficult to extract the data without adding additional load on already busy servers. The traditional approach is to periodically poll the source systems, which is sub-optimal. Nightly batch jobs take time and add unwanted latency.

Discovery pipeline and inventory management services run on the edge node and keeps track of the source and target metadata and data mapping. The command line utilities can be used to generate pipeline mapping configuration as well as target DDLs for Hive. Once configured, the services run in the background and connect to Kafka or other queue based data sources to collect the data streamed in. For source which do not support streaming, discovery can periodically pull the data and merge it into target stores. Discovery also supports Apache Oozie workflows, allowing it to participate with broader enterprise level data pipeline.

Discovery Pipeline

Overview

Features

Architecture

Cookie Policy