Mainframe Offload Engine (MOE) Platform:
Realize Mainframe Data Value with Hadoop
- -IT organizations have struggled for many years to combine the mainframe-generated data from large, mission critical, transactional systems, with data integration applications.
- -Inefficient COBOL programs continue to be stressed by volume growth – their performance remains unpredictable, even when mainframe processing is increased.
- -Extracting data out of the mainframe and into an open systems environment has many data structure problems.
- -Transforming mainframe data in conventional open source databases has traditionally been prohibitively cumbersome.
Let us introduce you to Hadoop – organizations now have the capability to move mainframe data into a platform that eliminates structural formatting bottlenecks. The Hadoop ecosystem provides a set of tools that address ingestion, conversion, serialization, and formatting of mainframe data with increased performance and economic value.
This blog post will outline the results of Data Storage Technologies’ engagement with a customer. The customer’s goal was to export and sanitize mainframe data for use in a business intelligence environment, thereby realizing the full potential of the transactional data at the core of the customer’s business. By utilizing Hadoop, DST achieved this goal.
Three years ago, a customer began migrating various workloads from the mainframe to a distributed systems environment. The primary drivers behind this migration were to lower cost and provide increased scalability and flexibility. Since this migration included core applications that had been in place for years, the risk of business disruption caused the project to be extremely visible within the customer’s organization.
The customer invested in a ‘Data Warehouse’ – an application and infrastructure stack meant to ingest datasets from the mainframe using the Oracle suite of products. Its objective was to extract and transform the data into meaningful information, and provide reports in a timely manner. In theory this system would allow the customer to add datasets from various relational database sources and then correlate them against the core business data extracted from the mainframe. The project had limited success – the Oracle solution elongated data import and query times.
The customer’s existing system had the following characteristics:
- Data structure and field values varied, with unusually long sort times.
- With almost 100TB of data, database management became unwieldy.
- Severely limited reporting capabilities.
- Unmet SLA’s for report generation due to lengthy times in the mainframe import and conversion of EBCDIC to ASCII.
Enter Data Storage Technologies (DST):
DST was called upon to introduce a new ETL processing and mainframe offload paradigm two years after the customer started its’ project. The company’s goals were straightforward:
- Decrease processing time by 50% from the existing Oracle-based solution.
- Increase ability of the datastore to handle variable sized columns.
- Reduce ingest times by moving away from a “schema-on-read” in favor of a “schema-on write” based architecture.
- Increase existing scalability limit (capacity, processing, table size, etc.).
- Decrease infrastructure maintenance and software licensing costs.
- Provide an open interface layer for application development – enabling multiple groups to rapidly code applications in familiar languages.
DST implemented a pilot using, Syncsort, Sqrrl, the R Suite, and Cloudera’s version of Hadoop. The Hadoop pilot focused on improving mainframe offload and decreasing data serialization time by using a next generation NoSQL datastore.
What is Hadoop?
Hadoop, is an open-source software ecosystem, licensed by Apache and productized by Cloudera. It begins with these three main components:
- Hadoop Distributed File System (HDFS) – an append-only datastore, distributed across tens, hundreds, thousands, or even tens of thousands of nodes (systems).
- Map/Reduce – a distributed data processing job engine providing rapid querying of data across the HDFS nodes.
- Yet Another Resource Scheduler (YARN) – the resource manager for the HDFS and Map/Reduce.
Typical elements comprising an Hadoop ecosystem of tools:
This Hadoop ecosystem platform has the necessary tools to perform the following functions:
- Data ingest
- Formatting and serialization of data
- Querying of data with scripting and SQL querying
- NoSQL and NewSQL database functionality
- Cluster management
Why Hadoop? Why Cloudera Hadoop?
Hadoop has been providing search and analytics capabilities for Internet companies such as Yahoo and Google for the past decade. Cloudera productized Hadoop in 2008 by offering packaged support, similar to how RedHat productized Linux in the late 1990’s.
The Hadoop framework provides data protection and data mobility in a flexible infrastructure based on commodity resources. HDFS writes three copies of data to multiple nodes to withstand multiple data failures, and Map/Reduce distributes jobs to multiple sets of nodes to enable the querying of large, disparate datasets.
Hadoop has shifted the processing paradigm – moving the processing power to the data rather than moving the data to a central processing unit. With minimal resources, Hadoop significantly reduces the time to interrogate large disparate datasets.
Importing the Data into Hadoop
The customer’s files followed a similar pattern; fixed part of records with fixed length columns and variable part lengths. Dramatic discrepancies in column length make it difficult for relational database imports to adapt without reprocessing existing data. In an Hadoop environment, however, adding new columns with variable data is straightforward and table structure is altered on the fly similar to XML, eliminating traditional inefficiencies.
Overcoming Traditional Process and Data Challenges
DST had to bring together several tools; some within the Apache open source ecosystem, and others that were best of breed players in the marketplace.
The Hadoop ecosystem created for the customer’s deployment:
The DST Hadoop Deployment
The characteristics of the mainframe data for the pilot were as follows:
- Approximately 14 million records
- Variable record lengths between three and eighteen columns
- Over 50% of columns had null values
- Columns were not ordered consistently
The data was parsed, splitting each record with a semi-colon, and each field with an equals sign. Once optimized, the new columnar data was dumped into a Sqrrl, noSQL datastore.
The new process cut traditional ETL processing from hours to minutes. This is only a portion of the potential realized value as the normalized data stored in a table is similar to a relational database without the rigid formatting constraints. This flexibility enabled the customer to quickly add structured and unstructured datasets to increase reporting accuracy. Further, it broadened reporting capabilities by adding new metrics.
The financial savings were also quite compelling. The solution realized immediate ROI at the infrastructure layer – decreasing storage and compute infrastructure by 82% and the associated maintenance by 77%. These results alone justified an expansion of the pilot project. Additionally, licensing costs decreased dramatically from $1.2 million to $50,000 and annual maintenance costs were reduced from $225,000 to $21,000.
By Debbie Westwood and Andrew Gauzza at Data Storage Technologies