Mainframe Offload Engine Platform

Mainframe Offload Engine (MOE) Platform:

Realize Mainframe Data Value with Hadoop

  • -IT organizations have struggled for many years to combine the mainframe-generated data from large, mission critical, transactional systems, with data integration applications.
  • -Inefficient COBOL programs continue to be stressed by volume growth – their performance remains unpredictable, even when mainframe processing is increased.
  • -Extracting data out of the mainframe and into an open systems environment has many data structure problems.
  • -Transforming mainframe data in conventional open source databases has traditionally been prohibitively cumbersome.

Let us introduce you to Hadoop – organizations now have the capability to move mainframe data into a platform that eliminates structural formatting bottlenecks. The Hadoop ecosystem provides a set of tools that address ingestion, conversion, serialization, and formatting of mainframe data with increased performance and economic value.

This blog post will outline the results of Data Storage Technologies’ engagement with a customer. The customer’s goal was to export and sanitize mainframe data for use in a business intelligence environment, thereby realizing the full potential of the transactional data at the core of the customer’s business. By utilizing Hadoop, DST achieved this goal.

Background:

Three years ago, a customer began migrating various workloads from the mainframe to a distributed systems environment. The primary drivers behind this migration were to lower cost and provide increased scalability and flexibility. Since this migration included core applications that had been in place for years, the risk of business disruption caused the project to be extremely visible within the customer’s organization.

The customer invested in a ‘Data Warehouse’ – an application and infrastructure stack meant to ingest datasets from the mainframe using the Oracle suite of products. Its objective was to extract and transform the data into meaningful information, and provide reports in a timely manner. In theory this system would allow the customer to add datasets from various relational database sources and then correlate them against the core business data extracted from the mainframe.  The project had limited success – the Oracle solution elongated data import and query times.

 Initial Challenges:

 The customer’s existing system had the following characteristics:

  • Data structure and field values varied, with unusually long sort times.
  • With almost 100TB of data, database management became unwieldy.
  • Severely limited reporting capabilities.
  • Unmet SLA’s for report generation due to lengthy times in the mainframe import and conversion of EBCDIC to ASCII.

Enter Data Storage Technologies (DST):

DST was called upon to introduce a new ETL processing and mainframe offload paradigm two years after the customer started its’ project. The company’s goals were straightforward:

  • Decrease processing time by 50% from the existing Oracle-based solution.
  • Increase ability of the datastore to handle variable sized columns.
  • Reduce ingest times by moving away from a “schema-on-read” in favor of a “schema-on write” based architecture.
  • Increase existing scalability limit (capacity, processing, table size, etc.).
  • Decrease infrastructure maintenance and software licensing costs.
  • Provide an open interface layer for application development – enabling multiple groups to rapidly code applications in familiar languages.

DST implemented a pilot using, Syncsort, Sqrrl, the R Suite, and Cloudera’s version of Hadoop.  The Hadoop pilot focused on improving mainframe offload and decreasing data serialization time by using a next generation NoSQL datastore.

 What is Hadoop?

 Hadoop, is an open-source software ecosystem, licensed by Apache and productized by Cloudera. It begins with these three main components:

  •  Hadoop Distributed File System (HDFS) – an append-only datastore, distributed across tens, hundreds, thousands, or even tens of thousands of nodes (systems).
  • Map/Reduce – a distributed data processing job engine providing rapid querying of data across the HDFS nodes.
  • Yet Another Resource Scheduler (YARN) – the resource manager for the HDFS and Map/Reduce.

Typical elements comprising an Hadoop ecosystem of tools:

Hadoop pic

This Hadoop ecosystem platform has the necessary tools to perform the following functions:

  •  Data ingest
  • Formatting and serialization of data
  • Querying of data with scripting and SQL querying
  • NoSQL and NewSQL database functionality
  • Cluster management

 Why Hadoop? Why Cloudera Hadoop?

 Hadoop has been providing search and analytics capabilities for Internet companies such as Yahoo and Google for the past decade. Cloudera productized Hadoop in 2008 by offering packaged support, similar to how RedHat productized Linux in the late 1990’s.

The Hadoop framework provides data protection and data mobility in a flexible infrastructure based on commodity resources. HDFS writes three copies of data to multiple nodes to withstand multiple data failures, and Map/Reduce distributes jobs to multiple sets of nodes to enable the querying of large, disparate datasets.

Hadoop has shifted the processing paradigm – moving the processing power to the data rather than moving the data to a central processing unit. With minimal resources, Hadoop significantly reduces the time to interrogate large disparate datasets.

 Importing the Data into Hadoop

The customer’s files followed a similar pattern; fixed part of records with fixed length columns and variable part lengths. Dramatic discrepancies in column length make it difficult for relational database imports to adapt without reprocessing existing data. In an Hadoop environment, however, adding new columns with variable data is straightforward and table structure is altered on the fly similar to XML, eliminating traditional inefficiencies.

 Overcoming Traditional Process and Data Challenges

 DST had to bring together several tools; some within the Apache open source ecosystem, and others that were best of breed players in the marketplace.

The Hadoop ecosystem created for the customer’s deployment:

Hadoop 2

 

The DST Hadoop Deployment

 The characteristics of the mainframe data for the pilot were as follows:

  •  Approximately 14 million records
  • Variable record lengths between three and eighteen columns
  • Over 50% of columns had null values
  • Columns were not ordered consistently

The data was parsed, splitting each record with a semi-colon, and each field with an equals sign.  Once optimized, the new columnar data was dumped into a Sqrrl, noSQL datastore.

Conclusion 

The new process cut traditional ETL processing from hours to minutes. This is only a portion of the potential realized value as the normalized data stored in a table is similar to a relational database without the rigid formatting constraints.  This flexibility enabled the customer to quickly add structured and unstructured datasets to increase reporting accuracy.  Further, it broadened reporting capabilities by adding new metrics.

The financial savings were also quite compelling. The solution realized immediate ROI at the infrastructure layer – decreasing storage and compute infrastructure by 82% and the associated maintenance by 77%. These results alone justified an expansion of the pilot project. Additionally, licensing costs decreased dramatically from $1.2 million to $50,000 and annual maintenance costs were reduced from $225,000 to $21,000.

 

By Debbie Westwood and Andrew Gauzza at Data Storage Technologies

 

 

Posted in Uncategorized | Leave a comment

Five Technologies Every Organization should consider in 2012

The world of data storage, business continuity and disaster recovery has gone through tremendous changes in recent years. We have seen; consolidation on a large scale by smaller companies that possessed thought leadership into larger organizations, influx of technology such as SSD into mainstream applications, object-based storage take hold for multi-petabyte size application requirements, virtualization extend from network and server environments to storage, and backup and replication technologies functionality converge.

DST is focused on shining a light on advancements in technologies that can increase productivity, decrease overall cost and simplify the environment, as we believe that data trends in the marketplace warrant the need for a re-think in how we manage and store data.

The first such advancement is Storage Virtualization. Storage Virtualization works in much the same way as server virtualization.  There is an abstraction between the physical hardware (the array) and the processing of the data.  Whether a structured data environment (SAN) or an unstructured data environment (NAS) the data management and ownership is abstracted from the hardware that the data resides on, creating a flexible central point of management for the entire heterogeneous resource pool.  The advantages of virtualization:

      • Create unparallel data mobility, fluidly assigning data to storage resources.
      • Easily control and manage multiple physical resources from a single pane of glass.
      • Flexibly set policies across entire ecosystem or volume by volume.
      • Ability to commoditized infrastructure as the intelligence is divorced from hardware.
      • Platform agnostic replication and virtualization.

Replication, Disaster Recovery and Backup is another area of technology convergence. All of these functions can be done now in a single piece of software from a single pane of glass more efficiently, with better recovery points and recovery times. Backup vendors are integrating with storage snapshot technology, storage vendors are able to create more snapshots and business continuity/disaster recovery product vendors are integrating file level catalogs – all evidence of several markets collapsing as technology advances.

Another popular topic of conversation is SSD. SSD Augmentation and Cloud-based Storage i.e. “Amazon”-esque storage are two areas of hardware evolution that will dominate the storage industry in the years ahead.  The price point between SAS/FC drives and SSD drives is narrowing.  The SSD technology is improving its longevity of use and addressing other issues that kept it from broad acceptance.  Performance benefits from SSD are game changing in some environments – VDI in particular offers users a much different experience when it is on SSD.

Object based storage which is the productizing of the technologies that Amazon and Rack Space have been using for years offers organizations to rethink how they do business.  Especially in the entertainment industry – the question becomes “what would I do if storage costs went on sale at 70% off or more” – what would you keep on line available for customers or for internal use if cost were not a factor.

DST as an organization works with customers to consultatively look through the emerging technologies and create options that have value to your specific needs.

Posted in Uncategorized | Leave a comment

SAN, NAS & Unified Storage – Revolution or New Story with Same Old Technology

In the past couple of years, we have seen EMC introduce the VNX and at the same time purchase Isilon, Hitachi purchase BlueArc, Oracle make a push as the proud new owner of ZFS courtesy of SUN, and the introduction of several new vendors enter the unified storage market such as Nexsan, Nimble and IceWEB.

EMC certainly has the broadest addressable market, with the VNX for the mid-tier SAN/NAS customer and Isilon for the scale-out Petabyte imaging applications, however, the VNX is merely a Clariion with 25 year old technology sitting behind a Celerra NAS gateway. They were able to craft a unified message merely by uniting the management of each in a single GUI. Now, I am not saying that this is all together bad, but this has proven to have scalability and performance issues in the past and although components are faster, the underlying bottlenecks have not changed. This is why Isilon will play a more strategic role for EMC in the future. Isilon is a scale-out NAS platform. Adding nodes increases performance linearly and reduces overall overhead of the system. Due to the architecture, Isilon is more geared for large file NAS deployments (in particular, the way in which Isilon handles Level 1 and Level 2 caching within the system) and is not really a unified platform.

Hitachi BlueArc NAS with an AMS 2000 series behind it shares much of the same inefficiencies as the VNX and it not a true unified platform. It has however found its niche in its ability to address Billions of small files efficiently.

Until recently, I had not had much deep technical experience with new age file systems such as ZFS (Zettabyte File System) and CASL (Cache Accelerated File System). Like NetApps WAFL, ZFS, found as the basis for Oracle’s Unified Platform, Nexsan’s E5000 and IceWEB, is a WIFS (Write In Free Space) file systems. This makes them extremely efficient as blocks are not over-written, but written to free space, updating the file system index with the new location. CASL which runs at the heart of Nimble Storage is a file system always writes in stripes across the whole disk group. One could argue that this makes compression more efficient and increases performance for random workloads.

The contention for the past several years has been centered around the limitations of the file system to effectively manage the transactional nature of the index. The addition of SSD into ZFS and CASL file system architectures reduces this bottleneck even during high IO transactional workloads. Each of these “next generation” unified platforms can accommodate Fibre Channel, iSCSI, NFS, CIFS, WebDAV, HTTP as well as a host of other protocols. Furthermore, the flexibility of leveraging commodity oriented hardware makes these architectures cost effective.

Posted in Uncategorized | Leave a comment

Storage Virtualization – What is your definition?

Every storage vendor on earth is claiming their version is “virtual storage”. But what does virtualized storage mean? There are many definitions and I will try to put some clarity to how this term is being used. For the sake of this blog, I will contrast the differences between four products in the marketplace today that are touted as virtual storage environments; Hitachi VSP (Virtual Storage Appliance), EMC VPLEX, Falconstor NSS (Network Storage Services) and IBM SVC.

These deserve the label virtualization tools because they are (or contain as part of them) create an abstraction layer between the host environment and the storage. I separate these products into two distinct groups; In-line Appliances or Storage Platforms.

In-line Appliances

The EMC VPLEX, Falconstor NSS and IBM SVC are all appliances that sit in the fabric between the hosts and storage. The all virtualize external storage and each has a set of storage vendors that are certified behind them.

The Falconstor NSS product has been out the longest of the three and is feature rich encorporating synchronous & asynchronous replication, snapshotting and thin (dynamic) provisioning. Like an active/active asynchronous SAN controller, a volume is owned by a particular NSS appliance. If a host image, such as a VM, lives in two places, only one can write at any one time. NSS is implemented in a clustered pair and there is a quorum to shift over ownership from one NSS appliance to another in the cluster. Asynchronous replication to a disaster recovery site is limited to Ethernet only.

The EMC VPLEX addressed a different concern – the ability to write from to the same volume from two different physical VPLEX engines and the VPLEX will manage through a separate 10G network and quorum the synchronization of the data that sits on EMC or non-EMC disk behind it. They have a fairly robust list of supported storage platforms on the matrix including the ability to virtualize other virtualization devices such as Hitachi’s VSP platform. This allows customers to achieve the ultimate in availability, data protection and mobility within and between datacenters. The EMC VPLEX is, however feature deficient today and EMC promises that this will change in the near future. All snapshotting and advanced provisioning are done at the platform being virtualized taking away some of the benefits that should be inherent in virtualization.

Although I have not encountered IBM’s SVC often and have not worked extensively with the product, it offers the ability to virtualize in the same manner as Falconstor and appears to have a reasonably complete set of provisioning and optimization features.

In-Line Storage Controller Virtualization

The Hitachi VSP is a unique breed of product expanding on the earlier USP-V technology. The VSP or Virtual Storage Platform is a storage subsystem and a virtualization engine in one. It provides all of the benefits of a Tier 0/1 storage platform with the ability to house internal disk with the ability to virtualize external storage. This is a double-edged sword as it provides the ability to provision, store data, tier within or externally and replicate within a single subsystem, however, the active/active unit is physically on chassis. Access, like with the Falconstor NSS and IBM SVC, is based on ownership. Only one VSP can have a host to LUN relationship at one time. VSPs can replicate synchronously or asynchronously over fabric.

Features and architecture greatly separate the storage virtualization platforms. When searching for a virtualization technology, first outline the goals your organizations requirements and put them in a priority order. Maybe I absolutely need complete mobility of applications and operating systems across synchronous distances. Maybe that is not as important as extending consistent advanced provisioning and snapshot functionality to several platforms that do not have the functionality today.

Posted in Uncategorized | 4 Comments

Starting out in the Blog-o-sphere

Too often people mix fact, fiction and opinion into a technology blog that is meant to inform. This DST blog has been started to remove the spin and give an honest assessment of the direction of the storage industry.

We are at an exciting time, especially in storage. The slow but steady adoption by the storage industry of external SAN storage virtualization has taken root, as evidenced by EMCs push with VPLEX, adds new possibilities and also complexities into the environment. Hitachi, IBM and Falconstor all offer a virtualization layer product, yet each is very different in capabilities and design. Virtualization is increasing being adopt in the NAS world as well to simplify migration and put a policy engine in front of the data. We also see the lines blurred – convergence between storage tool sets to augment backup, disaster recovery and business continuity and traditional backup products being more integrated at the storage layer. Customers are rethinking the entire backup paradigm that has existed for years.

This is fast becoming the decade of scale-out, scale up, scale over………… and the introduction of old technology in a new way is sweeping the scale-___ world. With the productizing of object-based storage, the Petabytes are becoming what used to be Megabytes. Scaling up to Exabytes is the new rage. With this technology comes a new method of access, removing the bloated file system and having applications track objects through an API or WebDAV access.

This blog hopes to shed some light into the world of storage and how customers can move tactically while making good strategic decisions for their enterprise.

Posted in Uncategorized | 1 Comment