HPCC (High-Performance Computing Cluster), also known as DAS (Data Analytics Supercomputer), is a Data Intensive Computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing Big Data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.
Many organizations have large amounts of data which has been collected and stored in massive datasets which needs be processed and analyzed to provide business intelligence, improve products and services for customers, or to meet other internal data processing requirements. For example, Internet companies need to process data collected by Web crawlers as well as logs, click data, and other information generated by Web services. Parallel relational database technology has not proven to be cost-effective or provide the high-performance needed to analyze massive amounts of data in a timely manner. As a result several organizations developed technology to utilize large clusters of commodity servers to provide high-performance computing capabilities for processing and analysis of massive datasets. Clusters can consist of hundreds or even thousands of commodity machines connected using high-bandwidth networks. Examples of this type of cluster technology include Google’s MapReduce, Apache Hadoop, Aster Data Systems, Sector/Sphere, and LexisNexis HPCC platform.
High Performance Computing
High-Performance Computing (HPC) is used to describe computing environments which utilize supercomputers and computer clusters to address complex computational requirements, support applications with significant processing time requirements, or require processing of significant amounts of data. Supercomputers have generally been associated with scientific research and compute-intensive types of problems, but more and more supercomputer technology is appropriate for both compute-intensive and data-intensive applications. A new trend in supercomputer design for high-performance computing is using clusters of independent processors connected in parallel. Many computing problems are suitable for parallelization, often problems can be divided in a manner so that each independent processing node can work on a portion of the problem in parallel by simply dividing the data to be processed, and then combining the final processing results for each portion. This type of parallelism is often referred to as data-parallelism, and data-parallel applications are a potential solution to petabyte scale data processing requirements. Data-parallelism can be defined as a computation applied independently to each data item of a set of data which allows the degree of parallelism to be scaled with the volume of data. The most important reason for developing data-parallel applications is the potential for scalable performance in high-performance computing, and may result in several orders of magnitude performance improvement.
Commodity Computing Clusters
Commodity Computing Cluster
The resulting economies of scale in using multiple independent processing nodes for supercomputer design to address high-performance computing requirements led directly to the implementation of commodity computing clusters. A computer cluster is a group of shared individual computers, linked by high-speed communications in a local area network topology using technology such as gigabit network switches or InfiniBand, and incorporating system software which provides an integrated parallel processing environment for applications with the capability to divide processing among the nodes in the cluster. Cluster configurations can not only improve the performance of applications which use a single computer, but provide higher availability and reliability, and are typically much more cost-effective than single supercomputer systems with equivalent performance. The key to the capability, performance, and throughput of a computing cluster is the system software and tools used to provide the parallel job execution environment. Programming languages with implicit parallel processing features and a high-degree of optimization are also needed to ensure high-performance results as well as high programmer productivity. Clusters allow the data used by an application to be partitioned among the available computing resources and processed independently to achieve performance and scalability based on the amount of data.
Commodity computing clusters are configured using commercial off-the-shelf (COTS) PC components. Rack-mounted servers or blade servers each with local memory and disk storage are often used as processing nodes to allow high-density small footprint configurations which facilitate the use of very high-speed communications equipment to connect the nodes (Figure 1). Linux is widely used as the operating system for computer clusters.
HPCC System Architecture
Thor Processing Cluster.
The HPCC system architecture includes two distinct cluster processing environments, each of which can be optimized independently for its parallel data processing purpose. The first of these platforms is called a Data Refinery whose overall purpose is the general processing of massive volumes of raw data of any type for any purpose but typically used for data cleansing and hygiene, ETL processing of the raw data, record linking and entity resolution, large-scale ad-hoc complex analytics, and creation of keyed data and indexes to support high-performance structured queries and data warehouse applications. The Data Refinery is also referred to as Thor, a reference to the mythical Norse god of thunder with the large hammer symbolic of crushing large amounts of raw data into useful information. A Thor cluster is similar in its function, execution environment, filesystem, and capabilities to the Google and Hadoop MapReduce platforms.
Figure 2 shows a representation of a physical Thor processing cluster which functions as a batch job execution engine for scalable data-intensive computing applications. In addition to the Thor master and slave nodes, additional auxiliary and common components are needed to implement a complete HPCC processing environment.
Roxie Processing Cluster.
The second of the parallel data processing platforms is called Roxie and functions as a rapid data delivery engine. This platform is designed as an online high-performance structured query and analysis platform or data warehouse delivering the parallel data access processing requirements of online applications through Web services interfaces supporting thousands of simultaneous queries and users with sub-second response times. Roxie utilizes a distributed indexed filesystem to provide parallel processing of queries using an optimized execution environment and filesystem for high-performance online processing. A Roxie cluster is similar in its function and capabilities to Hadoop with HBase and Hive capabilities added, and provides for near real time predictable query latencies. Both Thor and Roxie clusters utilize the ECL programming language for implementing applications, increasing continuity and programmer productivity.
Figure 3 shows a representation of a physical Roxie processing cluster which functions as a online query execution engine for high-performance query and data warehousing applications. A Roxie cluster includes multiple nodes with server and worker processes for processing queries; an additional auxiliary component called an ESP server which provides interfaces for external client access to the cluster; and additional common components which are shared with a Thor cluster in an HPCC environment. Although a Thor processing cluster can be implemented and used without a Roxie cluster, an HPCC environment which includes a Roxie cluster should also include a Thor cluster. The Thor cluster is used to build the distributed index files used by the Roxie cluster and to develop online queries which will be deployed with the index files to the Roxie cluster.
HPCC Software Architecture
The HPCC software architecture incorporates the Thor and Roxie clusters as well as common Middleware components, an external communications layer, client interfaces which provide both end-user services and system management tools, and auxiliary components to support monitoring and to facilitate loading and storing of filesystem data from external sources.
An HPCC environment can include only Thor clusters, or both Thor and Roxie clusters. The overall HPCC software architecture is shown in Figure 4.
Figure 4. HPCC Software Architecture