Data Warehousing in the. Age of Big Data. Krish Krishnan. AMSTERDAM • BOSTON • HEIDELBERG • LONDON. NEW YORK • OXFORD • PARIS • SAN DIEGO. Additional praise for Big Data, Data Mining, and. Machine Learning: Value Creation for Business Barack H. Obama: the unauthorized biography. Data Warehouse Modernization: Problem or Opportunity? 12 .. Warehouse Architectures in the Age of Big Data, online at bestthing.info bpreports.
|Language:||English, Dutch, French|
|Genre:||Health & Fitness|
|ePub File Size:||28.62 MB|
|PDF File Size:||9.77 MB|
|Distribution:||Free* [*Sign up for free]|
This debate prompts the question: What is a data warehouse in the age of big data? How does the advent of Hadoop, Spark, Python, data virtualization, data. Big Data and its Impact on Data Warehousing. The “big data” movement has taken the informa- tion technology world by storm. Fueled by open source projects. download Data Warehousing in the Age of Big Data - 1st Edition. Print Book & E- Book. DRM-free (EPub, PDF, Mobi). × DRM-Free Easy - Download and start.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications.
Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. This variety of unstructured data poses certain issues for storage, mining and analyzing data. How fast the data is generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous. During the map phase, input data is processed item by item and transformed into an intermediate data set. During the reduce phase, data is converted into a summarized data set. Other high-level scripting and query languages make it easier to do computations in the MapReduce framework.
Query layer Active archive must be able to query the data easily and perform computations, aggregations, and other typical SQL operations.
Although all operations need to be done as MapReduce jobs in the big data platform, writing MapReduce jobs in languages such as the Java programming language makes the code less intuitive and less convenient. The query layer provides a way to specify the operations for analyzing, loading, and saving data from the distributed file system.
It orchestrates the processing of jobs in the big data platform.
Many commercial and open source variants such as Hive, Pig, and Jaql can be used to query data easily from the active archive that is stored in the big data platform. Active archive design Data warehouse is designed based on business requirements and the active archive is an extension of the warehouse. Active archive must provide a way to query and extract data with semantics that are consistent with the data warehouse.
Metadata information that is developed for the warehouse must be applied to the active archive so that users familiar with the warehouse are able to understand the active archive without too much difficulty. Carefully select an infrastructure, data layout, and system or record for the active archive.
Infrastructure If an organization chooses to build an active archive as the first step in implementing a big data platform, it must analyze whether to build or download the services that are needed. In many cases, an active archive is only one part of a larger data analytics strategy. Building a big data platform from scratch requires expertise and infrastructure to set up a scalable and extensible platform that can address current needs and scale to accommodate future requirements.
Although Hadoop-based systems can run on commodity hardware, a big data solution includes system management software, networking capability, and extra capacity for analytical processing. The active archive Hadoop infrastructure must be sized to accommodate the amount of data to be stored.
Consider the replication factor within Hadoop and the efficiency that can be achieved by compressing data in Hadoop. The replication factor represents the number of copies of each data slice. Depending on the number of data nodes, one or more management nodes might be needed. The number of racks that are needed for data nodes are also an important factor to be considered during infrastructure design.
Data layout One of the most important design decisions is the layout of the data in the big data platform. Because the active archive must store a huge volume of data, an appropriate structure to organize the data is important. This structure affects query performance and how computations are done against the data within the active archive.
Data Warehousing in the Age of Big Data
The structure must be scalable so that data can be added incrementally from the data warehouse when the data is ready to be archived. As shown in Figure 5, the partition scheme that is used in the data warehouse can be used to arrange the folder structure within the distributed file system. Any distribution or clustering key that is used for the main fact tables can be used to organize subfolders.
For example, a telecommunications data warehouse stores call data records that are partitioned by months.
Use big data technology for an active archive
If the daily volume of data is in the millions, the active archive might use month as the main folder with the data for individual days that are stored in subfolders.
Figure 5. Layout of archived data in the big data platform View image at full size System of record As shown in Figure 6, data from the data warehouse must be moved into flat files in the big data platform for the active archive.
A system of record in the data warehouse refers to the combination of data elements to describe a business process across tables. A data warehouse system of record must be converted to a Hadoop system of record so that the active archive can serve a similar purpose for historical data. If the purpose of the active archive is merely to store historical data for analytical purposes and the restore function is not required, the system of record in Hadoop can be designed to suit the requirements and does not have to mirror the data warehouse system of record.
For example, data elements that are needed from various tables in the data warehouse can be combined into a Hadoop system of record for the active archive. Figure 6.
Introduction to BIG DATA: What is, Types, Characteristics & Example
Layout of data elements in active archive View image at full size If the restoring of data is required, all data elements that are required in the data warehouse must be moved into the big data system. The system of record in Hadoop might span multiple files, and queries on the big data platform might require operations similar to SQL join statements. In this case, converting a system of record from the data warehouse to Hadoop during the archive operation or the reverse during a restore operation must be possible.
Figure 7. Alternative layout of archived data in the big data platform View image at full size Tools and techniques The big data platform provides many tools to help implement an active archive. Open source tools and commercial products support moving data in and out of the active archive.
Other types of methods try to replicate and collocate data in order to achieve node independence. However, in big data problems, these methods make the volume of big data balloon, which is unacceptable for already immense amounts of data. In this paper, we propose a method called Chabok that not only solves the data locality problem completely but solves network congestion problems as well.
In Chabok, a two-phased Map-Reduce method is used for data warehouse problems with big data. Chabok is used for star-schema data warehouses and can compute distributive measures.
This method can also be applied to big dimensions, which are dimensions where data volume is greater than the volume of a node. Related works In this section, we investigate related works that try to solve Map-Reduce problems related to the data warehouse. This method uses data collocation and co-partitioning to support the join operator, and join execution is done on the Mappers.
CoHadoop [ 15 ] intentionally collocates data on the nodes. Using this method, related data are placed together, and a data structure called the Locator is added to the HDFS Hadoop file system.
Using this method, Map-side join without data shuffling is possible. Queries in this method are only extracted from related CFiles, and it is not necessary to scan all files.
Osprey [ 17 ] fragments table data between nodes, and each fragmentation is allocated to a node. Queries are divided into sub-queries and executed simultaneously on each node.
GridBatch [ 18 ] is the same as CoHadoop, but colocation occurs at the file system layer. Arvand [ 19 ] is a method that integrate multi-dimensional data sources into big data analytic structure like Hadoop.
In [ 21 ], a method is proposed that transfers legacy data warehouses to Hive [ 22 ]. In [ 23 ], data from legacy data warehouses are transferred to Hive by a rule-based method. In [ 24 ], three physical data warehouse designs were investigated to analyse the impact of attribute distribution among column-families in HBase based on OLAP query performance.
The authors conclude that OLAP query performance in HBase can be improved by using a distinct set of attribute distributions among column-families. In [ 25 ], three types of transformation are covered. In the first method, dimensions and measures are directly transferred to NOSQL one table for each fact and dimension.
In the second method, one table is transferred. Facts and dimension information are merged in that table. The last method is similar to the second method but with one difference: it uses a column family instead of a simple attribute. In addition to the columnar format, Cheetah [ 26 ] uses compression methods. RCFile [ 27 ] uses horizontal and vertical partitioning. First, the data are partitioned horizontally, and each section is partitioned vertically.
CIF [ 28 ] is a binary columnar method that first divides data horizontally, creates a directory for each partition and then creates a subdirectory for each column. A metadata file keeps directory information. MRShare [ 29 ] divides a job into queries and creates the provision that the previous execution results can be used if it is necessary to re-execute a query.
ReStore [ 30 ] is a method that stores intermediate results for future calculations. Hadoop manages coordination among nodes.Part 1 discusses Big Data, its technologies and use cases from early adopters.
The structure must be scalable so that data can be added incrementally from the data warehouse when the data is ready to be archived.
When an e-commerce site detects an increase in favourable clicks from an experimental online advertisement, that insight can be taken to the bottom line immediately. Some methods try to accelerate query execution by putting some metadata in each node.
But a simple search is crude.