Where those designations appear in the book, and Manning .. download of Hadoop in Practice includes free access to a private web forum run by Man-. A new book from Manning, Hadoop in Practice, is definitely the most modern book Kumar Vavilapalli et al., bestthing.info~garth//papers/bestthing.info . Hadoop in Practice collects 85 Hadoop examples and presents them in a problem/solution format. Each technique addresses a specific task you'll face, like .
|Language:||English, Portuguese, Arabic|
|Genre:||Business & Career|
|ePub File Size:||22.73 MB|
|PDF File Size:||11.74 MB|
|Distribution:||Free* [*Sign up for free]|
Hadoop in Practice, Second Edition provides over tested, instantly This revised new edition covers changes and new features in the Hadoop core architecture, including MapReduce 2. .. eBook $ pdf + ePub + site + liveBook. MANNING. IN PRACTICE Hadoop in Practice. Second Edition ment that transitions Hadoop into a distributed computing kernel that can support any type of. Where those designations appear in the book, and Manning Understanding distributed systems and Hadoop 6 . apply Hadoop in practice is needed.
Big data serialization formats 3. Technique 10 Working with SequenceFiles. Technique 13 Selecting the appropriate way to use Avro in MapReduce.
Technique 15 Using Avro records in MapReduce. Technique 17 Controlling how sorting worksin MapReduce. Columnar storage 3. Understanding object models and storage formats. Parquet and the Hadoop ecosystem. Technique 20 Reading Parquet files via the command line.
Technique 21 Reading and writing Avro data in Parquet with Java. Technique 22 Parquet and MapReduce.
Hadoop in Action
Technique 24 Pushdown predicates and projection with Parquet. Custom file formats 3. Input and output formats. Technique 25 Writing input and output formats for CSV.
The importance of output committing. Organizing and optimizing data in HDFS 4. Data organization 4. Directory and file layout.
Technique 26 Using MultipleOutputs to partition your data. Technique 27 Using a custom MapReduce partitioner. Technique 28 Using filecrush to compact data. Technique 29 Using Avro to store multiple small binary files.
Efficient storage with compression Technique 30 Picking the right compression codec for your data. Moving data into and out of Hadoop 5. Key elements of data movement. Moving data into Hadoop 5. Roll your own ingest. Technique 33 Using the CLI to load files.
IN PRACTICE. Alex Holmes MANNING SAMPLE CHAPTER
Technique 37 Using DistCp to copy data within and between clusters. Technique 38 Using Java to load files. Continuous movement of log and binary files into HDFS. Technique 41 Scheduling regular ingress activities with Oozie. Technique 44 MapReduce with HBase as a data source.
Moving data out of Hadoop 5. Roll your own egress.
Technique 46 Using the CLI to extract files. Technique 50 Using DistCp to copy data out of Hadoop. Technique 51 Using Java to extract files. Applying MapReduce patterns to big data 6. Joining Technique 54 Picking the best join strategy for your data. Technique 55 Filters, projections, and pushdowns. Technique 56 Joining data where one dataset can fit into memory. Technique 57 Performing a semi-join on large datasets.
Technique 58 Joining on presorted and prepartitioned data. Technique 59 A basic repartition join. Technique 60 Optimizing the repartition join. Technique 61 Using Bloom filters to cut down on shuffled data.
Dem Autor folgen
Data skew in reduce-side joins. Technique 62 Joining large datasets with high join-key cardinality. Technique 63 Handling skews generated by the hash partitioner. Sorting 6. Secondary sort. Technique 64 Implementing a secondary sort.
Technique 65 Sorting keys across multiple reducers. Sampling Technique 66 Writing a reservoir-sampling InputFormat.
Utilizing data structures and algorithms at scale 7. Modeling data and solving problems with graphs 7.
Modeling graphs. Technique 67 Find the shortest distance between two users. Using Giraph to calculate PageRank over a web graph. Technique 69 Calculate PageRank over a web graph. HyperLogLog 7. A brief introduction to HyperLogLog. Unexpected input caused the application to fail. Depending on the problem, you may find additional useful information in the logs, or in the standard out stdout or standard error stderr of the task process.
You can view all three outputs easily by selecting the All link under the Logs column, as shown in figure This is all fine and dandy, but what if you don t have access to the UI? How do you figure out the failed tasks and get at their output files? Clicking on the logs link will take you to this view. The syslog shows us the exception that's causing the job to fail. The all option gives you verbose output for all tasks.
A URL that can be used to retrieve all the outputs related to the task. You can also figure out the host that executed the task by examining the host in the URL. This output is informative: not only do you see the exception, but you also see the task name and the host on which the task was executed. The start of the standard error. Error running child java. It ll be easier to parse the output by saving the HTML to a file by adding -o [filename] to the curl command , copying that file to your local host, and using a browser to view the file.
This may be the case if you re working in clusters that have firewalls blocking access to the UI ports from your laptop or desktop. What if you only have SSH access to the cluster?
One option is to run Lynx, a text-based web browser, from inside your cluster. If you don t have Lynx you ll have to know how to access the task logs directly.
The logs for each task are contained in the Hadoop logs directory. Under this directory you ll find at least the following three files: stderr, containing standard error output stdout, containing standard output stdlog, containing the logs You can use your favorite editor or simple tools like cat or less to view the contents of these files.
Summary Often, when things start going wrong in your jobs the task logs will contain details on the cause of the failure. This technique looked at how you could use the JobTracker and, alternatively, the Linux shell to access your logs. If the data in the logs suggests that the problem with your job is with the inputs which can be manifested by a parsing exception , you need to figure out what kind of input is causing the problem Debugging unexpected inputs In the previous section, you saw how to access failed task output files to help you figure out the root cause of the failure.
In the example, the outputs didn t contain any additional information, which means that you re dealing with some MapReduce code that wasn t written to handle error conditions. If it s possible to easily modify the MapReduce code that s failing, go ahead and skip to section and look at the strategies to update your code to better handle and report on broken inputs.
Roll these changes into your code, push your code to the cluster, and rerun the job. Your job outputs now will contain enough details for you to be able to update your code to better handle the unexpected inputs.
If this isn t an option, read on; we ll look at what to do to isolate the input data that s causing your code to misbehave. Some of the tweets aren t formed correctly could be a syntax problem or an unexpected value that you re unaware of in your data dictionary , which leads to failure in your processing logic.
But your job has numerous input files and they re all large, so your challenge is to narrow down where the problem inputs exist. Problem You want to identify the specific input split that s causing parsing issues. Solution Use the keep. Discussion Take the following three steps to fix the situation: 1 Identify the bad input record s. In this technique we ll focus on the first item, because it will help you to fix your code. We ll cover future-proofing your code for debugging in section The first step you need to do is determine what file contains the bad input record, and even better, find a range within that file, if the file s large.
Unfortunately, Hadoop by default wipes out task-level details, including the input splits after the tasks have completed. You ll need to disable this by setting the keep. You ll also have to rerun the job that failed, but this time you ll be able to extract additional metadata about the failing task. After rerunning the failed job you ll once again need to use the hadoop job -history command discussed in the previous section to identify the host and job or task IDs.
With this information in hand, you ll need to use the shell to log into the TaskTracker node, which ran the failed task, and then navigate to the task directory, which contains information about the input splits for the task.
Figure shows how to do that.
The trick here is, if you have multiple directories configured for mapred. This file contains information about the location of the input split file in HDFS, as well as an offset that s used to determine which of the input splits this task is working on.
Both the task and job split files are a mixture of text and binary content, so unfortunately, you can t crack out your command-line editor to easily view their contents. Be warned that there s a good chance this won t work with versions other than Hadoop 0. If you run this on the input.
You also can modify your code to catch an exception, which will allow you to set a breakpoint in your IDE and observe the input that s causing your exception. Alternatively, depending on the Hadoop distribution you re running, Hadoop comes with a tool called IsolationRunner, which can re-execute a specific task with its input split. Unfortunately, IsolationRunner is broken on 0. Summary We used this technique to identify the input splits for a task that s failing due to a problem with some input data.
Next we ll look at how you get at the JVM arguments you used to launch your task useful when you suspect there s an issue related to the JVM environment Debugging JVM settings This technique steps somewhat outside of the realm of your user space MapReduce debugging, but it s a useful technique in situations where you suspect there s an issue with the startup JVM arguments for tasks. For example, sometimes the classpath ordering of JARs is significant and issues with it can cause class loading problems.
Also, if a job has dependencies on native libraries, the JVM arguments can be used to debug issues with java. For example, let s say you re trying to use a native Hadoop compression codec, but your MapReduce tasks are failing and the errors complain that the native compression libraries can t be loaded. In this case review the JVM startup arguments to determine if all of the required settings exist for native compression to work. Problem You suspect that a task is failing due to missing arguments when a task is being launched, and want to examine the JVM startup arguments.
Discussion As the TaskTracker prepares to launch a map or reduce task, it also creates a shell script that s subsequently executed to run the task. The problem is that MapReduce by default removes these scripts after a job has completed. So, during the executing of a long-running job or task, you ll have access to these scripts, but if tasks and the job are short-lived which they may well be if you re debugging an issue that causes the task to fail off the bat , you will once again need to set keep.
Figure shows all of the steps required to gain access to the task shell script. If you were investigating an issue related to the native Hadoop compression, there s a good chance that when you viewed the taskjvm. Chapter summary 9. Predictive analytics with Mahout 9. Using recommenders to make product suggestions Technique 61 Item-based recommenders using movie ratings 9. Classification Technique 62 Using Mahout to train and test a spam classifier 9.
Clustering with K-means Technique 63 K-means with a synthetic 2D dataset 9. Chapter summary Part 5 Taming the Elephant Technique 62 Using Mahout to train and test a spam classifier. With this information in hand, you ll need to use the shell to log into the TaskTracker node, which ran the failed task, and then navigate to the task directory, which contains information about the input splits for the task.
Technique 84 Force container JVMs to generate a heap dump. Technique 96 Refreshing metadata. About the book It's always a good time to upgrade your Hadoop skills!
The Autobiography of Malcolm X: As Told to Alex Haley
One option is to run Lynx, a text-based web browser, from inside your cluster. Yes Do the map task logs show that they are correctly emitting records? Jobs are also slow because much of the MapReduce stack is being exercised. You feed MRUnit the inputs, which in turn are supplied to the mapper.
Technique 51 Reduce skew mitigation.