Friday, 15 June 2012

hadoop - Running Map-Reduce job on specific files/blocks in HDFS -


First of all, I'm new to Haupt :)

I have large data sets (files) TB of documents in gzip file around every size of 100-500 MB).

Actually, I need some sort of filtering of my input, my maps-less jobs.

I want to analyze these files in various ways, only one definite format (with some length, some words etc.) of these jobs - the need to analyze all types of arbitrary (inverted index) , And it takes improperly long to process the entire dataset for each task. So I want to create an index which indicates specific blocks / files in HDFS.

I can manually generate the required index, but how can I specify (thousands) special files / blocks on which I want to process, as input for the mappers? Can I do this without having to read the source data without any reference to HBase? Do i want Or am I completely dealing with this problem wrongly?

Assuming that you have a way by which you can know which one of the X files Files to be processed in large corpus, you can expand org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPathFilter (job, class & lt;? PathFilter & gt;) The method can use the job

You must provide a class that implements PathFilter . Hadop will make a new example of this class and it will be presented in each file in the corpus through the Boolean Acceptance (Path Path) method. You can then use it to filter the files against the actual process map works (regardless of file names, sizes, last modified timestamps, etc.).

To target specific blocks, you must apply your extension of FileInputFormat, especially overriding the getSplits method in this method to determine that Uses the listStatus method to determine which process the input files are on (and that is where the previously mentioned pathfilt is applied), after which it determines that those files are divided to do How to divide the way (if the files are different). So in this getSplits method, you will need to target specific segmentation to use your reference data again.

Information for retrieving / retrieving this target file, you have several options in the persistence store such as key / value store (as you have written in your question), a separate database (MySQL) , Etc.), an Inverted Index (Lucene) etc.

No comments:

Post a Comment