Providing several non-textual files to a single map in Hadoop MapReduce -

- March 15, 2013

i'm writing distributed application parses pdf files of hadoop mapreduce. input mapreduce job thousands of pdf files (which range 100kb ~2mb), , output set of parsed text files.

for testing purposes, used wholefileinputformat provided in tom white's hadoop. definitive guide book, provides single file single map. worked fine small number of input files, however, not work thousands of files obvious reasons. single map task takes around second complete inefficient.

so, want submit several pdf files 1 map (for example, combining several files single chunk has around hdfs block size ~64mb). found out combinefileinputformat useful case. cannot come out idea how extend abstract class, can process each file , filename single key-value record.

any appreciated. thanks!

i think sequencefile suit needs here: http://wiki.apache.org/hadoop/sequencefile

essentially, put pdfs sequence file , mappers receive many pdfs fit 1 hdfs block of sequence file. when create sequence file, you'll set key pdf filename, , value binary representation of pdf.

Search This Blog

JNI

Providing several non-textual files to a single map in Hadoop MapReduce -

Comments

Post a Comment

Popular posts from this blog

c# - How to set Z index when using WPF DrawingContext? -

razor - Is this a bug in WebMatrix PageData? -

visual c++ - Using relative values in array sorting ( asm ) -