Providing several non-textual files to a single map in Hadoop MapReduce -
i'm writing distributed application parses pdf files of hadoop mapreduce. input mapreduce job thousands of pdf files (which range 100kb ~2mb), , output set of parsed text files.
for testing purposes, used wholefileinputformat
provided in tom white's hadoop. definitive guide book, provides single file single map. worked fine small number of input files, however, not work thousands of files obvious reasons. single map task takes around second complete inefficient.
so, want submit several pdf files 1 map (for example, combining several files single chunk has around hdfs block size ~64mb). found out combinefileinputformat
useful case. cannot come out idea how extend abstract class, can process each file , filename single key-value record.
any appreciated. thanks!
i think sequencefile suit needs here: http://wiki.apache.org/hadoop/sequencefile
essentially, put pdfs sequence file , mappers receive many pdfs fit 1 hdfs block of sequence file. when create sequence file, you'll set key pdf filename, , value binary representation of pdf.
Comments
Post a Comment