InputFormat (Hadoop 0.18.1 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.mapred
Interface InputFormat<K,V>

All Known Subinterfaces:: ComposableInputFormat<K,V>

All Known Implementing Classes:: CompositeInputFormat, FileInputFormat, KeyValueTextInputFormat, LineDocInputFormat, MultiFileInputFormat, MultiFileWordCount.MyInputFormat, NLineInputFormat, Parser.Node, SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat, SequenceFileInputFilter, SequenceFileInputFormat, StreamInputFormat, TextInputFormat

public interface InputFormat<K,V>

InputFormat describes the input-specification for a Map-Reduce job.

The Map-Reduce framework relies on the InputFormat of the job to:

Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.

The default behavior of file-based InputFormats, typically sub-classes of FileInputFormat, is to split the input into logical InputSplits based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size.

Clearly, logical splits based on input-size is insufficient for many applications since record boundaries are to respected. In such cases, the application has to also implement a RecordReader on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task.

See Also:: InputSplit, RecordReader, JobClient, FileInputFormat

Method Summary
`RecordReader<K,V>`	`getRecordReader(InputSplit split, JobConf job, Reporter reporter)` Get the `RecordReader` for the given `InputSplit`.
`InputSplit[]`	`getSplits(JobConf job, int numSplits)` Logically split the set of input files for the job.
`void`	`validateInput(JobConf job)` Deprecated. getSplits is called in the client and can perform any necessary validation of the input

Method Detail

validateInput

void validateInput(JobConf job)
                   throws IOException

Deprecated. getSplits is called in the client and can perform any necessary validation of the input

Check for validity of the input-specification for the job.

This method is used to validate the input directories when a job is submitted so that the JobClient can fail early, with an useful error message, in case of errors. For e.g. input directory does not exist.

Parameters:: job - job configuration.
Throws:: InvalidInputException - if the job does not have valid input; IOException

getSplits

InputSplit[] getSplits(JobConf job,
                       int numSplits)
                       throws IOException

Logically split the set of input files for the job.

Each InputSplit is then assigned to an individual Mapper for processing.

Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple.

Parameters:: job - job configuration.; numSplits - the desired number of splits, a hint.
Returns:: an array of InputSplits for the job.
Throws:: IOException

getRecordReader

RecordReader<K,V> getRecordReader(InputSplit split,
                                  JobConf job,
                                  Reporter reporter)
                                  throws IOException

Get the RecordReader for the given InputSplit.

It is the responsibility of the RecordReader to respect record boundaries while processing the logical split to present a record-oriented view to the individual task.

Parameters:: split - the InputSplit; job - the job that this split belongs to
Returns:: a RecordReader
Throws:: IOException