| 
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.hadoop.filecache.DistributedCache
public class DistributedCache
Distribute application-specific large, read-only files efficiently.
DistributedCache is a facility provided by the Map-Reduce
 framework to cache files (text, archives, jars etc.) needed by applications.
 
Applications specify the files, via urls (hdfs:// or http://) to be cached 
 via the JobConf. The DistributedCache assumes that the
 files specified via hdfs:// urls are already present on the 
 FileSystem at the path specified by the url.
The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
DistributedCache can be used to distribute simple, read-only
 data/text files and/or more complex types such as archives, jars etc. 
 Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. 
 Jars may be optionally added to the classpath of the tasks, a rudimentary 
 software distribution mechanism.  Files have execution permissions.
 Optionally users can also direct it to symlink the distributed cache file(s)
 into the working directory of the task.
DistributedCache tracks modification timestamps of the cache 
 files. Clearly the cache files should not be modified by the application 
 or externally while the job is executing.
Here is an illustrative example on how to use the 
 DistributedCache:
     // Setting up the cache for the application
     
     1. Copy the requisite files to the FileSystem:
     
     $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat  
     $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip  
     $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
     $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
     $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
     $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
     
     2. Setup the application's JobConf:
     
     JobConf job = new JobConf();
     DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), 
                                   job);
     DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
     DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
     
     3. Use the cached files in the Mapper or Reducer:
     
     public static class MapClass extends MapReduceBase  
     implements Mapper<K, V, K, V> {
     
       private Path[] localArchives;
       private Path[] localFiles;
       
       public void configure(JobConf job) {
         // Get the cached archives/files
         localArchives = DistributedCache.getLocalCacheArchives(job);
         localFiles = DistributedCache.getLocalCacheFiles(job);
       }
       
       public void map(K key, V value, 
                       OutputCollector<K, V> output, Reporter reporter) 
       throws IOException {
         // Use data from the cached archives/files here
         // ...
         // ...
         output.collect(k, v);
       }
     }
     
 
JobConf, 
JobClient| Constructor Summary | |
|---|---|
DistributedCache()
 | 
|
| Method Summary | |
|---|---|
static void | 
addArchiveToClassPath(Path archive,
                      Configuration conf)
Add an archive path to the current set of classpath entries.  | 
static void | 
addCacheArchive(URI uri,
                Configuration conf)
Add a archives to be localized to the conf  | 
static void | 
addCacheFile(URI uri,
             Configuration conf)
Add a file to be localized to the conf  | 
static void | 
addFileToClassPath(Path file,
                   Configuration conf)
Add an file path to the current set of classpath entries It adds the file to cache as well.  | 
static boolean | 
checkURIs(URI[] uriFiles,
          URI[] uriArchives)
This method checks if there is a conflict in the fragment names of the uris.  | 
static void | 
createAllSymlink(Configuration conf,
                 File jobCacheDir,
                 File workDir)
This method create symlinks for all files in a given dir in another directory  | 
static void | 
createSymlink(Configuration conf)
This method allows you to create symlinks in the current working directory of the task to all the cache files/archives  | 
static Path[] | 
getArchiveClassPaths(Configuration conf)
Get the archive entries in classpath as an array of Path  | 
static String[] | 
getArchiveTimestamps(Configuration conf)
Get the timestamps of the archives  | 
static URI[] | 
getCacheArchives(Configuration conf)
Get cache archives set in the Configuration  | 
static URI[] | 
getCacheFiles(Configuration conf)
Get cache files set in the Configuration  | 
static Path[] | 
getFileClassPaths(Configuration conf)
Get the file entries in classpath as an array of Path  | 
static String[] | 
getFileTimestamps(Configuration conf)
Get the timestamps of the files  | 
static Path | 
getLocalCache(URI cache,
              Configuration conf,
              Path baseDir,
              boolean isArchive,
              long confFileStamp,
              Path currentWorkDir)
Get the locally cached file or archive; it could either be previously cached (and valid) or copy it from the FileSystem now. | 
static Path | 
getLocalCache(URI cache,
              Configuration conf,
              Path baseDir,
              FileStatus fileStatus,
              boolean isArchive,
              long confFileStamp,
              Path currentWorkDir)
Get the locally cached file or archive; it could either be previously cached (and valid) or copy it from the FileSystem now. | 
static Path[] | 
getLocalCacheArchives(Configuration conf)
Return the path array of the localized caches  | 
static Path[] | 
getLocalCacheFiles(Configuration conf)
Return the path array of the localized files  | 
static boolean | 
getSymlink(Configuration conf)
This method checks to see if symlinks are to be create for the localized cache files in the current working directory  | 
static long | 
getTimestamp(Configuration conf,
             URI cache)
Returns mtime of a given cache file on hdfs.  | 
static String | 
makeRelative(URI cache,
             Configuration conf)
 | 
static void | 
purgeCache(Configuration conf)
Clear the entire contents of the cache and delete the backing files.  | 
static void | 
releaseCache(URI cache,
             Configuration conf)
This is the opposite of getlocalcache.  | 
static void | 
setArchiveTimestamps(Configuration conf,
                     String timestamps)
This is to check the timestamp of the archives to be localized  | 
static void | 
setCacheArchives(URI[] archives,
                 Configuration conf)
Set the configuration with the given set of archives  | 
static void | 
setCacheFiles(URI[] files,
              Configuration conf)
Set the configuration with the given set of files  | 
static void | 
setFileTimestamps(Configuration conf,
                  String timestamps)
This is to check the timestamp of the files to be localized  | 
static void | 
setLocalArchives(Configuration conf,
                 String str)
Set the conf to contain the location for localized archives  | 
static void | 
setLocalFiles(Configuration conf,
              String str)
Set the conf to contain the location for localized files  | 
| Methods inherited from class java.lang.Object | 
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait | 
| Constructor Detail | 
|---|
public DistributedCache()
| Method Detail | 
|---|
public static Path getLocalCache(URI cache,
                                 Configuration conf,
                                 Path baseDir,
                                 FileStatus fileStatus,
                                 boolean isArchive,
                                 long confFileStamp,
                                 Path currentWorkDir)
                          throws IOException
FileSystem now.
cache - the cache to be localized, this should be specified as 
 new URI(hdfs://hostname:port/absolute_path_to_file#LINKNAME). If no schema 
 or hostname:port is provided the file is assumed to be in the filesystem
 being used in the Configurationconf - The Confguration file which contains the filesystembaseDir - The base cache Dir where you wnat to localize the files/archivesfileStatus - The file status on the dfs.isArchive - if the cache is an archive or a file. In case it is an
  archive with a .zip or .jar or .tar or .tgz or .tar.gz extension it will
  be unzipped/unjarred/untarred automatically 
  and the directory where the archive is unzipped/unjarred/untarred is
  returned as the Path.
  In case of a file, the path to the file is returnedconfFileStamp - this is the hdfs file modification timestamp to verify that the 
 file to be cached hasn't changed since the job startedcurrentWorkDir - this is the directory where you would want to create symlinks 
 for the locally cached files/archives
IOException
public static Path getLocalCache(URI cache,
                                 Configuration conf,
                                 Path baseDir,
                                 boolean isArchive,
                                 long confFileStamp,
                                 Path currentWorkDir)
                          throws IOException
FileSystem now.
cache - the cache to be localized, this should be specified as 
 new URI(hdfs://hostname:port/absolute_path_to_file#LINKNAME). If no schema 
 or hostname:port is provided the file is assumed to be in the filesystem
 being used in the Configurationconf - The Confguration file which contains the filesystembaseDir - The base cache Dir where you wnat to localize the files/archivesisArchive - if the cache is an archive or a file. In case it is an 
  archive with a .zip or .jar or .tar or .tgz or .tar.gz extension it will 
  be unzipped/unjarred/untarred automatically 
  and the directory where the archive is unzipped/unjarred/untarred 
  is returned as the Path.
  In case of a file, the path to the file is returnedconfFileStamp - this is the hdfs file modification timestamp to verify that the 
 file to be cached hasn't changed since the job startedcurrentWorkDir - this is the directory where you would want to create symlinks 
 for the locally cached files/archives
IOException
public static void releaseCache(URI cache,
                                Configuration conf)
                         throws IOException
cache - The cache URI to be releasedconf - configuration which contains the filesystem the cache 
 is contained in.
IOException
public static String makeRelative(URI cache,
                                  Configuration conf)
                           throws IOException
IOException
public static long getTimestamp(Configuration conf,
                                URI cache)
                         throws IOException
conf - configurationcache - cache file
IOException
public static void createAllSymlink(Configuration conf,
                                    File jobCacheDir,
                                    File workDir)
                             throws IOException
conf - the configurationjobCacheDir - the target directory for creating symlinksworkDir - the directory in which the symlinks are created
IOException
public static void setCacheArchives(URI[] archives,
                                    Configuration conf)
archives - The list of archives that need to be localizedconf - Configuration which will be changed
public static void setCacheFiles(URI[] files,
                                 Configuration conf)
files - The list of files that need to be localizedconf - Configuration which will be changed
public static URI[] getCacheArchives(Configuration conf)
                              throws IOException
conf - The configuration which contains the archives
IOException
public static URI[] getCacheFiles(Configuration conf)
                           throws IOException
conf - The configuration which contains the files
IOException
public static Path[] getLocalCacheArchives(Configuration conf)
                                    throws IOException
conf - Configuration that contains the localized archives
IOException
public static Path[] getLocalCacheFiles(Configuration conf)
                                 throws IOException
conf - Configuration that contains the localized files
IOExceptionpublic static String[] getArchiveTimestamps(Configuration conf)
conf - The configuration which stored the timestamps
IOExceptionpublic static String[] getFileTimestamps(Configuration conf)
conf - The configuration which stored the timestamps
IOException
public static void setArchiveTimestamps(Configuration conf,
                                        String timestamps)
conf - Configuration which stores the timestamp'stimestamps - comma separated list of timestamps of archives.
 The order should be the same as the order in which the archives are added.
public static void setFileTimestamps(Configuration conf,
                                     String timestamps)
conf - Configuration which stores the timestamp'stimestamps - comma separated list of timestamps of files.
 The order should be the same as the order in which the files are added.
public static void setLocalArchives(Configuration conf,
                                    String str)
conf - The conf to modify to contain the localized cachesstr - a comma separated list of local archives
public static void setLocalFiles(Configuration conf,
                                 String str)
conf - The conf to modify to contain the localized cachesstr - a comma separated list of local files
public static void addCacheArchive(URI uri,
                                   Configuration conf)
uri - The uri of the cache to be localizedconf - Configuration to add the cache to
public static void addCacheFile(URI uri,
                                Configuration conf)
uri - The uri of the cache to be localizedconf - Configuration to add the cache to
public static void addFileToClassPath(Path file,
                                      Configuration conf)
                               throws IOException
file - Path of the file to be addedconf - Configuration that contains the classpath setting
IOExceptionpublic static Path[] getFileClassPaths(Configuration conf)
conf - Configuration that contains the classpath setting
public static void addArchiveToClassPath(Path archive,
                                         Configuration conf)
                                  throws IOException
archive - Path of the archive to be addedconf - Configuration that contains the classpath setting
IOExceptionpublic static Path[] getArchiveClassPaths(Configuration conf)
conf - Configuration that contains the classpath settingpublic static void createSymlink(Configuration conf)
conf - the jobconfpublic static boolean getSymlink(Configuration conf)
conf - the jobconf
public static boolean checkURIs(URI[] uriFiles,
                                URI[] uriArchives)
uriFiles - The uri array of urifilesuriArchives - the uri array of uri archives
public static void purgeCache(Configuration conf)
                       throws IOException
IOException
  | 
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||