This section describes the configuration file settings that are to be set in order for the DFS to be used under Hadoop.
Configuration files include the types shown below. These files are placed in the "/etc/hadoop" directory.
Configuration file types
This section describes the items to be set in each of these files.
hadoop-env.sh file
Set the following environment variables in the hadoop-env.sh file:
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/opt/FJSVpdfs/lib/pdfs.jar" export HADOOP_USER_CLASSPATH_FIRST="true" export HADOOP_SSH_OPTS="-o StrictHostKeyChecking=no -o BatchMode=yes"
If required, also set JAVA_HOME.
export JAVA_HOME=/usr/java/default
core-site.xml file
This section describes the properties to be set in the core-site.xml file.
Specify the default file system in the "pdfs://<directory>/" format (the <directory> part can be omitted).
The default file system is used to determine the URI path for path specifications that are not in URI format.
For example, if "pdfs:///" is set for fs.default.name, the "/mydir/myfile" is determined as being the "pdfs:///mydir/myfile" URI.
file:///
<property> <name>fs.default.name</name> <value>pdfs:///</value> </property>
Allocate the DFS file system class to any scheme.
Specify com.fujitsu.pdfs.fs.PdfsDirectFileSystem as the value.
For example, if set as the property named "fs.pdfs.impl", the path specification becomes the "pdfs:///mydir/myfile" URI.
<property> <name>fs.pdfs.impl</name> <value>com.fujitsu.pdfs.fs.PdfsDirectFileSystem</value> </property>
Specify the default buffer size used during Read/Write.
Multiple of 4096 (Bytes)
4096
128 KB
<property> <name>io.file.buffer.size</name> <value>131072</value> </property>
Specify the directory used by Hadoop to store temporary files.
/tmp/hadoop-<user name>
Default
<property> <name>hadoop.tmp.dir</name> <value>/var/tmp/hadoop-${user.name}</value> </property>
mapred-site.xml file
This section describes the properties to be set in the mapred-site.xml file.
Specify the directory that holds temporary data and Map intermediate output files while TaskTracker is executing MapReduce jobs.
Specify a directory (or directories) in the local disk as the directory.
${hadoop.tmp.dir}/mapred/local
directory in in-built disk/mapred/local
Note
Do not include ${user.name} in the specified path.
Example: If mounted at the three in-built disks /data/1, /data/2, and /data/3:
<property> <name>mapred.local.dir</name> <value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value> </property>
Specify the directory that stores the MapReduce processing control file.
${hadoop.tmp.dir}/mapred/system
/mapred/system
<property> <name>mapred.system.dir</name> <value>/mapred/system</value> </property>
Specify the top directory for the directories that store user-specific MapReduce job information files.
Job information files are stored in the "${mapreduce.jobtracker.staging.root.dir}/<user name>/mapred/staging" directory.
${hadoop.tmp.dir}/mapred/staging
Same as the pdfs.fs.local.homedir property (default: /user)(Refer to pdfs.fs.local.homedir)
<property> <name> mapreduce.jobtracker.staging.root.dir</name> <value>/user</value> </property>
Specify, in host:port format (port cannot be omitted), the host name and port number of the RPC server that runs JobTracker. For the port number, specify an unused port number between 1024 and 61000.
<property> <name>mapred.job.tracker</name> <value>host1:50001</value> </property>
Specify the number of Map tasks executed in parallel at one node.
2
Whichever is the larger of the following:
Number of CPU cores - 1
The number of CPU cores can be checked using "cat /proc/cpuinfo".
Total number of physical disks comprising the DFS / Number of slave nodes in the Hadoop cluster (rounded up to whole number)
If the disk device side has a RAID-1 mirror configuration, for example, the total number of physical disks is 2 * the number of LUNs.
<property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> </property>
Specify the number of Reduce tasks executed in parallel at one node.
2
Whichever is the larger of the following:
Number of CPU cores - 1
The number of CPU cores can be checked using "cat /proc/cpuinfo".
Total number of LUNs comprising the DFS / Number of slave nodes in the Hadoop cluster (rounded up to whole number)
<property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>4</value> </property>
Specify whether to enable Map task speculative execution.
If "true" is specified, a task that is the same as the Map task currently being executed is processed in parallel at a node that has space for task execution, and whichever finishes first is used.
true (enable)
If MapReduce jobs used in the Hadoop cluster include processing that write directly from Map tasks to DFS files, specify "false".
If in doubt, specify "false".
<property> <name>mapred.map.tasks.speculative.execution</name> <value>false</value> </property>
Specify whether to enable Reduce task speculative execution.
If "true" is specified, a task that is the same as the Reduce task currently being executed is processed in parallel at a node that has space for task execution, and whichever finishes first is used.
true (enable)
If MapReduce jobs used in the Hadoop cluster include processing that writes directly from Reduce tasks to DFS files, specify "false".
If in doubt, specify "false".
<property> <name>mapred.reduce.tasks.speculative.execution</name> <value>false</value> </property>
Specify the maximum log size for one task in the MapReduce task log. The actual log size might be a little larger than the specified value due to the Hadoop control method.
The MapReduce task log is created under the $HADOOP_LOG_DIR/userlogs directory.
1 ( KB) or more
0 (unlimited)
Around 1000 (1 MB) (enough for about 10,000 rows)
This is large enough for the usual usage range, but adjust to suit the amount of logs output for MapReduce jobs.
<property> <name>mapred.userlog.limit.kb</name> <value>1000</value> </property>
Specify the retention period for MapReduce task logs. The task logs might be required for investigating the causes of MapReduce job errors.
MapReduce task logs are created under the $HADOOP_LOG_DIR/userlogs directory.
Retention time
24 (hours)
About 168 (enough for one week)
If the specified time is short, the required information might not be available when investigating past jobs. However, this needs to be balanced against the disk space (disk space available in the HADOOP_LOG_DIR directory) that can be used for retaining logs. Set a suitable time that takes the following into account:
Space that can be permitted
Operation work schedule
mapred.userlog.limit.kb property value (refer to "mapred.userlog.limit.kb")
If, for example, the maximum log output for one minute is assumed to be 1 MB (about10,000 rows), a space of about 10GB would be required to retain a week's worth of logs.
<property> <name>mapred.userlog.retain.hours</name> <value>168</value> </property>
pdfs-site.xml file
If required, set the properties below.
Normally, set the pdfs.fs.local.basedir property (refer to "pdfs.fs.local.basedir") and the pdfs.security.authorization property (refer to "pdfs.security.authorization"), and the defaults can be used for the other properties.
Specify the DFS mount directory path.
If "/mnt/pdfs/hadoop" is set for pdfs.fs.local.basedir, for example, the URI "pdfs:///user/bdppuser1" becomes "/mnt/pdfs/hadoop/user/bdppuser1" path at the operating system.
/
<property> <name>pdfs.fs.local.basedir</name> <value>/mnt/pdfs/hadoop</value> </property>
Specify the home directory path for users in the DFS FileSystem class.
If "/user" is set for pdfs.fs.local.homedir, for example, the DFS home directory URI path for the user with the name "bdppuser1" becomes "pdfs:///user/bdppuser1".
/user (same as HDFS)
Default (/user)
<property> <name>pdfs.fs.local.homedir</name> <value>/home</value> </property>
Specify whether or not to use the DFS's own MapReduce job user authentication.
true: Use
false: Do not use
false
true
To use Kerberos authentication under Hadoop, specify "false".
<property> <name>pdfs.security.authorization</name> <value>true</value> </property>
Specify the default buffer size to be used during Read/Write.
Note that either this property value or the io.file.buffer.size value, whichever is larger, is used (refer to "io.file.buffer.size").
Multiple of 4096 (Bytes)
128 KB
128 KB to 512 KB
<property> <name>pdfs.fs.local.buffer.size</name> <value>524288</value> </property>
Specify the data size into which Map tasks are split for MapReduce jobs.
As a guide, specify <total input data size of main MapReduce jobs / Number of slave nodes> or less.
Note that this specification need not match the block size (blocksz option) specified for pdfsmkfs at the time of "D.4.2 Creating a File System".
Multiple of 33554432 (32 MB)
256 MB
1GB (1073741824)
256 MB to 1GB
<property> <name>pdfs.fs.local.block.size</name> <value>1073741824</value> </property>
Specify whether or not the process umask value is reflected to the access permissions set during file or directory creation.
true: Use umask value (POSIX compatible)
false: Do not use umask value (HDFS compatible)
true
<property> <name>pdfs.fs.local.posix.umask</name> <value>false</value> </property>
Specify whether or not to use the cache local MapReduce function.
When enabled, this function fetches the memory cache retention node information of the target file when a MapReduce job starts. It also prioritizes assignment of the Map task to the node that has the cache, thus speeding up Map phase processing.
true: Use the cache local MapReduce function
false: Do not use the cache local MapReduce function
true
<property> <name>pdfs.fs.local.cache.location</name> <value>false</value> </property>
Specify the file size that is excluded from the cache local MapReduce function targets. Since there is a cost associated with fetching memory cache retention node information, it is possible to set for this information to not be fetched if the file is less than the specified size.
1 (Byte) or more
1048576 (1 MB)
<property> <name>pdfs.fs.local.cache.minsize</name> <value>1048576</value> </property>
Specify the remote command execution parameter used when the cache local MapReduce function fetches memory cache information.
/usr/bin/ssh -o IdentityFile=%HOME/.pdfs/id_hadoop -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -o EscapeChar=none
Note that "%HOME" in the character string is replaced by "${pdfs.fs.local.basedir}/${pdfs.fs.local.homedir}/<user name>", and "%USER" is replaced by the user name.
<property> <name>pdfs.fs.local.cache.shell</name> <value> /usr/bin/ssh -o IdentityFile=/home/%USER/.ssh/id_rsa -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -o EscapeChar=none</value> </property>
Specify the number of multiplex executions when the cache local MapReduce function fetches memory cache information.
10
<property> <name>pdfs.fs.local.cache.procs</name> <value>40</value> </property>