D.5.1 Configuration File and Properties Settings

This section describes the configuration file settings that are to be set in order for the DFS to be used under Hadoop.

Configuration files include the types shown below. These files are placed in the "/etc/hadoop" directory.

Configuration file types

This section describes the items to be set in each of these files.

hadoop-env.sh file

Set the following environment variables in the hadoop-env.sh file:

export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/opt/FJSVpdfs/lib/pdfs.jar"
export HADOOP_USER_CLASSPATH_FIRST="true"
export HADOOP_SSH_OPTS="-o StrictHostKeyChecking=no -o BatchMode=yes"

If required, also set JAVA_HOME.

export JAVA_HOME=/usr/java/default

core-site.xml file

This section describes the properties to be set in the core-site.xml file.

fs.default.name

Specify the default file system in the "pdfs://<directory>/" format (the <directory> part can be omitted).

The default file system is used to determine the URI path for path specifications that are not in URI format.

For example, if "pdfs:///" is set for fs.default.name, the "/mydir/myfile" is determined as being the "pdfs:///mydir/myfile" URI.

Default value: file:///

Settings example

<property>
  <name>fs.default.name</name>
  <value>pdfs:///</value>
</property>

fs.<scheme>.impl

Allocate the DFS file system class to any scheme.

Specify com.fujitsu.pdfs.fs.PdfsDirectFileSystem as the value.

For example, if set as the property named "fs.pdfs.impl", the path specification becomes the "pdfs:///mydir/myfile" URI.

Settings example

<property>
  <name>fs.pdfs.impl</name>
  <value>com.fujitsu.pdfs.fs.PdfsDirectFileSystem</value>
</property>

io.file.buffer.size

Specify the default buffer size used during Read/Write.

Value to be specified: Multiple of 4096 (Bytes)

Default value: 4096

Recommended value: 128 KB

Settings example

<property>
  <name>io.file.buffer.size</name>
  <value>131072</value>
</property>

hadoop.tmp.dir

Specify the directory used by Hadoop to store temporary files.

Default value: /tmp/hadoop-<user name>

Recommended value: Default

Settings example

<property>
  <name>hadoop.tmp.dir</name>
  <value>/var/tmp/hadoop-${user.name}</value>
</property>

mapred-site.xml file

This section describes the properties to be set in the mapred-site.xml file.

mapred.local.dir

Specify the directory that holds temporary data and Map intermediate output files while TaskTracker is executing MapReduce jobs.
Specify a directory (or directories) in the local disk as the directory.

Default value: ${hadoop.tmp.dir}/mapred/local

Recommended value: directory in in-built disk/mapred/local
Note
Do not include ${user.name} in the specified path.

Settings example

Example: If mounted at the three in-built disks /data/1, /data/2, and /data/3:

<property>
  <name>mapred.local.dir</name>
  <value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value>
</property>

mapred.system.dir

Specify the directory that stores the MapReduce processing control file.

Default value: ${hadoop.tmp.dir}/mapred/system

Recommended value: /mapred/system

Settings example

<property>
  <name>mapred.system.dir</name>
  <value>/mapred/system</value>
</property>

mapreduce.jobtracker.staging.root.dir

Specify the top directory for the directories that store user-specific MapReduce job information files.
Job information files are stored in the "${mapreduce.jobtracker.staging.root.dir}/<user name>/mapred/staging" directory.

Default value: ${hadoop.tmp.dir}/mapred/staging

Recommended value: Same as the pdfs.fs.local.homedir property (default: /user)(Refer to pdfs.fs.local.homedir)

Settings example

<property>
  <name> mapreduce.jobtracker.staging.root.dir</name>
  <value>/user</value>
</property>

mapred.job.tracker

Specify, in host:port format (port cannot be omitted), the host name and port number of the RPC server that runs JobTracker. For the port number, specify an unused port number between 1024 and 61000.

Settings example

<property>
  <name>mapred.job.tracker</name>
  <value>host1:50001</value>
</property>

mapred.tasktracker.map.tasks.maximum

Specify the number of Map tasks executed in parallel at one node.

Default value: 2

Recommended value

Whichever is the larger of the following:

Number of CPU cores - 1
The number of CPU cores can be checked using "cat /proc/cpuinfo".
Total number of physical disks comprising the DFS / Number of slave nodes in the Hadoop cluster (rounded up to whole number)
If the disk device side has a RAID-1 mirror configuration, for example, the total number of physical disks is 2 * the number of LUNs.

Settings example

<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>8</value>
</property>

mapred.tasktracker.reduce.tasks.maximum

Specify the number of Reduce tasks executed in parallel at one node.

Default value: 2

Recommended value

Whichever is the larger of the following:

Number of CPU cores - 1
The number of CPU cores can be checked using "cat /proc/cpuinfo".
Total number of LUNs comprising the DFS / Number of slave nodes in the Hadoop cluster (rounded up to whole number)

Settings example

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>4</value>
</property>

mapred.map.tasks.speculative.execution

Specify whether to enable Map task speculative execution.

If "true" is specified, a task that is the same as the Map task currently being executed is processed in parallel at a node that has space for task execution, and whichever finishes first is used.

Default value

true (enable)

If MapReduce jobs used in the Hadoop cluster include processing that write directly from Map tasks to DFS files, specify "false".
If in doubt, specify "false".

Settings example

<property>
  <name>mapred.map.tasks.speculative.execution</name>
  <value>false</value>
</property>

mapred.reduce.tasks.speculative.execution

Specify whether to enable Reduce task speculative execution.

If "true" is specified, a task that is the same as the Reduce task currently being executed is processed in parallel at a node that has space for task execution, and whichever finishes first is used.

Default value

true (enable)

If MapReduce jobs used in the Hadoop cluster include processing that writes directly from Reduce tasks to DFS files, specify "false".
If in doubt, specify "false".

Settings example

<property>
  <name>mapred.reduce.tasks.speculative.execution</name>
  <value>false</value>
</property>

mapred.userlog.limit.kb

Specify the maximum log size for one task in the MapReduce task log. The actual log size might be a little larger than the specified value due to the Hadoop control method.
The MapReduce task log is created under the $HADOOP_LOG_DIR/userlogs directory.

Value to be specified: 1 ( KB) or more

Default value: 0 (unlimited)

Recommended value

Around 1000 (1 MB) (enough for about 10,000 rows)

This is large enough for the usual usage range, but adjust to suit the amount of logs output for MapReduce jobs.

Settings example

<property>
  <name>mapred.userlog.limit.kb</name>
  <value>1000</value>
</property>

mapred.userlog.retain.hours

Specify the retention period for MapReduce task logs. The task logs might be required for investigating the causes of MapReduce job errors.
MapReduce task logs are created under the $HADOOP_LOG_DIR/userlogs directory.

Value to be specified: Retention time

Default value: 24 (hours)

Recommended value

About 168 (enough for one week)

If the specified time is short, the required information might not be available when investigating past jobs. However, this needs to be balanced against the disk space (disk space available in the HADOOP_LOG_DIR directory) that can be used for retaining logs. Set a suitable time that takes the following into account:

Space that can be permitted
Operation work schedule
mapred.userlog.limit.kb property value (refer to "mapred.userlog.limit.kb")

If, for example, the maximum log output for one minute is assumed to be 1 MB (about10,000 rows), a space of about 10GB would be required to retain a week's worth of logs.

Settings example

<property>
  <name>mapred.userlog.retain.hours</name>
  <value>168</value>
</property>

pdfs-site.xml file

If required, set the properties below.
Normally, set the pdfs.fs.local.basedir property (refer to "pdfs.fs.local.basedir") and the pdfs.security.authorization property (refer to "pdfs.security.authorization"), and the defaults can be used for the other properties.

pdfs.fs.local.basedir

Specify the DFS mount directory path.
If "/mnt/pdfs/hadoop" is set for pdfs.fs.local.basedir, for example, the URI "pdfs:///user/bdppuser1" becomes "/mnt/pdfs/hadoop/user/bdppuser1" path at the operating system.

Default value: /

Settings example

<property>
  <name>pdfs.fs.local.basedir</name>
  <value>/mnt/pdfs/hadoop</value>
</property>

pdfs.fs.local.homedir

Specify the home directory path for users in the DFS FileSystem class.
If "/user" is set for pdfs.fs.local.homedir, for example, the DFS home directory URI path for the user with the name "bdppuser1" becomes "pdfs:///user/bdppuser1".

Default value: /user (same as HDFS)

Recommended value: Default (/user)

Settings example

<property>
    <name>pdfs.fs.local.homedir</name>
    <value>/home</value>
</property>

pdfs.security.authorization

Specify whether or not to use the DFS's own MapReduce job user authentication.

Value to be specified

true: Use
false: Do not use

Default value: false

Recommended value

true

To use Kerberos authentication under Hadoop, specify "false".

Settings example

<property>
  <name>pdfs.security.authorization</name>
  <value>true</value>
</property>

pdfs.fs.local.buffer.size

Specify the default buffer size to be used during Read/Write.
Note that either this property value or the io.file.buffer.size value, whichever is larger, is used (refer to "io.file.buffer.size").

Value to be specified: Multiple of 4096 (Bytes)

Default value: 128 KB

Recommended value: 128 KB to 512 KB

Settings example

<property>
  <name>pdfs.fs.local.buffer.size</name>
  <value>524288</value>
</property>

pdfs.fs.local.block.size

Specify the data size into which Map tasks are split for MapReduce jobs.
As a guide, specify <total input data size of main MapReduce jobs / Number of slave nodes> or less.
Note that this specification need not match the block size (blocksz option) specified for pdfsmkfs at the time of "D.4.2 Creating a File System".

Value to be specified: Multiple of 33554432 (32 MB)

Default value: 256 MB

Maximum value: 1GB (1073741824)

Recommended value: 256 MB to 1GB

Settings example

<property>
  <name>pdfs.fs.local.block.size</name>
  <value>1073741824</value>
</property>

pdfs.fs.local.posix.umask

Specify whether or not the process umask value is reflected to the access permissions set during file or directory creation.

Value to be specified

true: Use umask value (POSIX compatible)
false: Do not use umask value (HDFS compatible)

Default value: true

Settings example

<property>
  <name>pdfs.fs.local.posix.umask</name>
  <value>false</value>
</property>

pdfs.fs.local.cache.location

Specify whether or not to use the cache local MapReduce function.
When enabled, this function fetches the memory cache retention node information of the target file when a MapReduce job starts. It also prioritizes assignment of the Map task to the node that has the cache, thus speeding up Map phase processing.

Value to be specified

true: Use the cache local MapReduce function
false: Do not use the cache local MapReduce function

Default value: true

Settings example

<property>
  <name>pdfs.fs.local.cache.location</name>
  <value>false</value>
</property>

pdfs.fs.local.cache.minsize

Specify the file size that is excluded from the cache local MapReduce function targets. Since there is a cost associated with fetching memory cache retention node information, it is possible to set for this information to not be fetched if the file is less than the specified size.

Value to be specified: 1 (Byte) or more

Default value: 1048576 (1 MB)

Settings example

<property>
  <name>pdfs.fs.local.cache.minsize</name>
  <value>1048576</value>
</property>

pdfs.fs.local.cache.shell

Specify the remote command execution parameter used when the cache local MapReduce function fetches memory cache information.

Default value

/usr/bin/ssh -o IdentityFile=%HOME/.pdfs/id_hadoop -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -o EscapeChar=none

Note that "%HOME" in the character string is replaced by "${pdfs.fs.local.basedir}/${pdfs.fs.local.homedir}/<user name>", and "%USER" is replaced by the user name.

Settings example

<property>
  <name>pdfs.fs.local.cache.shell</name>
  <value> /usr/bin/ssh -o IdentityFile=/home/%USER/.ssh/id_rsa -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -o EscapeChar=none</value>
</property>

pdfs.fs.local.cache.procs

Specify the number of multiplex executions when the cache local MapReduce function fetches memory cache information.

Default value: 10

Settings example

<property>
  <name>pdfs.fs.local.cache.procs</name>
  <value>40</value>
</property>