What are
the default configuration files that are used in Hadoop
|
As of
0.20 release, Hadoop supported the following read-only default configurations
-
src/core/core-default.xml
-
src/hdfs/hdfs-default.xml
-
src/mapred/mapred-default.xml
|
How will
you make changes to the default configuration files
|
Hadoop
does not recommends changing the default configuration files, instead it
recommends making all site specific changes in the following files
-
conf/core-site.xml
-
conf/hdfs-site.xml
-
conf/mapred-site.xml
Unless
explicitly turned off, Hadoop by default specifies two resources, loaded
in-order from the classpath:
-
core-default.xml : Read-only defaults for hadoop.
-
core-site.xml: Site-specific configuration for a given hadoop installation.
|
Consider
case scenario where you have set property mapred.output.compress totrue to
ensure that all output files are compressed for efficient space usage on
the cluster. If a cluster user does not want to compress data for
a specific job then what will you recommend him to do ?
|
Ask him
to create his own configuration file and specify configuration mapred.output.compressto false and
load this file as a resource in his job.
|
In
the above case scenario, how can ensure that user cannot override the
configuration mapred.output.compress to false in any of his
jobs
|
This can
be done by setting the property final to true in
the core-site.xml file
|
What of
the following is the only required variable that needs to be set in
file conf/hadoop-env.sh for hadoop to work
|
-
HADOOP_LOG_DIR
-
JAVA_HOME
-
HADOOP_CLASSPATH
The only
required variable to set is JAVA_HOME that needs to point to <java
installation> director
|
List all
the daemons required to run the Hadoop cluster
|
- NameNode
- DataNode
- JobTracker
- TaskTracker
|
Whats the default port that jobtrackers listens to
|
50030
|
Whats the default port where the dfs
namenode web ui will listen on
|
50070
|
What is
HDFS
|
HDFS, the
Hadoop Distributed File System, is a distributed file system designed to hold
very large amounts of data (terabytes or even petabytes), and provide
high-throughput access to this information. Files are stored in a redundant
fashion across multiple machines to ensure their durability to failure and
high availability to very parallel applications
|
What does
the statement "HDFS is block structured file system" means
|
It means
that in HDFS individual files are broken into blocks of a fixed size. These
blocks are stored across a cluster of one or more machines with data storage
capacity
|
What does
the term "Replication factor" mean
|
Replication
factor is the number of times a file needs to be replicated in HDFS
|
What is
the typical block size of an HDFS block
|
64Mb to
128Mb
|
What is
the benefit of having such big block size (when compared to block size of
linux file system like ext)
|
It allows
HDFS to decrease the amount of metadata storage required per file (the list
of blocks per file will be smaller as the size of individual blocks
increases). Furthermore, it allows for fast streaming reads of data, by
keeping large amounts of data sequentially laid out on the disk
|
Why is it
recommended to have few very large files instead of a lot of small files in
HDFS
|
This is
because the Name node contains the meta data of each and every file in HDFS
and more files means more metadata and since namenode loads all the metadata
in memory for speed hence having a lot of files may make the metadata
information big enough to exceed the size of the memory on the Name
node
|
What is a
datanode in HDFS
|
ndividual
machines in the HDFS cluster that hold blocks of data are called datanode
|
What is a
Namenode in HDSF
|
The Namenode stores all the metadata for the
file system
|
What
alternate way does HDFS provides to recover data in case a Namenode, without
backup, fails and cannot be recovered
|
There is
no way. If Namenode dies and there is no backup then there is no way to
recover data
|
Describe
how a HDFS client will read a file in HDFS, like will it talk to data node or
namenode ... how will data flow et
|
To open a
file, a client contacts the Name Node and retrieves a list of locations for
the blocks that comprise the file. These locations identify the Data Nodes
which hold each block. Clients then read file data directly from the Data
Node servers, possibly in parallel. The Name Node is not directly involved in
this bulk data transfer, keeping its overhead to a minimum.
|
Using
linux command line. how will you
|
- List
the the number of files in a HDFS directory
|
- Create
a directory in HDFS
|
- Copy
file from your local directory to HDSF
|
hadoop fs -ls
hadoop fs -mkdir
hadoop fs -put localfile hdfsfile
|
How will
you write a custom partitioner for a Hadoop job
|
To have
hadoop use a custom partitioner you will have to do minimum the following
three
- Create
a new class that extends Partitioner class
-
Override method getPartition
- In the
wrapper that runs the Map Reducer, either
-
add the custom partitioner to the job programtically using method setPartitionerClass or
-
add the custom partitioner to the job as a config file (if your wrapper reads
from config file or oozie)
|
How did
you debug your Hadoop code
|
There can
be several ways of doing this but most common ways are
- By
using counters
- The web
interface provided by Hadoop framework
|
Did you
ever built a production process in Hadoop ? If yes then what was the process
when your hadoop job fails due to any reason
|
Its an
open ended question but most candidates, if they have written a production
job, should talk about some type of alert mechanisn like email is sent or
there monitoring system sends an alert. Since Hadoop works on unstructured
data, its very important to have a good alerting system for errors since
unexpected data can very easily break the job.
|
Did you
ever ran into a lop sided job that resulted in out of memory error, if yes
then how did you handled it
|
This is
an open ended question but a candidate who claims to be an intermediate
developer and has worked on large data set (10-20GB min) should have run into
this problem. There can be many ways to handle this problem but most common
way is to alter your algorithm and break down the job into more map reduce
phase or use a combiner if possible.
|
Whats is
Distributed Cache in Hadoop
|
Distributed
Cache is a facility provided by the Map/Reduce framework to cache files
(text, archives, jars and so on) needed by applications during execution of
the job. The framework will copy the necessary files to the slave node before
any tasks for the job are executed on that node.
|
What is
the benifit of Distributed cache, why can we just have the file in HDFS and
have the application read it
This is
because distributed cache is much faster. It copies the file to all trackers
at the start of the job. Now if the task tracker runs 10 or 100 mappers or
reducer, it will use the same copy of distributed cache. On the other hand,
if you put code in file to read it from HDFS in the MR job then every mapper
will try to access it from HDFS hence if a task tracker run 100 map jobs then
it will try to read this file 100 times from HDFS. Also HDFS is not very
efficient when used like this.
|
What
mechanism does Hadoop framework provides to synchronize changes made in
Distribution Cache during runtime of the application
|
This is a
trick questions. There is no such mechanism. Distributed Cache by design is
read only during the time of Job execution
|
Have you
ever used Counters in Hadoop. Give us an example scenario
|
Anybody
who claims to have worked on a Hadoop project is expected to use counters
|
Is it possible
to provide multiple input to Hadoop? If yes then how can you give multiple
directories as input to the Hadoop job
|
Yes, The
input format class provides methods to add multiple directories as input to a
Hadoop job
|
Is
it possible to have Hadoop job output in multiple directories. If yes then
how
|
Yes, by
using Multiple Outputs class
|
What
will a hadoop job do if you try to run it with an output directory that is
already present? Will it
|
-
overwrite it
- warn
you and continue
- throw
an exception and exit
|
The
hadoop job will throw an exception and exit
|
How can
you set an arbitary number of mappers to be created for a job in Hadoop
|
This is a
trick question. You cannot set it
|
How can
you set an arbitary number of reducers to be created for a job in Hadoop
|
You can
either do it progamatically by using method setNumReduceTasksin the
JobConfclass or set it up as a configuration setting
|
What is a JobTracker in
Hadoop? How many instances of JobTracker run on a Hadoop Cluster?
|
JobTracker is the daemon service for
submitting and tracking MapReduce jobs in Hadoop. There is only One Job
Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM
process. In a typical production cluster its run on a separate machine. Each
slave node is configured with job tracker node location. The JobTracker is
single point of failure for the Hadoop MapReduce service. If it goes down,
all running jobs are halted. JobTracker in Hadoop performs following
actions(from Hadoop Wiki:)
Client applications submit jobs to the Job
tracker.
The JobTracker talks to the NameNode to
determine the location of the data
The JobTracker locates TaskTracker nodes
with available slots at or near the data
The JobTracker submits the work to the
chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they
do not submit heartbeat signals often enough, they are deemed to have failed and
the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker
when a task fails. The JobTracker decides what to do then: it may resubmit
the job elsewhere, it may mark that specific record as something to avoid,
and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker
updates its status.
Client applications can poll the JobTracker
for information.
|
Tuesday, December 4, 2012
Hadoop Interview Questions
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment