Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

Wednesday, May 3, 2017

Hadoop general interview questions


  1. architecture component of Hadoop
  2. os level optimisation
  3. prerequisites before installing
  4. how to bring data
  5. what all we need to make sure in order to copy data from one cluster
  6. scenarios and use of the scheduler
  7. I want to implement department wise access level on hdfs and yarn
  8. job flow in yarn
  9. how resource allocation happens in yarn
  10. what is the file read/write flow
  11. how different nodes in a cluster communicate with each other
  12. how the request flows through zookeepers
  13. please tell something about read/write pipeline on Hadoop
  14. how do you do deployments on many servers at once

Wednesday, February 19, 2014

What the functions of a scheduling algorithm?


  • Reduce the total amount of computation necessary to complete a job
  • Allow multiple users to share clusters in a predictable, policy-guided manner.
  • Run jobs at periodic times of the day.
  • Reduce job latencies in an environment with multiple jobs of different sizes.

Tuesday, December 4, 2012

Hadoop Interview Questions


What are the default configuration files that are used in Hadoop 
As of 0.20 release, Hadoop supported the following read-only default configurations
- src/core/core-default.xml
- src/hdfs/hdfs-default.xml
- src/mapred/mapred-default.xml
How will you make changes to the default configuration files 
Hadoop does not recommends changing the default configuration files, instead it recommends making all site specific changes in the following files
- conf/core-site.xml
- conf/hdfs-site.xml
- conf/mapred-site.xml
Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
- core-default.xml : Read-only defaults for hadoop.
- core-site.xml: Site-specific configuration for a given hadoop installation.
Hence if same configuration is defined in file core-default.xml and src/core/core-default.xml then the values in file core-default.xml (same is true for other 2 file pairs) is used.


Monday, December 3, 2012

Hadoop Interview Question

 Here are Some Hadoop Administration question you May expect.  answers you need to find.... :) i can give but i wont : if you find good answer share with me also :) hope you will right ? If you are not able to find let me know through comments i will post the answers too.


  • What is Hadoop? Brief about the components of  Hadoop.
  • What are the Hadoop daemon processes tell the components of Hadoop and functionality?
  • Tell steps for configuring Hadoop?
  • What is architecture of HDFS and flow?
  • Can we have more than one configuration setting for Hadoop cluster how can you switch between these configurations?
  • What will be your troubleshooting approach in Hadoop?
  • What are the exceptions you have come through while working on Hadoop, what was your approach for getting rid of those exceptions or errors?

Tuesday, September 18, 2012

Demystifying Hadoop concepts Series: Safe mode

 

safemode

What is is safe mode of hadoop, may time we come across this exception “ org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.SafeModeException”  or some other exceptions Which contains safe mode in it Smile .

 

First let me tell what Safe mode is in context to Hadoop : as we all know Name node contains fsimage (metadata) of the data present on the cluster, which can be large or small based on the size of the cluster and the size of date present on the cluster, so when the name node starts it loads this fsimage and the edit logs from the disk in the Primary memory RAM for fast processing, and after loading it waits for data nodes to report about the present on those data nodes, so during this process that is loading the fsimage and edit logs and waiting for data nodes to report about the data block in safe mode, which is a read only mode for name node this is done to maintain the consistency of the data present, this is just like saying “ i will not receive any thing till i know what i already have”. And during this period no modification to the file blocks are allowed as to maintain the correctness of the data.

 

How long safemode exist :

Generally name node automatically comes out of safe mode in 30 seconds if all data present are consistent according the fsimage and the editlogs.

 

Related commands :

Put Namenode in Safemode: bin/hadoop dfsadmin –safemode

Leave Safemode : bin/hadoop dfsadmin -safemode leave

 

What to do if you encounter this exception :

 

First, wait a minute or two and then retry your command. If you just started your cluster, it's possible that it isn't fully initialized yet. If waiting a few minutes didn't help and you still get a "safe mode" error, check your logs to see if any of your data nodes didn't start correctly (either they have Java exceptions in their logs or they have messages stating that they are unable to contact some other node in your cluster). If this is the case you need to resolve the configuration issue (or possibly pick some new nodes) before you can continue.

 



Tuesday, April 17, 2012

What is the difference between HDFS and NAS ?

    The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. Following are differences between HDFS and NAS
    • In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
    • HDFS is designed to work with Map Reduce System, since computation are moved to data. NAS is not suitable for Map Reduce since data is stored separately from the computations.
    • HDFS runs on a cluster of machines and provides redundancy using replication protocol. Whereas NAS is provided by a single machine therefore does not provide data redundancy.

What is a Job Tracker in Hadoop? How many instances of Job Tracker run on a Hadoop Cluster?

    Job Tracker is the daemon service for submitting and tracking Map Reduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The Job Tracker is single point of failure for the Hadoop Map Reduce service. If it goes down, all running jobs are halted. Job Tracker in Hadoop performs following actions(from Hadoop Wiki:)
    • Client applications submit jobs to the Job tracker.
    • The JobTracker talks to the NameNode to determine the location of the data
    • The JobTracker locates TaskTracker nodes with available slots at or near the data
    • The JobTracker submits the work to the chosen TaskTracker nodes.
    • The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
    • A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
    • When the work is completed, the JobTracker updates its status.
    • Client applications can poll the JobTracker for information.

What is compute and Storage nodes?

Compute Node: This is the computer or machine where your actual business logic will be executed.

Storage Node: This is the computer or machine where your file system reside to store the processing data.

In most of the cases compute node and storage node would be the same machine.

What is Map Reduce ?

Map reduce is an algorithm or concept to process Huge amount of data in a faster way. As per its name you can divide it Map and Reduce.

  • The main Map Reduce job usually splits the input data-set into independent chunks. (Big data sets in the multiple small datasets)
  • Map Task: will process these chunks in a completely parallel manner (One node can process one or more chunks).
  • The framework sorts the outputs of the maps.
  • Reduce Task : And the above output will be the input for the reduce tasks, produces the final result.

Your business logic would be written in the Mapped Task and Reduced Task.

Typically both the input and the output of the job are stored in a file-system (Not database). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

What is Hadoop framework

Hadoop is a open source framework which is written in java by apache software foundation. This framework is used to write software application which requires to process vast amount of data (It could handle multi TB of data). It works in-parallel on large clusters which could have 1000 of computers (Nodes) on the clusters. It also process data very reliably and fault-tolerant manner. See the below image how does it looks.

Live

Your Ad Here