org.apache.hadoop.mapreduce.Mapper ( and ) org.apache.hadoop.mapreduce.Reducer
Sunday, April 22, 2012
Thursday, April 19, 2012
how to calculate median in Hive
percentile(BIGINT col, p)
and set p to be 0.5
Will calculate median :)
Tuesday, April 17, 2012
What is the difference between HDFS and NAS ?
- The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. Following are differences between HDFS and NAS
- In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
- HDFS is designed to work with Map Reduce System, since computation are moved to data. NAS is not suitable for Map Reduce since data is stored separately from the computations.
- HDFS runs on a cluster of machines and provides redundancy using replication protocol. Whereas NAS is provided by a single machine therefore does not provide data redundancy.
What is a Job Tracker in Hadoop? How many instances of Job Tracker run on a Hadoop Cluster?
- Job Tracker is the daemon service for submitting and tracking Map Reduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The Job Tracker is single point of failure for the Hadoop Map Reduce service. If it goes down, all running jobs are halted. Job Tracker in Hadoop performs following actions(from Hadoop Wiki:)
- Client applications submit jobs to the Job tracker.
- The JobTracker talks to the NameNode to determine the location of the data
- The JobTracker locates TaskTracker nodes with available slots at or near the data
- The JobTracker submits the work to the chosen TaskTracker nodes.
- The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
- A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
- When the work is completed, the JobTracker updates its status.
- Client applications can poll the JobTracker for information.
What is compute and Storage nodes?
Compute Node: This is the computer or machine where your actual business logic will be executed.
Storage Node: This is the computer or machine where your file system reside to store the processing data.
In most of the cases compute node and storage node would be the same machine.
What is Map Reduce ?
Map reduce is an algorithm or concept to process Huge amount of data in a faster way. As per its name you can divide it Map and Reduce.
- The main Map Reduce job usually splits the input data-set into independent chunks. (Big data sets in the multiple small datasets)
- Map Task: will process these chunks in a completely parallel manner (One node can process one or more chunks).
- The framework sorts the outputs of the maps.
- Reduce Task : And the above output will be the input for the reduce tasks, produces the final result.
Your business logic would be written in the Mapped Task and Reduced Task.
Typically both the input and the output of the job are stored in a file-system (Not database). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
What is Hadoop framework
Hadoop is a open source framework which is written in java by apache software foundation. This framework is used to write software application which requires to process vast amount of data (It could handle multi TB of data). It works in-parallel on large clusters which could have 1000 of computers (Nodes) on the clusters. It also process data very reliably and fault-tolerant manner. See the below image how does it looks.
Thursday, April 12, 2012
What is Strongly Typed.
- This means that the data type, is predefined, which prevents programmers to invent and add new data type, to enforce this the language compiler comes in play, it takes care of checking the typed at the compile time only.
- In more clear way we can say that the programming language is strongly typed if it does not allows to define a variable with a data type, we can take and example of c, in which whenever we declare a variable we need to specify its data type also e.g.: int a, char b, but some other languages are loosely typed as we can just give var a, or $var which can hold any data type assigned.
- One more explanation is that in strongly typed the data type is checked at compile time but in loosely typed this done at runtime.
Why Generic type???
- It helps us to separate logic from data type, means no matter what type of data type we are passing to method, will be handled using the same function.
- In other way, we can avoid polymorphism, means only one function can handle different data type rather than defining many function with different kind of data type.
- Generic type invocation as being similar to an ordinary method invocation, but instead of passing an argument to a method, you're passing a type argument
Sunday, April 1, 2012
What is the difference between const and static readonly?
The difference is that the value of a readonly field is set at run time, so it can have a different value for different executions of the program. However, the value of a const field is set to a compile time constant.
Readonly instance fields
- Must have set value, by the time constructor exits
- Are evaluated when instance is created
How to make a class immutable ??
Immutable class is a type of class whose object cant be modified after their creation, which means only constructor will be able to modify or write the fields values of the class.
Java :
In java best way to make a class as immutable is to declare all the fields or variables of the class as final, so declaring final, will prevent the variable or fields to be modified outside the constructor, it will also look after the memory synchronization.
Code Example :
How to check if a link list is circular of is a loop??
This is one of the important question interviewers askes and you are supposed to at least give a logical answer to it, so the logic behind this is to “ Create two markers (pointers) move one pointer faster other slower in a loop, so if these two pointers meets at some point, before the link list ends (while not null) means the link list is circular else if the two pointer meets when the link list ends means while null becomes true means the link list is not circular.
Example Code :
while (SlowPointer) {
SlowPointer = SlowPointer->next;
FastPointer = FastPointer->next;
if (FastPointer) FastPointer=FastPointer->next;
if (SlowPointer == FastPointer) {
print ("circular\n");
}
}