Monday, February 24, 2014
Comparison between Big Data and RDBMS
RDBMS
|
Big Data
| |
Data size
|
Gigabytes
|
Petabytes
|
Access
|
Interactive and batch
|
Batch
|
Updates
|
Read and write many times
|
Write once, read many times
|
Structure
|
Static schema
|
Dynamic schema
|
Integrity
|
High
|
Low
|
Scaling
|
Nonlinear
|
Linear
|
Friday, February 21, 2014
What happens in Major compaction in HBase?
1. Delete the data which is masked by tombstone
2. Delete the data which has expired TTL
3. Compact several small HFile into a single larger one
2. Delete the data which has expired TTL
3. Compact several small HFile into a single larger one
Wednesday, February 19, 2014
List out some limitation of Hadoop.
- Write-once model
- Plan to support appending-writes
- A namespace with an extremely large number of files exceeds Namenode’s capacity to maintain
- Cannot be mounted by exisiting OS
- Getting data in and out is tedious
- Virtual File System can solve problem
- Java API
- Thrift API is available to use other languages
- HDFS does not implement / support
- User quotas
- Access permissions
- Hard or soft links
- Data balancing schemes
- No periodic checkpoints
- Namenode is single point of failure
- Automatic restart and failover to another machine not yet supported
List of steps of datanode failure.
- Namenode marks Datanodes without recent heartbeat as dead
- Does not forward any new I/O requests
- Constantly tracks which blocks must be replicated with BlockMap
- Initiates replication when necessary
Please write steps of checkpointing in hadoop.
- Performed by Namenode
- Two versions of FsImage
- One stored on disk
- One in memory
- Applies all transactions in EditLog to in-memory FsImage
- Flushes FsImage to disk
- Truncates EditLog
What is the namenode startup process steps?
- Namenode enters Safemode
- Replication does not occur in Safemode
- Each Datanode sends Heartbeat
- Each Datanode sends Blockreport
- Lists all HDFS data blocks
- Namenode creates Blockmap from Blockreports
- Namenode exits Safemode
- Replicate any under-replicated blocks
What is one of the underacted problem which may occur with map reduce submission to Hadoop.
Due to some condition of bad code it goes into infinity loop.
What the functions of a scheduling algorithm?
- Reduce the total amount of computation necessary to complete a job
- Allow multiple users to share clusters in a predictable, policy-guided manner.
- Run jobs at periodic times of the day.
- Reduce job latencies in an environment with multiple jobs of different sizes.
Subscribe to:
Posts (Atom)