The core part of hadoop inlcudes HDFS and map/reduce.  There are many projects around hadoop.  Data lake is often associated with hadoop-oriented object storage.

Now trying the hadoop ( on debian 8 ) is quite easy.

How to setup

Following the

(1) apt-get install openjdk-7-jdk sudo rsync


tar -zxvf hadoop-2.7.2.tar.gz

vi etc/hadoop/, change:
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64

(2) For psudo-distributed operation





(3) start the hadoop/namenode and hdfs
bin/hdfs namenode -format

Now you have Hadoop to play!

HDFS:  NameNode(One master: meta data about cluster), multiple DataNodes ( store real data)

e.g: bin/hdfs dfs -mkdir /user

bin/hdfs dfs -ls / ( mv/cp/df/du/find/mv/rm/rmdir/tail/truncate similar Unix/Linux shell commands etc,  get/put/setrep  )

bin/hadoop fs

http://localhost:50070 ( Namenode dashboard)

Map/Reduce 1.x: JobTracker ( master service to monitor jobs) , Task Tracker ( run tasks,  same as DataNode)

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output ‘dfs[a-z.]+’

YARN ( map/reduce 2.x): 

A good video is at:

