Hands-on session: HDFS 1. Let's retrieve the Docker image with pre-installed Hadoop docker pull matnar/hadoop:3.3.2 (On GitHub the source code that allows you to create or customize the matnar/hadoop Docker image: https://github.com/matnar/docker-hadoop ) 2. We can now create an isolated network with 3 datanodes and a namenode docker network create --driver bridge hadoop_network docker run -t -i -p 9864:9864 -d --network=hadoop_network --name=slave1 matnar/hadoop docker run -t -i -p 9863:9864 -d --network=hadoop_network --name=slave2 matnar/hadoop docker run -t -i -p 9862:9864 -d --network=hadoop_network --name=slave3 matnar/hadoop docker run -t -i -p 9870:9870 --network=hadoop_network --name=master matnar/hadoop 3. Before we start, let's check the configuration files. Being interested only in the HDFS, we explore core-site.xml, which contains the location of the namenode (master), and the hdfs-site.xml, which contains the default replication degree of files stored on the HDFS. In "workers", we can find the list of datanodes available for the storing data. Observe that Docker helps us, because it automatically adds the containers' name in /etc/hosts, so we can connect to "slave1", for example, without knowing its IP address assigned by the network. cd $HADOOP_HOME/etc/hadoop vi core-site.xml vi hdfs-site.xml vi workers 4. The configuration of HDFS is ready, we can initialize our system. As for each file system, we need to format our storage units. The format operation loses all stored data, hence it has to be performed only the first time we activate the HDFS. hdfs namenode -format 5. Everything is ready. We can start the HDFS from the namenode (master). The initialization script will automatically start the datanodes (configured in the "workers" file): $HADOOP_HOME/sbin/start-dfs.sh 6. It is running. Let's explore the (empty) HDFS: http://localhost:9870/dfshealth.html hdfs dfsadmin -report 7. Let's play around: basic operations ============================================ Basic operations ----------------- echo "File content" >> file hdfs dfs -put file /file hdfs dfs -ls / hdfs dfs -mv /file /democontent hdfs dfs -cat /democontent hdfs dfs -appendToFile file /democontent hdfs dfs -cat /democontent hdfs dfs -mkdir /folder01 hdfs dfs -cp /democontent /folder01/text hdfs dfs -ls /folder01 hdfs dfs -rm /democontent hdfs dfs -get /folder01/text textfromhdfs cat textfromhdfs hdfs dfs -rm -r /folder01 ============================================ Snapshot ----------------- hdfs dfs -mkdir /snap hdfs dfs -cp /debs /snap/debs hdfs dfsadmin -allowSnapshot /snap hdfs dfs -createSnapshot /snap snap001 # Listing the snapshots: hdfs dfs -ls /snap/.snapshot # Listing the files in snapshot snap001: hdfs dfs -ls /snap/.snapshot/snap001 # Copying a file from snapshot snap001: hdfs dfs -cp -ptopax /snap/.snapshot/snap001/debs /debs hdfs dfs -deleteSnapshot /snap snap001 hdfs dfsadmin -disallowSnapshot /snap hdfs dfs -rm -r /snap ============================================ Replication ----------------- hdfs dfs -mkdir /norepl hdfs dfs -put file /norepl/file hdfs dfs -ls /norepl hdfs dfs -setrep 1 /norepl hdfs dfs -ls /norepl hdfs dfs -put file /norepl/file2 hdfs dfs -ls /norepl hdfs dfs -setrep 1 /norepl/file2 #check block availability from website (and after a while) 8. Replication is meant for fault-tolerance. Let's check the system behavior when a datanode is terminated forcefully. We consider a file previously uploaded on the HDFS and we check the datanodes that store its blocks. Afterwards, we kill the related datanode; assuming the chuck is on the node slave2: # Check availability di un file # Kill del datanote (e.g., 529b2e16cbea) docker kill slave2 We can now monitor the datanode states from: http://localhost:9870/dfshealth.html#tab-datanode After a while data are replicated automatically on a safe node. 9. We can turn off our HDFS: $HADOOP_HOME/sbin/stop-dfs.sh 10. We can stop and remove our Docker environment: docker kill master slave1 slave2 slave3 docker rm master slave1 slave2 slave3 docker network rm hadoop_network