Ubuntu에 Hadoop Single Cluster Pseudo Distributed Operation 설치 하기

10월 08, 2019 0 Comments

안녕하십니까.
이번 포스팅에서는 Ubuntu에 Hadoop Single Cluster를 Pseudo Distributed Operation으로 설치를 하고 word count 예제를 다루려고 합니다.

hadoop

참고 링크 : https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

Hadoop 다운로드

Apache Download Mirrors에서 Hadoop을 다운로드 받습니다.. 저는 편의상 home 밑에 hadoop 폴더로 이름을 변경하였습니다.

SSH 설치 확인

아래와 같이 SSH를 설치 합니다.
아마 localhost를 접속하기 위함인듯 합니다.

  $ sudo apt-get install ssh
  $ sudo apt-get install pdsh

Java 설치 하기

자바를 설치 하고 etc/hadoop/hadoop-env.sh 아래에 설치한 java 위치를 아래처럼 한줄 추가해 줍니다.
$JAVA_HOME/bin/java 위치에 java가 있어야 동작 합니다.

  # set to the root of your Java installation
  export JAVA_HOME=/usr/java/latest

이상이 없다면 아래 커맨드에서 에러가 발생하지 않습니다.

  $ bin/hadoop

설정 파일

etc/hadoop/core-site.xml라는 위치에 아래 파일을 생성 합니다.

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml라는 위치에 아래 파일을 생성 합니다.

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

SSH 확인

아까 설치한 ssh가 작동 하는지 확인 합니다.

  $ ssh localhost

비밀번호 없이 로그인이 되어야 하므로 아래의 커맨드를 통하여 설정합니다.

  $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  $ chmod 0600 ~/.ssh/authorized_keys

노드 실행

아래 명령어 들을 통하여 노드를 실행 합니다.

Format the filesystem
```
  $ bin/hdfs namenode -format
```
SNameNode daemon, DataNode daemon을 실행 합니다.
```
  $ sbin/start-dfs.sh
```
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
Browse the web interface for the NameNode; by default it is available at:
- NameNode - http://localhost:9870/
아래 명령어를 통해서 hdfs위에 폴더를 생성 합니다. 해당 폴더가 하둡 시스템에서 사용하게 되는 폴더 입니다.
```
  $ bin/hdfs dfs -mkdir /user
  $ bin/hdfs dfs -mkdir /user/<username>
```
아래 명령어를 통해서 인풋을 hdfs위에 넣습니다 주어진 예제에서는 컨피그를 넣었는데.. 이는 그냥 아무거나 넣은것 같습니다.
```
  $ bin/hdfs dfs -mkdir input
  $ bin/hdfs dfs -put etc/hadoop/*.xml input
```
아래 명령어를 통해 예제를 실행 합니다. 파일이 없다고 뜰시 경로를 확인하고 파일 이름을 확인합니다.
```
  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar grep input output 'dfs[a-z.]+'
```
아래 명령어를 통해서 결과물을 복사한 후에 출력 하거나
```
  $ bin/hdfs dfs -get output output
  $ cat output/*
```
아래 명령어를 통해서 결과물을 바로 봅니다.
```
  $ bin/hdfs dfs -cat output/*
```
할게 없다면 아래 명령어를 통해서 Daemon을 끕니다.
```
  $ sbin/stop-dfs.sh
```

WordCount 예제

Word카운트는 대표적인 예제인듯 합니다.
링크 : https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0

코드 복사

위의 링크에 있는 WordCount 파일을 WordCount.java라는 이름으로 복사해서 서버에 저장 합니다.

컴파일을 하여 jar로 만들어 줍니다.

$ bin/hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class

Assuming that:

/user/joe/wordcount/input - 입력을 넣을 HDFS 폴더
/user/joe/wordcount/output - 출력을 넣을 HDFS 폴더

Sample text-files as input:

아래 예시처럼 wordcount/input이라는 위치에 카운트 하실 파일을 입력 하셔야 합니다.

입력하시는 방법은 위에 있는 -put 명령어를 활용하시면 됩니다.
bin/hdfs dfs -put input/* wordcount/input

$ bin/hdfs dfs -put input/* wordcount/input
//input이라는 폴더 아래에 아래와 같이 문장을 입력한 파일이 있어야 함.

$ bin/hadoop fs -cat /user/joe/wordcount/input/file01
Hello World Bye World

$ bin/hadoop fs -cat /user/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop

아래 명령어를 통해서 WordCount를 실행 합니다.

$ bin/hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output

Output:

$ bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

끝

이 블로그 검색

김띵준의 Programming Story