Big data space

Sunday, May 14, 2017

How to process huge files with hadoop

In order to read huge files we cannot afford to load them into memory. One solution would be a BlockingQueue approach.

It's a consumer-producer approach where producer read the file and place in the queue until it's full. Consumer thread is blocked until the queue is non empty.

Queue is created using BlockingQueue interface.

ExecutorService is used to create a consumer thread pool and producer thread pool.

java.util.concurrent.ExecutorService consumer = java.util.concurrent.Executors.newFixedThreadPool(CONSUMERS_COUNT);

Sunday, December 11, 2016

Livy Job Server Implementation

Livy Job Server Installation on Horton Works Data Platform (HDP 2.4):
Download the compressed file from
http://archive.cloudera.com/beta/livy/livy-server-0.2.0.zip

Unzip the file and work on configuration changes.

Configurations:

1	Add 'spark.master=yarn-cluster' in config file '/usr/hdp/current/spark-client/conf/spark-defaults.conf'

livy-env.sh
Add these entries

export SPARK_HOME=/usr/hdp/current/spark-client
export HADOOP_HOME=/usr/hdp/current/hadoop-client/bin/
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_CONF_DIR=$SPARK_HOME/conf
export LIVY_LOG_DIR=/jobserver-livy/logs
export LIVY_PID_DIR=/jobserver-livy
export LIVY_MAX_LOG_FILES=10
export HBASE_HOME=/usr/hdp/current/hbase-client/bin

log4j.properties
By default logs roll on console in INFO mode. It can be changed to file rolling with debug enabled with the below configuration:

log4j.rootCategory=DEBUG, NotConsole
log4j.appender.NotConsole=org.apache.log4j.RollingFileAppender
log4j.appender.NotConsole.File=/Analytics/livy-server/logs/livy.log
log4j.appender.NotConsole.maxFileSize=20MB
log4j.appender.NotConsole.layout=org.apache.log4j.PatternLayout
log4j.appender.NotConsole.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Copy the following libraries to the path '<LIVY SEVER INSTALL PATH>/rsc-jars'

hbase-common.jar
hbase-server.jar
hbase-client.jar
hbase-rest.jar
guava-11.0.2.jar
protobuf-java.jar
hbase-protocol.jar
spark-assembly.jar

Integrating web Application with Livy job server:
Copy the following libraries to tomcat classpath (lib folder)

1 2	livy-api-0.2.0.jar livy-client-http-0.2.0.jar

Servlet invokes the LivyClientUtil

LivyClientUtil:

String livyUrl = "http://127.0.0.1:8998";
LivyClient client = new HttpClientFactory().createClient(new URL(livyUrl).toURI(), null);
client.uploadJar(new File(jarPath)).get();
TimeUnit time = TimeUnit.SECONDS;
String jsonString = client.submit(new SparkReaderJob("08/19/2010")).get(40,time);

SparkReaderJob:

public class SparkReaderJob implements Job<String> {
@Override
public String call(JobContext jc) throws Exception {
           JavaSparkContext jsc = jc.sc();
} }

References:
http://livy.io/
https://github.com/cloudera/livy

Monday, November 14, 2016

How to call a java class from scala

Create an Eclipse java maven project.

Add Scala nature to the project Right click on project > Configure > Add Scala Nature

Create project structure as:
src/main/java
src/main/scala

Sample Java class:

package test.java;

public class Hello {

    public static void main(String[] args) {
        Hello hello = new Hello();
        hello.helloUser("Tomcat");

    }

    public void helloUser(String user) {
        System.out.println("Hello " + user);
    }

}

Calling java class from Scala class

package test.scala;

import test.java.Hello

object HelloScala {

  def helloScala() {
    val hello = new Hello()
    hello.helloUser("Scala")
  }
}

POM file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>Java_Scala_Project</groupId>
    <artifactId>Java_Scala_Project</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.10.4</version>
        </dependency>

    </dependencies>

    <!-- <repositories>
        <repository>
            <id>scala</id>
            <name>Scala Tools</name>
            <url>http://scala-tools.org/repo-releases/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>
    </repositories> -->

    <pluginRepositories>
        <pluginRepository>
            <id>scala</id>
            <name>Scala Tools</name>
            <url>http://scala-tools.org/repo-releases/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </pluginRepository>
    </pluginRepositories>

    <build>
        <sourceDirectory>src</sourceDirectory>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.3</version>
                <configuration>
                    <source>1.7</source>
                    <target>1.7</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <executions>
                    <execution>
                        <id>compile</id>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                        <phase>compile</phase>
                    </execution>
                    <execution>
                        <id>test-compile</id>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                        <phase>test-compile</phase>
                    </execution>
                    <execution>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

Thursday, September 22, 2016

Spark Trouble shooting

Increasing the resultset size:

spark-submit: --conf spark.driver.maxResultSize=4g
conf.set("spark.driver.maxResultSize", "4g")
spark-defaults.conf: spark.driver.maxResultSize 4g

Wednesday, September 21, 2016

Useful Amazon EC2 commands

SSH to EC2 instance:
sudo ssh -i <path/cluster.pem> user@public DNS

SCP file from desktop to EC2 instance:

Thursday, August 4, 2016

Useful Commands

Adding a header to an existing file:

echo -e "field1,field2,field3,field4,field5" | cat - test.csv > /tmp/tempFile && mv /tmp/tempFile test.csv

grep a file with wild card in the search string

grep "08/../2010" test.csv

Delete a line from a file matching a search string
sed -i.bak '/search string/d' file

Delete first line from the file
sed '1d' file.txt > tmpfile; mv tmpfile file.txt

Starting Hive session with Tez execution engine
hive --hiveconf hive.execution.engine=Tez

Installing artifact into maven repository

mvn install:install-file -DgroupId=jdk.tools -DartifactId=jdk.tools -Dpackaging=jar -Dversion=1.6 -Dfile=tools.jar -DgeneratePom=true

 Installing Java 8 on centos

su -c "yum install java-1.8.0-openjdk"

 Updating JAVA_HOME

sudo gedit .bash_profile

Add JAVA_HOME=/usr/lib/jvm/java-1.8.0/jre

Update java -version

sudo vi .bashrc

Add PATH=/usr/lib/jvm/java-1.8.0/jre/bin:$PATH

Saturday, July 16, 2016

Basics Stats

Mean
A mean score is an average score, often denoted by X. It is the sum of individual scores divided by the number of individuals.

Median

The median is a simple measure of central tendency. To find the median, we arrange the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values.

Mode

The mode is the most frequently appearing value in a population or sample.