Lab Description: In this lab we will spin up an Elastic MapReduce (EMR) cluster for Big Data processing and use Hive with SQL-style queries to get an analyze data. You will create a small Hadoop cluster using EMR which will allow to run interactive Hive queries against data stored in Amazon S3. You will use Hive to normalize the data in a more useful way, and you we will generate a table with meaningful results and store it in S3 so it can be re-used for other jobs running on the cluster.

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

User Name:  awsstudent
Key: d92ZrVtWnwgM
Access Key Id: AKIAIIBIV7H7MECKXRUQ
Secret Access Key: 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
ec2-54-85-177-100.compute-1.amazonaws.com
login as hadoop
hive

set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set mapred.min.split.size=67108864;
set mapred.max.split.size=536870912;
set lang=eng-1M;
set ngram=1gram;
set outputbucket=s3n://ngramlabkm/output;
set min_year=1970;
set max_year=1995;

Log URI: s3://ngramlabkm/logs/

Dataset: http://aws.amazon.com/datasets/8172056142375670

CREATE EXTERNAL TABLE IF NOT EXISTS ngrams (gram string, year int, occurrences bigint, pages bigint, books bigint)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS SEQUENCEFILE

LOCATION: ‘s3://datasets.elasticmapreduce/ngrams/books/20090715/${hiveconf:lang}/${hiveconf:ngram}/’;

DESCRIBE ngrams;

CREATE TABLE IF NOT EXISTS normalized (gram string,year int,occurrences bigint);

INSERT OVERWRITE TABLE normalized

SELECT lower(gram), year, occurrences FROM ngrams WHERE year >= (${hiveconf:min_year} - 1) AND year <= ${hiveconf:max_year} AND gram REGEXP "^[A-Za-z+'-]{3,}$";

http://ec2-54-85-177-100.compute-1.amazonaws.com/ganglia/

In this section of the lab, you create a ratios table and populate it with data.

The current table only contains absolute numbers, not taking into account the total size of the corpus. In order to calculate trends, you should work on relative occurrences of words, compared to the total corpus by year.

CREATE TABLE IF NOT EXISTS ratios (gram string, year int, ratio double);

hive.exec.reducers.max
mapred.reduce.tasks 
hive.exec.reducers.bytes.per.reducer
CREATE EXTERNAL TABLE IF NOT EXISTS output_analysis_table ( gram string, year int, ratio double, increase double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '${hiveconf:outputbucket}';
INSERT OVERWRITE TABLE output_analysis_table

SELECT
a.gram as gram,
a.year as year,
a.ratio as ratio,
a.ratio / b.ratio as increase
FROM ratios a
JOIN ratios b
ON
a.gram = b.gram and
a.year - 1 = b.year
WHERE
a.ratio > 0.000001 and
a.year >= ${hiveconf:min_year} and
a.year <= ${hiveconf:max_year}
DISTRIBUTE BY year SORT BY year ASC, increase DESC;

 

Advertisements