首页 > > 详细

CS590-07 Big Data and Cloud Computing Homework#3

Homework#3

CS590-07 Big Data and Cloud Computing, Fall 2019

NOTE

 When you submit your homework, submit with one file, i.e.,

HW3_YourLastName_FirstName.zip .  Submit your homework to the Blackboard course webpage.

Due: November 22 (Friday)

Part I. Spark Setup and World Count Program test

1. Setup Spark Standalone in your computer.

For the installation, refer to https://spark.apache.org/docs/latest/

2. Word Count in Spark shell.

Spark provides two interactive shells: python shell and scala shell.

Dataset: An e-book (Pride and Prejudice by Jane Austen) in Text format (UTF-8) at [Link].

Task to do :

(1)Develop Spark scripts for the World Count program. Submit the script program.

(2) Submit the experimental result.

3. Standalone Spark Word Count

Dataset: An e-book (Pride and Prejudice by Jane Austen) in Text format (UTF-8) at [Link].

Task to do:

(1) Implement a standalone World Count program with Spark. Submit the program code.

(Option 1) Java World Count program

[Reference] The Java program of Spark Word Count is available at the Learning Spark Github,

https://github.com/databricks/learning-spark/blob/master/mini-completeexample/src/main/java/com/oreilly/learningsparkexamples/mini/java/WordCount.java

You can also find using Maven to create the Java standalone program in Karau et al., “Learning

Spark”, pp18 – 20, or at https://www.journaldev.com/20342/apache-spark-example-word-countprogram-java

(Option 2) Python World Count program

[Reference] Some Python codes for Spark Word Count are available at the Learning Spark Github,

https://github.com/databricks/learning-spark/tree/master/src/python

(2) Submit the execution result with the given dataset.

Part II. Spark-based implementation of Moran’s I

Problem description: The objective of this assignment is, as an application, to implement a spatial

statistic function, Moran’s I with Spark.

Spatial autocorrelation is a special property of any spatial data and can be defined as the correlation of a

variable with itself through space. Among statistic functions to measure spatial autocorrelation, Moran’s I

is often used. Moran’s I can be expressed as follows:

𝐼 = 𝑛 ×(∑ ∑ 𝑤𝑖𝑗(𝑧𝑖−𝑧)(𝑧𝑗−𝑧)) 𝑛𝑗=1 𝑛𝑖=1 ( ∑ ∑ 𝑤𝑖𝑗)×(∑ (𝑧𝑖−𝑧) 𝑛 2 𝑖=1 ) 𝑛𝑗=1 𝑛𝑖=1 , 𝑖 ≠ 𝑗 (1)

Where, 𝑧𝑖 and 𝑧𝑗 are observed values of a spatial feature at location i and location j respectively. 𝑧 is the

mean of observed values in all the sites; 𝑛 is the number of observation locations/sites; and 𝑤𝑖𝑗 is the

weight, defined based on the spatial proximity between location i and j . A simple approach to get 𝑤𝑖𝑗 is

to use (Euclidean) distance between 𝑖 and 𝑗,

𝑤𝑖𝑗 = 1 𝑑(𝑖,𝑗) ⁄ for 𝑖 ≠ 𝑗, = 0 for 𝑖 = 𝑗.

where, 𝑑(𝑖,𝑗) = √(𝑥𝑖−𝑥𝑗)2 + (𝑦𝑖−𝑦𝑗)2 if Euclidean distance is used.

The values of I range from -1 to +1. Negative values indicate negative spatial autocorrelation and positive

values indicate positive spatial autocorrelation. A zero value indicates existence of a random spatial

pattern.

However, since the computation for Moran’s I needs to deal with the weight 𝑤𝑖𝑗between each pair of

spatial locations < i, j >, it becomes computationally intensive in case the total number of observation

locations or sites (n) becomes very large. Therefore, it is necessary to provide a parallel and distributed

computation of Moran’s I .

Dataset: You will experiment your implementation of Moran’s I with Spark APIs using a given dataset

(hep.txt) which includes Hepatitis rate of CA counties.

The data set has four attributes, County, X, Y and Rate as shown below. However, to compute Moran’s I,

the county name is not necessary. X and Y presents a location information, and Rate presents a spatial

feature value, z in equation (1). For example, the first row <195, 500, 14.4> presents the observation

value 14.4 at location (195, 500).

County X Y Rate

Alameda 195 500 14.4

Alpine 318 560 0

Amador 265 550 12.1

… … … … … … … …

Yuba 228 604 79.8

Tasks to do: 1. Explain your algorithmic design for Moran’s I using a Spark data flow diagram. Your diagram should

show all transformations to construct a new RDD from previous RDDs and all actions to arrive to the

result. For example, you can find a Spark data flow diagram in pp313 in Parsian’s Data Algorithm

book:

2. Implement your algorithm using (Option 1) Java or (Option 2) Python, and submit the program code.

3. Experiment your implementation with a given data set (hep.txt) and show the experimental result.

4. In the execution result, show the content of each your RDD. For that, you may get your RDD data

using saveAsTextFile Spark function.

Part III. Spark MLlib

Problem description: The spark framework provides some common machine learning (ML) functionality

in a library called MLlib. MLlib supports K-Means clustering that groups the data points into a predefined

number of clusters. You need to conduct clustering analysis using MLlib K-means.

Reference: Parsian’s Data Algorithm, Ch 12.

Dataset: Wine data (wine.data) . This dataset has a result of a chemical analysis of wines grown in a

particular region in Italy but derived from three different cultivars. The analysis determined the quantities

of 13 constituents found in each of the three types of wines. The attributes are: Alcohol, Malic acid, Ash,

Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color

intensity, Hue, OD280/OD315 of diluted wines, and Proline. The data set has 178 observations and no

missing values. The wine.names files shows the data description.

Tasks to do:

1. Explain your algorithmic design for k-mean using a Spark data flow diagram. The diagram of your

program might be simple if you use MLlib. If you can analyze the source code of MLlib k-mean, you

may draw the detail data flow diagram for K-mean algorithm. But that is optional.

IMPORTANT NOTE: The wine.data has 14 attributes. The first attribute indicates a class label. In

unsupervised learning like clustering analysis, a class label attribute should be removed. In your

program design, include a transformation(s) to exclude the first attribute (with 1, 2, or 3 value) from

the original RDD.

2. Implement your algorithm using MLlib K-mean and submit the program code.

3. Experiment your implementation with a given data set (wine.data) when k=3 and show the

experimental result (data members per each cluster and final centroid values).

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

更多

辅导 comm2000 creating socia... 2026-01-08
讲解 isen1000 – introductio... 2026-01-08
讲解 cme213 radix sort讲解 c... 2026-01-08
辅导 csc370 database讲解迭代 2026-01-08
讲解 ca2401 a list of colleg... 2026-01-08
讲解 nfe2140 midi scale play... 2026-01-08
讲解 ca2401 the universal li... 2026-01-08
辅导 engg7302 advanced compu... 2026-01-08
辅导 comp331/557 – class te... 2026-01-08
讲解 soft2412 comp9412 exam辅... 2026-01-08
讲解 scenario # 1 honesty讲解... 2026-01-08
讲解 002499 accounting infor... 2026-01-08
讲解 comp9313 2021t3 project... 2026-01-08
讲解 stat1201 analysis of sc... 2026-01-08
辅导 stat5611: statistical m... 2026-01-08
辅导 mth2010-mth2015 - multi... 2026-01-08
辅导 eeet2387 switched mode ... 2026-01-08
讲解 an online payment servi... 2026-01-08
讲解 textfilter辅导 r语言 2026-01-08
讲解 rutgers ece 434 linux o... 2026-01-08

热点标签

engn4536/engn6536

comp(2041|9044)

litr1-uc6201.200

int2067/int5051

csci-ua.0480-003

cs247—assignment

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

© 2024 www.7daixie.com

程序辅导网！