首页 > > 详细

CS590-07 Big Data and Cloud Computing Homework#3

 Homework#3

CS590-07 Big Data and Cloud Computing, Fall 2019
NOTE
 When you submit your homework, submit with one file, i.e., 
HW3_YourLastName_FirstName.zip .  Submit your homework to the Blackboard course webpage. 
Due: November 22 (Friday)
Part I. Spark Setup and World Count Program test
1. Setup Spark Standalone in your computer. 
For the installation, refer to https://spark.apache.org/docs/latest/
2. Word Count in Spark shell. 
Spark provides two interactive shells: python shell and scala shell.
Dataset: An e-book (Pride and Prejudice by Jane Austen) in Text format (UTF-8) at [Link].
Task to do :
(1)Develop Spark scripts for the World Count program. Submit the script program. 
(2) Submit the experimental result. 
3. Standalone Spark Word Count
Dataset: An e-book (Pride and Prejudice by Jane Austen) in Text format (UTF-8) at [Link].
Task to do: 
(1) Implement a standalone World Count program with Spark. Submit the program code. 
(Option 1) Java World Count program 
[Reference] The Java program of Spark Word Count is available at the Learning Spark Github, 
https://github.com/databricks/learning-spark/blob/master/mini-complete￾example/src/main/java/com/oreilly/learningsparkexamples/mini/java/WordCount.java
You can also find using Maven to create the Java standalone program in Karau et al., “Learning 
Spark”, pp18 – 20, or at https://www.journaldev.com/20342/apache-spark-example-word-count￾program-java
(Option 2) Python World Count program 
[Reference] Some Python codes for Spark Word Count are available at the Learning Spark Github, 
https://github.com/databricks/learning-spark/tree/master/src/python
(2) Submit the execution result with the given dataset. 
Part II. Spark-based implementation of Moran’s I
Problem description: The objective of this assignment is, as an application, to implement a spatial 
statistic function, Moran’s I with Spark. 
Spatial autocorrelation is a special property of any spatial data and can be defined as the correlation of a 
variable with itself through space. Among statistic functions to measure spatial autocorrelation, Moran’s I
is often used. Moran’s I can be expressed as follows: 
𝐼 = 𝑛 ×(∑ ∑ 𝑤𝑖𝑗(𝑧𝑖−𝑧)(𝑧𝑗−𝑧)) 𝑛𝑗=1 𝑛𝑖=1 ( ∑ ∑ 𝑤𝑖𝑗)×(∑ (𝑧𝑖−𝑧) 𝑛 2 𝑖=1 ) 𝑛𝑗=1 𝑛𝑖=1 , 𝑖 ≠ 𝑗 (1)
Where, 𝑧𝑖 and 𝑧𝑗 are observed values of a spatial feature at location i and location j respectively. 𝑧 is the 
mean of observed values in all the sites; 𝑛 is the number of observation locations/sites; and 𝑤𝑖𝑗 is the 
weight, defined based on the spatial proximity between location i and j . A simple approach to get 𝑤𝑖𝑗 is 
to use (Euclidean) distance between 𝑖 and 𝑗, 
𝑤𝑖𝑗 = 1 𝑑(𝑖,𝑗) ⁄ for 𝑖 ≠ 𝑗, = 0 for 𝑖 = 𝑗.
where, 𝑑(𝑖,𝑗) = √(𝑥𝑖−𝑥𝑗)2 + (𝑦𝑖−𝑦𝑗)2 if Euclidean distance is used.
The values of I range from -1 to +1. Negative values indicate negative spatial autocorrelation and positive 
values indicate positive spatial autocorrelation. A zero value indicates existence of a random spatial 
pattern.
However, since the computation for Moran’s I needs to deal with the weight 𝑤𝑖𝑗between each pair of 
spatial locations < i, j >, it becomes computationally intensive in case the total number of observation 
locations or sites (n) becomes very large. Therefore, it is necessary to provide a parallel and distributed 
computation of Moran’s I . 
Dataset: You will experiment your implementation of Moran’s I with Spark APIs using a given dataset 
(hep.txt) which includes Hepatitis rate of CA counties. 
The data set has four attributes, County, X, Y and Rate as shown below. However, to compute Moran’s I, 
the county name is not necessary. X and Y presents a location information, and Rate presents a spatial 
feature value, z in equation (1). For example, the first row <195, 500, 14.4> presents the observation 
value 14.4 at location (195, 500).
County X Y Rate
Alameda 195 500 14.4
Alpine 318 560 0
Amador 265 550 12.1
… … … … … … … …
Yuba 228 604 79.8
Tasks to do: 1. Explain your algorithmic design for Moran’s I using a Spark data flow diagram. Your diagram should 
show all transformations to construct a new RDD from previous RDDs and all actions to arrive to the 
result. For example, you can find a Spark data flow diagram in pp313 in Parsian’s Data Algorithm 
book: 
2. Implement your algorithm using (Option 1) Java or (Option 2) Python, and submit the program code.
3. Experiment your implementation with a given data set (hep.txt) and show the experimental result. 
4. In the execution result, show the content of each your RDD. For that, you may get your RDD data 
using saveAsTextFile Spark function. 
Part III. Spark MLlib
Problem description: The spark framework provides some common machine learning (ML) functionality 
in a library called MLlib. MLlib supports K-Means clustering that groups the data points into a predefined 
number of clusters. You need to conduct clustering analysis using MLlib K-means. 
Reference: Parsian’s Data Algorithm, Ch 12. 
Dataset: Wine data (wine.data) . This dataset has a result of a chemical analysis of wines grown in a 
particular region in Italy but derived from three different cultivars. The analysis determined the quantities 
of 13 constituents found in each of the three types of wines. The attributes are: Alcohol, Malic acid, Ash, 
Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color 
intensity, Hue, OD280/OD315 of diluted wines, and Proline. The data set has 178 observations and no 
missing values. The wine.names files shows the data description. 
Tasks to do: 
1. Explain your algorithmic design for k-mean using a Spark data flow diagram. The diagram of your 
program might be simple if you use MLlib. If you can analyze the source code of MLlib k-mean, you 
may draw the detail data flow diagram for K-mean algorithm. But that is optional. 
IMPORTANT NOTE: The wine.data has 14 attributes. The first attribute indicates a class label. In 
unsupervised learning like clustering analysis, a class label attribute should be removed. In your 
program design, include a transformation(s) to exclude the first attribute (with 1, 2, or 3 value) from 
the original RDD. 
2. Implement your algorithm using MLlib K-mean and submit the program code. 
3. Experiment your implementation with a given data set (wine.data) when k=3 and show the 
experimental result (data members per each cluster and final centroid values). 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!