MGSC 661: Individual Assignment #3
General Information
This assignment must be submitted via the submission folder for Assignment #3 via myCourses before its due date. This is an individual assignment and more instructions on this assignment and due date can be found in the module Assessment Overview for Assignment #3.
In this assignment, you will be also assessed based on the effectiveness of your visualizations.
This assignment is worth 100 points (30% of your marks). All the questions are listed below. You can find the complete grading rubric breakdown in this assignment submission folder.
Assignment Goal
The aim of this assignment is to build unsupervised learning models.
On the first problem, you will use K-means clustering for image compression. You will reduce the number of colours in an image by clustering similar colours and replacing them with the centroid of the cluster. This will help you understand how K-means clustering can be applied to image data for compression purposes.
On the second problem, you will use a dataset on the performance characteristics of various automobiles to perform principal component analysis to make cross-country of manufacturing comparisons.
For all questions in this assignment, you must attach your Python code with your submission along with your outputs. Jupyter notebook code files are accepted as well.
Read the following tasks that you should perform to complete this assignment. This assignment contains 2 parts as follows:
Part 1: Image Compression using K-Means (50 points)
a. (5 points) Downloading of Image from Internet using Python Library: Pick any colour image of your choice. You can use an image from your personal collection or download a sample image from the internet. If the run-time of your algorithm is too slow, you may need to choose a lower-resolution image. Use an appropriate Python library to load the image. Alternatively, you can use the flower.jpg from sklearn.datasets.
from sklearn.datasets import load_sample_image
flower = load_sample_image(“flower.jpg”)
requirement: Image is loaded correctly using an appropriate Python lib, with clear and accurate descriptions.
b. (5 points) Image Conversion: Convert the image to a two-dimensional array where each row represents a pixel and each column represents a colour channel (RGB values).
requirement: Image is converted correctly to a two-dimensional array with clear and accurate descriptions.
c. (10 points) K-Means Implementation: Implement k-means clustering to cluster the pixel colors into k clusters (experiment with different values of k such as 16, 32, 64, etc.).
requirement: K-means clustering is implemented correctly with clear and accurate descriptions, experimenting with different values of k.
d. (5 points) Colour Replacement: Replace each pixel's colour with the centroid of the cluster it belongs to.
requirement: Pixel colors are replaced correctly with the centroid of the cluster with clear and accurate descriptions.
e. (10 points) Reconstructing, Comparing & Defining: Reconstruct the compressed
image from the clustered pixel data. Compare the original image and the compressed image by visualizing them side by side. Define a metric to quantify the compression achieved, if at all, for the image.
requirement: Compressed image is reconstructed correctly and compared with the original using appropriate visualizations, with clear and accurate descriptions of the compression metric. Metric to quantify loss of quality is defined correctly with clear and accurate descriptions, discussing trade-offs between clusters and image quality/ compression.
f. (15 points) Metric Definition and Trade-off Discussion: Define a metric to quantify
the loss of quality, if at all, between the original image and the compressed image.
Discuss the trade-off between the number of clusters and the quality of the image and the compression achieved.
requirement: Metric is defined correctly with clear and accurate descriptions, effectively quantifying the loss of quality. Comprehensive discussion with clear and accurate descriptions, effectively addressing the trade-off between clusters, image quality, and compression.
Part 2: Principal Component Analysis (50 points)
The mtcars dataset is a classic dataset in statistics and machine learning, extracted from the 1974 Motor Trend US magazine and comprises various specifications of 32 different car models.
• model: Car make
• mpg: Miles per gallon (fuel efficiency)
• cyl: Number of cylinders in the engine
• disp: Displacement (cubic inches)
• hp: Gross horsepower
• drat: Rear axle ratio
• wt: Weight (1000 lbs)
• qsec: 1/4 mile time (seconds)
• vs: Engine (0 = V-shaped, 1 = straight)
• am: Transmission (0 = automatic, 1 = manual)
• gear: Number of forward gears
• carb: Number of carburetors
In this problem, you have also been provided with an additional feature: country denoting the origin of the vehicle.
a. (5 points) Loading of the Dataset: Load the dataset and confirm it has loaded correctly.
requirement: Dataset is loaded correctly with clear and accurate descriptions.
b. (5 points) Standardization: Standardize all numerical features in the dataset to
have a mean of 0 and a standard deviation of 1. This is an essential preprocessing step in many machine learning algorithms and statistical techniques.
requirement: Numerical features are standardized correctly with clear and accurate descriptions.
c. (15 points) PCA Application: Apply PCA to the standardized numerical features to reduce it to the first two principal components. Also, explain the percentage of variance explained by each principal component.
requirement: PCA is applied correctly with clear and accurate descriptions of variance explained by each principal component.
d. (5 points) Scatter Plot Creation: Create a scatter plot of the data points in the new 2-dimensional space defined by the principal components.
requirement: Scatter plot is created correctly with clear and accurate descriptions.
e. (10 points) Colour-Coding by Number of Cylinders: Colour-code the observations by the number of cylinders (cyl). Discuss the separation of the different cylinder categories in the PCA plot. Which cylinder categories are most clearly separated? requirement: Observations are color-coded correctly with clear and accurate descriptions, discussing separation of cylinder categories.
f. (10 points) Colour-Coding by Country: Next, colour-code the observations by the country of the models. Discuss and interpret the separation of the cars from different countries in the PCA plot.
requirement: Observations are color-coded correctly with clear and accurate descriptions, discussing separation of cars from different countries