辅导 Computer Architecture Project 2: Label Propagation via Parallel Formulation and GEMM讲解 Python编程

Computer Architecture

Project 2: Label Propagation via Parallel Formulation and GEMM

1 Project Description

The content of this project involves the implementation of a semi-supervised learning algorithm based on label propagation. It is well-acknowledged that machine learning broadly falls into three principal categories: supervised learning, unsupervised learning, and semi-supervised learning. Supervised learning presupposes the availability of a substantial corpus of labeled data to train a model, with the anticipation that the model will apprehend the data distribution and consequently make predictions about future, unseen samples. The source of this performance – the labeled data – thereby becomes tremendously important. Generally, an ample amount of training data, which encompasses the sample distribution present in real-world data, is imperative for the learned model to be meaningful. On the other hand, unsupervised learning operates in the absence of any labeled data, often referred to as clustering, which utilizes the intrinsic data distribution to assign categories to the data points. Semi-supervised learning, as the term suggests, occupies an intermediate position between the two, where only a limited amount of labeled data is available. The objective is to extract valuable information from this minimal subset of labeled data in conjunction with a significantly larger volume of unlabeled data.

1.1 Semi-Supervised Learning

Semi-Supervised Learning (SSL) proves to be beneficial in scenarios where data comprises a subset equipped with labels, while the remainder is devoid of them. Typically, a vast majority of the data remains unlabeled with only a minuscule fraction being labeled. SSL algorithms optimally exploit the unlabeled data to capture the latent distribution of the entire dataset. This approach is founded upon three pivotal assumptions:

1) Smoothness Assumption: Data points that are similar are likely to share the same label.

2) Cluster Assumption: Data points that belong to the same cluster are likely to possess the same label.

3) Manifold Assumption: Data points residing on the same manifold structure are presumed to share the same label. Consider the graphical illustration below: there are only two labeled data points. If these points are used to train a classifier directly, like a Logistic Regression (LR) or Support Vector Machine (SVM), the resultant decision boundary would resemble that depicted on the left image. Should the actual data distribution in practice align with the illustration on the right, it becomes patently clear that the classifier trained on the left is not reliable. The scarcity of labeled training data fails to encompass the range of scenarios potentially encountered in the future. However, if the copious amounts of unlabeled data (represented in black) are integrated into the analysis, as shown on the right image, a comprehensive perspective is obtained. An SSL algorithm would classify the data points within the larger circle as belonging to the red category and those within the inner circle as belonging to the blue category, revealing the data's true dual-circular manifold structure. In practice, labeled data is costly and challenging to acquire, in contrast to unlabeled data, which is abundantly accessible. Therefore, the ability to leverage the vast amount of unlabeled data to enrich our model's learning capabilities carries substantial value.

There exist numerous algorithms within the realm of semi-supervised learning. The task at hand involves the implementation of the most foundational among these, known as the label propagation algorithm.

2 Label Propagation Algorithm

The Label Propagation (LP) algorithm is predicated on a straightforward core concept: data points that are similar should bear identical labels. The LP algorithm encompasses two major steps: 1) the construction of a similarity matrix, and 2) the propagation of labels.

Grounded in graph theory, the LP algorithm requires the construction of a graph incorporating all data points, with nodes representing individual data instances, inclusive of both labeled and unlabeled data. The edge between node i and node j denotes their similarity. There are myriad methods for graph construction; for the purposes of this discussion, let us assume a fully connected graph where the edge weight between node i and node j is determined by a predefined similarity measure. The weight encapsulates the degree of resemblance between the two data points, which ultimately influences the label propagation process:

Here, the hyperparameter α controls the similarity. The k-nearest neighbors (k-NN) graph is a more rudimentary approach, wherein only the weights of the k-nearest neighbors to each node are preserved, with all others set to zero, indicating the absence of an edge and thus resulting in a sparse similarity matrix.

The label propagation algorithm disseminates labels through the edges connecting the nodes. Greater edge weights signify higher similarity between nodes, which in turn facilitates the ease of label transmission. We define an N×N probability transition matrix P:

Here, Pij denotes the probability of transitioning from node ( i ) to node ( j ). Assuming there are ( C ) classes and ( L ) labeled samples, we define an ( L × C ) label matrix ( Y_L ), where the ( i )-th row represents the label indicator vector for the ( i )-th sample. That is, if the ( i )-th sample belongs to class ( j ), then the ( j )-th element of that row is 1, and all others are 0. Similarly, we also provide an ( U × C ) label matrix ( Y_U ) for the ( U ) unlabeled samples. By combining them, we obtain an ( N \times C ) soft label matrix ( F = [Y_L; Y_U] ). The notion of 'soft labels' implies that we retain the probabilities of sample ( i ) belonging to each class, rather than categorically declaring with absolute certainty (probability 1) that the sample belongs to a single class. Of course, when it comes time to determine the definitive class for sample ( i ), the class with the maximum probability—i.e., ( \arg\max )—is selected as its label. The initial values of ( Y_U ) in ( F ) can be set using random numbers in the range (0,1).

The label propagation algorithm unfolds as follows: 1) Perform. propagation: ( F = P ✖ F ); 2) Reset the labels of the labeled samples in ( F ): ( F ✖ L = Y ✖ L ); 3) Repeat steps 1) and 2) until ( F ) converges.

Step 1) entails multiplying matrix ( P ) by matrix ( F ). During this step, every node propagates its label to other nodes according to the probabilities defined by ( P ). The more similar two nodes are (i.e., the closer they are in Euclidean space), the more readily their labels will influence one another, akin to forming alliances. Step 2) is critical because the labels of the labeled data are pre-determined and immutable; thus, after each propagation, they must revert to their original labels. By continually disseminating their labels, the labeled data help to carve out the class boundaries, which eventually pass through regions of high density and settle within low-density gaps, effectually demarcating territories of influence for samples of different classes.

Further, we partition the matrix ( P ) as follows:

Such that e have

Generally, the aforementioned iterative steps are repeated until convergence is achieved:

It can be observed that ( F_U ), the vector representing the labels of the unlabeled data, not only depends on the labels of the labeled data and their respective transition probabilities but also on the current label assignments and transition probabilities of the unlabeled data. Thus, the Label Propagation (LP) algorithm is capable of exploiting the distributional characteristics inherent in the unlabeled data. The convergence properties of this algorithm are also readily demonstrable, as it can be proven to converge to a convex solution:

Consequently, one can directly solve for the final ( Y_U ) using this approach. However, in practical applications, given that matrix inversion requires ( O(n^3) ) computational complexity, if the number of unlabeled data points is large, then inverting the matrix ( I - P_{UU} ) will be prohibitively time-consuming. Therefore, under such circumstances, iterative algorithms are commonly preferred for implementation.

3 Evaluation

3.1 Scoring For classification validation

We utilize the Shape sets dataset available at http://cs.joensuu.fi/sipu/datasets/, which includes eight datasets such as http://cs.joensuu.fi/sipu/datasets/Aggregation.txt. You may set varying numbers of initial labeled data and use the remaining data as test data. Please describe in detail the implementation and the results achieved in your experimental report.

3.2 Speedup Ratio of Parallel Algorithms

Accelerate the computation of Equation (1) using a parallel formulation (Parallel Algorithm). The criterion for evaluation in this section will be the speedup ratio obtained through optimization with the parallel algorithm, defined as ( R = T_0 / T_P ), where ( T_P ) is the time taken using the parallel algorithm and ( T_0 ) is the execution time of the algorithm prior to optimization. This section contributes 30% to the total score. Consideration: How to choose the granularity of parallel computation to best fit your runtime environment configuration (CPU core/thread number, etc.)?

3.3 Speedup Ratio Using GEMM

Accelerate computation of Equation (1) by employing General Matrix Multiply (GEMM). Similar to 3.2, the speedup ratio after the use of GEMM needs to be calculated/tested. This section accounts for 40% of the total score. Consideration: How to determine the size of the blocks in Blocked Matrix Multiplication to best suit your runtime environment configuration (CPU cache size, etc.)?