讲解II-1辅导R、R语言程序解析

Contents
I Membership Inference[10 points] 2
II Mitigating Bias with Adversarial Learning [10 points] 4
II-1 Demographic Parity [5 points] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
II-2 Equality of Opportunity [5 points] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
IIIMitigating Bias in Word Embeddings[10 points] 5
III-1 Debiasing word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Setup
I Membership Inference[10 points]
In this part of the assignment, you will be implementing the black-box shadow model membership inference
attack of Shokri et al. [3]1 also described in Nasr et al. [2], the survey we read for class. A membership
inference is the scenario in which an adversary seeks to determine if a given instance was used for training a
given model. We will consider the "black-box" variant of membership inference: the adversary does not have
direct access to the model, they only have query access in that they can ask the model to make predictions
on chosen instances and observe the outcome. More difficult variants of this scenario only give the adversary
the prediction class for their chosen queries but we will work in the easier case where the adversary is given
the output probability vector. We will see that despite not having direct access, the attacker can still achieve
success. White-box vs. black-box is a common dichotomy in security where it is generally recognized that
hiding systems or models (that is, black-box scenarios) from adversaries is typically not an effective form
of defense. This is true in a wider sense as exemplified by the concept of "security through obscurity", the
usually ineffective approach for securing systems by hiding their internal operation.
White-box Membership Inference Let (x,y) ∼ D be a dataset partitioned into subsets T0, T1. Let f
be a model trained on T1 where yˆ = f(x), the model outputs, are probability distributions over a set of
classes. An adversary is a procedure, that given an instance x from either T0 or T1 outputs either 0 or 1
indicating the guess of which subset the instance comes from (that is, whether it is a training instance). The
adversary also has:
• Query access to f ’s probability vectors. In another words, given any chosen instance x′, the attacker
can obtain yˆ′ = f(x′). They are not limited in how many queries to make (though typically this is a
point of comparison for black-box attacks).
• a shadow subset S ⊆ D independent of the training set: S ∩ T0 = ∅ 2.
• Knowledge of the format of the inputs and outputs of the targeted model, including their number and
the range of values they can take.
1https://arxiv.org/abs/1610.05820
2The attack will work better if S happens to have some overlap with T1.
2
• Knowledge of the type and architecture of the machine learning model, as well as the training algo-
rithm3.
Let b be a fair coin flip in range {0, 1} and let x be a sample uniformly from Tb. The probability that the
attacker outputs b is a measure of adversary success. We will write Af,S to denote the adversary with the
aforementioned access, then:
success (Af,S) def= Pr
b
[Af,S(x) = b | x ∈ Tb]
Shadow Model Attack In the shadow model attack, the attacker uses the shadow dataset S to create
a predictor for the question "was x used to train this model?". The process has two main steps: 1) train
predictors for known splits of S and collect their predictions on instances in S and out of S into a synthesized
dataset for the membership inference task, 2) train a model over the synthesized dataset. We elaborate below.
1. Repeat a number of times the process:
• Split S into two disjoint subsets Sin and Sout and train a shadow model g using only Sin. We will
use this model to generalize output behaviour of models on training instances vs. non-training
instances.
• Synthesize two datasets Ain and Aout. The features in Ain are the ground truth label y and the
g-predicted class distribution for each instance (x, y) ∈ Sin while Aout has the same but for each
instance of Sout. The target class in these is an indicator whether the given instances are from
Sin (indicated by 1) or from Sout (indicated by 0).
We now have (y, yˆ, 1) ∼ Ain where yˆ = g(x) for (x, y) ∈ Sin and (y, yˆ, 0) ∼ Aout where yˆ = g(x)
for (x, y) ∈ Sout
2. Combine all of the produced Ain and Aout sets into (y, yˆ, b) ∼ A.
3. Train an attack model m : (y, yˆ) 7→ b using A to predict training set membership b.
Now given an instance (x, y), we can use m to predict b = m(y, g(x)) telling us whether (x, y) ∈ Sin,
the training set of the shadow model g. Interestingly, we can also b = m(y, f(x)) to determine whether
(x, y) ∈ T1, the training set of the model we are attacking, f !
For the implementation, we are additionally asking you to create not one attack model m but rather a
family my of models where my is specialized to instances of only class y. Thus to make membership guess
for instance (x, y), we look up the prediction b = my(y, f(x)).
Implementation For the attack models, my, you can use the following architecture though we encourage
you to experiment:
• Let C be the number of classes of the target model (C can be obtained via shadow_labels.max() + 1).
The input of m has shape (None, 2C) (2C because m takes both the predicted distribution over labels
and the one-hot true label.
• One hidden layer of shape (None, 4C) with a ReLU activation.
• Output is of shape (None, 1) with a sigmoid activation.
3This is not required in the original paper but is assumed in this homework for simplicity.
3
• Binary crossentropy for the loss function.
Coding Exercise 1 [4 points] Implement synthesize_attack_data in hw5_part1.py. This corresponds
to the first two points of the algorithm above.
Coding Exercise 2 [4 points] Implement build_attack_models in hw5_part1.py. This is the last point
of the algorithm above.
Coding Exercise 3 [2 points] Implement evaluate_membership in hw5_part1.py. This method applies
the attack models to a dataset to make their membership guesses.
The starter code in hw5_part1.py includes an invocation of the exercises on a model for CIFAR. You can
use it to test your solutions.
Tips:
• The build_attack_model function takes the target model, shadow data and labels (S), and the number
of shadow models to use for the attack. When splitting the shadow data into Sin and Sout, you should
use the DataSplit class (found in hw5_part1_utils.py). The constructor for DataSplit takes the
labels of the dataset you would like to split, and a seed (index from 0 to num_shadow_models). The
resulting object has two attributes, in_idx and out_idx, which give the list of indices into the original
data that form the “in” and “out” datasets. For example, with a DataSplit object, split, Sin can be
obtained via shadow_data[split.in_idx].
• The evaluate_membership function takes the attack models returned by build_attack_models, the
target model’s predictions on a set of points, and the true labels for the same set of points. Recall that,
while the attack model takes both the predicted labels and the one-hot true labels as input, there is
also a separate attack model for each class.
II Mitigating Bias with Adversarial Learning [10 points]
In this part of the assignment, you will be implementing the GAN-like fair training routine of Zhang et al.
[4] described in lecture. You will be implementing two variants of the training procedure:
1. A variant that aims to achieve demographic parity which we will identify with the condition
Pr[Yˆ = 1 | Z = 0] = Pr[Yˆ = 1 | Z = 1]
where prediction yˆ = 1 is positive and z ∈ {0, 1} are two groups (genders, etc.).
2. A variant that aims to achieve equality of opportunity for positive ground truth, identified by the
condition
Pr[Yˆ = 1 | Y = 1, Z = 0] = Pr[Yˆ = 1 | Y = 1, Z = 1]
where y = 1 indicates positive ground truth.
Your solution will fill in the holes of hw5_part2.py. The starter code also includes invocations based on the
UCI Adult dataset4. We split the dataset for you into demographics features matrix X, class y (1 indicates
income >= 50k, the outcome we will consider positive), and a group (gender) attribute z (1 indicates male).
4http://archive.ics.uci.edu/ml/datasets/Adult
4
The idea of the debiased training procedure is that we can mitigate bias in our classifier via a competition
between an adversary and the classifier. The classifier wants to predict the correct output, while also
keeping the adversary from predicting the protected attribute. Meanwhile, the adversary wants to predict
the protected attribute. We give the adversary different information depending upon which fairness objective
to achieve. For demographic parity, the adversary only gets the classifier prediction while for equality of
opportunity, the adversary gets both the correct class and classifier prediction.
II-1 Demographic Parity [5 points]
Coding Exercise 4 [1 points] Implement the evaluate_dem_parity method in hw5_part2.py. This method
measures the demographic parity of a given model. It should return a tuple with two values: (1) the probability
that the prediction for group 0 is 1, and (2) the probability that the prediction for group 1 is 1. Demographic
parity is achieved if these two values equal.
Coding Exercise 5 [4 points] Implement the train_dem_parity method of the AdversarialFairModel
class in hw5_part2.py.
The train_dem_parity method returns nothing but should update self.classifier by training it accord-
ing to the following procedure:
1. Create the adversary and connect it to the classifier’s outputs.
2. Create operations for the loss, gradients, and parameter updates of the adversary.
3. Create operations for the loss, modified gradients, and parameter updates of the classifier.
4. For each epoch, train the adversary, then the classifier on all batches (on epoch t, use a learning rate of
1/t and an α of
√
t). This will results in reduction of learning rate with epochs and a slowly increasing
debiasing strength.
The adversary network should be simply a linear model with a single sigmoid output (as the protected
attribute is binary).
II-2 Equality of Opportunity [5 points]
Coding Exercise 6 [1 points] Implement the evaluate_eq_op method in hw5_part2.py. This computes
a measure of equality of opportunity for a given model. We will focus on the positive ground truth (1). It
should return a tuple with two values: (1) the probability that the prediction for group 0 is 1, given that the
ground truth is 1, and (2) the probability that the prediction for group 1 is 1 given that the ground truth is
1. Equality of opportunity (for positive ground truth) is achieved if these values equal.
Coding Exercise 7 [4 points] Implement the train_eq_op method of the AdversarialFairModel in hw5_part2.py.
The general operation of this method as described in the exercise for demographic parity.
You can test your implementation with the last part of the starter code hw5_part2.py.
III Mitigating Bias in Word Embeddings[10 points]
In this part you will implement the word-embedding debiasing technique of Bolukbasi et al. [1]. In the paper,
they refer to this technique as "hard-debiasing" or "neutralize and equalize". Please refer to Section 6
of the paper [1] for more details.
5
Before you get started Install extra packages (json and gensim may be necessary) and download the
word2vec word embedding. You will then need to unzip the data and place it in the data folder:
pip install gensim json
wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
gunzip GoogleNews-vectors-negative300.bin.gz
mv GoogleNews-vectors-negative300.bin data
We have provided the class that loads in the embedding in the starter code in hw5_part3.py. There are
four other json/txt files you need to use in your implementation, which are also loaded for you.
• definitional_pairs.json: Definitional pairs used to find the gender dimension.
• gender_specific_full.json: All words that you should not debias.
• equalize_pairs.json: Word pairs to equalize so that they are equal-distant to the debiased gender-
neutral words.
• questions-word.txt: An evaluation dataset to test the performance of word embeddings. Each line
contains an analogy, in some lines it has subcategory information with colon that you can ignore.
III-1 Debiasing word embeddings
Coding Exercise 8 [3 points] Complete the method identify_gender_subspace that extracts the gender
direction (1 dimension). This is done by performing PCA on the gender definitional words. You can use
np.linalg.svd for PCA. No other packages (such as sklearn) are allowed in this part.
Coding Exercise 9 [3 points] Complete the method neutralize that project all gender-neutral words (the
complement of gender-specific words) away from the gender axis.
Coding Exercise 10 [4 points] Complete the method equalize that makes sure both words within each
equalized pair are equal-distant to the gender-neutral words.
To evaluate the bias and utility of the original and debiased embedding, you can use compute_analogy,
which computes the fourth word given three words in an analogy. It is done by finding a word (different from
the given three words) that is closest (in terms of inner product) to the fourth vertex in the parallelogram
where other given words occupy three vertices. The end of hw5_part3.py also includes an invocation of the
requested methods you can test your solution with.
References
[1] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to com-
puter programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information
processing systems, pages 4349–4357, 2016. URL https://arxiv.org/abs/1607.06520.
[2] Milad Nasr, Reza Shokri, and Amir Houmansadr. Comprehensive privacy analysis of deep learning: Stand-alone
and federated learning under passive and active white-box inference attacks. arXiv preprint arXiv:1812.00910,
2018. URL https://arxiv.org/pdf/1812.00910.

[3] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against
machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017. URL
https://arxiv.org/abs/1610.05820.
[4] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning.
CoRR, abs/1801.07593, 2018. URL http://arxiv.org/abs/1801.07593.