首页 > > 详细

Molecular evolution assignment -- Part 2

 
Molecular evolution assignment -- Part 2
Read the README-FIRST (README-FIRST.ipynb) notebook first!
Fitting and interpreting maximum likelihood models
The questions in this notebook are worth 13%
There is an extension question in this notebook worth 1%.
Q1 – Hypothesis test comparing GTR and HKY85
Q1a
Perform a hypothesis test that compares GTR and HKY85. Refer to the online notes
( ) and the hypothesis test demo
(demo_hypothesis_test.ipynb).
Apply your hypothesis test to the alignment at aln_path .
In [ ]:
In [ ]:
Q1b
Which model is the alternate hypothesis
What is your conclusion from the hypothesis test
ANUID = ""
aln_path = "data/part2/cds/ENSG00000011007.fasta"
# This part worth 1
# Enter your code here
raise NotImplementedError # No Answer - remove if you provide an answer2022/6/28 1111 - Jupyter Notebook
 
YOUR ANSWER HERE
Q1c
Assuming the null hypothesis is correct, what is the probability of a LR ≥ that observed?
YOUR ANSWER HERE
Q1d
Display the MLEs for the model you accepted
In [ ]:
Q1e
In ≤ 50 words.
Identify what you think is the most striking difference between the MLEs from the two models.
Explain why you thought it was striking and what it reveals about the nature of substitutions affecting
this gene.
Feel free to add a compute cell to display those estimates.
YOUR ANSWER HERE
Q1f
In ≤50 words.
List the assumptions common to both HKY85 and GTR models.
List the assumptions unique to HKY85
List the assumptions unique to GTR
YOUR ANSWER HERE
Q2 – Analysis of the substitution rates by sequence
"class"
Background
# This part worth 0.1
# Enter your code here
raise NotImplementedError # No Answer - remove if you provide an answer2022/6/28 1111 - Jupyter Notebook
 3/5
g
In the sequence comparison assignment you have sought an understanding of where the information
specifying Transcription Factor (TF) binding occurs. As discussed in the background material for that
topic, there is a tendency for these elements to lie within the "proximal promoter" region, i.e. 5'- of the
transcription start site of the gene.
Motivated by this evidence for function in the 5' proximal region, we are now seeking to evaluate
whether the amount of substitutions in such "promoter" regions reflects this function.
Experimental design
Read the data sampling for likelihood analyses (data_description.html) as it contains critical
information regarding the study design.
That work produced a tab delimited result file ( data/part2/results-summed_lengths.tsv ) that
contains estimates of the sum of branch lengths from the entire tree for each gene. (See the online
notes regarding branch lengths
( /molevol/substitution_models.html#time-in-molecular-evolution).)
Q2a
In ≤ 100 words.
with reference to the Neutral Theory, make a prediction regarding how you expect the branch
lengths to differ (or not) between at least 2 of the sequence types. Your prediction must include
the logical reasoning by which you arrived at the prediction.
YOUR ANSWER HERE
Q2b
This part worth 2
In ≤ 300 words.
Choose an appropriate statistical testing procedure (see the online notes
( /cogent3/statistical_tests.html) or use scipy.stats ) to test the
prediction you made above
select the sequence types you will use and justify the decision
specify what hypothesis tests you will conduct (each one will require specifying the null and
alternate) and what testing procedure you will use, to evaluate this prediction
justify why you chose that procedure
NOTE
If you think the best statistical procedure is not implemented in cogent3 or in scipy :
implement it yourself if you can; or2022/6/28 1111 - Jupyter Notebook
 4/5
pick what you think is closest to it from those in cogent3 and explain what you see as
limitations of the choice.
Some literature research to choose what seems the most suitable test for the hypothesis. Here's one
reference page (https://en.wikipedia.org/wiki/Location_test) for statistical testing, and another
(https://en.wikipedia.org/wiki/Paired_difference_test) to get you started.
YOUR ANSWER HERE
Q2c
Apply your chosen procedure to the stored data at tsv_path .
For information on how to manipulate the tabular data, refer to the online notes
( /cogent3/tables.html).
In [ ]:
In [ ]:
Q2d
In ≤ 300 words.
From the results of your hypothesis test(s) you have performed, draw your conclusion regarding your
hypothesis.
Explain your result(s) with reference to the Neutral Theory. In your answer consider the following:
do the results make sense?
was there anything surprising?
if you think there are limitations of the design / analysis that compromise the ability to draw
conclusions, state them
by design I mean, the sequence sampling protocol
by analyses I mean the properties of the methods for estimating the branch lengths
are there alternate explanations
This part worth 3.2
YOUR ANSWER HERE
Q3 – extension question
Worth 1
tsv_path = "data/part2/results-summed_lengths.tsv"
# Enter your code here
raise NotImplementedError # No Answer - remove if you provide an answer2022/6/28 1111 - Jupyter Notebook
 5/5
Is there a way to sample columns from the protein coding sequence alignments so that the variation
is more likely to be neutral?
Explain and justify (≤ 100 words)
implement it using only pure python and cogent3
apply it to a couple of alignments and compare the results to not using your procedure
what is the limitation of the approach
Your answer YOUR ANSWER HERE
In [ ]: # your code
# Enter your code here
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!