辅导databases、讲解Java，Python编程设计、辅导c/c++语言解析C/C++编程|辅导R语言程序

Home/Syllabus Schedule/Notes Topics
Projects

Due date: Dec 9th, 11:59pm

Requirements: you only need to pick and work on one of the problems shown below. You will need to implement and submit your codes to github@IU under a folder called projects. Make sure that you include a detailed readme file, showing how to use your programs, what databases you used to test your programs (if you wrote programs to simulate the data sets, please also submit your codes for the simulation), and your results.

Problem 1

Select any two of the point representations described in Shaffer Chapter 13 (i.e., the k-d tree, the PR quadtree, the bintree and the point quadtree). Implement your two choices and compare them over a wide range of data sets. Describe which is easier to implement, which appears to be more space efficient, and which appears to be more time efficient. You may use this city database (which is downloaded from United States Cities Database) and its subsets to test. You are welcome to use other point data sets that you can find online (if so make sure that you describe the data sets in your README file).
Problem 2

For this problem, you will construct and analyze a gossip network using tweets you will collect over a period of several days about a topic (e.g., Tesla and Uber). You may use the links provided here to retrieve tweets mentioning Tesla, Uber, TikToc, Minecraft. Based on the tweets you collect, build a gossip network and analyze the network to learn something about Tesla/Uber/Minecraft. (Please let me know if you are interested in something else other than these three). Can you use Cytoscape to visualize the network? What are the things that people also talk about (the most) when they talk about Tesla/Uber/TikToc/Minecraft? How many days have you tracked to collect the tweets? Are there other questions that you would like to ask given the gossip network you construct?
Problem 3

You have worked on the LCS problem and implemented a DP algorithm to compute edit distance between two input strings. You have also worked on strings with biological meaning (DNA sequences, which are strings in the alphabet in {A, T, C and G}). We could imagine that two similar DNA sequences share longer LCS, and their edit distance will be smaller, comparing to DNA sequences that are less similar to each other. For this problem, you are going to study the relationship between the LCS and the edit distance using DNA sequences: is there inverse correlation (i.e., if one goes up, the other goes down) between these two metrics, and if so, how strong is the inverse correlation? You may use these two files to test your programs/ideas: small.txt and median.txt (each file contains multiple DNA sequences in the FASTA format). Briefly, in FASTA file, the lines starting with > are the names of the sequences, and the lines below each name line are the actual sequences. Below shows a toy example with two sequences (toy1 and toy2):
>toy1 name of sequence 1
TACTGATGGGGAGAGDTAT
>toy2 name of sequence 2
ACTGATCATCGGGATAGAGEGAGE