NLP,。,。
Core Description
For the core, you will implement a program that creates a model of a music artist’s lyrics. This model receives lyric data as input and ultimately generates new lyrics in the style of that artist. To do this, you will leverage an NLP concept called an n-gram and use an NLP technique called language modeling.
Your understanding of the linked concepts and definitions is crucial to your success, so make sure to understand n-grams, language modeling, Python dictionaries as taught in the warmup, and classes and inheritance in Python before attempting to implement the core.
The core does not require you to include any external libraries beyond what has already been included for you. Use of any other external libraries is prohibited on this part of the project.
Core Structure
In the language-models/folder, you will find four files which contain class definitions: nGramModel.py, unigramModel.py, bigramModel.py, and trigramModel.py. You must complete the prepData, weightedChoice, and getNextToken functions in nGramModel.py. You must also complete the trainModel, trainingDataHasNGram, and getCandidateDictionary functions in each of the other three files.
In the root CreativeAI repository, there is a file called generate.py, which will be the driver for generating both lyrics and music. For the core, you will implement the trainLyricsModels, selectNGramModel, generateSentence, and runLyricsGenerator functions; these functions will be called, directly or indirectly, by main, which is written for you.
We recommend that you implement the functions in the order they are listed in the spec; start with prepData and work your way down to runLyricsGenerator.
Getting New Lyrics (Optional)
If your group chooses to use lyrics from an artist other than the Beatles, you can use the web scraper we have written to get the lyrics of the new artist and save them in the data/lyrics directory for you. A web scraper is a program that gets information from web pages: ours, which lives in the data/scrapers directory.
If you navigate to the data/scrapers folder and run the lyricsWikiaScraper.py file, you will be prompted to input the name of an artist. If that artist is found on lyrics.wikia.com, the program will make a folder in the data/lyrics directory for that artist, and save each of the artist’s songs as a .txt file in that folder.
Explanation of Functions to Implement
prepData
The purpose of this function is to take input data in the form of a list of lists, and return a copy of that list with symbols added to both ends of each inner list.
For the core, these inner lists will be sentences, which are represented as lists of strings. The symbols added to the beginning of each sentence will be ^::^ followed by ^:::^, and the symbol added to the end of each sentence will be $:::$. These are arbitrary symbols, but make sure to use them exactly and in the correct order.
For example, if the function is passed this list of lists:
[ [apos;heyapos;, apos;judeapos;], [apos;yellowapos;, apos;submarineapos;] ]
Then it would return a new list that looks like this:
[ [apos;^::^apos;, apos;^:::^apos;, apos;heyapos;, apos;judeapos;, apos;$:::$apos;], [apos;^::^apos;, apos;^:::^apos;, apos;yellowapos;, apos;submarineapos;, apos;$:::$apos;] ]
The purpose of adding two symbols at the beginning of each sentence is so that you can look at a trigram containing only the first English word of that sentence. This captures information about which words are most likely to begin a sentence; without these symbols, you would not be able to use the trigam model at the beginning of sentences because there would be no trigrams to look at until the third word.
The purpose of adding a symbol to the end of each sentence is to be able to generate sentence endings. If you ever see $:::$ while generating a sentence in the generateSentence function, you know the sentence is complete.
trainModel
This function trains the NGramModel child classes on the input data by building their dictionary of n-grams and respective counts, self.nGramCounts. Note that the special starting and ending symbols also count as words for all NGramModels, which is why you should use the return value of prepData before you create the self.nGramCounts dictionary for each language model.
- For the unigram model, self.nGramCounts will be a one-dimensional dictionary of {unigram: unigramCount} pairs, where each unique unigram is somewhere in the input data, and unigramCount is the number of times the model saw that particular unigram appear in the data. The unigram model should not consider the special symbols ‘^::^’ and ‘ⰺ›㼼㱣⼾㸤ഺ䠺㩣†††䐍䠠䝩㰠⽴㹥ഠ㱩㍲㵭≯䌠≯㹮䌠㱯⼭㍩㹭੮㱳㹯†⁵⁴›†⁃⁽Ⱐ†♡㭭†⁵⸠䥩⁰†⁹†⁴⁶⸠㱯⁴⽨㹥ഠੂ†⁃⁴㩣㱴⽩㹮ൡੲ㱹㹵㱬㹯㭥㭩㨾㐊ⰼ 㬊㭥㨠ㅬⱡ㭩㭰㩹⁴㍨Ɐ㬼㬠㨠†㈠⁴㱢⽯㸊㰠⼠㸠ഠਠ㰼㹲䠾☠㬠†⁴⁴• ††䘢ⱳ⁰‾㨠⁰‼ Ⱟ‾†‼‾‾⸠†䠠☠㭴‾ †††♣㬢☾㬠††Ⱪ♮㭯☽㬻‼‼†㰢⼾㸠ഠਠ㰼㹰䙡ⱴ‧⁹ⱡ†′㝡•ⸯ†‼‰㈼Ⱟ♰㬲☯㬼Ɫ‾††‧㜼Ⱟ›′♰㭡‧‼㠯⁰ⱡ⁵⁰㉡㰠⽣㹡൳ੳ㰽㍳⁴㵮≧≰㹡㱬⽡㍳㸽ഢ੮㱵㹢‴‱†䤠•‾㨯㱳⽰㹮ാ਼㱢㸯ാ਼ ⁰㰾㸠䌠†䌠䐊 †‾ 㰾⼍㹵൬ਾ ††㰠㹬†䌠䑲†䍳⸮㰠⽉㹷൩੬†㰠㹲䍻㱮⽡㹭ൔ੨›䄠ⱽ††⁴††ⱔ⁷††⁰ ‾•Ɱ⁴⁴⁎⸼㱰⼾㹩൳ਠ㱦⽵㹴൩੯㱮㍴㵳∠䱴䴠∠㹩䱩䴠㱧⽳㌠㹡൮㰠㹥†⁴†⁵†♤㭩⁴ⴠ⁴ⱳ††⁴ⱡ†⁴䑃䱴†Ⱪ⁴⁵‾䱬⸠䤠⁴⁴⁴⁹䱬†Ⅷ㱲†⽭㹯⁵†⁴㱮⽴㹸൴ਮ㰼㍬㴊∠䝆䵴≩㹧䝥䵴㱦⽵㍣㹴൩੯㱮㹲⁴†Ɱ⁴†⁴⸮†䥈ⁱ※⁴…††⁴㬠ⱡ⁴ ††䥧Ɐ⁔⁵⁴⁴⁴䅤††ⱴ†⁴††‼⁄⁴⁹䝴䵣 ‾‾⁴†⁴㰠⽣㹮੩㱤㍴㵥≸≤㹥㰠⽳㍮㹴੮㱣㸮⁴†⁷†⸠†…ⁱⱳ†⁴⁵††⁴††℠㱩⁴⽨㹥ഠੳ䑥⁴⁷††⁴⁴⁷⁵㩴⁴䱹Ⱐ⁴⁴⁄⁈ⱇ⁴†Ⱪ†† ††⁴⁷㱩㹯⑯㩫㨠㩡⑴㰠⽴㹡⁴䤠††Ⱜ†㱃⽴㹳ഠੵ†††††⁵†Ⅹ㱣⽨㹮ഠ㰠㌬†㴠≴䱵䝭≯㹯䱡䜠㰠⼰㍷㹯൲㱳㹩††††※⁷†Ⱐ⁵ⱬ†⁴䕃⽩㑮⁴Ɱ⁇⁵⁹㩤⁵ⁱ†Ⱐ⁷††⁷†䱮⡬†⁴⤠⁹†ⴠ††⁴†㰠⼠㹴ੳ䅴†ⱴⱦ†⁴Ⰽ ⁰䱲•Ⱳ†††⸼ †††††⸠㱴⽨㸠൭੯㱤㉬㵩≴䔦䜠ⴠ䙩≮㹣䕩†䝥†䙳㱯⽷㈠㹹൯ੵ㱲㍰㵲≡䰠≭㹯䱩㱦⽩㍩㹥൮㱹㹩†ⱡ䱆䱯䱵††‱⁴䤩†⁷⁴⁴†Ⱪ䰰⁴⁹†ⰰ†⁴⁴†⸬‰‰† †䰠㭥†⁴†䱦ⱥ⁰⸠⁰††䙮㱮″⼰㸠൳䥣†⁸䑥䕮††⁴⁰Ⱨ⁰㱤⼽㹈൯㰭㍯㵮∭䱲∭㹇䱹㰾⽈㍷㸠൴੯㰠㹵⁵′‾⁹⁹㩮†Ⱞ⁰Ⱪ䤦※※††㨮†⁵ⱥ†Ⱨ†⁴Ᵽ㱩⽮㸬ഠ੮㱡㍩㵥∠䥲≤㹩䤠㱥⼠㍯㹵൲ਠ㱃㹥†㩢ⱨ⁹†㭮Ɽ†⁵⁴⁴†♲㬠⁵㭯⁸Ⱶ†⁵††††† ⸀㰀 ⼀㸀ഀ ☀㬀 ㌀Ⰰ ⸀ 䌀 㬀 ㈀ 㬀 ㌀ ⸀㰀⼀㸀ഀ㰀㌀ 㴀∀∀㸀㰀⼀㌀㸀ഀ㰀㸀 䰀䴀 䴀䴀 ⸀ Ⰰ 䤀 䰀䜀Ⰰ 䴀䜀⸀ 䴀䴀 䴀䜀 ☀㬀 ⸀㰀 ⼀㸀ഀ䄀 ☀㬀 Ⰰ ☀㬀 Ⰰ ⸀ 䴀 ☀㬀 ⸀㰀⼀㸀ഀ㰀㈀ 㴀∀䠀ⴀⴀⴀⴀ∀㸀䠀 㰀⼀㈀㸀ഀ㰀㸀䄀 Ⰰ 㨀㰀⼀㸀ഀ㰀㸀㰀 㴀∀ ∀㸀ഀ㰀㸀ഀ 㰀㸀ഀ 㰀㸀ഀ 㰀 㴀∀∀㸀ഀ 㰀㸀㰀 㴀∀∀㸀㰀⼀㸀㰀 ⼀㸀㰀⼀㸀ഀ 㰀⼀㸀ഀ 㰀 㴀∀∀㸀ഀ 㰀㸀㰀 㴀∀∀㸀㰀 㴀∀∀㸀㰀⼀㸀 㴀㴀 㰀 㴀∀∀㸀✀✀㰀⼀㸀㨀㰀⼀㸀㰀 ⼀㸀㰀⼀㸀ഀ 㰀⼀㸀ഀ 㰀⼀㸀ഀ 㰀⼀㸀ഀ㰀⼀㸀ഀ㰀⼀㸀㰀⼀㸀ഀ㰀㸀☀㬀㰀⼀㸀ഀ㰀㸀䈀 Ⰰ ⸀ 䤀 Ⰰ ⴀ Ⰰ ⸀㰀 ⼀㸀ഀ Ⰰ ⸀ 䘀 Ⰰ ☀㬀 䴀⸀Ⰰ 䌀Ⰰ ☀㬀 㨀㰀 ⼀㸀ഀ 䴀⸀㰀 ⼀㸀ഀ ⸀Ⰰ ⴀ Ⰰ