Execute
Save
Share
Share link
share
share
share
Team
Public Teams
Comments
0
Created By:
Guest
Title:
Title
Description
Edit
Copy Link
Login
Email *
Password *
Login
OR
Create Account
Screen Name *
Email *
Password *
Retype Password *
Team Access Code
Register
Public CodeBins
HELP
--Select Theme--
Ambiance
Blackboard
Cobalt
Eclipse
Elegant
Erlang-Dark
Lesser-Dark
Monokai
Neat
Night
Rubyblue
Vibrant-Ink
Xq-Dark
New CodeBin
CodeBins Versions
02/18/2017- V.1
Recent CodeBins
View All CodeBins
import sys import nltk import math import time import collections START_SYMBOL = '*' STOP_SYMBOL = 'STOP' RARE_SYMBOL = '_RARE_' RARE_WORD_MAX_FREQ = 5 LOG_PROB_OF_ZERO = -1000 #TODO: IMPLEMENT THIS FUNCTION#Receives a list of tagged sentences and processes each sentence to generate a list of words and a list of tags.#Each sentence is a string of space separated "WORD/TAG" tokens, with a newline character in the end.#Remember to include start and stop symbols in your returned lists, as defined by the constants START_SYMBOL and STOP_SYMBOL.#brown_words(the list of words) should be a list where every element is a list of the words of a particular sentence.#brown_tags(the list of tags) should be a list where every element is a list of the tags of a particular sentence. def split_wordtags(brown_train): brown_words = [] brown_tags = [] return brown_words, brown_tags #TODO: IMPLEMENT THIS FUNCTION#This function takes tags from the training data and calculates tag trigram probabilities.#It returns a python dictionary where the keys are tuples that represent the tag trigram, and the values are the log probability of that trigram def calc_trigrams(brown_tags): q_values = {} trigramCount = collections.Counter() bigramCount = collections.Counter() for sentence in brown_tags: tokens = sentence.strip().split() bigram_tokens = [START_SYMBOL] + tokens + [STOP_SYMBOL] trigram_tokens = [START_SYMBOL] + [START_SYMBOL] + tokens + [STOP_SYMBOL] for bigram in nltk.bigrams(bigram_tokens): bigramCount[bigram] += 1 for trigram in nltk.trigrams(trigram_tokens): trigramCount[trigram] += 1 #trigram probability bigramCount[(START_SYMBOL, START_SYMBOL)] = len(brown_tags) q_values = { t: math.log(float(v) / bigramCount[t[: 2]], 2) for t, v in trigramCount.iteritems() } return q_values #This function takes output from calc_trigrams() and outputs it in the proper format def q2_output(q_values, filename): outfile = open(filename, "w") trigrams = q_values.keys() trigrams.sort() for trigram in trigrams: output = " ".join(['TRIGRAM', trigram[0], trigram[1], trigram[2], str(q_values[trigram])]) outfile.write(output + '\n') outfile.close() #TODO: IMPLEMENT THIS FUNCTION#Takes the words from the training data and returns a set of all of the words that occur more than 5 times(use RARE_WORD_MAX_FREQ)#brown_words is a python list where every element is a python list of the words of a particular sentence.#Note: words that appear exactly 5 times should be considered rare!def calc_known(brown_words): known_words = set([]) words = collections.Counter() for sentence in brown_words: for word in sentence: words[word] += 1 for word, count in words.iteritems(): if count > RARE_WORD_MAX_FREQ: known_words.add(word) return known_words #TODO: IMPLEMENT THIS FUNCTION#Takes the words from the training data and a set of words that should not be replaced for '_RARE_'#Returns the equivalent to brown_words but replacing the unknown words by '_RARE_' (use RARE_SYMBOL constant) def replace_rare(brown_words, known_words): brown_words_rare = [] for sentence in brown_words: for word in sentence: if word not in known_words: brown_words_rare.append(word) word = RARE_SYMBOL return brown_words_rare #This function takes the ouput from replace_rare and outputs it to a file def q3_output(rare, filename): outfile = open(filename, 'w') for sentence in rare: outfile.write(' '.join(sentence[2: -1]) + '\n') outfile.close() #TODO: IMPLEMENT THIS FUNCTION#Calculates emission probabilities and creates a set of all possible tags#The first return value is a python dictionary where each key is a tuple in which the first element is a word#and the second is a tag, and the value is the log probability of the emission of the word given the tag#The second return value is a set of all possible tags for this data set def calc_emission(brown_words_rare, brown_tags): e_values = {} taglist = set([]) e_values_count = collections.Counter() taglist_count = collections.Counter() for word_s, taglist_s in zip(brown_words_rare, brown_tags): for word, tag in zip(word_s, taglist_s): e_values_count[word, tag] += 1 taglist_count[tag] += 1 for (word, tag), count in e_values_count.iteritems(): e_values[(word, tag)] = math.log(float(count) / taglist_count[tag], 2) taglist.append((word, tag)) return e_values, taglist #This function takes the output from calc_emissions() and outputs it def q4_output(e_values, filename): outfile = open(filename, "w") emissions = e_values.keys() emissions.sort() for item in emissions: output = " ".join([item[0], item[1], str(e_values[item])]) outfile.write(output + '\n') outfile.close() #TODO: IMPLEMENT THIS FUNCTION#This function takes data to tag(brown_dev_words), a set of all possible tags(taglist), a set of all known words(known_words), #trigram probabilities(q_values) and emission probabilities(e_values) and outputs a list where every element is a tagged sentence# ( in the WORD / TAG format, separated by spaces and with a newline in the end, just like our input tagged data)#brown_dev_words is a python list where every element is a python list of the words of a particular sentence.#taglist is a set of all possible tags#known_words is a set of all known words#q_values is from the return of calc_trigrams()#e_values is from the return of calc_emissions()#The return value is a list of tagged sentences in the format "WORD/TAG", separated by spaces.Each sentence is a string with a#terminal newline, not a list of tokens.Remember also that the output should not contain the "_RARE_" symbol, but rather the#original words of the sentence!def viterbi(brown_dev_words, taglist, known_words, q_values, e_values): tagged = [] return tagged #This function takes the output of viterbi() and outputs it to file def q5_output(tagged, filename): outfile = open(filename, 'w') for sentence in tagged: outfile.write(sentence) outfile.close() #TODO: IMPLEMENT THIS FUNCTION#This function uses nltk to create the taggers described in question 6#brown_words and brown_tags is the data to be used in training#brown_dev_words is the data that should be tagged#The return value is a list of tagged sentences in the format "WORD/TAG", separated by spaces.Each sentence is a string with a#terminal newline, not a list of tokens. def nltk_tagger(brown_words, brown_tags, brown_dev_words): #Hint: use the following line to format data to what NLTK expects for training training = [zip(brown_words[i], brown_tags[i]) for i in xrange(len(brown_words))] #IMPLEMENT THE REST OF THE FUNCTION HERE tagged = [] default_tagger = nltk.DefaultTagger('NN') bigram_tagger = nltk.BigramTagger(training, backoff = default_tagger) trigram_tagger = nltk.TrigramTagger(training, backoff = bigram_tagger) for sentence in brown_dev_words: tagged = trigram_tagger.tag(sentence) tagged.append(' '.join([word + '/' + tag for word, tag in tagged]) + '\n') return tagged #This function takes the output of nltk_tagger() and outputs it to file def q6_output(tagged, filename): outfile = open(filename, 'w') for sentence in tagged: outfile.write(sentence) outfile.close() DATA_PATH = '/home/classes/cs477/data/' OUTPUT_PATH = 'output/' def main(): #start timer time.clock() #open Brown training data infile = open(DATA_PATH + "Brown_tagged_train.txt", "r") brown_train = infile.readlines() infile.close() #split words and tags, and add start and stop symbols(question 1) brown_words, brown_tags = split_wordtags(brown_train) #calculate tag trigram probabilities(question 2) q_values = calc_trigrams(brown_tags) #question 2 output q2_output(q_values, OUTPUT_PATH + 'B2.txt') #calculate list of words with count > 5(question 3) known_words = calc_known(brown_words) #get a version of brown_words with rare words replace with '_RARE_' (question 3) brown_words_rare = replace_rare(brown_words, known_words) #question 3 output q3_output(brown_words_rare, OUTPUT_PATH + "B3.txt") #calculate emission probabilities(question 4) e_values, taglist = calc_emission(brown_words_rare, brown_tags) #question 4 output q4_output(e_values, OUTPUT_PATH + "B4.txt") #delete unneceessary data del brown_train del brown_words_rare #open Brown development data(question 5) infile = open(DATA_PATH + "Brown_dev.txt", "r") brown_dev = infile.readlines() infile.close() #format Brown development data here brown_dev_words = [] for sentence in brown_dev: brown_dev_words.append(sentence.split(" ")[: -1]) #do viterbi on brown_dev_words(question 5) viterbi_tagged = viterbi(brown_dev_words, taglist, known_words, q_values, e_values) #question 5 output q5_output(viterbi_tagged, OUTPUT_PATH + 'B5.txt') #do nltk tagging here nltk_tagged = nltk_tagger(brown_words, brown_tags, brown_dev_words) #question 6 output q6_output(nltk_tagged, OUTPUT_PATH + 'B6.txt') #print total time to run Part B print "Part B time: " + str(time.clock()) + ' sec' if __name__ == "__main__": main()
Bottom of Page
In Head
On Load
On Ready
Setting
Validate
Copy
Format
Setting
Validate
Copy
Format
No Doc Type
HTML5
HTML 4.01 Transitional
HTML 4.01 Strict
HTML 4.01 Frameset
XHTML 1.1
XHTML 1.0 Transitional
XHTML 1.0 Strict
XHTML 1.0 Frameset
Copy
Format
Download
×
Code Description
×
Difference of Versions
HTML
CSS
JS
×
JS Error
×
CSS Error
Errors
Warnings
×
JavaScript Setting
JS Libraries:
Chrome Frame 1.0.3
Dojo 1.8.0
Dojo 1.7.3
Dojo 1.7.2
Ext Core 3.1.0
jQuery 1.8.0
jQuery 1.7.2
jQuery 1.6.0
jQuery 1.5.0
jQuery 1.4.4
jQuery 1.4.0
jQuery-min 1.7.2
jQueryUI-min 1.8.21
MooTools more-1.4.0.1-full
MooTools core-1.4.5-full
MooTools core-1.4.1-full
Prototype 1.7.1.0
script.aculo.us 1.9.0
SWFObject 2.2
Twitter Bootstrap 2.0.4
WebFont Loader 1.0.28
yui 3.5.1
User Libraries:
Upload File
JavaScript URL(s):
×
CSS Setting
CSS Libraries:
jQueryUI 1.8.21
Twitter Bootstrap 2.0.4
User Libraries:
Upload File
CSS URL(s):