Homework 5
Task 1 (10 pts)
Refer to the Bayesian inference notes. Calculate whether it is more probable that a patient has a cavity or gum disease given he/she has a toothache. Show your work.
Task 2 (10 pts)
Using the “helpful software” at the bottom of the Bayesian inference notes, load the fire alarm network problem, and determine which is more probable, given that a fire has been reported: the smoke detector was tampered with, or there is a fire. Give the probabilities of these two events.
Task 3 (80 pts)
Perform naïve Bayesian classification on a spam/ham dataset. Report the average experimental f-score after training/testing on 10 random subsamples, where a random 90% serves as the training set and the remainder serves as the test set.
Use the lemmatized, stopword-removed “Ling-Spam” dataset available in lingspam_lemm_stop.zip. Original source: http://csmining.org/index.php/ling-spam-datasets.html
Submit your code to the dropbox. Do not include the dataset in your submission! If your program is non-typical Java or C++ code, explain
in a README.TXT
how to compile and run it.
Important: Your code must open the listing.txt
and other data
files from a folder called data
.
Possible pseudo-code
The challenge with this assignment is not the math but getting all the data needed to compute. As is often the case in AI research, the mathematics are simple but the algorithm, particularly the collection and processing of data, is the real challenge.
In the aim of possibly helping you along, here is some pseudo-code:
Task 1. Set up global variables. - create a map for di values, with keys = strings and values = ints (this map has words as keys; the value for a word is the number of documents across all categories that have this word anywhere in the document) - create several maps for dci values (strings -> ints again, like the di map; these maps record the number of documents in a particular category that have a particular word); perhaps you can put these maps in another map, like so: dci[category][word] = #; in C++, that's map<string, map<string, int> > (cat) (word) (# of docs) Task 2. Read the text documents. - open the listing.txt file - loop: for each line in the listing.txt file - read the first "word" -- this is a file name (without the .txt) - read the rest of the line from the listing.txt file (the number of categories and the categories themselves) - record this category information for the document; how will you do this? - open the document .txt file - create a document map D with keys = strings and values = true/false or 1/0 (this map will represent the document; the strings are words from the document, and the true/false means the document does or does not have that word; of course, we don't need to put words in the map that the document does not have, just words it does, so only true's will be in the map, not false's; you might prefer to use a set of strings rather than a map) - loop: for each word in the document - read the next word from the .txt file - record this word in the document map D - close the .txt file - put this document map D in a vector to collect all such D's Task 3. Separate the docs into a test set and a training set. Do this 10 times, always taking a random 90% of the documents to serve as a training set, the other 10% to serve as the testing set. *Do all of the following 10 times, once for each random subsample. Average the f-scores.* Task 4. Reset di and dci values. - loop: for each document in the training set - loop: for each word in the doc map - increment in the global di map the number of occurrences of documents with this word (because one more document has this word) - loop: for each category that this document is a member of - increment in the global dci map the value for this word in this category Task 5. Classify each testing document - set up variables for true positive, false positive, false negative counts - loop: for each testing document: - figure out which category maximizes the final formula in the Bayesian methods notes... - collect the f-score for this testing set Task 6. Report the average f-score across the 10 random subsamples.
This is a chunk of C++ that you may find useful for this task. It only
processes the listing.txt
file, and does not do any machine
learning.
#include <fstream> #include <string> // ... string docfile, category, word; int category_count; ifstream f, df; f.open("listing.txt"); while(!f.eof()) { // eof() is true if the file is finished f >> docfile; // maybe more than 1 category; not in the spam data, but in other // datasets...; the file gives the category count next (an int) f >> category_count; // now get each category for this docfile for(int i = 0; i < category_count; i++) { f >> category; // do something with category } // open docfile, process it a word at a time df.open(docfile.c_str()); // c_str() is just a funky C++ thing while(!df.eof()) { df >> word; // do something with word } df.close() // done with that line of listing.txt } f.close()
Extra credit (30 pts)
Make a copy of your code, rename it, and modify it so it trains on all the documents. Then, save the resulting data (the “model”) to a file. Write another program (or expand this original program) so that a user can predict the category (spam/ham) on a new, unseen document which is specified on the command line. Usage should be like this:
bayes-train.exe mymodel ;; saves the model bayes-predict.ext mymodel mydoc.txt ;; predicts using the model
There should be no output from bayes-train.exe
and the only output
from bayes-predict.exe
should be the predicted category (spam/ham).