Home

# Homework 5

## TOC

Refer to the Bayesian inference notes. Calculate whether it is more probable that a patient has a cavity or gum disease given he/she has a toothache. Show your work.

Using the “helpful software” at the bottom of the Bayesian inference notes, load the fire alarm network problem, and determine which is more probable, given that a fire has been reported: the smoke detector was tampered with, or there is a fire. Give the probabilities of these two events.

Perform naïve Bayesian classification on a spam/ham dataset. Report the average experimental f-score after training/testing on 10 random subsamples, where a random 90% serves as the training set and the remainder serves as the test set.

Use the lemmatized, stopword-removed “Ling-Spam” dataset available in lingspam_lemm_stop.zip. Original source: http://csmining.org/index.php/ling-spam-datasets.html

Submit your code to the dropbox. Do not include the dataset in your submission! If your program is non-typical Java or C++ code, explain in a `README.TXT` how to compile and run it.

Important: Your code must open the `listing.txt` and other data files from a folder called `data`.

### Possible pseudo-code

The challenge with this assignment is not the math but getting all the data needed to compute. As is often the case in AI research, the mathematics are simple but the algorithm, particularly the collection and processing of data, is the real challenge.

In the aim of possibly helping you along, here is some pseudo-code:

```Task 1. Set up global variables.

- create a map for di values, with keys = strings and values = ints
(this map has words as keys; the value for a word is the number of
documents across all categories that have this word anywhere in the
document)

- create several maps for dci values (strings -> ints again, like the
di map; these maps record the number of documents in a particular
category that have a particular word); perhaps you can put these
maps in another map, like so: dci[category][word] = #; in C++,
that's map<string, map<string, int> >
(cat)       (word)  (# of docs)

- open the listing.txt file
- loop: for each line in the listing.txt file
- read the first "word" -- this is a file name (without the .txt)
- read the rest of the line from the listing.txt file (the
number of categories and the categories themselves)
- record this category information for the document; how will you
do this?
- open the document .txt file
- create a document map D with keys = strings and values =
true/false or 1/0 (this map will represent the document; the
strings are words from the document, and the true/false means the
document does or does not have that word; of course, we don't need
to put words in the map that the document does not have, just
words it does, so only true's will be in the map, not false's; you
might prefer to use a set of strings rather than a map)
- loop: for each word in the document
- read the next word from the .txt file
- record this word in the document map D
- close the .txt file
- put this document map D in a vector to collect all such D's

Task 3. Separate the docs into a test set and a training set. Do this
10 times, always taking a random 90% of the documents to serve as a
training set, the other 10% to serve as the testing set.

*Do all of the following 10 times, once for each random
subsample. Average the f-scores.*

Task 4. Reset di and dci values.

- loop: for each document in the training set
- loop: for each word in the doc map
- increment in the global di map the number of occurrences of
documents with this word (because one more document has this
word)
- loop: for each category that this document is a member of
- increment in the global dci map the value for this word in
this category

Task 5. Classify each testing document

- set up variables for true positive, false positive, false negative
counts
- loop: for each testing document:
- figure out which category maximizes the final formula in the
Bayesian methods notes...
- collect the f-score for this testing set

Task 6. Report the average f-score across the 10 random subsamples.
```

This is a chunk of C++ that you may find useful for this task. It only processes the `listing.txt` file, and does not do any machine learning.

```#include <fstream>
#include <string>

// ...

string docfile, category, word;
int category_count;
ifstream f, df;
f.open("listing.txt");
while(!f.eof()) { // eof() is true if the file is finished
f >> docfile;

// maybe more than 1 category; not in the spam data, but in other
// datasets...; the file gives the category count next (an int)
f >> category_count;
// now get each category for this docfile
for(int i = 0; i < category_count; i++) {
f >> category;
// do something with category
}
// open docfile, process it a word at a time
df.open(docfile.c_str()); // c_str() is just a funky C++ thing
while(!df.eof()) {
df >> word;
// do something with word
}
df.close()

// done with that line of listing.txt
}
f.close()
```

## Extra credit (30 pts)

Make a copy of your code, rename it, and modify it so it trains on all the documents. Then, save the resulting data (the “model”) to a file. Write another program (or expand this original program) so that a user can predict the category (spam/ham) on a new, unseen document which is specified on the command line. Usage should be like this:

```bayes-train.exe mymodel                  ;; saves the model
bayes-predict.ext mymodel mydoc.txt      ;; predicts using the model
```

There should be no output from `bayes-train.exe` and the only output from `bayes-predict.exe` should be the predicted category (spam/ham).