Home    

Homework 5

TOC

Task 1 (10 pts)

Refer to the Bayesian inference notes. Calculate whether it is more probable that a patient has a cavity or gum disease given he/she has a toothache. Show your work.

Task 2 (10 pts)

Using the “helpful software” at the bottom of the Bayesian inference notes, load the fire alarm network problem, and determine which is more probable, given that a fire has been reported: the smoke detector was tampered with, or there is a fire. Give the probabilities of these two events.

Task 3 (80 pts)

Perform naïve Bayesian classification on a spam/ham dataset. Report the average experimental f-score after training/testing on 10 random subsamples, where a random 90% serves as the training set and the remainder serves as the test set.

Use the lemmatized, stopword-removed “Ling-Spam” dataset available in lingspam_lemm_stop.zip. Original source: http://csmining.org/index.php/ling-spam-datasets.html

Submit your code to the dropbox. Do not include the dataset in your submission! If your program is non-typical Java or C++ code, explain in a README.TXT how to compile and run it.

Important: Your code must open the listing.txt and other data files from a folder called data.

Possible pseudo-code

The challenge with this assignment is not the math but getting all the data needed to compute. As is often the case in AI research, the mathematics are simple but the algorithm, particularly the collection and processing of data, is the real challenge.

In the aim of possibly helping you along, here is some pseudo-code:

Task 1. Set up global variables.

- create a map for di values, with keys = strings and values = ints
  (this map has words as keys; the value for a word is the number of
  documents across all categories that have this word anywhere in the
  document)

- create several maps for dci values (strings -> ints again, like the
  di map; these maps record the number of documents in a particular
  category that have a particular word); perhaps you can put these
  maps in another map, like so: dci[category][word] = #; in C++,
  that's map<string, map<string, int> >
             (cat)       (word)  (# of docs)

Task 2. Read the text documents.

- open the listing.txt file
- loop: for each line in the listing.txt file
  - read the first "word" -- this is a file name (without the .txt)
  - read the rest of the line from the listing.txt file (the
    number of categories and the categories themselves)
    - record this category information for the document; how will you
      do this?
  - open the document .txt file
  - create a document map D with keys = strings and values =
    true/false or 1/0 (this map will represent the document; the
    strings are words from the document, and the true/false means the
    document does or does not have that word; of course, we don't need
    to put words in the map that the document does not have, just
    words it does, so only true's will be in the map, not false's; you
    might prefer to use a set of strings rather than a map)
  - loop: for each word in the document
    - read the next word from the .txt file
    - record this word in the document map D
  - close the .txt file
  - put this document map D in a vector to collect all such D's

Task 3. Separate the docs into a test set and a training set. Do this
10 times, always taking a random 90% of the documents to serve as a
training set, the other 10% to serve as the testing set.

*Do all of the following 10 times, once for each random
subsample. Average the f-scores.*

Task 4. Reset di and dci values.

- loop: for each document in the training set
  - loop: for each word in the doc map
    - increment in the global di map the number of occurrences of
      documents with this word (because one more document has this
      word)
    - loop: for each category that this document is a member of
      - increment in the global dci map the value for this word in
        this category

Task 5. Classify each testing document

- set up variables for true positive, false positive, false negative
  counts
- loop: for each testing document:
  - figure out which category maximizes the final formula in the
    Bayesian methods notes...
- collect the f-score for this testing set

Task 6. Report the average f-score across the 10 random subsamples.

This is a chunk of C++ that you may find useful for this task. It only processes the listing.txt file, and does not do any machine learning.

#include <fstream>
#include <string>

// ...

string docfile, category, word;
int category_count;
ifstream f, df;
f.open("listing.txt");
while(!f.eof()) { // eof() is true if the file is finished
    f >> docfile;

    // maybe more than 1 category; not in the spam data, but in other
    // datasets...; the file gives the category count next (an int)
    f >> category_count;
    // now get each category for this docfile
    for(int i = 0; i < category_count; i++) {
        f >> category;
        // do something with category
    }
    // open docfile, process it a word at a time
    df.open(docfile.c_str()); // c_str() is just a funky C++ thing
    while(!df.eof()) {
        df >> word;
        // do something with word
    }
    df.close()

    // done with that line of listing.txt
}
f.close()

Extra credit (30 pts)

Make a copy of your code, rename it, and modify it so it trains on all the documents. Then, save the resulting data (the “model”) to a file. Write another program (or expand this original program) so that a user can predict the category (spam/ham) on a new, unseen document which is specified on the command line. Usage should be like this:

bayes-train.exe mymodel                  ;; saves the model
bayes-predict.ext mymodel mydoc.txt      ;; predicts using the model

There should be no output from bayes-train.exe and the only output from bayes-predict.exe should be the predicted category (spam/ham).

AI Su13 material by Joshua Eckroth is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Source code for this website available at GitHub.