Motivation of this project
- Spam email has been annoyed every personal email account
- 60% of January 2004 emails were spam
- Fraud & Phishing
Pre-processing of data
- Convert capital letters to lowercase
- Remove numbers, and extra white space
- Remove punctuations
- Remove stop-words
- Delete terms with length greater than 20.
Hash table
- Calculate Hash Key for each term in term-list.
- Use hash function place each term into Table
- Once collision occurs, use the separate chain
- Chose the first one of this linked-list as new term to represent this whole linked-list.
Different classification methods
Result of two different models
- 3-NN model has the correctness of 62%
- Naive-Bayes model has the correctness of 82.36%