Confidentiality Agreement 
Business Goal
-Lots of money is spent on recruiting new students and models that can successfully predict student
who will enroll/not-enroll, can prove very resourceful in targeted marketing.
-Money can be better allocated on targeting good students with less likelihood of enrollment rather than students who would definitely enroll or will not enroll anyway.
-Money can be better allocated on targeting good students with less likelihood of enrollment rather than students who would definitely enroll or will not enroll anyway.
Research Objective
Introduction
Data
Pre-processing
Pre-processing using Weka
Discretize -B 10 -M -1.0 -R - first-last
- Default discretization mechanism was used. It is an equal-width or equal interval discretization.
It suffers from the drawback that the instances are distributed among the bins very unevenly,
making it hard to form any decision structures. Supervised equal frequency discretization will be
used in the next phase of the project.
Learners Used
DMM Compatibility
Level 1- Initial
- Access, Arff
- ZeroR, OneR, NaiveBayes, NaiveBayesSimple, J48
- Weka Experiment - ZeroR, OneR, NaiveBayes, NaiveBayesSimple with 10
way cross validation and 10 iterations
Level 2- Repeatable
- Weka Experiment - ZeroR, OneR, NaiveBayes, NaiveBayesSimple with 10 way cross validation and 10 iterations
- No : Sensitive data
- Weka Experiment - ZeroR, OneR, NaiveBayes, NaiveBayesSimple with 10
way cross validation and 10 iterations
- ZeroR, OneR, NaiveBayes, NaiveBayesSimple, J48
- No
Level 3- Defined
- Yes: Defined above
- Yes: A description file which is not shown here
- Yes: Each instance is different than other as each record is an application.
- Yes: Some example attributes are shown here. Missing values exist
because some fields don't apply to all applicants.
EnrolledIndicator.JPG (size: 136 KB)
MedianFamilyIncome.JPG (size: 155 KB)
ACTequivalent.JPG (size: 154 KB)
HighschoolGPA.JPG(size: 162 KB)
EnrolledIndicator.JPG (size: 136 KB)
MedianFamilyIncome.JPG (size: 155 KB)
ACTequivalent.JPG (size: 154 KB)
HighschoolGPA.JPG(size: 162 KB)
- Yes: Select attributes - cfsSubsetEval, infogain -
subseteval_Results.txt
- The comparison of results (result comparison.txt ) shows that since there are missing values NaiveBayes performs well. The decision tree learner has higher number of correctly classified instances but since it was 10% sub-sample of the data these results may not be indicative of whole data.
- After comparison of the results obtained from cfsSubsetEval it can concluded that approximately the same amount of accuracy can be obtained by using four variables selected by CFS. The variables are ApplicationResidencyIndicator, ApplicationStateCode, FinancialAidIndicator, HighSchoolCountyCode.
- The comparison of results (result comparison.txt ) shows that since there are missing values NaiveBayes performs well. The decision tree learner has higher number of correctly classified instances but since it was 10% sub-sample of the data these results may not be indicative of whole data.
- After comparison of the results obtained from cfsSubsetEval it can concluded that approximately the same amount of accuracy can be obtained by using four variables selected by CFS. The variables are ApplicationResidencyIndicator, ApplicationStateCode, FinancialAidIndicator, HighSchoolCountyCode.
- Yes: Data from Spring 1999 - Fall 2006. Data differ among
semesters (Fall - 25,000 and Spring - 3,000)
Roadmap to Project IV 
Task |
Schedule (Week) |
|
| 1 | Multiple preprocessors - binLogging, nBins, lognums | Nov 13 |
| 2 | Multiple learners - J48, NaiveBayes, Naivebayessimple, OneR | Nov 20 |
| 3 | Comparison of Results - Quartile charts | Nov 20 |
| 4 | Analysis of learners performance through student t-tests | Nov 20 |
| 5 | Business study by Lift charts and ROC curves | Nov 27 |
| 6 | Interpretation of the results | Nov 27 |
| 7 | Result condensation and Report writing | Nov 27 |