Ashutosh Nandeshwar

Subodh Chaudhari

Confidentiality Agreement


  • confidentiality agreement.pdf (size: 217 KB)
  • code of conduct.pdf (size: 328 KB)
  • fields.pdf (size: 173 KB)
  • Business Goal


  • Build models to predict enrollment of a student using the student admissions data
    -Lots of money is spent on recruiting new students and models that can successfully predict student who will enroll/not-enroll, can prove very resourceful in targeted marketing.

    -Money can be better allocated on targeting good students with less likelihood of enrollment rather than students who would definitely enroll or will not enroll anyway.

  • Research Objective

  • Analyze different learners' performance on student admissions data and provide models to the University.

  • Estimate cost and other benefits to be reaped.

  • Introduction

  • West Virginia University employs Banner system to record all data of student admissions, enrollment and course registration.

  • The data are extracted from the Banner system to WVU servers.

  • For this project the data sets from the server were used.

  • Data

  • Admissions data from Spring 1999 to Fall 2006 were used.

  • There were approximately 3,000 applications for Spring and 25,000 applications for Fall.

  • 248 attributes- demographical and academic information.

  • Pre-processing

  • All the data tables were joined to create a single table.

  • Flag variables were created - Enrollment indicator, First Generation Indicator

  • ACT and SAT scores were combined using concordance tables.

  • Permanent address Zip codes were used to create a field Median Family Income from using Zip code and Income data from Census.gov website.

  • Applications which were not accepted were removed-total number of instances 112,390.

  • Domain knowledge and common sense was used to remove some attributes - email address, phone numbers, etc.

  • Access table was converted to ARFF using VBA script.

  • Pre-processing using Weka


  • String variables were removed using RemoveType –string

  • Useless variables were removed using RemoveUseless filter, which removes constant attributes, along with nominal attributes that vary too much

  • Discretization was done using Weka's Discretize filter.

    Discretize -B 10 -M -1.0 -R - first-last

    - Default discretization mechanism was used. It is an equal-width or equal interval discretization. It suffers from the drawback that the instances are distributed among the bins very unevenly, making it hard to form any decision structures. Supervised equal frequency discretization will be used in the next phase of the project.
  • Learners Used


  • ZeroR

  • OneR

  • NaiveBayes

  • NaiveBayesSimple

  • J48

  • DMM Compatibility


    Level 1- Initial
  • Data is in some defined data format (csv, xml, arff, ...).
    - Access, Arff
  • Data has been run through any automatic learner.
    - ZeroR, OneR, NaiveBayes, NaiveBayesSimple, J48
  • The learned theory has been automatically applied to some data to return some conclusion without human intervention.
    - Weka Experiment - ZeroR, OneR, NaiveBayes, NaiveBayesSimple with 10 way cross validation and 10 iterations

  • Level 2- Repeatable
  • A theory learned from some data D1 has been run on some other data D2 and D1 NE D2 ; e.g. via a N-way cross-validation study.
    - Weka Experiment - ZeroR, OneR, NaiveBayes, NaiveBayesSimple with 10 way cross validation and 10 iterations
  • Data is in the public domain; e.g. on a web site with free registration or, better yet, no registration
    - No : Sensitive data
  • The learned theory has been automatically applied to some data to return some conclusion without human intervention.
    - Weka Experiment - ZeroR, OneR, NaiveBayes, NaiveBayesSimple with 10 way cross validation and 10 iterations
  • Data has been run through learners that are public domain.
    - ZeroR, OneR, NaiveBayes, NaiveBayesSimple, J48
  • Someone else has processed this data rather than the original users.
    - No

  • Level 3- Defined
  • A goal for the learning is recorded.
    - Yes: Defined above
  • The meaning of most attributes are defined.
    - Yes: A description file which is not shown here
  • The meaning of each instance is defined
    - Yes: Each instance is different than other as each record is an application.
  • Statistics are available on the distribution of each attribute.
    - Yes: Some example attributes are shown here. Missing values exist because some fields don't apply to all applicants.
    EnrolledIndicator.JPG (size: 136 KB)
    MedianFamilyIncome.JPG (size: 155 KB)
    ACTequivalent.JPG (size: 154 KB)
    HighschoolGPA.JPG(size: 162 KB)
  • Attribute subsets are identified that have differing effects on the goals
    - Yes: Select attributes - cfsSubsetEval, infogain - subseteval_Results.txt
    - The comparison of results (result comparison.txt ) shows that since there are missing values NaiveBayes performs well. The decision tree learner has higher number of correctly classified instances but since it was 10% sub-sample of the data these results may not be indicative of whole data.

    - After comparison of the results obtained from cfsSubsetEval it can concluded that approximately the same amount of accuracy can be obtained by using four variables selected by CFS. The variables are ApplicationResidencyIndicator, ApplicationStateCode, FinancialAidIndicator, HighSchoolCountyCode.

  • Instance subsets are identified which domain knowledge observes tells us is very different to the other instances.
    - Yes: Data from Spring 1999 - Fall 2006. Data differ among semesters (Fall - 25,000 and Spring - 3,000)
  • Roadmap to Project IV


     
    Task
    Schedule (Week)
    1 Multiple preprocessors - binLogging, nBins, lognums
    Nov 13
    2 Multiple learners - J48, NaiveBayes, Naivebayessimple, OneR
    Nov 20
    3 Comparison of Results - Quartile charts
    Nov 20
    4 Analysis of learners performance through student t-tests
    Nov 20
    5 Business study by Lift charts and ROC curves
    Nov 27
    6 Interpretation of the results
    Nov 27
    7 Result condensation and Report writing
    Nov 27