Abstract

The purpose of the paper is developing a method to predict the

probability of Chinese students gets admission when applying graduate school.

This paper use logistics regression algorithm base on student’s features, such as GPA, GRE and TOEFL score,

to find out the relation with final result of application.

First, collecting dataset from Chinese forum posts 1 which cover

abundantly of previous student’s

application data. And raw data need to scrubbing before training and

transform data into best format. Using algorithm to training data, find optimal

coefficients to classify data. The application needs to get some instance

features as input data and output structured numeric values. According to

training model will build simple regression calculation on this input data and

determines which class the input data should belong to. According to the

students’ features, the output will the probability of being accepted by the graduate

school.

Keywords Machine Learning,

1?Introduction

Each year United State universities receive much application from

different countries, including numerous Chinese students who apply graduated

school intends to go further in their academic research area. Each graduate

school has qualifications to students, such as the minimum requirement of

student’s grade point average during undergraduate. Graduate Record

Examination (GRE) is a

standardized test which is an admission requirement for the most graduate university in the

United State. For international students also require English as a Foreign

Language (TOEFL) grade. The admission of graduate school exists many uncertain

factors, but most of the schools will have admission requirement. Even the highest

level of students’ applicants

exists random factors. Not all the good grade students can be accepted into

graduate school.

It is no doubt that there are many factors that affect admission, but

there are several factors that have an enormous impact on determining admission

rates: GPA and GRE scores. TOEFL score also as standard to verified

international students whether having the ability to access better

universities. No one really knows how universities evaluation and filtering

students from a number of applications based on these indicators.

Because some of the data can’t be measured by detail numerical value. In

this paper, we didn’t consider the student’s academic research experiment and students’ background.

2?Experimental and

Computational Details

2.1 Collecting Dataset

All the data collecting from public data. Chinese forum gter.com which

include abundant of posts previous student’s application data. In the forum posts have detail information

about student’s GPA, GRE and TOEFL score. Students also post their application

graduate school list and application result. Using open source web crawler from

GitHub to collect data from website forum.

2.2 Data Cleaning and

Format

Even though web crawler can collect abundant of data from the internet,

some of the data meaningless for this project. So, data cleaning is especially

required when integrating heterogeneous data sources, some of the data value

are losing or anomalous. Some of the data formats should be addressed together

with schema related data.2 In order to improve the quality of data, data

cleaning deals with detecting and removing errors and inconsistencies from

data. Most of the data are reliable, each instance includes features in the

data set. Because the dataset comes from the internet, which can’t avoid losing some important values. Data

cleaning need to modify or remove data according to requirements. One of the

methods to solve its problem is using the mean value from all the available

data to substitute the losing value. But for some of the instances losing too

many features which may misadvise final result, for this condition have to

ignore instance.

In order to quantify data, transforming raw data into schema data which

will facilitate during processing data. Some of the Chinese need to translate

into English and transform measurable data as number instead of string format

data. For example, mark receives offer as 1 and rejected as 0 as one instance

result.

3. Algorithm

3.1 Logistics Regression

apply into admission result

The admission result also could modeling as binary target variable, the

applicant was admitted to the program or rejected. The probability of

occurrence of an event as a function of a relatively some independent

variables. Data features are independent variables, each student has a unique

set of test scores, GRE and TOEFL grades and cumulative grade point average

data.

3.2 Data Features

Graduate Record Exam(GRE), a generalized test for prospective graduate

students including verbal reasoning and quantitative reasoning., continuous

between 130 to 170.

English as a Foreign Language (TOEFL), standardized test to measure the

English language ability of non-native speakers, score continuously between 0

to 120.

GPA, cumulative grade point average, continuous between 0.0 and 4.0.

Admission result, Binary variable, 0 or 1, where 1 means the applicant

was admitted to the program.

3.3 Logistics Regression

to Best Optimization Problem

Logistic regression is a popular classification method, it will limit

the output in 0 and 1. Denote the possible observations by 0 and 1, each series

of trials therefore giving a sequence of 0’s and 1’s. Value of 1 means the

applicant was admitted to the program and 0 means rejected. 3

The Sigmoid function have similar feature like logistic regression as

jump function between value 0 and 1,

g(z)=1/?1+e?^(-z) (3.1)

Figure 3.1 shows a larger scale where the sigmoid appears similar to a

step function at x=0.

Base on this Figure 3.1 when the g(z) bigger than 0.5 classify as 1, and

below 0.5 classify as 0.

z=w_0 x_0+w_1 x_1+w_2 x_2+?+w_n x_(n )=f(x)=?_(i=1)^n??w_i x_i ?= w^T x (3.2)

Multiply two vectors and add up all the features together. Using vector

notation write as z=w^T x. The vector x is instance’s features as input data, and we want to find the best

coefficients w.

Binomial logistic regression model could consider as classify to two

different output. The output can be considered as a given set of probabilities

to enter an event, just like any other classification method. P(y ?| x)

means conditional probability distribution, variable y is 0 or 1.

P( y=1 ?|x;

w)= 1/(1+e^(-w^T x) )= g(w^T x) (3.3)

P( y=0 ?|x;

w)= 1- g(w^T x) (3.4)

Merge equation (3) and (4) together

P( y ?|x; w)=(g(w^T x))^y (1-

g(w^T x))^(1-y) (3.5)

In order to find optimize the function, need to building another

optimization algorithms to get the best result. After determining use logistics

regression model and selecting initial features set, the next step is how to

obtain the optimize parameters so that the training logistics regression model

process into best classification results.

This process can be regarded as a search process. how to find a solution

that matches logistics regression model we designed in a logistics regression

solution space. In order to obtain the corresponding optimal logistics

regression model, we need to design a search strategy, considering what kind of

criteria to choose the optimal model.

w^(t+1)=w^t-? (? L(w))/?w

(3.8)

This is equation by introducing ? (0

Gradient ascent is a method to find the local optimal solution of the

function by using the gradient information. It is also the simplest and most

commonly used optimization method in machine learning. In this logistics

regression problem need to find the maximum solution, just need to go up every

step which makes cost function smaller. Then use same way iterate function to

find the optimal value.

Pseudo code for the gradient ascent

Result

Figure 1

As the figure 1 show the GRE score and

the number

of acceptation. Most student’s sore mode number between 315 – 330, according the number of the GRE score are higher than 315 will

have more probability acceptance rate.

Figure 2 TOEFL

result with accept rate

The TOEFL performance is more likely to show that higher TOEFL score higher the admission rate.

Figure 3 ROC

Precision

and ROC Cureve

Predict

the impact of admission rate by GRE and TOEFL grade, use the predictive

function returns the value of the label value, 0 represents failure and 1

represents success. The function returns the probability value as predict probe.

The first term is Precision = TP/(TP+FP). Precision tells us the fraction of

records that were positive from the group that the classifier predicted to be

positive. The second term we care about is Recall = TP/(TP+FN).

We

use ROC curve can be used to compare the classification and production costs

and benefits of decision-making analysis 5. Different classifiers may perform better for different

thresholds, and combining them in some way may make sense. In this mode, we

combine GRE score with acceptance rate as Figure 3 orange curve, TOEFL score

with acceptance rate as Figure 3 orange curve.

The

x-axis in figure3 is the number of false positives TP/(TP+FP), and the y-axis

is the number of true positives TP/(TP+FN). The ROC curve shows how the two

rates change with the threshold. The leftmost point corresponds to a negative

category, and the rightmost point corresponds to all categories in the positive

category.5,6

Conclusion

Logistic regression found that the most suitable parameter for the

nonlinear function is called sigmoid. Although logistic regression can be used

for classification, the algorithm still belongs to linear regression. On the

basis of linear regression, adding one more sigmoid function mapping when

mapping features to result. The first sum of linear features and then use a

sigmoid function to predict it. The optimization method can be used to find the

best fitting parameters. Among the optimization algorithms, one of the most

commonly used algorithms is the gradient-ascent algorithm. Gradient ascent can

be simplified by a random gradient ascent.