Github and R Markdown, and mastery in the practice of regression analysis.
Weight (% of final grade):
Due Date:
25%
11:59 ADT December 10th, 2020
Upon successful completion of this project, students will possess a working knowledge of Githuband R Markdown, and mastery in the practice of regression analysis. These skills are highly valuedglobally by employers in search of data scientists.
Project Description:
Each group will be assigned a dataset. Collectively, group members are to perform a completeregression analysis of their data, details of which must be presented on Github(https://github.com) using R Markdown (https://rmarkdown.rstudio.com/articles_intro.html).The following sections must be included:
Abstract (150 words or less)
Introduction (must contain a thorough description of the questions of interest)Data Description (must contain data visualizations that are properly labelled and explained)Methods (must contain a complete description of all analysis tools used)Results (all figures should be properly labelled and discussed)Conclusion (must contain a concise discussion of what has been learned from the analysis)Appendix (must include all data and R Markdown files for reproducibility)
Data:
Datasets are found at https://lionbridge.ai/datasets/10-open-datasets-for-linear-regression/.Groups 1-5 are to analyse Dataset 1 (Cancer), Groups 6-10 are to analyse Dataset 2 (CDC) ,Groups 11-15 are to analyse Dataset 3 (Fish Market), Groups 16-20 are to analyse Dataset 4(Medical Insurance), Groups 21-25 are to analyse Dataset 5 (New York Stock Exchange), Groups26-30 are to analyse Dataset 7 (Real Estate), Groups 31-35 are to analyse Dataset 8 (Red Wine),Groups 36-40 are to analyse Dataset 9 (Vehicle), Groups 41-45 are to analyse Dataset 10 (WHO).Note: Before commencing your analysis, you must introduce one new additional data point intoyour assigned dataset. A description of this unique data point must be included in your DataDescription section along with some rationale for the values chosen.
Grading Scheme:6 Overall presentation and organization of materials3 Quality of data visualizations6 Correctness of analysis4 Quality and selection of relevant figures6 Interpretation of results--25Regression and Analysis of Variance STAT 3340 / MATH 3340Fall 2020Final Project
Regression and Prediction Using R
Student's Name
Institutional Affiliation
Abstract
As part of daily life, machine learning is used to make decisions, especially by data scientists. This paper aimed to incorporate machine learning algorithms in the prediction of vehicle prices. First, the car.csv dataset was inspected, cleaned, and organized. A final dataset was arrived at and used for further analysis. The dataset was fitted with Present_Price as the response variable and the rest as explanatory variables using the basic linear model. Three algorithms, linear regression, random forest, and support vector machines, were selected for modeling. Data were partitioned into two; training and testing set. The training data was used to predict the prices on the testing set. The models' performances were evaluated and improved using tuning, cross-validation, and checked for overfitting. Lastly, the models were compared against one another using a calculated RMSE (Root Mean Squared Error). The best performing model was chosen, hence leading to the arrival of meaningful conclusions.
Keywords: Algorithms, linear regression, random forest, basic linear model, dataset, regression, and output.
Introduction
Every day, applications for machine learning are typical to come across. The algorithms help in making critical decisions in every field of work. For instance, media sites rely on machine learning to sift through millions of options to give you song or movie recommendations, and retailers use it to gain insight into their customers' purchasing behavior. Closer to home, data scientists using machine learning to advise on future data patterns and behavior that could be encouraged or discouraged. Besides, it entails building models that offer predictive power and can be used to understand data not yet collected.
In earlier statistics classes, machine learning has been used when running simple regression models. On the other hand, this is a complex topic with a wide range of possibilities and applications. Therefore, this study sought to present a basic understanding of regression modeling using linear, random forest, and support vector modeling, as well as to answer the following questions of interest;
* What is the relationship among variables, especially between the car price variable with other variables?
* Is it possible to predict the price of a new car based on historical data?
* Which is the best model for use in prediction among the three?
Data Description
The car dataset contained information about cars and motorcycles listed on CarDekho.com. The car data was in a CSV file and included the following columns: model/ Car_Name, year, selling price, showroom price/Present_Price, kilometers driven, fuel type, seller type, transmission, and the number of previous owners. Using R, the dataset had nine columns and 301 rows/observations. After cleaning the dataset, there were four numerical and five categorical variables. Those categories were model, fuel type, seller type, and transmission. Categorical ones were namely, year, selling prices, showroom price, kilometers driven, and the number of previous owners. As instructed, an additional data point was added for a 2018 manual city selling at 4.34M but presently priced at 5.12M with a mileage of 37,000 kilometers. The model was being sold by a Dealer, thus no previous ownership, and ran on Diesel. During cleaning, the model variable was dropped as it was regarded unimportant for upcoming analysis. A new variable, age, was created after subtracting the year values from the current year (2020). This meant deletion of the year and current year variables. Notably, there were no missing values in the dataset. Finally, the car dataset contained 302 rows and eight columns; Selling_Price, Present_Price, Kms_Driven Fuel_Type, Seller_Type, Transmission, Owner, and Age.
Further, visual presentations in ggplots, histograms, and plots were obtained for each variable.
A ggplot of Age and Present_Price
The above ggplot showed that the relationship between the ages of the cars was positive and linear. In other words, the newer the car, the more expensive it was.
Histograms
Most cars' selling prices were between 0 and 5M while a few cars were priced between 30 and 35M. Also, most cars had the least number of kilometers driven/mileage. Most cars had close to no previous ownership, with a few having about three previous owners. Expectedly, most cares were either new or almost new.
Plots
The majority of the cars ran on Petrol while less than 50 cars used CNG. Dealers were selling more cars than individuals. Many operated on manual while less than 50 or a few were automatic. With all these explanations, most of the cars featured in the dataset were new.
Methods
As mentioned earlier, a linear regression was fit using Present_Price as the response variable. Overall, from the R output, the model's predictors explained 84.71% R2 of the price variation. The model was a good fit, F (8, 293) = 202.9, p-value < 2.2e-16. The following model equation was obtained.
Present_Price = Selling_Price + Kms_Driven + Fuel_Type + Seller_Type + Transmission + Owner + Age
However, some coefficients were more statistically significant than others as ranked by importance below.
Overall
Selling_Price
27.4908128
Kms_Driven
2.1504483
Fuel_Type Diesel
0.6907296
Fuel_Type Petrol
0.2347870
Seller_Type Individual
0.3083687
Transmission Manual
0.5142432
Owner
1.9049391
Age
8.1736365
For example, selling price, kilometers are driven, and age was statistically significant in predicting showroom price (p < 0.05). Fuel type and seller type were negative and statistically insignificant, while transmission and owner were positive but insignificant (p > 0.05). As expected, the more miles a car had, the cheaper it was. However, this was not the case because the more kilometers drove, the more the showroom price. The transmission manual did not have much of an influence on Prices.
Based on the above results, it was generally not a noble idea to use the entire data sample to fit the model. Hence, the decision was made to train the model on a sample of the data. Then, its performance would be observed outside of the training sample. Thus, the dataset was partitioned into training and test set, where the latter was used to evaluate the performance of the three models with unseen data. See the next section.
Results
On a ratio of 0.8: 0.2, the dataset was divided to form training and testing sets, 241 rows and eight columns and 61 rows and eight columns. Next, the caret package was loaded to obtain the train function used for model training. A glimpse of the first six rows of the predicted showroom prices of cars on the testing set was obtained using the predict function. Evaluation metrics, RMSE, R2, MAE, and many others were calculated for each algorithm. All models were compared against one another based on the metrics.
1 Linear Regression algorithm
The first algorithm, linear regression, was trained using the train function where Present_Price/showroom price was the outcome variable. The following results were obtained.
Intercept
RMSE
R Squared
MAE
RMSESD
R Squared SD
MAESD
TRUE
3.425228
0.831936
2.098598
1.871193
0.1224851
0.5642068
On the testing set, a trained model was used to predict car prices at the showroom.
6
10
17
21
22
23
11.320564
10.548235
10.184033
2.766763
5.617042
9.694914
The predictive ability of the model was tested on the testing set, and the evaluation metrics were derived as shown below;
RMSE
R squared
MAE
4.4259303
0.8192709
2.5507228
The above output showed that RMSE, one of the two evaluation metrics, was 3.43M for the training set and 4.43M for the testing set. On the other hand, the R-squared value was around 83% and 82% for both training and testing sets, respectively, which indicated good performance.
2 Support Vector Machine Algorithm
Similarly, the cross-validation methods were specified with k-folds and leave one out CV. Since SVM has hyperparameters, they were tuned by inspecting and modifying specific algorithm parameters. Specific values were fed into the model training function using the expand.grid command.
Then, the model was tuned, trained on the training set.
C
RMSE
R squared
MAE
RMSES
R squared SD
MAESD
1e-03
6.112354
0.7257655
3.376103
3.913042
0.1337751
1.0075231
1e-02
3.961975
0.8439960
1.988603
3.079260
0.1179277
0.7879450
1e-01
3.435029
0.8595457
1.818873
2.524667
0.1268926
0.6702223
1e+00
3.365431
0.8604723
1.822928
2.328030
0.1310797
0.6151965
1e+01
3.365003
0.8607997
1.822493
2.323299
0.1307758
0.6101864
1e+02
3.365878
0.8607631
1.823303
2.324530
0.1307558
0.6107806
Afterward, it was used to obtain the predicted car prices at the showroom.
6
10
17
21
22
23
10.798856
10.37632
10.0967
3.925298
5.803843
8.91126
The model's predictability was evaluated on the testing set. See the table below.
RMSE
R squared
MAE
3.1847648
0.8468279
1.7856504
The tables above showed that RMSE, one of the two evaluation metrics, was between 3.3 and 6.1M for training data and 3.2M for the testing set. On the other hand, the R-squared value was around 72.3 to 86.1% and 85% for both training and testing sets, respectively, which indicated good performance.
3 RandomForest Algorithm
Like -wise, the cross-validation method was specified before tuning and later training the model. The table below shows the model's results.
mtry
RMSE
R squared
MAE
RMSESD
R squared SD
MAESD
1
2
3.691223
0.8484474
1.956376
3.842158
0.12096760
0.9986551
2
5
2.997213
0.9142653
1.343094
3.646090
0.08030471
0.8501472
3
8
2.983392
0.9152494
1.305836
3.402356
0.06566592
0.7632234
As usual, the model was used to obtain the first six rows of the predicted prices as below.
6
10
17
21
22
23
10.634268
10.822836
10.779255
4.185635
5.818534
11.641299
The predictive ability of the model was assessed on the testing set using the three metrics.
RMSE
R squared
MAE
3.2610962
0.8614406
1.4489065
Lastly, all tables above showed that RMSE, one of the two evaluation metrics, ranged from 3 to 3.7M for the training set and 3.3M for the testing set. On the other hand, the R-squared value was around 84.8 to 91.5% and 86.1% for both training and testing sets, respectively, which indicated good performance.
All three models were compared to each other using the RMSE and R2 to determine the best model for predicting car prices at the showroom. The data frame containing all values for the three models on the two metrics was obtained, as shown in the R output. Also, note that the evaluation metrics could not be discussed per model because there would not be any values to compare with.
Linear regression had the lowest RMSE of 3.4M on the training set, while the support vector machine model had the highest at 6.11M. In terms of R2 values on the training set, the random forest model had the highest (84.8%), while the support vector machine had the least at 72.6%. The latter also had the lowest RMSE on the testing set at 3.18M, while linear regression had the highest at 4.43M. For R2 on the testing set, linear regression had the least at 81.9%, while random forest had the highest (86.14%). Typically, a lower RMSE and higher R2 values are indicative of a good model. Therefore, the best models for predicting car prices were the support vector machine (low RMSE of 3.18M) and random forest due to a higher R2 value of 86.14%.
Conclusion
This study aims to help understand the predictability of regression models in R using the car dataset. The critical points of interest were addressed via regression, which showed a relationship between showroom price and predictors. Similarly, predicting a car's price based on the already known data in each model was possible. See the predicted prices above in tabular form. Besides, the RMSE and R2 helped to determine the best performing model. The regularized regression models, support vector machine, and random forest performed better than the linear regression model. Both models had a lower RMSE and a higher R2 than the linear regression model. Overall, all the models performed well with decent R2 above 80% and stable RMSE values below 7. In reality, the ideal result would be an RMSE value of zero and an R-squared value of 1, but that's almost impossible in real economic datasets.
Appendix
* The Data
Selling_Price
Present_Price
Kms_Driven
Fuel_Type
Seller_Type
Transmission
Owner
Age
3.35
5.59
27000
Petrol
Dealer
Manual
0
6
4.75
9.54
43000
Diesel
Dealer
Manual
0
7
7.25
9.85
6900
Petrol
Dealer
Manual
0
3
2.85
4.15
5200
Petrol
Dealer
Manual
0
9
4.6
6.87
42450
Diesel
Dealer
Manual
0
6
9.25
9.83
2071
Diesel
Dealer
Manual
0
2
6.75
8.12
18796
Petrol
Dealer
Manual
0
5
6.5
8.61
33429
Diesel
Dealer
Manual
0
5
8.75
8.89
20273
Diesel
Dealer
Manual
0
4
7.45
8.92
42367
Diesel
Dealer
Manual
0
5
2.85
3.6
2135
Petrol
Dealer
Manual
0
3
6.85
10.38
51000
Diesel
Dealer
Manual
0
5
7.5
9.94
15000
Petrol
Dealer
Automatic
0
5
6.1
7.71
26000
Petrol
Dealer
Manual
0
5
2.25
7.21
77427
Petrol
Dealer
Manual
0
11
7.75
10.79
43000
Diesel
Dealer
Manual
0
4
7.25
10.79
41678
Diesel
Dealer
Manual
0
5
7.75
10.79
43000
Diesel
Dealer
Manual
0
4
3.25
5.09
35500
CNG
Dealer
Manual
0
5
2.65
7.98
41442
Petrol
Dealer
Manual
0
10
2.85
3.95
25000
Petrol
Dealer
Manual
0
4
4.9
5.71
2400
Petrol
Dealer
Manual
0
3
4.4
8.01
50000
Petrol
Dealer
Automatic
0
9
2.5
3.46
45280
Petrol
Dealer
Manual
0
6
2.9
4.41
56879
Petrol
Dealer
Manual
0
7
3
4.99
20000
Petrol
Dealer
Manual
0
9
4.15
5.87
55138
Petrol
Dealer
Manual
0
7
6
6.49
16200
Petrol
Individual
Manual
0
3
1.95
3.95
44542
Petrol
Dealer
Manual
0
10
7.45
10.38
45000
Diesel
Dealer
Manual
0
5
3.1
5.98
51439
Diesel
Dealer
Manual
0
8
2.35
4.89
54200
Petrol
Dealer
Manual
0
9
4.95
7.49
39000
Diesel
Dealer
Manual
0
6
6
9.95
45000
Diesel
Dealer
Manual
0
6
5.5
8.06
45000
Diesel
Dealer
Manual
0
6
2.95
7.74
49998
CNG
Dealer
Manual
0
9
4.65
7.2
48767
Petrol
Dealer
Manual
0
5
0.35
2.28
127000
Petrol
Individual
Manual
0
17
3
3.76
10079
Petrol
Dealer
Manual
0
4
2.25
7.98
62000
Petrol
Dealer
Manual
0
17
5.85
7.87
24524
Petrol
Dealer
Automatic
0
4
2.55
3.98
46706
Petrol
Dealer
Manual
0
6
1.95
7.15
58000
Petrol
Dealer
Manual
0
12
5.5
8.06
45780
Diesel
Dealer
Manual
0
6
1.25
2.69
50000
Petrol
Dealer
Manual
0
8
7.5
12.04
15000
Petrol
Dealer
Automatic
0
6
2.65
4.89
64532
Petrol
Dealer
Manual
0
7
1.05
4.15
65000
Petrol
Dealer
Manual
0
14
5.8
7.71
25870
Petrol
Dealer
Manual
0
5
7.75
9.29
37000
Petrol
Dealer
Automatic
0
3
14.9
30.61
104707
Diesel
Dealer
Automatic
0
8
23
30.61
40000
Diesel
Dealer
Automatic
0
5
18
19.77
15000
Diesel
Dealer
Automatic
0