100% (1)
Pages:
10 pages/≈2750 words
Sources:
-1
Style:
APA
Subject:
Mathematics & Economics
Type:
Statistics Project
Language:
English (U.S.)
Document:
MS Word
Date:
Total cost:
$ 51.84
Topic:

Github and R Markdown, and mastery in the practice of regression analysis.

Statistics Project Instructions:

Weight (% of final grade):
Due Date:


25%
11:59 ADT December 10th, 2020
Upon successful completion of this project, students will possess a working knowledge of Githuband R Markdown, and mastery in the practice of regression analysis. These skills are highly valuedglobally by employers in search of data scientists.
Project Description:
Each group will be assigned a dataset. Collectively, group members are to perform a completeregression analysis of their data, details of which must be presented on Github(https://github.com) using R Markdown (https://rmarkdown.rstudio.com/articles_intro.html).The following sections must be included:
Abstract (150 words or less)
Introduction (must contain a thorough description of the questions of interest)Data Description (must contain data visualizations that are properly labelled and explained)Methods (must contain a complete description of all analysis tools used)Results (all figures should be properly labelled and discussed)Conclusion (must contain a concise discussion of what has been learned from the analysis)Appendix (must include all data and R Markdown files for reproducibility)
Data:
Datasets are found at https://lionbridge.ai/datasets/10-open-datasets-for-linear-regression/.Groups 1-5 are to analyse Dataset 1 (Cancer), Groups 6-10 are to analyse Dataset 2 (CDC) ,Groups 11-15 are to analyse Dataset 3 (Fish Market), Groups 16-20 are to analyse Dataset 4(Medical Insurance), Groups 21-25 are to analyse Dataset 5 (New York Stock Exchange), Groups26-30 are to analyse Dataset 7 (Real Estate), Groups 31-35 are to analyse Dataset 8 (Red Wine),Groups 36-40 are to analyse Dataset 9 (Vehicle), Groups 41-45 are to analyse Dataset 10 (WHO).Note: Before commencing your analysis, you must introduce one new additional data point intoyour assigned dataset. A description of this unique data point must be included in your DataDescription section along with some rationale for the values chosen.
Grading Scheme:6 Overall presentation and organization of materials3 Quality of data visualizations6 Correctness of analysis4 Quality and selection of relevant figures6 Interpretation of results--25Regression and Analysis of Variance STAT 3340 / MATH 3340Fall 2020Final Project

Statistics Project Sample Content Preview:

Regression and Prediction Using R
Student's Name
Institutional Affiliation
Abstract
As part of daily life, machine learning is used to make decisions, especially by data scientists. This paper aimed to incorporate machine learning algorithms in the prediction of vehicle prices. First, the car.csv dataset was inspected, cleaned, and organized. A final dataset was arrived at and used for further analysis. The dataset was fitted with Present_Price as the response variable and the rest as explanatory variables using the basic linear model. Three algorithms, linear regression, random forest, and support vector machines, were selected for modeling. Data were partitioned into two; training and testing set. The training data was used to predict the prices on the testing set. The models' performances were evaluated and improved using tuning, cross-validation, and checked for overfitting. Lastly, the models were compared against one another using a calculated RMSE (Root Mean Squared Error). The best performing model was chosen, hence leading to the arrival of meaningful conclusions.
Keywords: Algorithms, linear regression, random forest, basic linear model, dataset, regression, and output.
Introduction
Every day, applications for machine learning are typical to come across. The algorithms help in making critical decisions in every field of work. For instance, media sites rely on machine learning to sift through millions of options to give you song or movie recommendations, and retailers use it to gain insight into their customers' purchasing behavior. Closer to home, data scientists using machine learning to advise on future data patterns and behavior that could be encouraged or discouraged. Besides, it entails building models that offer predictive power and can be used to understand data not yet collected.
In earlier statistics classes, machine learning has been used when running simple regression models. On the other hand, this is a complex topic with a wide range of possibilities and applications. Therefore, this study sought to present a basic understanding of regression modeling using linear, random forest, and support vector modeling, as well as to answer the following questions of interest;
* What is the relationship among variables, especially between the car price variable with other variables?
* Is it possible to predict the price of a new car based on historical data?
* Which is the best model for use in prediction among the three?
Data Description
The car dataset contained information about cars and motorcycles listed on CarDekho.com. The car data was in a CSV file and included the following columns: model/ Car_Name, year, selling price, showroom price/Present_Price, kilometers driven, fuel type, seller type, transmission, and the number of previous owners. Using R, the dataset had nine columns and 301 rows/observations. After cleaning the dataset, there were four numerical and five categorical variables. Those categories were model, fuel type, seller type, and transmission. Categorical ones were namely, year, selling prices, showroom price, kilometers driven, and the number of previous owners. As instructed, an additional data point was added for a 2018 manual city selling at 4.34M but presently priced at 5.12M with a mileage of 37,000 kilometers. The model was being sold by a Dealer, thus no previous ownership, and ran on Diesel. During cleaning, the model variable was dropped as it was regarded unimportant for upcoming analysis. A new variable, age, was created after subtracting the year values from the current year (2020). This meant deletion of the year and current year variables. Notably, there were no missing values in the dataset. Finally, the car dataset contained 302 rows and eight columns; Selling_Price, Present_Price, Kms_Driven Fuel_Type, Seller_Type, Transmission, Owner, and Age.
Further, visual presentations in ggplots, histograms, and plots were obtained for each variable.
A ggplot of Age and Present_Price
The above ggplot showed that the relationship between the ages of the cars was positive and linear. In other words, the newer the car, the more expensive it was.
Histograms
Most cars' selling prices were between 0 and 5M while a few cars were priced between 30 and 35M. Also, most cars had the least number of kilometers driven/mileage. Most cars had close to no previous ownership, with a few having about three previous owners. Expectedly, most cares were either new or almost new.
Plots
The majority of the cars ran on Petrol while less than 50 cars used CNG. Dealers were selling more cars than individuals. Many operated on manual while less than 50 or a few were automatic. With all these explanations, most of the cars featured in the dataset were new.
Methods
As mentioned earlier, a linear regression was fit using Present_Price as the response variable. Overall, from the R output, the model's predictors explained 84.71% R2 of the price variation. The model was a good fit, F (8, 293) = 202.9, p-value < 2.2e-16. The following model equation was obtained.
Present_Price = Selling_Price + Kms_Driven + Fuel_Type + Seller_Type + Transmission + Owner + Age
However, some coefficients were more statistically significant than others as ranked by importance below.
Overall

Selling_Price

27.4908128

Kms_Driven

2.1504483

Fuel_Type Diesel

0.6907296

Fuel_Type Petrol

0.2347870

Seller_Type Individual

0.3083687

Transmission Manual

0.5142432

Owner

1.9049391

Age

8.1736365

For example, selling price, kilometers are driven, and age was statistically significant in predicting showroom price (p < 0.05). Fuel type and seller type were negative and statistically insignificant, while transmission and owner were positive but insignificant (p > 0.05). As expected, the more miles a car had, the cheaper it was. However, this was not the case because the more kilometers drove, the more the showroom price. The transmission manual did not have much of an influence on Prices.
Based on the above results, it was generally not a noble idea to use the entire data sample to fit the model. Hence, the decision was made to train the model on a sample of the data. Then, its performance would be observed outside of the training sample. Thus, the dataset was partitioned into training and test set, where the latter was used to evaluate the performance of the three models with unseen data. See the next section.
Results
On a ratio of 0.8: 0.2, the dataset was divided to form training and testing sets, 241 rows and eight columns and 61 rows and eight columns. Next, the caret package was loaded to obtain the train function used for model training. A glimpse of the first six rows of the predicted showroom prices of cars on the testing set was obtained using the predict function. Evaluation metrics, RMSE, R2, MAE, and many others were calculated for each algorithm. All models were compared against one another based on the metrics.
1 Linear Regression algorithm
The first algorithm, linear regression, was trained using the train function where Present_Price/showroom price was the outcome variable. The following results were obtained.
Intercept

RMSE

R Squared

MAE

RMSESD

R Squared SD

MAESD

TRUE

3.425228

0.831936

2.098598

1.871193

0.1224851

0.5642068

On the testing set, a trained model was used to predict car prices at the showroom.
6

10

17

21

22

23

11.320564

10.548235

10.184033

2.766763

5.617042

9.694914

The predictive ability of the model was tested on the testing set, and the evaluation metrics were derived as shown below;
RMSE

R squared

MAE

4.4259303

0.8192709

2.5507228

The above output showed that RMSE, one of the two evaluation metrics, was 3.43M for the training set and 4.43M for the testing set. On the other hand, the R-squared value was around 83% and 82% for both training and testing sets, respectively, which indicated good performance.
2 Support Vector Machine Algorithm
Similarly, the cross-validation methods were specified with k-folds and leave one out CV. Since SVM has hyperparameters, they were tuned by inspecting and modifying specific algorithm parameters. Specific values were fed into the model training function using the expand.grid command.
Then, the model was tuned, trained on the training set.

C

RMSE

R squared

MAE

RMSES

R squared SD

MAESD

1e-03

6.112354

0.7257655

3.376103

3.913042

0.1337751

1.0075231

1e-02

3.961975

0.8439960

1.988603

3.079260

0.1179277

0.7879450

1e-01

3.435029

0.8595457

1.818873

2.524667

0.1268926

0.6702223

1e+00

3.365431

0.8604723

1.822928

2.328030

0.1310797

0.6151965

1e+01

3.365003

0.8607997

1.822493

2.323299

0.1307758

0.6101864

1e+02

3.365878

0.8607631

1.823303

2.324530

0.1307558

0.6107806

Afterward, it was used to obtain the predicted car prices at the showroom.
6

10

17

21

22

23

10.798856

10.37632

10.0967

3.925298

5.803843

8.91126

The model's predictability was evaluated on the testing set. See the table below.
RMSE

R squared

MAE

3.1847648

0.8468279

1.7856504

The tables above showed that RMSE, one of the two evaluation metrics, was between 3.3 and 6.1M for training data and 3.2M for the testing set. On the other hand, the R-squared value was around 72.3 to 86.1% and 85% for both training and testing sets, respectively, which indicated good performance.
3 RandomForest Algorithm
Like -wise, the cross-validation method was specified before tuning and later training the model. The table below shows the model's results.

mtry

RMSE

R squared

MAE

RMSESD

R squared SD

MAESD

1

2

3.691223

0.8484474

1.956376

3.842158

0.12096760

0.9986551

2

5

2.997213

0.9142653

1.343094

3.646090

0.08030471

0.8501472

3

8

2.983392

0.9152494

1.305836

3.402356

0.06566592

0.7632234

As usual, the model was used to obtain the first six rows of the predicted prices as below.
6

10

17

21

22

23

10.634268

10.822836

10.779255

4.185635

5.818534

11.641299

The predictive ability of the model was assessed on the testing set using the three metrics.
RMSE

R squared

MAE

3.2610962

0.8614406

1.4489065

Lastly, all tables above showed that RMSE, one of the two evaluation metrics, ranged from 3 to 3.7M for the training set and 3.3M for the testing set. On the other hand, the R-squared value was around 84.8 to 91.5% and 86.1% for both training and testing sets, respectively, which indicated good performance.
All three models were compared to each other using the RMSE and R2 to determine the best model for predicting car prices at the showroom. The data frame containing all values for the three models on the two metrics was obtained, as shown in the R output. Also, note that the evaluation metrics could not be discussed per model because there would not be any values to compare with.
Linear regression had the lowest RMSE of 3.4M on the training set, while the support vector machine model had the highest at 6.11M. In terms of R2 values on the training set, the random forest model had the highest (84.8%), while the support vector machine had the least at 72.6%. The latter also had the lowest RMSE on the testing set at 3.18M, while linear regression had the highest at 4.43M. For R2 on the testing set, linear regression had the least at 81.9%, while random forest had the highest (86.14%). Typically, a lower RMSE and higher R2 values are indicative of a good model. Therefore, the best models for predicting car prices were the support vector machine (low RMSE of 3.18M) and random forest due to a higher R2 value of 86.14%.
Conclusion
This study aims to help understand the predictability of regression models in R using the car dataset. The critical points of interest were addressed via regression, which showed a relationship between showroom price and predictors. Similarly, predicting a car's price based on the already known data in each model was possible. See the predicted prices above in tabular form. Besides, the RMSE and R2 helped to determine the best performing model. The regularized regression models, support vector machine, and random forest performed better than the linear regression model. Both models had a lower RMSE and a higher R2 than the linear regression model. Overall, all the models performed well with decent R2 above 80% and stable RMSE values below 7. In reality, the ideal result would be an RMSE value of zero and an R-squared value of 1, but that's almost impossible in real economic datasets.
Appendix
* The Data
Selling_Price

Present_Price

Kms_Driven

Fuel_Type

Seller_Type

Transmission

Owner

Age

3.35

5.59

27000

Petrol

Dealer

Manual

0

6

4.75

9.54

43000

Diesel

Dealer

Manual

0

7

7.25

9.85

6900

Petrol

Dealer

Manual

0

3

2.85

4.15

5200

Petrol

Dealer

Manual

0

9

4.6

6.87

42450

Diesel

Dealer

Manual

0

6

9.25

9.83

2071

Diesel

Dealer

Manual

0

2

6.75

8.12

18796

Petrol

Dealer

Manual

0

5

6.5

8.61

33429

Diesel

Dealer

Manual

0

5

8.75

8.89

20273

Diesel

Dealer

Manual

0

4

7.45

8.92

42367

Diesel

Dealer

Manual

0

5

2.85

3.6

2135

Petrol

Dealer

Manual

0

3

6.85

10.38

51000

Diesel

Dealer

Manual

0

5

7.5

9.94

15000

Petrol

Dealer

Automatic

0

5

6.1

7.71

26000

Petrol

Dealer

Manual

0

5

2.25

7.21

77427

Petrol

Dealer

Manual

0

11

7.75

10.79

43000

Diesel

Dealer

Manual

0

4

7.25

10.79

41678

Diesel

Dealer

Manual

0

5

7.75

10.79

43000

Diesel

Dealer

Manual

0

4

3.25

5.09

35500

CNG

Dealer

Manual

0

5

2.65

7.98

41442

Petrol

Dealer

Manual

0

10

2.85

3.95

25000

Petrol

Dealer

Manual

0

4

4.9

5.71

2400

Petrol

Dealer

Manual

0

3

4.4

8.01

50000

Petrol

Dealer

Automatic

0

9

2.5

3.46

45280

Petrol

Dealer

Manual

0

6

2.9

4.41

56879

Petrol

Dealer

Manual

0

7

3

4.99

20000

Petrol

Dealer

Manual

0

9

4.15

5.87

55138

Petrol

Dealer

Manual

0

7

6

6.49

16200

Petrol

Individual

Manual

0

3

1.95

3.95

44542

Petrol

Dealer

Manual

0

10

7.45

10.38

45000

Diesel

Dealer

Manual

0

5

3.1

5.98

51439

Diesel

Dealer

Manual

0

8

2.35

4.89

54200

Petrol

Dealer

Manual

0

9

4.95

7.49

39000

Diesel

Dealer

Manual

0

6

6

9.95

45000

Diesel

Dealer

Manual

0

6

5.5

8.06

45000

Diesel

Dealer

Manual

0

6

2.95

7.74

49998

CNG

Dealer

Manual

0

9

4.65

7.2

48767

Petrol

Dealer

Manual

0

5

0.35

2.28

127000

Petrol

Individual

Manual

0

17

3

3.76

10079

Petrol

Dealer

Manual

0

4

2.25

7.98

62000

Petrol

Dealer

Manual

0

17

5.85

7.87

24524

Petrol

Dealer

Automatic

0

4

2.55

3.98

46706

Petrol

Dealer

Manual

0

6

1.95

7.15

58000

Petrol

Dealer

Manual

0

12

5.5

8.06

45780

Diesel

Dealer

Manual

0

6

1.25

2.69

50000

Petrol

Dealer

Manual

0

8

7.5

12.04

15000

Petrol

Dealer

Automatic

0

6

2.65

4.89

64532

Petrol

Dealer

Manual

0

7

1.05

4.15

65000

Petrol

Dealer

Manual

0

14

5.8

7.71

25870

Petrol

Dealer

Manual

0

5

7.75

9.29

37000

Petrol

Dealer

Automatic

0

3

14.9

30.61

104707

Diesel

Dealer

Automatic

0

8

23

30.61

40000

Diesel

Dealer

Automatic

0

5

18

19.77

15000

Diesel

Dealer

Automatic

0


Updated on
Get the Whole Paper!
Not exactly what you need?
Do you need a custom essay? Order right now:
Sign In
Not register? Register Now!