Project 3

Introduce the problem

What regression model and features could be used to predict house prices?

Introduce the data

The data I will be using is a house prices dataset that includes common specifications for housing such as LotArea, LandSlope, SaleType, and MSSubClass: here!

Pre-processing the data

1. Check the data- Using the info() and/or dftypes methods to see the datatypes of the different fields in the dataset and I tried focusing on numerical datatypes rather than categorical ones.

2. Next, I used scatterplots to see which datatypes that may have a correlation with each other. I found that LotArea and LotFrontage may have potential as features in the model.

3. After I have found my target features through graphing I make sure they do not have any missing/NaN values from their datasets.

4. Lastly I dropped irrelevant columns that are not being used in the current experiment because in return when this will help aggregate through the data easier and not overfit the model.

Data Understanding/Visualization

Experiment 1

The regressional model I decided to use was PCA (Principle Component Analysis), I chose this model because since I am just more than one feature this model can best serve the purpose of showcasing how much the two attributes contribute to dictating the house price. So that is what I try to showcase with my first experiment.

As I said in my Pre-processing section, the features I wanted to focus on were LotArea and LotFrontage so those are the two features that I have denoted in my data frame.

Then I have to standardize my data and used the following code to complete that implementation:

Now that it is standardized I was able to implement the PCA model:

Based on the 3D graph the data is still polarized to a specific direction of the plane field, probably because there are house values that are closely the same or because of any NaN or missing value I have filled with the mean value favored a specific value in an interval that is consisted in lower numbers? Things to consider and maybe be possible but I do want to further analyze other features or maybe combine 2 more features with LotArea and LotFrontage to see if the visualization would improve. As for the accuracy time I was not able to reach it because of system errors that continued to persist but I hope to find an alternative to seeing the accuracy of the PCA model.

For this experiment, I want to be able to see what other features may have a correlation with Price so I implemented a heatmap to see which feature has a high correlation with SalePrice. Based on my analysis, "OverallQual" besides "Id" has a high correlation with this field.

As for the visualization I will use Linear Regression to help predict the Sales Price based on OverallQual So I first defined OverallQual as a feature(x) and Sales Price as a target(y). Then I was able to define the Linear Regression model to predict the target output which is explained below with my final finding for this experiment.

After I have finished setting up the model I was ready to predict the SalePrice based on OverallQual. For the example below I used one of the common values for OverallQual, 7.

So the predicted value was around $221,179.50 which is around the similar range some of the Sale Prices are. Lastly, to see how close or far I was from the actual value I calculated the mean_squared_error which turned out to be around 8.313 which is better than it being a triple number so I am close to the target value.

For this experiment, I decided to still use the PCA model but this time with different features which were: OverallQual, LotArea, and LotFrontage . First I defined them as a list of features for the model.

Next, I define and declare the PCA model with the following lines of code:

Then as for the visualization, I have realized there was a bit more improvement from the first experiment and I believe it may be because of the new feature added that has more input about the state of the price. It seems that there are a little bit more values that were able to be taken into account in the visualization but more adjustments could be made to improve the model overall.

Experiment 2

Experiment 3

Storytelling

House prices dictation can vary by market, by the house type/structure, and even location but within this dataset, a house's value could be assigned based on the number of bathrooms it has, the quality of it overall, and the size of the house. Those were factors I tried to consider when trying to determine which are important to stating the price of a home. From my own experience size means everything, so LotArea and LotFrontage stuck out to me based on the descriptions that the Kaggle source provides for each of them. But I do wonder if MSSubClass or the quality of the house can play a part in naming the price for the house as well.

My Code

You can view the code I wrote for my project here.