top of page

PROJECT 2

In the following project, we are asked to define a problem and show our data understanding over a dataset we have chosen.

Bar Chart

Introduce the problem

Based on a person's age and BMI(body mass index) do they identify as a smoker or not?

Introduce the data

The data I will be using is a US Health Insurance Dataset that was provided on Kaggle. You can view this dataset here!

Screenshot (330)_edited.jpg

Pre-processing the data

1. Check the data- see what attributes are classification-based or are numerical. The key attributes I want to focus on are age and BMI(body mass index). BMI and age are already numerical so they will be easy to measure. But, the smoker attribute is a classification variable so I converted the "yes" and "no"s of to 0s(yes) and 1s(no).

​

 

​

​

​

​


2. Next, I realized that there are some datasets that may have some values missing so I wanted to make sure that my features, in particular, did not have any NaNs or missing values so I took the mean over each column and the means would be encoded for missing values or NaNs in its designated column. 

​

​

​

​

3. Now we are able to use the classification model that I will use for this project, the Decision Tree.

​

​

​

Screenshot (331)_edited.jpg
Screenshot (332)_edited.jpg

Data Understanding/Visualization

For reference, I used the following source to implement my Decision Tree for this project so please feel free to see the tutorial for more context. When it comes to a Decision Tree I wanted to make the target variable the "smoker" attribute and the features will be "age" and "BMI." So I encoded those variables as shown below:

​

​

​

Then we are ready to implement the Decision Tree which I did first by importing its necessary code into my file after that, I split the dataset and it is identified that 70% of the dataset will be training and 30% is testing:

​

​

​

Now I can assign the DecisionTreeClassifier and check how accurate are testing portion of the dataset is, which is currently 69.15% which means we're on a good track but we can increase that accuracy a bit more.

​

​

​

​

​

​

But I am curious about how the visualization of our dataset looks with this accuracy rate so here is a visualization of the current state. As you can see it is a bit confusing to follow since the Gini Index causes multiple splits to occur so I learned that classifying a specific depth will not only return a readable tree but you are able to check the entropy or Gini index as well.

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

After I identified that I want to check the entropy of this decision tree and make the max_depth equal to 3 by using the following code:

​

​

​

​

​

I do want to point out that modifying our dataset increased our accuracy rate from 69.15% to 79.60%.

​

As a result of the code above, I am provided with a more readable decision tree (shown below) that can help me make an analysis about the features and target variable I have chosen that can help me answer my problem statement.

​

 

​

​

​

​

​

​

 

For context class = 0 resembles a non_smoker and class = 1 connotates a smoker so in the conclusion of this dataset, people of the age 59 or younger with a bmi of 47 or younger would be identified as a non-smoker. For the right-hand of the decision tree, a person that is 59 or older (with the exception of 63 or older) that have a bmi of 27 or less identifies as a smoker.,

Screenshot (333)_edited.jpg
Screenshot (334)_edited.jpg
Screenshot (334).png
insurance1.png
Screenshot (336).png
insurance.png

Storytelling

When it comes to insurance charges, as the dataset showcases, age, gender, smoking identification, and BMI can play a factor in how much you are charged. But in particular, with this dataset we focused on age and BMI which we have seen can be used to predict someone's smoking behavior. Probably in future analysis, there can be a way to identify if that predicts how much someone would be charged as well. But this dataset showed the importance of BMI at older ages can play a part in social behaviors that seem to be most common in that area.

My Code

You can view the code I wrote for my project here.

©2023 by Elise Frazier. Proudly created with Wix.com

bottom of page