Corona Virus: Limited Data Models

Jessica Kimbril
4 min readMar 10, 2020

For my last build project, I chose something light-hearted and fun. This build project I decided to get more serious. I want to make sure I have a good mix of project types in my portfolio and thought, “What is more serious right now than the Corona Virus!”.

The coronavirus is a deadly disease that is causing sadness and destruction around the world. I did not think about it as I got started with this project but then something dawned on me the last couple of days, the work I am doing on this project could affect the real world. The coronavirus is getting closer to home and people are getting scared.

Coronavirus Map

The original thought I had with this project was I wanted to be able to look at the data and make predictions on whether someone who contracted the disease died or fully recovered. I specifically wanted to see what the numbers looked like as the virus spreads and what the chances of recovery were. The first thing I did was look at my dataset. I wanted to see the whole picture and then I can get into the nitty-gritty from there.

Corona Virus Dataset

After looking at my data I created a baseline model as a starting point. I decided I want to make “Recovered” my target. My mean absolute error is the one to pay attention to here and it is 3.54.

Baseline

The next thing I did was split my data 80/20. I am training on 80 percent of my data and testing on 20 percent.

Train/Test/Split

The next thing I did was create a graph of my target. I used a kernel density plot and it shows that the data is not distributed well at all. This is how the majority of my data is. This is when I started noticing that maybe these models will not perform very well.

I went on to using a linear regression model first. My MAE is better than my baseline model at 0.83 but if you notice here, my R² score is 0.97. The reason this is worrisome is that earlier we saw how skewed the target is, and now the R² score is very high. When an R² score is close to 1 this means that the model has almost zero errors. So using this to determine how good my model is, this means I have a near-perfect model. I can see though that my data is skewed so maybe I would not use them for a real-world application.

Since the linear model did not look very good I tried a Random Forest Regressor.

As we can see, the MAE is 0.42 and our R² score is 0.97. As with the linear model, this is showing that my Random Forest Regressor model has near-zero errors. So after seeing two models perform this way it brought me to a thought. Maybe this data is not good for predictive models. If I were working for the CDC making models with this data I would not feel confident presenting this to the public. I wanted to include my feature importance graph in my conclusion. You can see how heavily the top features will weigh in the model. Unfortunately, as I stated below, I just do not feel like this data is large enough for me to use for accurate predictions. I understand that the CDC has different data but right now this is pretty much all we have access to in the public.

You can check out my code notebook at https://github.com/jessicakimbril/Corona-Virus-Build-Project/blob/master/Corona_Virus_Build.ipynb

--

--