Corona Virus: Limited Data Models

Jessica Kimbril
4 min readMar 10, 2020

For my last build project, I chose something light-hearted and fun. This build project I decided to get more serious. I want to make sure I have a good mix of project types in my portfolio and thought, “What is more serious right now than the Corona Virus!”.

The coronavirus is a deadly disease that is causing sadness and destruction around the world. I did not think about it as I got started with this project but then something dawned on me the last couple of days, the work I am doing on this project could affect the real world. The coronavirus is getting closer to home and people are getting scared.

Coronavirus Map

The original thought I had with this project was I wanted to be able to look at the data and make predictions on whether someone who contracted the disease died or fully recovered. I specifically wanted to see what the numbers looked like as the virus spreads and what the chances of recovery were. The first thing I did was look at my dataset. I wanted to see the whole picture and then I can get into the nitty-gritty from there.

Corona Virus Dataset

After looking at my data I created a baseline model as a starting point. I decided I want to make “Recovered” my target. My mean absolute error is the one to pay attention to here and it is 3.54.


The next thing I did was split my data 80/20. I am training on 80 percent of my data and testing on 20 percent.


The next thing I did was create a graph of my target. I used a kernel density plot and it shows that the data is not distributed well at all. This is how the majority of my data is. This is when I started noticing that maybe these models will not perform very well.

I went on to using a linear regression model first. My MAE is better than my baseline model at 0.83 but if you notice here, my R² score is 0.97. The reason this is worrisome is that earlier we saw how skewed the target is, and now the R² score is very high. When an R² score is close to 1 this means that the model has almost zero errors. So using this to determine how good my model is, this means I have a near-perfect model. I can see though that my data is skewed so maybe I would not use them for a real-world application.

Since the linear model did not look very good I tried a Random Forest Regressor.

As we can see, the MAE is 0.42 and our R² score is 0.97. As with the linear model, this is showing that my Random Forest Regressor model has near-zero errors. So after seeing two models perform this way it brought me to a thought. Maybe this data is not good for predictive models. If I were working for the CDC making models with this data I would not feel confident presenting this to the public. I wanted to include my feature importance graph in my conclusion. You can see how heavily the top features will weigh in the model. Unfortunately, as I stated below, I just do not feel like this data is large enough for me to use for accurate predictions. I understand that the CDC has different data but right now this is pretty much all we have access to in the public.

You can check out my code notebook at

