Coronavirus Web Scraper + Google Sheets

I decided to complete a weekly project during my self-isolation period in an effort to keep my mind sharp and entertained. I realized early last week that I can very easily piddle away the day doing school work, watching movies, playing video games, and taking naps. So, I made a personal goal: Start and finish a project that required me to learn something new, make a tangible product, and share what I created with other people.

My first project consisted of learning how to create a program that scrapes websites for data that could be repurposed in a spreadsheet program for mathematical analysis almost like a database. Since I was not working because of COVID-19, I thought that it might be fun to gather statistical data about the Coronavirus from various health repositories. I setup a Raspberry Pi, opened a Python script, and began a deep dive into automatic web scraping.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser... It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Click on a tab below to see graphs from the final project, view the spreadsheet database, and examine the Python program. All of the data comes from The COVID Tracking Project.




All of the data is stored in a Google Spreadsheet. The spreadsheet contains all state data from March 23, 2020 onward. Historical data is only recorded for the states of Virginia, Georgia, and North Carolina because that is where my family lives.

Updates occur at 11:55 PM each evening.

import requests
import json
import urllib
from datetime import date
import gspread
from oauth2client.service_account import ServiceAccountCredentials

url = ‘https://covidtracking.com/api/states’ # URL that contains current COVID-19 state data.
jsonFile = urllib.request.urlopen(url) # Extract data from URL as a JSON file.
all_states = json.loads(jsonFile.read()) # Parse JSON file as state, territory, and DC entries.
scope = [‘https://www.googleapis.com/auth/drive’]
credentials = ServiceAccountCredentials.from_json_keyfile_name(‘ABSOLUTE PATH TO CREDENTIALS’, scope)
client = gspread.authorize(credentials)
sheet = client.open_by_key(‘SPREADSHEET ID’).sheet1

for result in all_states:
today = date.today()
date1 = today.strftime(“%m/%d/%Y”)
abbr = result[‘state’]
pos = result[‘positive’]
neg = result[‘negative’]
pend = result[‘pending’]
hospital = result[‘hospitalized’]
dead = result[‘death’]
row = [date1, abbr, pos, neg, pend, hospital, dead]
sheet.append_row(row, value_input_option=’USER_ENTERED’, insert_data_option=’INSERT_ROWS’, table_range=’A1′)

  1. Parsing data from a .html page is straightforward with the Beautiful Soup package. When the underlying html changes like it did on The COVID Tracking Project website, parsing the page is frustrating. For example, I had to use td tags and indexing to pull values for all of the positive, negative, pending, and death cases. When the web managers altered the html and css properties, it produced a fatal IndexError in my original Python program. I had to write an entirely new Python program that used a .json file to scrape the data.
  2. Parsing data using a .json file is more consistent than requesting and storing the data in a .html file.
  3. Connecting a Python program to Google Sheets is simple if you use the oauth2clinet and the Google Sheets API. Following the official Google Sheets API directions is mind numbing and confounding. I found the Google Sheets + Python Integration video to be clear except for the fact that the screencast failed to state that you have to share the client_email in the credentials.json file in the target Google Sheet.
  4. Setting up a Crontab job to automate a Python program is confusing. For example, I wanted to test how my Python program ran so I implemented what I thought was a job that occurred every 5 minutes using 5 * * * *. It turned out that the time setting for the job meant that the script would only execute at 12:05 AM. Long story short, using a website like Crontab Guru is essential to properly execute jobs at specific intervals or times.

No Comments

Sorry, the comment form is closed at this time.