Jakub Rybicki
5 min readSep 7, 2021

Scraping the State

So…. a little about me.

I like dirt. I like to get dirty. Realllly dirty. ….No, actually just a regular amount of dirty. It’s essentially a prerequisite when you choose a career as a geotechnical engineer. As the branch of Civil Engineering that focuses on soil mechanics and the engineering behavior of earth material, we design everything that’s directly on or below the ground surface.

“Everything you see around you is supported by soil or rock. Geotechnical engineers are responsible for that. Anything that is not supported by soil or rock, either floats, flies or falls down.”

Most people might see it as just fiddling with soil but I think geotechnical engineering is one of the most exciting fields there is. We build bridge foundations, dams, tunnels while designing against destructive forces such as earthquakes, landslides, and liquefaction. Seismic design has become a priority for any structure in California to withstand the frequent earthquakes. In 2011, Christchurch was completely destroyed when half the cities ground became liquefied within seconds, it’s crazy. Oh and Mexico City is literally sinking over a foot per year and no one knows how to stop it. These are just some famous problems in geotech and what we as engineers try to design against.

Streets of Christchurch after liquefaction

I’ve been working at the NYSDOT Geotechnical Engineering Bureau for almost 4 years and have become licensed as a Professional Engineer earlier this year. Prior to that I’ve completed my Master’s degree, published research in scientific journals and presented at conferences along on the east coast. I’ve designed and worked on challenging projects throughout New York State ranging from foundation designs for bridges to mitigating the occasional landslides that occur upstate. As much as I enjoyed my job, the government constraints and political aspects involved in working for the State put a damper on my fun. As engineers we aim to design for a 50–100 year structural life span but instead we design according to a persons political career. Over the years, it’s become apparant that there are some things that can be improved with the way the State does things.

Data Scraping

When COVID first hit, New York State was scrambling to control the situation. The governor called out for volunteers to assist in the pandemic and was in turn greeted with over 30,000 medical professionals ready to assist the people of New York. The problem was that the State can’t let just anyone who claims to be a doctor start going around treating people. All 30,000+ rofessionals had to be vetted and have their backgrounds checked to see that they are actually licensed, trained and have no criminal records. The data on each volunteer was compiled into a long list in a csv file, however, the State didn’t really have an automated way to go through the list and perform a background check on each entry.The governors office met with the directors and representatives from each bureau and after some time determined that the best way to do this was pull dozens of state employees from different departments and have them manually verify each name on the list over the course of a week or two.

Dozens of state workers.

Manually clicking through sites to check credentials.

Updating an excel sheet.

Over a week.

This screamed inefficient but was the best plan they came up with on the spot. Those 30,000 medical professionals that are standing by ready to help can’t make a move until everyone gets checked. One of my supervisors that attended those meetings became concerned at the amount of manpower and time being wasted on such a tedious task. He called and asked if there’s “something we could code to do this, something like a scraper that could go through these sites and verify the list” (I had been pretty vocal in the office about learning to code and fiddling with APIs).

Of course a scraper would be great for this. I had a vague idea of how to code one, but I knew it could be done and definitely in a shorter amount of time than doing it manually. Five hours later, with the help of friends who are much more skilled in Python, we had our scraper. We broke the massive 30,000 person list into 5 smaller ones and had 5 instances of the program running for about half a day. And done.

It’s nothing special but this little bit of code saved a lot of time and manpower during the peak of the pandemic where speed was essential. It made me realize how useful it is to know how to manipulate and gather vast amounts of data. I’ll go over the code below but will not disclose the list of volunteers as it contains personal private information.

The Problem

To vet current and retired healthcare workers volunteering to assist NYS during the pandemic

1) Determine status of their license by searching the office of professions database

2) Check for any disciplinary actions

a) physicians (i.e., doctors), physician assistants, and specialist assistants: NYS Dept of Health database

b) all other medical professionals: Office of Professions database

3) Update “Vetting Status” and “Vetting Comments”

a) If not passed, pull comment data and add it to the spreadsheet (i.e. Disciplinary Action, License revoked/suspended, missing information)

Scraping with Selenium

To begin we need to install the selenium packages by typing the following in the terminal:

pip install selenium

We also need to download a driver for Selenium to communicate with a browser. This will emulate a real user’s interaction with the browser and allow us to navigate to the pages that have the data we need. Here we will use geckodriver to utilize Mozilla Firefox. After downloading geckdriver.exe we need to set the file path.

setx /m path "%path%;C:\WebDriver\bin\

Since we need to gather, analyze and manipulate data we will be utilizing the pandas library.

Begin by importing the necessary libraries. The multiple “from selenium…” will ensure the program continues running if there happens to be an entry with unreadable data

We use the information from these three columns to vet each volunteer.

This reads our spreadsheet of volunteers, tells the program how to interpret the spreadsheet data, what to input to the browser and where. It also prepares the location for the data we are scraping for each volunteer we’re vetting.

Checks the medical professionals license status.

Inputs the “License number” and selects the corresponding “Professional Title” from the dropdown options in the browser.

Logs the resulting license status of the volunteer

Checks if the volunteer has any disciplinary actions recorded against them.

If anyone fails the vetting, the comments are pulled into a separate column in the output file.

This is repeated for volunteers that are not doctors as their disciplinary records are located in a different database.

Records the comments for volunteers with license issues.

Outputs the spreadsheet with the scraped data and closes the browser.