Low code extracting listserv archives to tabular data using Webscraper.io

Nick Young
4 min readMar 15, 2022


Originally posted on techupover.com

Mailing lists come and go, and if you’ve kept everything in your inbox (and were a member since the list’s inception) then you can always go back to find that random answer someone gave. BUT, if you didn’t sign up at the start, or have lost emails, or just want a better way to maintain historical reference information in a tabular format, maybe this approach will be helpful.


  • Web Browser (Firefox recommended, although Chrome will work too)
  • Access to a mailing list that provides web-based archives (like “listserv”)
  • Webscraper.io browser extension


For my purpose, I wanted to retrieve the contents of a particular EDUCAUSE Community Group mailing list. It recently moved to a new platform, and they were sunsetting the old mailing list server/service.

  • First, download and install the webscraper.io browser extension
  • Open the Developer tools, and click the webscraper tab which should look something like this:
  • You’ll then be able to create “selectors” which is the extension’s name for how the tool will navigate through the website, and where it will find the data you want to extract. IE: If you have a structure like a mailing list, then you’ll navigate from the mailing list home page, then into one of the “month” pages, then into one of the threads, where you’ll end up being able to extract the email subject, date, sender, and message body.
  • This is what the main page looks like:

NOTE: we parse one year at a time, due to timeouts and unexpected resource issues. If you try to run the scraper on all 10 years (for this example) it will unexpectedly timeout somewhere in the middle. Doing one year at a time was the happy medium I found.

  • We grab the date, from, and subject from this page. But…we also want the message body, right? Well on the individual email page, it’s displayed in an iframe, which confused the webscraper extension. So, instead we traverse one more level by selecting the “plain text” link, and grabbing the text from there. See in the screenshot above, there’s 3 “selectortext” types, and 1 “selectorlink” type….that’s how we grab some data from one page, and supplement it with additional data from a subsequent page.
  • The plain text selector is the easiest of the lot. Just grab everything in the <pre> tag:
  • Simple, right? Traverse from the main page -> select a month -> select a thread -> store the date, from, subject -> select the plain text version -> store the plaintext version -> resulting row of data saved is: Date, Subject, From, Body
  • You can now export the data into XLSX or CSV.

PROTIP, Use XLSX because the CSV export does NOT handle newlines or quoted strings well at all.

That’s really it! If you dont want to go through all that trouble, you can import this sitemap I made already for you (just select all, copy and paste into the “import sitemap” option in the extension.

{"_id":"googleworkspace","startUrl":["http://listserv.educause.edu/scripts/wa.exe?A0=GOOGLEWORKSPACE"],"selectors":[{"id":"monthselector","parentSelectors":["_root"],"type":"SelectorLink","selector":"li:contains('2017') a","multiple":true,"delay":0},{"id":"threadselector","parentSelectors":["monthselector"],"type":"SelectorLink","selector":"span a","multiple":true,"delay":0},{"id":"email_date","parentSelectors":["threadselector"],"type":"SelectorText","selector":"tr.emphasizedgroup tr:nth-of-type(4) td:nth-of-type(3) p","multiple":false,"delay":0,"regex":""},{"id":"email_from","parentSelectors":["threadselector"],"type":"SelectorText","selector":"tr.emphasizedgroup tr:nth-of-type(2) td:nth-of-type(3) p","multiple":false,"delay":0,"regex":""},{"id":"email_subject","parentSelectors":["threadselector"],"type":"SelectorText","selector":"tr.emphasizedgroup tr:nth-of-type(1) td:nth-of-type(3) p","multiple":false,"delay":0,"regex":""},{"id":"textversionselector","parentSelectors":["threadselector"],"type":"SelectorLink","selector":"tr.emphasizedgroup tr tr a:nth-of-type(1)","multiple":false,"delay":0},{"id":"email_body_plaintext","parentSelectors":["textversionselector"],"type":"SelectorText","selector":"pre","multiple":false,"delay":0,"regex":""}]}

Data Cleanup

There’s always some kind of data cleanup you have to do in any data set, and this is no exception. To make this data easier to work with, here’s a few things I did:

  • Separate the sender data into separate fields for the sender display name, and sender email address (lowercased)
  • Normalize the subject line data, to remove “Re: “ strings, so threads were easier to find together
  • Extract the domain from the sender email address, so we could see which domains had the most people contributing to the mailing list.

Visualizing / Exploring the Data

For this particular effort, I used Google Data Studio as a quick way to create a dashboard for people to use for exploration and visualization. It works well (and is free) when the data underneath is stored in a Google Sheet or Excel file stored in Google Drive.

Here’s some quick screenshots from the dashboard I made:

I also did a leaderboard page, showing the people who emailed the most, and which domains had the most senders.


Hopefully this is helpful to people who want to do the same thing for their mailing lists, or for scraping from another site with similar structured navigation. I hadn’t used Webscraper.io until finding it a few days ago….so I can definitely recommend it for its simplicity.

Originally published at https://www.techupover.com on March 15, 2022.



Nick Young

Cloud stuff, data, analytics; Google, Internet2 Advisory Boards & working groups. Higher Ed IT since 2002. @techupover and @usaussie on twitter