Ten-Line Alternative to Web-Scraping

Typically, web-scraping can be done with integrated tools such as the read_html() command from the Pandas module. However, in some cases, the web page is not directly scrapable as it requires the user to be logged in. For instance, the read_html() function cannot be applied to the internal web pages of organizations. In my case, I am not working regularly with these sorts of problems, so I had to find a simpler way to obtain the data rather quickly.

This tutorial explains the alternative method to traditional web-scraping. I will try to scrape a webpage from the SEC (Securities and Exchange Commission) that contains a table. Note that the page is public but the code can be used for the internal webpages as well. The table that I want to scrape looks like this:

Figure 1: The capture of webpage

The strategy consists of three steps:

  1. Copy and paste the webpage content to the clipboard in its raw format
  2. Process the clipboard with regular expressions
  3. Organize the table in pandas dataframe

First, let us define the libraries needed for this little project. I will explain the use of the libraries later on in this tutorial.

import webbrowser as wb 
import pyautogui as ag 
import time 
import pyperclip 
import re 
import pandas as pd

In the first part, I am mimicking the computer actions with webbrowser and pyautogui libraries. I request the PC to open the given URL (now, you can see that if you are logged in to the organization web page you will not be requested to provide the username or password to run the code). Next, the script performs manual actions such as Ctrl+a, Ctrl+c, Ctrl+w to select all, copy and close the page, respectively. The code has just copied the entire web page in the raw format and saved it in your clipboard.

Note that I also add the time.sleep(3) command to wait three seconds before copying the content in case it takes longer to load the page.

To sum up, the code is as follows:

ag.hotkey('ctrl', 'a') 
ag.hotkey('ctrl', 'c') 
ag.hotkey('ctrl', 'w')

Next, I parse the copied content from the clipboard to the string variable in the editor using the pyperclip library. The content is saved in a raw string s.

s = pyperclip.paste() 

The picture below illustrates what it looks like if you print s.

Figure 2: The web page output in a raw string

It looks unclear, but you may be able to notice certain symbols in between the plain text, for instance, ‘\n’, ‘\r’ and ‘\t’. These symbols signify a new line, line end and table-cell, respectively. We can use this information to extract the table data that we were looking for.

To do this manipulation, I will use the regular expression (with the help of re library) where I specify the pattern that I am looking for, namely, a row with three columns. And I will be looking for this pattern in the whole string. The pattern can be constructed in the following way: '(.*?)\t(.*?)\t(.*)\r'. Here I indicated the structure of one row, namely, “some string \t some string \t some string \r”. Now, I want to find all matches in the string s, i.e., all the rows.

tab_pattern = '(.*?)\t(.*?)\t(.*)\r' 
tab_output = re.findall(tab_pattern, s)

And by printing the outcome, you get a list of clean and neat tuples.

Figure 3: List of tuples output

To finalize, I parse this list of tuples to a pandas dataframe. It makes my work organized and I can continue with my analytics tasks.

df = pd.DataFrame(tab_output, columns=['Release Nr', 'Date', 'Name'])
Figure 4: Dataframe output

Additionally, this code can be applied to roll over the web pages and to get the content for multiple years of data. I store the final result in a dictionary of dataframes. To do so, you can refer to the script below:

years = [2016, 2017, 2018, 2019] 
# create url dictionary 
d_urls = {} 
for year in years: 
    d_urls["url_{0}".format(year)]  = f'https://www.sec.gov/divisions/enforce/friactions/friactions{year}.shtml' 
# the urls of 2020 and 2021 have a different format --> add manually 
d_urls['url_2020'] = 'https://www.sec.gov/divisions/enforce/friactions/friactions2020.htm' 
d_urls['url_2021'] = 'https://www.sec.gov/divisions/enforce/friactions.htm' 
for key, url in d_urls.items(): 
    ag.hotkey('ctrl', 'a') 
    ag.hotkey('ctrl', 'c') 
    ag.hotkey('ctrl', 'w') 
    d_urls[key] = pyperclip.paste() 
all_dfs = {} 
for key, value in d_urls.items(): 
    all_dfs['df_{0}'.format(key)] = re.findall(tab_pattern, value)
    all_dfs['df_{0}'.format(key)] = pd.DataFrame(all_dfs['df_{0}'.format(key)], columns=['Release Nr', 'Date', 'Name'])

To conclude, I want to emphasize again that this approach to web-scraping is suitable for quick programming solutions. In my personal experience, I used it when I needed to get the data rather quickly from an internal webpage but I did not know how to get around the user login. However, it did the trick and introduced me to regular expressions.

I hope you found value in this article. You can follow me for more data analytics related content!

Leave a Reply

Your email address will not be published. Required fields are marked *