Web Scraping with Python and Selenium | Daniel Ellis Research | | March 2023



How to extract data from web pages with JavaScript.

A photo of the iceberg — a metaphor for all the data just below the surface of the web.
Photo by Annie Spratt on Unsplash

A very useful skill that keeps popping up again and again is web scraping. We alluded to this in a previous post, but decided it was time to have our own.

Selenium is an open source tool that enables web browser automation. Other browsers such as FireFox are also available, but I often use the chrome version.

Installation is easy with pip:

pip install selenium webdriver_manager

Now you are ready to code. For simplicity you can import the following block:

from selenium import webdriver 
from selenium.webdriver import Chrome
from import Service
from import ChromeDriverManager
from import By
from import WebDriverWait
from import expected_conditions as EC

I have imported Selenium here. Then you need to tell it to use the chrome web driver (browser).

options = webdriver.ChromeOptions() 
options.page_load_strategy = 'none'

chrome_path = ChromeDriverManager().install()
chrome_service = Service(chrome_path)
driver = Chrome(options=options, service=chrome_service)

I’m using chrome in “headless” mode for efficiency. This means that no graphical user interface needs to be loaded. You can omit this line if you want to manipulate the page directly (and programmatically).

All that’s left now is to direct the browser to the URL.

url = "https://www.<your target website here>.com/test" 


There are multiple ways to work with Selenium, but some of the more common ways are outlined below. For a comprehensive list, we recommend that you consult the documentation.

Select element by ID

The easiest and most common way to select an element is by its ID.

ele = driver.find_element(By.ID, "id")

Once selected, you can explore using Python.For example, to extract the text to use ele.text .

Element by CSS selector

But not all elements are labeled. So you can use the CSS selector form for more complex selections.

ele = driver.find_element(By.CSS_SELECTOR, "button[class='super']")

Select multiple elements

Often we are interested in performing actions on multiple items. here, find_elements Instead:

multiple = driver.find_elements(By.CLASS_NAME, "cool_buttons")

As a working example, all ‘cool_buttons’ using the following snippet

for button in driver.find_elements(By.CLASS_NAME, "cool_buttons"):

Often, when trying to retrieve information from a page, we need to insert some custom JS commands. this is, execute_script Instructions:

driver.execute_script( 'some_built_in_function()')

return data

To extract the result to Python, treat the internals as a JavaScript function and put return keyword.

driver.execute_script(f'return built_in_function()')

dynamic customization

In many cases, you may want to iteratively change commands based on external information. Here we use the f-string format as part of the execution string inside the loop.

for q in range(10):
driver.execute_script(f'return my_function({q}, true)')

When working with pages, it is inevitable that some delays will occur. This becomes even more of an issue when the Python script tries to access an element that doesn’t already exist.

For this reason, WebDriverWait Functions are included in Selenium.

timeout = 10 # seconds until we throw an error
wait = WebDriverWait(driver, timeout)

Waiting for URL change

For example, if a JavaScript command modifies a web page, you can check to see if the driver’s URL has been updated accordingly.

    driver.execute_script(f'go_to_element({q}, true)')

lambda driver: driver.current_url.split('=')[-1] == str(q)

Waiting for display of element

Alternatively, if you know the element will be visible, you can wait for it to be visible before accessing it.

answer = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'span[id="answerBox"]'))).text

text attributes here ‘span[id=”answerBox”]’ Once the span element is available, it is saved as a variable.

Error handling changes

Unfortunately, if the element is not visible, TimeoutException Stop program execution. As with all Python errors, the solution is to implement proper error handling.

for i in [url1,url2,url3...]:
haslink = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, f'a[href="{i}"]')))
except TimeoutException:
print(f'failed on {i}')

We all want to free resources when we are done using them.So to end what we use driver.close()

We now have the bare minimum commands that a data scientist or researcher needs. Shall we play? Find a website, scrape the data, put it into a CSV and plot it. All using Python.

(If you found this helpful, please click the “clap” button. Remember, you can click as many times as you want! Can you click 50 times?)


Source link

What do you think?

Leave a Reply

GIPHY App Key not set. Please check settings

    127479782 mediaitem127479781

    Benjamin Netanyahu, Israel’s defiant leader

    129160136 gettyimages 1253021336

    Laughing gas: Experts warn nitrous oxide ban will not stop use