Below is a simple web crawler I wrote in Python. A user inputs a search term (‘pump’ in the example below) which the crawler searches for on dotmed.com (a website that sells medical equipment). The crawler extracts all the auction listing URLs, visits the pages, and extracts listings details (such as the item type and model). These details are then written into an SQLite database.¶
In [1]:
test_mode = False
In [2]:
import re
import urllib.request
from datetime import datetime
import sqlite3

query = input('Enter search term: ')
url_suffix = 'https://www.dotmed.com/listings/search/equipment.html?key='
url_to_examine = url_suffix + query

print("This search page will be visited: " + url_to_examine)
Enter search term: pump
This search page will be visited: https://www.dotmed.com/listings/search/equipment.html?key=pump
In [3]:
# Connect to database
connection = sqlite3.connect("dotmed2.db")
db_cursor = connection.cursor()
In [4]:
# Visit page, get source
data = urllib.request.urlopen(url_to_examine).read()
search_page_source_code = data.decode('utf-8')

# Create listing URLs
listing_urls = re.findall(r'<a href="(/auction/[^\"]+)', search_page_source_code)
listing_urls = ['https://www.dotmed.com{0}'.format(i) for i in listing_urls]
listing_urls = list(dict.fromkeys(listing_urls))

print(str(len(listing_urls)) + " unique URLs found. The first URL is " + listing_urls[0])
15 unique URLs found. The first URL is https://www.dotmed.com/auction/pump-iv-infusion/baxter/i-pump-pain-management-pump-system/202483
In [5]:
# Visit URLs and extract listing details
urls_visited = 0
while urls_visited < len(listing_urls):
   
    listing_to_visit = listing_urls[urls_visited]
    print("Listing URL: " + listing_to_visit)
    data = urllib.request.urlopen(listing_to_visit).read()
    search_page_source_code = data.decode('utf-8')
    
    item_id = re.findall(r'/([0-9]{4,}$)', listing_to_visit)[0]
    print("Item ID is: "+ item_id)
    
    title = re.findall(r'<title>([^<]+)', search_page_source_code)[0]
    print("Title is: "+ title)
        
    listing_location = re.findall(r'<li><span>Location:</span>&nbsp;([^<]+)', search_page_source_code)[0]
    print("Item location is: "+ listing_location)
    
    model = re.findall(r'<li><span>Model:</span>&nbsp;([^<]+)', search_page_source_code)[0]
    print("The model of the item is: "+ model)
    
    item_type = re.findall(r'<li><span>Type:</span>&nbsp;([^<]+)', search_page_source_code)[0]
    print("The item type is: "+ item_type)
    
    # Timestamp
    now = datetime.now()
    now = now.strftime("%Y-%m-%d %H:%M:%S")
    print("Scrapping complete at : "+ now)
    
    # Add extracted text (listing url, title, listing location, model, item type) and timestamp into database
    sql_to_run = "INSERT INTO listings VALUES ('" + listing_to_visit + "', '" + item_id + "', '" + title + "', '" + listing_location + "', '" + model + "', '" + item_type + "', '" + now + "')"
    print("Running this SQL query: " + sql_to_run)
    db_cursor.execute(sql_to_run)
    connection.commit()
        
    urls_visited = urls_visited+1
    print("\n")
    
    if test_mode == True:
        if urls_visited == 5:
            break
Listing URL: https://www.dotmed.com/auction/pump-iv-infusion/baxter/i-pump-pain-management-pump-system/202483
Item ID is: 202483
Title is:  I-Pump Pain Management Pump / System Pump IV Infusion  Auction
Item location is: NC, USA
The model of the item is: I-Pump Pain Management Pump / System
The item type is: Pump IV Infusion
Scrapping complete at : 2023-02-13 19:20:53
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-iv-infusion/baxter/i-pump-pain-management-pump-system/202483', '202483', ' I-Pump Pain Management Pump / System Pump IV Infusion  Auction', 'NC, USA', 'I-Pump Pain Management Pump / System', 'Pump IV Infusion', '2023-02-13 19:20:53')


Listing URL: https://www.dotmed.com/auction/pump-iv-infusion/abbott-hospira/plum-a/202477
Item ID is: 202477
Title is:  Plum A+ Pump IV Infusion  Auction
Item location is: NC, USA
The model of the item is: Plum A+
The item type is: Pump IV Infusion
Scrapping complete at : 2023-02-13 19:20:54
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-iv-infusion/abbott-hospira/plum-a/202477', '202477', ' Plum A+ Pump IV Infusion  Auction', 'NC, USA', 'Plum A+', 'Pump IV Infusion', '2023-02-13 19:20:54')


Listing URL: https://www.dotmed.com/auction/pump-iv-infusion/abbott-hospira/plum-a/202478
Item ID is: 202478
Title is:  Plum A+ Pump IV Infusion  Auction
Item location is: NC, USA
The model of the item is: Plum A+
The item type is: Pump IV Infusion
Scrapping complete at : 2023-02-13 19:20:56
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-iv-infusion/abbott-hospira/plum-a/202478', '202478', ' Plum A+ Pump IV Infusion  Auction', 'NC, USA', 'Plum A+', 'Pump IV Infusion', '2023-02-13 19:20:56')


Listing URL: https://www.dotmed.com/auction/pump-iv-infusion/b-braun/outlook-400es-safety-infusion-system/202479
Item ID is: 202479
Title is:  Outlook 400ES  Safety Infusion System Pump IV Infusion  Auction
Item location is: NC, USA
The model of the item is: Outlook 400ES  Safety Infusion System
The item type is: Pump IV Infusion
Scrapping complete at : 2023-02-13 19:20:58
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-iv-infusion/b-braun/outlook-400es-safety-infusion-system/202479', '202479', ' Outlook 400ES  Safety Infusion System Pump IV Infusion  Auction', 'NC, USA', 'Outlook 400ES  Safety Infusion System', 'Pump IV Infusion', '2023-02-13 19:20:58')


Listing URL: https://www.dotmed.com/auction/pump-iv-infusion/b-braun/outlook-400es-safety-infusion-system/202480
Item ID is: 202480
Title is:  Outlook 400ES  Safety Infusion System Pump IV Infusion  Auction
Item location is: NC, USA
The model of the item is: Outlook 400ES  Safety Infusion System
The item type is: Pump IV Infusion
Scrapping complete at : 2023-02-13 19:20:59
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-iv-infusion/b-braun/outlook-400es-safety-infusion-system/202480', '202480', ' Outlook 400ES  Safety Infusion System Pump IV Infusion  Auction', 'NC, USA', 'Outlook 400ES  Safety Infusion System', 'Pump IV Infusion', '2023-02-13 19:20:59')


Listing URL: https://www.dotmed.com/auction/pump-iv-infusion/b-braun/outlook-400es-safety-infusion-system/202481
Item ID is: 202481
Title is:  Outlook 400ES  Safety Infusion System Pump IV Infusion  Auction
Item location is: NC, USA
The model of the item is: Outlook 400ES  Safety Infusion System
The item type is: Pump IV Infusion
Scrapping complete at : 2023-02-13 19:21:01
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-iv-infusion/b-braun/outlook-400es-safety-infusion-system/202481', '202481', ' Outlook 400ES  Safety Infusion System Pump IV Infusion  Auction', 'NC, USA', 'Outlook 400ES  Safety Infusion System', 'Pump IV Infusion', '2023-02-13 19:21:01')


Listing URL: https://www.dotmed.com/auction/pump-iv-infusion/b-braun/outlook-400es-safety-infusion-system/202482
Item ID is: 202482
Title is:  Outlook 400ES  Safety Infusion System Pump IV Infusion  Auction
Item location is: NC, USA
The model of the item is: Outlook 400ES  Safety Infusion System
The item type is: Pump IV Infusion
Scrapping complete at : 2023-02-13 19:21:02
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-iv-infusion/b-braun/outlook-400es-safety-infusion-system/202482', '202482', ' Outlook 400ES  Safety Infusion System Pump IV Infusion  Auction', 'NC, USA', 'Outlook 400ES  Safety Infusion System', 'Pump IV Infusion', '2023-02-13 19:21:02')


Listing URL: https://www.dotmed.com/auction/pump-vascular-compression/kendall/scd-express-with-vascular-refill-detection/202617
Item ID is: 202617
Title is:  SCD Express With Vascular Refill Detection Pump Vascular Compression  Auction
Item location is: NC, USA
The model of the item is: SCD Express With Vascular Refill Detection
The item type is: Pump Vascular Compression
Scrapping complete at : 2023-02-13 19:21:04
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-vascular-compression/kendall/scd-express-with-vascular-refill-detection/202617', '202617', ' SCD Express With Vascular Refill Detection Pump Vascular Compression  Auction', 'NC, USA', 'SCD Express With Vascular Refill Detection', 'Pump Vascular Compression', '2023-02-13 19:21:04')


Listing URL: https://www.dotmed.com/auction/pump-vascular-compression/covidien-kendall/scd-express-sequential/202618
Item ID is: 202618
Title is:  SCD Express  Sequential Pump Vascular Compression  Auction
Item location is: NC, USA
The model of the item is: SCD Express  Sequential
The item type is: Pump Vascular Compression
Scrapping complete at : 2023-02-13 19:21:05
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-vascular-compression/covidien-kendall/scd-express-sequential/202618', '202618', ' SCD Express  Sequential Pump Vascular Compression  Auction', 'NC, USA', 'SCD Express  Sequential', 'Pump Vascular Compression', '2023-02-13 19:21:05')


Listing URL: https://www.dotmed.com/auction/pump-vascular-compression/kendall/6060-compression-system-sequential-compression-device-scd/202619
Item ID is: 202619
Title is:  6060 Compression System (Sequential Compression Device, SCD) Pump Vascular Compression  Auction
Item location is: NC, USA
The model of the item is: 6060 Compression System (Sequential Compression Device, SCD)
The item type is: Pump Vascular Compression
Scrapping complete at : 2023-02-13 19:21:07
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-vascular-compression/kendall/6060-compression-system-sequential-compression-device-scd/202619', '202619', ' 6060 Compression System (Sequential Compression Device, SCD) Pump Vascular Compression  Auction', 'NC, USA', '6060 Compression System (Sequential Compression Device, SCD)', 'Pump Vascular Compression', '2023-02-13 19:21:07')


Listing URL: https://www.dotmed.com/auction/pump-vascular-compression/kendall/6060-compression-system-sequential-compression-device-scd/202620
Item ID is: 202620
Title is:  6060 Compression System (Sequential Compression Device, SCD) Pump Vascular Compression  Auction
Item location is: NC, USA
The model of the item is: 6060 Compression System (Sequential Compression Device, SCD)
The item type is: Pump Vascular Compression
Scrapping complete at : 2023-02-13 19:21:09
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-vascular-compression/kendall/6060-compression-system-sequential-compression-device-scd/202620', '202620', ' 6060 Compression System (Sequential Compression Device, SCD) Pump Vascular Compression  Auction', 'NC, USA', '6060 Compression System (Sequential Compression Device, SCD)', 'Pump Vascular Compression', '2023-02-13 19:21:09')


Listing URL: https://www.dotmed.com/auction/dvt-pump/ctc-compression-therapy-concepts/vasopress-supreme-mini/202646
Item ID is: 202646
Title is:  Vasopress Supreme Mini DVT Pump  Auction
Item location is: NC, USA
The model of the item is: Vasopress Supreme Mini
The item type is: DVT Pump
Scrapping complete at : 2023-02-13 19:21:10
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/dvt-pump/ctc-compression-therapy-concepts/vasopress-supreme-mini/202646', '202646', ' Vasopress Supreme Mini DVT Pump  Auction', 'NC, USA', 'Vasopress Supreme Mini', 'DVT Pump', '2023-02-13 19:21:10')


Listing URL: https://www.dotmed.com/auction/dvt-pump/ctc-compression-therapy-concepts/vasopress-supreme-mini/202647
Item ID is: 202647
Title is:  Vasopress Supreme Mini DVT Pump  Auction
Item location is: NC, USA
The model of the item is: Vasopress Supreme Mini
The item type is: DVT Pump
Scrapping complete at : 2023-02-13 19:21:12
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/dvt-pump/ctc-compression-therapy-concepts/vasopress-supreme-mini/202647', '202647', ' Vasopress Supreme Mini DVT Pump  Auction', 'NC, USA', 'Vasopress Supreme Mini', 'DVT Pump', '2023-02-13 19:21:12')


Listing URL: https://www.dotmed.com/auction/module/carefusion/8100-alrais-pump/202823
Item ID is: 202823
Title is:  8100 Alrais Pump Module  Auction
Item location is: UT, USA
The model of the item is: 8100 Alrais Pump
The item type is: Module
Scrapping complete at : 2023-02-13 19:21:14
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/module/carefusion/8100-alrais-pump/202823', '202823', ' 8100 Alrais Pump Module  Auction', 'UT, USA', '8100 Alrais Pump', 'Module', '2023-02-13 19:21:14')


Listing URL: https://www.dotmed.com/auction/pump-vascular-compression/kendall/scd-express/-lot-of-10/202259
Item ID is: 202259
Title is:  SCD EXPRESS Pump Vascular Compression  Auction
Item location is: CA, USA
The model of the item is: SCD EXPRESS
The item type is: Pump Vascular Compression
Scrapping complete at : 2023-02-13 19:21:15
Running this SQL query: INSERT INTO listings VALUES ('https://www.dotmed.com/auction/pump-vascular-compression/kendall/scd-express/-lot-of-10/202259', '202259', ' SCD EXPRESS Pump Vascular Compression  Auction', 'CA, USA', 'SCD EXPRESS', 'Pump Vascular Compression', '2023-02-13 19:21:15')


In [6]:
# Remove duplicate from database (if they exist)
remove_duplicates_sql = "DELETE FROM listings WHERE EXISTS (SELECT 1 FROM listings p2 WHERE listings.listing_url = p2.listing_url AND listings.rowid > p2.rowid);"
db_cursor.execute(remove_duplicates_sql)
connection.commit()
In [7]:
# Close database
connection.close()