Webscraping obviously includes lots of potential tools, methods and solutions, but I tried to come up with an overview of the various options to consider and go through whenever you’re trying to extract data/files/things online. I’ve mostly learned webscraping through Python (and R), so that’s what I’d generally gravitate towards, but in order to optimize the process and not waste time or miss useful time-saving tricks/options, I’ve tried expanding my toolkit.
- First make sure that a) you can’t find the data in a downloadable format/file online (via search engine searches (incl. advanced tricks), public databases, download or export buttons on webpages, etc.), and b) that you can’t merely copy-paste in a way that works (and doesn’t lead to a messy and unusable dataset, as it often does).
- For some data/media (e.g. images, videos, links, embedded content) you can use the web inspector to find the (file/etc) source and download it directly.
- Before moving on to more sophisticated scraping tools (i.e. APIs, coded scrapers and the advanced no-code scraping/crawling apps/services/features), you should always consider if some of the simplest/fastest no-code/low-code options can do the work:
- Simple (free) extensions for extracting tables/table data (e.g. Table Capture in Chrome or Firefox), or to download files/lists/etc (e.g. DownThemAll in Firefox).
- Using Google Sheets’ Import functions for HTML tables and lists (ImportHTML), XML/XPath elements (ImportXML), .csv/.tsv-formatted download links (ImportData) or RSS feeds (ImportFeed).
- There are some great free automatic/extension-based scrapers like Crawly and Data Miner, and the free plans/versions of WebScraper/Octoparse/Simplescraper/Bardeen/ParseHub, that can quickly get the data you want (for some of those, including when it’s from dynamic and/or multiple webpages) and put it in machine-readable format.
- At this point you have to decide what tools (either using a single one or combining them/using several tools) you’re gonna use if these previous options didn’t work. This depends on various things including whether you’ve got access for an advanced paid scraping tool (Octoparse, ParseHub, Import.io, Bright Data’s Web Scraper IDE, ScrapingBee, Dexi, etc), what you want to do next with your data – e.g. cleaning and/or data analyis/data viz/spatial analysis/text analysis/etc – (that impacts the data/file format and type of tool you’ll be using afterward, e.g. Excel, Tableau, Datawrapper, Rawgraphs, Python, R, Flourish, etc.), and obviously your own skills and preferences.
- Here I’m only gonna consider free options (either open-source/non-commercial ones, or the free plans/trials of good apps/services like Octoparse), and it makes sense to first check out if APIs a) exist/are accessible (i.e. some websites put a paywall on APIs, I don’t think anyone should accept paying those while there are other webscraping options that often give more flexibility and cost nothing), b) actually make it possible for you to get what you want (websites/companies often limit what data you can get (or even see) through their APIs)
- Good advanced scraping services/apps – that are either free/open-source or include some free features/plans (obviously with limitations, this is capitalism remember?) – include Morph (ScraperWiki has been decommissioned), OutWit Hub, ParseHub, Octoparse, ScrapingBee, Bardeen, Simplescraper (Chrome), Apify, Common Crawl (repository of crawled datasets). Usually, they’ll allow you to select (often through “point and click”) the specific things you want to extract from webpages and they’ll do the scraping for you (either/both extractors (single page) and crawlers (multiple pages)) and give you a way to download the data/output in suitable formats (csv, json, xlsx, etc)
- But in a sense the ultimate free webscraping tools are programming languages that give you full ability/autonomy to control, try/test, and complete the webscraping process from start to finish, and sometimes very quickly even if it’s advanced scrapers -> Python, R, Javascript and Ruby are all very good options!
- As always it starts with the web/element inspector, where you should examine the source code’s HTML structure, elements and tags, and look for/locate the specific things you want to extract. You’ll need those in order to fetch and parse the data!
- You should also use it (specifically, the Network tab) to:
- Examine how files are being passed between the server and the browser.
- Look at the headers of the files – what requests are being made, what the parameters of those requests are and how the server responds.
- Check out if there are any useful “Hidden APIs“
- The actual programming usually starts with fetching/loading the the content (source code) of a webpage (i.e. a specific URL from the server) via Requests in Python, or Mechanize/net-http/HTTParty/OpenURI in Ruby; then parsing them – i.e. extracting specific datafields – via BeautifulSoup in Python, and Nokogiri in Ruby, or doing both via Rvest in R.
- Extract and (if necessary) reformat those elements/data fields and save the output (or just immediately use it for analysis/data viz)
- But needless to say there are lots of methods, considerations and things that go into coding scrapers, beyond this basic general checklist. You can sometimes simply use for loops in Python, you can use pre-made libraries that scrape specific sites for you (e.g. for football analytics, see Jason Zivkovic’s worldfootballR in R and Ben Griffis’ Griffis-Soccer-Analysis in Python), you can use Selenium for automated tests/dynamic webpages/bypassing logins/etc. with most (popular) browsers and programming languages (e.g. Python, R, Ruby), and so on and so forth.