1. Introduction
In this tutorial, we’ll develop a sample web crawler using Breadth-First Search algorithm. A web crawler, or spider, is a type of bot that is typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.
The whole World Wide Web (WWW) can be represented by a directed graph. In this directed graph,
- vertices (nodes) represent the domains / URLs / websites
- edges represent the connections between the websites via links. You can see the link between web pages by looking at the HTML of the web pages.
Web crawlers are used to get important information about the parameters. For example, Google have to index all the relevant web pages and have to find a way to sort them by relevance in the search results.
To make a web crawler, we have to crawl a given website’s HTML code and look for the URLS which are the links to other websites. Using the breadth-first search, we can develop a web crawler that can hop from one URL to another and read the pages.
2. Web crawler program
We’ll create two files, App.java
and BFS.java
. The first one is file with main
method to test the application. BFS.java
is the file which contains the code for the web crawler.
App.java
public class App { public static void main(String args[]){ BFS bfs = new BFS(); bfs.discoverWeb("https://www.learnitweb.com"); } }
BFS.java
import java.io.BufferedReader; import java.io.InputStreamReader; import java.net.URL; import java.util.ArrayList; import java.util.LinkedList; import java.util.List; import java.util.Queue; import java.util.regex.Matcher; import java.util.regex.Pattern; public class BFS { private Queue<String> queue; private List<String> discoveredWebSiteList; public BFS(){ this.queue = new LinkedList<>(); this.discoveredWebSiteList = new ArrayList<>(); } // root is the starting url like www.learnitweb.com public void discoverWeb(String root){ //add root to the queue this.queue.add(root); // add root to the list of discovered websites this.discoveredWebSiteList.add(root); while(!queue.isEmpty()){ String v = this.queue.remove(); String rawHtml = readUrl(v); } } private String readUrl(String v) { StringBuilder rawHtml = new StringBuilder(""); try { URL url = new URL(v); BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream())); String line = ""; while((line=reader.readLine()) != null){ rawHtml.append(line); String regexp = "https://(\\w+\\.)*(\\w+)"; Pattern pattern = Pattern.compile(regexp); Matcher matcher = pattern.matcher(rawHtml); while(matcher.find()){ String w = matcher.group(); if(!discoveredWebSiteList.contains(w)){ discoveredWebSiteList.add(w); System.out.println("Website found: " + w); queue.add(w); } } } rawHtml.toString(); } catch (Exception e){ System.out.println("Problem while traversing website"); } return null; } }
When you run the program, you’ll get the output like the following:
Website found: https://yoast.com Website found: https://learnitweb.com Website found: https://schema.org Website found: https://www.googletagmanager.com Website found: https://api.w.org Website found: https://fonts.googleapis.com Website found: https://fonts.gstatic.com Website found: https://cdn
3. Conclusion
In this tutorial, we explored how to implement a web crawler using the Breadth-First Search (BFS) algorithm. We started by understanding the fundamental concepts of web crawling and the BFS approach, then moved on to setting up a basic crawler that can efficiently explore and index web pages.