Learnitweb

Web crawler – Breadth-First Search

1. Introduction

In this tutorial, we’ll develop a sample web crawler using Breadth-First Search algorithm. A web crawler, or spider, is a type of bot that is typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.

The whole World Wide Web (WWW) can be represented by a directed graph. In this directed graph,

  • vertices (nodes) represent the domains / URLs / websites
  • edges represent the connections between the websites via links. You can see the link between web pages by looking at the HTML of the web pages.

Web crawlers are used to get important information about the parameters. For example, Google have to index all the relevant web pages and have to find a way to sort them by relevance in the search results.

To make a web crawler, we have to crawl a given website’s HTML code and look for the URLS which are the links to other websites. Using the breadth-first search, we can develop a web crawler that can hop from one URL to another and read the pages.

2. Web crawler program

We’ll create two files, App.java and BFS.java. The first one is file with main method to test the application. BFS.java is the file which contains the code for the web crawler.

App.java

public class App {
    public static void main(String args[]){
        BFS bfs = new BFS();
        bfs.discoverWeb("https://www.learnitweb.com");
    }
}

BFS.java

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Queue;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class BFS {

    private Queue<String> queue;
    private List<String> discoveredWebSiteList;

    public BFS(){
        this.queue = new LinkedList<>();
        this.discoveredWebSiteList = new ArrayList<>();
    }

    // root is the starting url like www.learnitweb.com
    public void discoverWeb(String root){
        //add root to the queue
        this.queue.add(root);
        // add root to the list of discovered websites
        this.discoveredWebSiteList.add(root);

        while(!queue.isEmpty()){
            String v = this.queue.remove();
            String rawHtml = readUrl(v);
        }
    }

    private String readUrl(String v) {
        StringBuilder rawHtml = new StringBuilder("");
        try {
            URL url = new URL(v);
            BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
            String line = "";
            while((line=reader.readLine()) != null){
                rawHtml.append(line);

                String regexp = "https://(\\w+\\.)*(\\w+)";
                Pattern pattern = Pattern.compile(regexp);
                Matcher matcher = pattern.matcher(rawHtml);

                while(matcher.find()){
                    String w = matcher.group();

                    if(!discoveredWebSiteList.contains(w)){
                        discoveredWebSiteList.add(w);
                        System.out.println("Website found: " + w);
                        queue.add(w);
                    }
                }
            }
            rawHtml.toString();
        } catch (Exception e){
            System.out.println("Problem while traversing website");
        }
        return null;
    }
}

When you run the program, you’ll get the output like the following:

Website found: https://yoast.com
Website found: https://learnitweb.com
Website found: https://schema.org
Website found: https://www.googletagmanager.com
Website found: https://api.w.org
Website found: https://fonts.googleapis.com
Website found: https://fonts.gstatic.com
Website found: https://cdn

3. Conclusion

In this tutorial, we explored how to implement a web crawler using the Breadth-First Search (BFS) algorithm. We started by understanding the fundamental concepts of web crawling and the BFS approach, then moved on to setting up a basic crawler that can efficiently explore and index web pages.