Skip to the content.

Distributed web crawler

Requirements and User stories

Functional

Non functional

Terms

Calculations

Data storage

Traffic

We will be mainly focusing on crawler in this post, so a “write-heavy” it is.

Data model

// CrawlItem is the entity which is instantiated for Scheduler's front queue
type CrawlItem struct {
	URL string
	Priority int
  Status string // Running, Idle, Invalid
}

type Page struct {
	ID string
	Domain string
	URL string
	Content []byte // The page content in bytes
	LastVisited Time
	SimHash string // TBD: We store it or compute on runtime ?
	...
}

APIs

// Downloader
func Download(url string) string
// Extractor
func Extract(html string) []string
// Storage
func Add(page Page) error
func Get(url string) Page
// Scheduler
func Enqueue(url string, priority int) error
func Dequeue() string

Architecture

Engine

Downloader

Scheduler

Alt solution of Prioritizer with politeness: Front queue(priority) + Back queue(politeness)

This solution is mentioned in this PDF

End to end workflow

Extractor

Storage

Could we use Kafka to replace Engine component

Short answer is yes, we could use Kafka as the event bus and all components are just producers and consumers connected to it.

This is also what scrapy-cluster does:

Using BFS or DFS

In the case of scrapy-cluster, it uses BFS strategy.

References