Common Crawl

Type:	Dataset
Technique:	scraping
Developed by:	The Common Crawl Foundation, California, US

Common Crawl is a registered non-profit organisation founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable.

Common Crawl completes four crawls a year. Amazon Web Services began hosting Common Crawl's archive through its Public Data Sets program in 2012. The crawl of September 2017 contains 3.01 billion web pages and over 250 TiB of uncompressed content, or about 75% of the Internet.

The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available.

Common Crawl datasets are used to create pretrained word embeddings datasets, like GloVe (see The GloVe Reader).

Common Crawl

From Algolit

Revision as of 13:30, 25 October 2017 by An (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Common Crawl

From Algolit

Revision as of 13:30, 25 October 2017 by An (talk | contribs)(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Revision as of 13:30, 25 October 2017 by An (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)