Common Crawl: Difference between revisions
From Algolit
(Created page with "{| |- | Type: || Dataset |- | Technique: || scraping |- | Developed by: || The Common Crawl Foundation, California, US |} [http://commoncrawl.org Common Crawl] is a registere...") |
|||
Line 13: | Line 13: | ||
The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available. | The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available. | ||
+ | |||
+ | Common Crawl datasets are used to create pretrained word embeddings datasets, like GloVe (see [http://www.algolit.net/index.php/The_GloVe_Reader The GloVe Reader]). |
Revision as of 13:30, 25 October 2017
Type: | Dataset |
Technique: | scraping |
Developed by: | The Common Crawl Foundation, California, US |
Common Crawl is a registered non-profit organisation founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable.
Common Crawl completes four crawls a year. Amazon Web Services began hosting Common Crawl's archive through its Public Data Sets program in 2012. The crawl of September 2017 contains 3.01 billion web pages and over 250 TiB of uncompressed content, or about 75% of the Internet.
The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available.
Common Crawl datasets are used to create pretrained word embeddings datasets, like GloVe (see The GloVe Reader).