Java Web Crawler: Difference between revisions

From Chorke Wiki
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 112: Line 112:
{|
{|
| valign="top" |
| valign="top" |
* [https://stackoverflow.com/questions/24940976/ Update or Remove an Item from a Cached Collection]
* [https://hub.docker.com/r/selenium/standalone-firefox Docker Image <code>selenium/standalone-firefox</code>]
* [https://hub.docker.com/r/selenium/standalone-firefox Docker Image <code>selenium/standalone-firefox</code>]
* [https://hub.docker.com/r/selenium/standalone-chrome Docker Image <code>selenium/standalone-chrome</code>]
* [https://hub.docker.com/r/selenium/standalone-chrome Docker Image <code>selenium/standalone-chrome</code>]
* [https://hub.docker.com/r/selenium/standalone-opera Docker Image <code>selenium/standalone-opera</code>]
* [https://hub.docker.com/r/selenium/standalone-opera Docker Image <code>selenium/standalone-opera</code>]
* [https://stackoverflow.com/questions/16335820/convert-xpath-to-jsoup-query#:~:text=Google%20Chrome%20Version Copy <code>Jsoup</code> Selector by Chrome Browser]
*[https://jsoup.org/cookbook/extracting-data/attributes-text-html <code>Jsoup</code> Attributes Text Html]
* [https://jsoup.org/cookbook/extracting-data/selector-syntax <code>Jsoup</code> Selector Syntax]
* [https://www.baeldung.com/crawler4j Baeldung Crawler4j]
* [https://www.baeldung.com/crawler4j Baeldung Crawler4j]


Line 121: Line 125:
* [https://devdocs.magento.com/mftf/docs/guides/selectors.html How To write good selectors]
* [https://devdocs.magento.com/mftf/docs/guides/selectors.html How To write good selectors]
* [https://stackoverflow.com/questions/33080906/ XPath Iteration in Java]
* [https://stackoverflow.com/questions/33080906/ XPath Iteration in Java]
|}
|}

Latest revision as of 17:04, 22 October 2020

A web crawler, or spider, is a type of bot that's typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.

Selenium

docker run --detach \
--publish 4444:4444 \
--hostname firefox \
--name firefox \
--shm-size 2g \
selenium/standalone-firefox:80.0

--OR--

docker run --detach \
--publish 4444:4444 \
--hostname firefox \
--name firefox \
--volume /dev/shm:/dev/shm \
selenium/standalone-firefox:80.0
docker exec -it firefox cat /etc/hosts
http://localhost:4444/wd/hub
docker run --detach \
--publish 4444:4444 \
--hostname chrome \
--name chrome \
--shm-size 2g \
selenium/standalone-chrome:85.0

--OR--

docker run --detach \
--publish 4444:4444 \
--hostname chrome \
--name chrome \
--volume /dev/shm:/dev/shm \
selenium/standalone-chrome:85.0
docker exec -it chrome cat /etc/hosts
http://localhost:4444/wd/hub
docker run --detach \
--publish 4444:4444 \
--hostname opera \
--name opera \
--shm-size 2g \
selenium/standalone-opera:71.0

--OR--

docker run --detach \
--publish 4444:4444 \
--hostname opera \
--name opera \
--volume /dev/shm:/dev/shm \
selenium/standalone-opera:71.0
docker exec -it opera cat /etc/hosts
http://localhost:4444/wd/hub

References