Java Web Crawler: Difference between revisions
Jump to navigation
Jump to search
(31 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
A web crawler, or spider, is a type of bot that's typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results. | A web crawler, or spider, is a type of bot that's typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results. | ||
==Selenium== | |||
{| | |||
| valign="top" | | |||
<source lang="bash"> | |||
docker run --detach \ | |||
--publish 4444:4444 \ | |||
--hostname firefox \ | |||
--name firefox \ | |||
--shm-size 2g \ | |||
selenium/standalone-firefox:80.0 | |||
</source> | |||
<code>'''--OR--'''</code> | |||
<source lang="bash"> | |||
docker run --detach \ | |||
--publish 4444:4444 \ | |||
--hostname firefox \ | |||
--name firefox \ | |||
--volume /dev/shm:/dev/shm \ | |||
selenium/standalone-firefox:80.0 | |||
</source> | |||
docker exec -it firefox cat /etc/hosts | |||
http://localhost:4444/wd/hub | |||
| valign="top" | | |||
<source lang="bash"> | |||
docker run --detach \ | |||
--publish 4444:4444 \ | |||
--hostname chrome \ | |||
--name chrome \ | |||
--shm-size 2g \ | |||
selenium/standalone-chrome:85.0 | |||
</source> | |||
<code>'''--OR--'''</code> | |||
<source lang="bash"> | |||
docker run --detach \ | |||
--publish 4444:4444 \ | |||
--hostname chrome \ | |||
--name chrome \ | |||
--volume /dev/shm:/dev/shm \ | |||
selenium/standalone-chrome:85.0 | |||
</source> | |||
docker exec -it chrome cat /etc/hosts | |||
http://localhost:4444/wd/hub | |||
| valign="top" | | |||
<source lang="bash"> | |||
docker run --detach \ | |||
--publish 4444:4444 \ | |||
--hostname opera \ | |||
--name opera \ | |||
--shm-size 2g \ | |||
selenium/standalone-opera:71.0 | |||
</source> | |||
<code>'''--OR--'''</code> | |||
<source lang="bash"> | |||
docker run --detach \ | |||
--publish 4444:4444 \ | |||
--hostname opera \ | |||
--name opera \ | |||
--volume /dev/shm:/dev/shm \ | |||
selenium/standalone-opera:71.0 | |||
</source> | |||
docker exec -it opera cat /etc/hosts | |||
http://localhost:4444/wd/hub | |||
|} | |||
==References== | ==References== | ||
{| | |||
| valign="top" | | |||
* [https://www.developersoapbox.com/java-connect-to-sqlite-using-spring-boot/ Connect to SQLite using Spring Boot] | * [https://www.developersoapbox.com/java-connect-to-sqlite-using-spring-boot/ Connect to SQLite using Spring Boot] | ||
* [https://medium.com/@kumarshivam_66534/implementation-of-spring-boot-data-redis-for-caching-in-my-application-218d02c31191 Spring Boot Data Redis for caching] | * [https://medium.com/@kumarshivam_66534/implementation-of-spring-boot-data-redis-for-caching-in-my-application-218d02c31191 Spring Boot Data Redis for caching] | ||
Line 12: | Line 83: | ||
* [https://docs.spring.io/spring-data/data-redis/docs/current/reference/html/ Spring Data Redis] | * [https://docs.spring.io/spring-data/data-redis/docs/current/reference/html/ Spring Data Redis] | ||
* [https://www.baeldung.com/spring-boot-sqlite SQLite Dialect] | * [https://www.baeldung.com/spring-boot-sqlite SQLite Dialect] | ||
| valign="top" | | |||
* [https://stackoverflow.com/questions/14072380/ <code>@Cacheable</code> Key on Multiple Method Arguments] | |||
* [https://github.com/SeleniumHQ/docker-selenium Docker images for the Selenium Grid Server] | |||
* [https://stackoverflow.com/questions/11559464/ EhCache overflow to disk at specific path] | |||
* [https://javabeat.net/enablecaching-spring/ <code>@EnableCaching</code> Annotation in Spring] | |||
* [https://stackoverflow.com/questions/12836114/ Selenium Webdriver Remote Setup] | |||
* [https://dimitr.im/spring-boot-cache-ehcache Using EhCache 3 with Spring boot] | |||
* [https://underthehood.meltwater.com/blog/2016/11/09/using-docker-with-selenium-server-to-run-your-browser-tests/ Using Selenium-Server on Docker] | |||
* [https://examples.javacodegeeks.com/enterprise-java/spring/boot/spring-boot-ehcache-example/ Spring Boot Ehcache Example] | |||
* [https://www.baeldung.com/spring-boot-evict-cache Cache Eviction in Spring Boot] | |||
* [https://www.scrapingbee.com/blog/introduction-to-chrome-headless/ Chrome Headless with Java] | |||
| valign="top" | | |||
* [https://stackoverflow.com/questions/44781339/ Spring Boot Web Application using Selenium WebDriver] | |||
* [https://dzone.com/articles/automated-testing-with-junit-and-selenium-for-brow Automated Testing With JUnit & Selenium for Browser] | |||
* [https://stackoverflow.com/questions/417142/ Maximum length of a URL for the Browsers] | |||
* [https://stackoverflow.com/questions/17749049/ Spring <code>@CacheEvict</code> using wildcards] | |||
* [https://www.foreach.be/blog/spring-cache-annotations-some-tips-tricks Spring Cache Annotations Tips & Tricks] | |||
* [https://bonigarcia.github.io/selenium-jupiter/#quick-reference Selenium Jupiter Quick Reference] | |||
* [https://stackoverflow.com/questions/25306704/ Disable RobotServer in Crawler4j] | |||
* [https://github.com/bonigarcia/selenium-jupiter Selenium Jupiter] | |||
* [https://github.com/yasserg/crawler4j Crawler4j] | |||
|} | |||
---- | |||
{| | |||
| valign="top" | | |||
* [https://stackoverflow.com/questions/24940976/ Update or Remove an Item from a Cached Collection] | |||
* [https://hub.docker.com/r/selenium/standalone-firefox Docker Image <code>selenium/standalone-firefox</code>] | |||
* [https://hub.docker.com/r/selenium/standalone-chrome Docker Image <code>selenium/standalone-chrome</code>] | |||
* [https://hub.docker.com/r/selenium/standalone-opera Docker Image <code>selenium/standalone-opera</code>] | |||
* [https://stackoverflow.com/questions/16335820/convert-xpath-to-jsoup-query#:~:text=Google%20Chrome%20Version Copy <code>Jsoup</code> Selector by Chrome Browser] | |||
*[https://jsoup.org/cookbook/extracting-data/attributes-text-html <code>Jsoup</code> Attributes Text Html] | |||
* [https://jsoup.org/cookbook/extracting-data/selector-syntax <code>Jsoup</code> Selector Syntax] | |||
* [https://www.baeldung.com/crawler4j Baeldung Crawler4j] | |||
| valign="top" | | |||
* [https://www.scientecheasy.com/2020/07/find-xpath-chrome.html/ Find XPath in Chrome Browser] | |||
* [https://devdocs.magento.com/mftf/docs/guides/selectors.html How To write good selectors] | |||
* [https://stackoverflow.com/questions/33080906/ XPath Iteration in Java] | |||
|} |
Latest revision as of 17:04, 22 October 2020
A web crawler, or spider, is a type of bot that's typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.
Selenium
docker run --detach \
--publish 4444:4444 \
--hostname firefox \
--name firefox \
--shm-size 2g \
selenium/standalone-firefox:80.0
docker run --detach \
--publish 4444:4444 \
--hostname firefox \
--name firefox \
--volume /dev/shm:/dev/shm \
selenium/standalone-firefox:80.0
docker exec -it firefox cat /etc/hosts http://localhost:4444/wd/hub |
docker run --detach \
--publish 4444:4444 \
--hostname chrome \
--name chrome \
--shm-size 2g \
selenium/standalone-chrome:85.0
docker run --detach \
--publish 4444:4444 \
--hostname chrome \
--name chrome \
--volume /dev/shm:/dev/shm \
selenium/standalone-chrome:85.0
docker exec -it chrome cat /etc/hosts http://localhost:4444/wd/hub |
docker run --detach \
--publish 4444:4444 \
--hostname opera \
--name opera \
--shm-size 2g \
selenium/standalone-opera:71.0
docker run --detach \
--publish 4444:4444 \
--hostname opera \
--name opera \
--volume /dev/shm:/dev/shm \
selenium/standalone-opera:71.0
docker exec -it opera cat /etc/hosts http://localhost:4444/wd/hub |