자바로 쇼핑몰 상품 이미지 크롤링 하기. (jsoup 라이브러리 이용) 21. 05. 30.

✨ 개요

크롤링(crawling)은 웹 페이지를 그대로 가져와서 거기서 데이터를 추출해 내는 행위를 말한다.
Maven에서 jsoup.jar을 받아 자바 프로젝트에 외부 라이브러리 연결을 먼저 해준다.

크롤링 해 볼 사이트는 https://www.marpple.com/kr/product/list/9.

🍏 1. 사전 작업 CrawlTest.java

package crawling;

import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;

public class CrawlTest {
    public static void main(String[] args) throws Exception {
        String targetUrl = "https://www.marpple.com/kr/product/list/9";
        URL url = new URL(targetUrl);

        InputStream is = url.openStream();
        FileOutputStream fos = new FileOutputStream("text.html");
        int b;
        while ((b = is.read()) != -1) {
            fos.write(b);
        }
        fos.close();

    }
}

🔶 먼저 해당 url의 html 파일이 유효한지 체크한다.
🔶 그 후, 정적 페이지인지 비동기 처리한 페이지인지 확인해야 한다. jsoup은 정적 페이지의 크롤링만 처리 가능. (비동기는 셀레니움Selenium을 많이 이용한다고 한다.)
🔶 해당 사이트의 이미지 링크가 자바에서 만든 test.html 안에 같은 링크가 있는지를 확인해서 있다면 정적 페이지인 것!

🍏 2. HTML 분석 후 규칙 찾기 CrawlTest.java

이제 가져온 HTML 파일을 분석할 시간! 브라우저 요소 검사에서 보기 편하므로 거기서 규칙을 찾아본다.
jsoup이 강력한 건 Element를 통해 찾아올 수 있기 때문이다! doc.getElementsByClass(className) 이런 게 가능하다는 것이다!!

해당 사이트의 요소가 이러하므로 base_product 클래스명을 통해 찾기 시작.

        File file = new File("text.html");
        Document doc = Jsoup.parse(file, "utf-8");
        System.out.println(doc);

        Elements els = doc.getElementsByClass("base_product");
        els.forEach(el -> {
            System.out.println(el);
        });

        for(int i = 0 ; i < els.size() ; i++) {
            Element el = els.get(i);
            System.out.println(el.select("a").get(0).attr("href"));
        }

Elements els = doc.getElementsByClass("base_product");
        Map<String, String> map = new HashMap<String, String>();
        for(int i = 0 ; i < els.size() ; i++) {
            Element el = els.get(i);
            String link = el.select("a").get(0).attr("href");
            map.put("link", link);
            String pid = link.substring(link.indexOf("bp_id=")+6);
            map.put("pid", pid);
            String src = el.select("img").get(0).attr("data-src");
            map.put("src", src);
            System.out.println(src);
        }

🔶 패턴화하기 위해 Map 이용

여기까지 한 후에 파일을 다운 받을 때, 정리해서 받기 위해 폴더 생성을 해줘야 한다.

        File tFolder = new File("d:/marpple/" + pid);
            if(!tFolder.exists()) {
                tFolder.mkdirs();
            }

이제 이미지를 다운받아보자!

                URL url = new URL("https:" + src);
                InputStream is = url.openStream();
                FileOutputStream fos = new FileOutputStream("d:/marpple/" + pid + "/index.jpg");
                int b;
                while ((b = is.read()) != -1) {
                    fos.write(b);
                }
                fos.close();

                System.out.println(pid + " :: 작업 완료");

완료!

이런 식으로 해당 상품에서 한 depth 더 들어가서 상세 정보 이미지도 가져올 수 있는데, 그건 아직 안 해봤다. 다음에 해봐야지!

자바 코드 전체보기

package crawling;

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.util.HashMap;
import java.util.Map;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class CrawlTest {
	public static void main(String[] args) throws Exception {
//		String targetUrl = "https://www.marpple.com/kr/product/list/9";
//		URL url = new URL(targetUrl);
//
//		InputStream is = url.openStream();
//		FileOutputStream fos = new FileOutputStream("text.html");
//		int b;
//		while ((b = is.read()) != -1) {
//			fos.write(b);
//		}
//		fos.close();
		
		File file = new File("text.html");
		Document doc = Jsoup.parse(file, "utf-8");
		System.out.println(doc);
		
		Elements els = doc.getElementsByClass("base_product");
		Map<String, String> map = new HashMap<String, String>();
		for(int i = 0 ; i < els.size() ; i++) {
			Element el = els.get(i);
			String link = el.select("a").get(0).attr("href");
			map.put("link", link);
			String pid = link.substring(link.indexOf("bp_id=")+6);
			map.put("pid", pid);
			String src = el.select("img").get(0).attr("data-src");
			map.put("src", src);
			System.out.println(src);
			
			File tFolder = new File("d:/marpple/" + pid);
			if(!tFolder.exists()) {
				tFolder.mkdirs();
			}
			
			URL url = new URL("https:" + src);
				InputStream is = url.openStream();
				FileOutputStream fos = new FileOutputStream("d:/marpple/" + pid + "/index.jpg");
				int b;
				while ((b = is.read()) != -1) {
					fos.write(b);
				}
				fos.close();
				
				System.out.println(pid + " :: 작업 완료");
		}
		
		
	}
}

저작자표시 비영리 변경금지 (새창열림)

'JAVA' 카테고리의 다른 글

java.util.function 패키지 21. 05. 06. (0)	2021.05.07
14장 람다와 스트림 21. 04. 30. (0)	2021.05.05
통합 구현 (JSON을 이용한 파싱) 21. 03. 03. (0)	2021.03.03
07장- 객체지향 프로그래밍Ⅱ(6. 추상클래스, 7. 인터페이스) 21. 02. 24. (0)	2021.02.24
09장- java.lang패키지와 유용한 클래스 21. 02. 21. (0)	2021.02.21

삐약삐약

자바로 쇼핑몰 상품 이미지 크롤링 하기. (jsoup 라이브러리 이용) 21. 05. 30.

✨ 개요

🍏 1. 사전 작업 CrawlTest.java

🍏 2. HTML 분석 후 규칙 찾기 CrawlTest.java

'JAVA' 카테고리의 다른 글

티스토리툴바

자바로 쇼핑몰 상품 이미지 크롤링 하기. (jsoup 라이브러리 이용) 21. 05. 30.

✨ 개요

🍏 1. 사전 작업 CrawlTest.java

🍏 2. HTML 분석 후 규칙 찾기 CrawlTest.java

'JAVA' 카테고리의 다른 글

'JAVA' Related Articles

티스토리툴바