BeautifulSoup 사용

특정 태그를 추출한 후 해당 태그의 특정 속성이 있는지 찾기

설치

pip install beautifulsoup4

from bs4 import BeautifulSoup

soup = BeautifulSoup("Html 내용", "html.parser")


# video tag 추출
video_tag = soup.findAll("video")

for vid in video_tag:
   if vid.has_attr('mp4'):
     #video 태그중 mp4 속성 찾기        
   elif vid.has_attr('flv'):
     #video 태그중 flv 속성 찾기

참고

네이버 실시간 검색어 가져오기

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

url = "http://www.naver.com/"
html = urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")

rank = soup.find("dl", id="ranklist") # dl 태그 + id 속성 값 "ranklist" 검색

for i in rank.find_all("li", value=True, id=False): # li 태그 + value 속성 존재 + id 속성 비 존재

    print(i.get_text(" ", strip=True)) # 문자열을 가져오는데 태그를 빈 공백으로 나누고 앞 뒤 공백 제거

헤더 추가
url = "http://www.naver.com/"
req = Request(url)
req.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(req).read()

사용법
1. 모든 a 태그 검색
soup.find_all("a")
soup("a")

2. string 이 있는 title 태그 모두 검색
soup.title.find_all(string=True)
soup.title(string=True)

3. a 태그를 두개만 가져옴
soup.find_all("a", limit=2)

4. string 검색
soup.find_all(string="Elsie") # string 이 Elsie 인 것 찾기
soup.find_all(string=["Tillie", "Elsie", "Lacie"]) # or 검색
soup.find_all(string=re.compile("Dormouse")) # 정규식 이용

5. p 태그와 속성 값이 title 이 있는거
soup.find_all("p", "title")
예) <p class="title"></p>

6. a태그와 b태그 찾기
soup.find_all(["a", "b"])

7. 속성 값 가져오기
soup.p['class']
soup.p['id']

8. string을 다른 string으로 교체
tag.string.replace_with("새로운 값")

9. 보기 좋게 출력
soup.b.prettify()

10. 간단한 검색
soup.body.b # body 태그 아래의 첫번째 b 태그
soup.a # 첫번째 a 태그

11. 속성 값 모두 출력
tag.attrs

12. class 는 파이썬에서 예약어이므로 class_ 로 쓴다.
soup.find_all("a", class_="sister")

13. find
find()
find_next()
find_all()

14. find 할 때 확인
if soup.find("div", title=True) is not None:
i = soup.find("div", title=True)

15. data-로 시작하는 속성 find
soup.find("div", attrs={"data-value": True})

16. 태그명 얻기
soup.find("div").name

17. 속성 얻기
soup.find("div")['class'] # 만약 속성 값이 없다면 에러
soup.find("div").get('class') # 속성 값이 없다면 None 반환

18. 속성이 있는지 확인
tag.has_attr('class') 
tag.has_attr('id')
있으면 True, 없으면 False

19. 태그 삭제
a_tag.img.unwrap()

20. 태그 추가
soup.p.string.wrap(soup.new_tag("b"))
soup.p.wrap(soup.new_tag("div")

Beautiful Soup 설치
설치: pip install beautifulsoup4

라이브러리
1. html.parser (괜찮은 속도와 관대함)

2. lxml (매우 빠르고 관대함)
pip install lxml

3. html5lib (매우 느리지만 극도로 관대)
pip install html5lib

출처 http://zeroplus1.zc.bz/jh/web/main.php?id=132&category=ETC

저작자표시 (새창열림)

'Python' 카테고리의 다른 글

json 등 딕셔너리에 특정 키 존재 확인 (0)	2020.02.24
Confluence python-api 참고 (0)	2020.02.24
django SECRET_KEY 따로 관리하기 (0)	2019.12.20
django logging 설정 (0)	2019.12.16
django admin 패스워드 변경 + 제목(header) 변경 (0)	2019.12.11

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

만들고, 퍼오고, 공유하고

BeautifulSoup 사용

'Python' 카테고리의 다른 글

티스토리툴바