아름다운 수프로 보이는 웹 페이지 텍스트만 긁어내는 방법?

programing

아름다운 수프로 보이는 웹 페이지 텍스트만 긁어내는 방법?

css3 2023. 9. 7. 21:57

아름다운 수프로 보이는 웹 페이지 텍스트만 긁어내는 방법?

기본적으로 제가 사용하고 싶습니다.BeautifulSoup웹 페이지의 보이는 텍스트를 엄격히 잡는 것입니다.예를 들어, 이 웹페이지는 저의 테스트 케이스입니다.그리고 저는 주로 본문 텍스트(기사)와 여기 저기에 있는 탭 이름 몇 개만 받고 싶습니다.나는 이 SO 질문에서 제안을 시도해 보았는데 많은 것을 돌려줍니다.<script>내가 원하지 않는 태그와 html 댓글.웹 페이지에서 보이는 텍스트를 얻기 위해 함수에 필요한 인수를 파악할 수 없습니다.

그럼 스크립트, 댓글, CSS 등을 제외한 모든 눈에 보이는 텍스트를 어떻게 찾아야 하나요?

시도해 보기:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

@jbochi의 승인된 답변은 저에게 맞지 않습니다.str() 함수 호출은 BeautifulSoup 요소의 ASCII가 아닌 문자를 인코딩할 수 없으므로 예외를 발생시킵니다.예제 웹 페이지를 보이는 텍스트로 필터링하는 보다 간단한 방법이 있습니다.

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

저는 렌더링된 콘텐츠를 얻기 위해 Beautiful Soup을 사용하는 것을 전적으로 존중하지만, 페이지에서 렌더링된 콘텐츠를 얻는 데 이상적인 패키지가 아닐 수도 있습니다.

렌더링된 콘텐츠, 또는 일반 브라우저에서 보이는 콘텐츠를 얻는 것과 비슷한 문제가 있었습니다.특히 아래와 같은 간단한 예를 가지고 작업할 수 있는 전형적인 사례가 많이 있었습니다.이 경우 표시할 수 없는 태그는 스타일 태그에 중첩되며, 확인한 많은 브라우저에서 볼 수 없습니다.클래스 태그 설정 디스플레이를 없음으로 정의하는 등의 다른 변형이 있습니다.그럼 디브는 이 수업을 이용해서.

<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>

위에 게시된 한 가지 해결책은 다음과 같습니다.

html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

이 솔루션은 많은 경우에 응용 프로그램이 있으며 일반적으로 작업을 잘 수행하지만 위에 게시된 html에서는 렌더링되지 않은 텍스트를 유지합니다.SO를 검색한 후 여기 BeautifulSoup get_text라는 몇 가지 솔루션이 등장했지만 모든 태그와 자바스크립트가 제거되지는 않았고 여기 Python을 사용하여 HTML을 일반 텍스트로 렌더링했습니다.

저는 html2text와 nltk.clean_html의 두 가지 솔루션을 모두 사용해 보았는데, 타이밍 결과에 놀라 후세에 대한 답을 보장한다고 생각했습니다.물론 속도는 데이터의 내용에 따라 크게 좌우됩니다.

여기서 @Helge의 한 가지 대답은 모든 것을 nltk로 사용하는 것에 관한 것입니다.

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

html이 렌더링된 문자열을 반환하는 것은 정말 잘 작동했습니다.이 nltk 모듈은 html2 텍스트보다 빨랐지만 아마 html2 텍스트가 더 강했을 것입니다.

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

Beautiful Soup을 사용하는 가장 쉬운 방법은 코드를 적게 사용하여 빈 줄이나 쓰레기 없이 문자열만 얻을 수 있습니다.

tag = <Parent_Tag_that_contains_the_data>
soup = BeautifulSoup(tag, 'html.parser')

for i in soup.stripped_strings:
    print repr(i)

성능에 관심이 있다면 보다 효율적인 다른 방법이 있습니다.

import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    ])
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)

soup.strings이고, 데시다를 반환합니다.NavigableString여러 번의 루프를 거치지 않고 부모의 태그 이름을 직접 확인할 수 있습니다.

을 사용하는 것을 합니다. 형식의 html 웹 또는 경우)의, 은 beautiful-soup 하는 합니다 을 으로 합니다 으로 을 하는 만약 누군가가 어떤 이유로든 잘못된 형식의 html(예를 들어 웹 페이지의 세그먼트 또는 줄만 있는 경우)의 보이는 부분을 표시하려고 한다면 다음은 다음과 같은 내용을 제거합니다.<그리고.>선택사항:

import re   ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text):             
    return(re.sub("(\<.*?\>)", "",text))

에 있습니다.<nyt_headline>,에다n 안에 중첩되는 <h1>꼬리표를 달고<div>로 합니다. id "article"를다다를.

soup.findAll('nyt_headline', limit=1)

효과가 있을 겁니다.

은 안에 .<nyt_text> a에다의 내부에 중첩되는 입니다.<div>의 tag articleBody로를다는다gh로". 안에e<nyt_text>소,트가에다됩니다 안에 포함됩니다.<p>는 해당그 않습니다. 이미지는 그 안에 있지 않습니다.<p>것은 의 작업 가 발생할 됩니다.tags. 구문으로 실험하기는 어렵지만, 이런 모양의 작업 스크래치가 발생할 것으로 예상됩니다.

text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    if re.match(r"[\n]+",str(element)): return False
    return True
def text_from_html(url):
    body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
    soup = BeautifulSoup(body ,"lxml")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    text = u",".join(t.strip() for t in visible_texts)
    text = text.lstrip().rstrip()
    text = text.split(',')
    clean_text = ''
    for sen in text:
        if sen:
            sen = sen.rstrip().lstrip()
            clean_text += sen+','
    return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
print(text_from_html(url))

갱신하다

문서에서:버전 4.9.0에서, 또는 사용 중인 경우, , 및 태그의 내용은 일반적으로 사람이 볼 수 있는 페이지 내용의 일부가 아니기 때문에 '텍스트'로 간주되지 않습니다.

HTML의 이 읽을 수 다음과 같이 .<body>를 사용하여 중복 공백 등을 제거할 수 있습니다. 스트립 파라미터를 설정하고 단일 공백으로 모두 조인/조인할 수 있습니다.

import bs4, requests

response = requests.get('https://www.nytimes.com/interactive/2022/09/13/us/politics/congress-stock-trading-investigation.html',headers={'User-Agent': 'Mozilla/5.0','cache-control': 'max-age=0'}, cookies={'cookies':''})
soup = bs4.BeautifulSoup(response.text)

soup.article.get_text(' ', strip=True)

에서는 이전 사용 안 함 에서는 findAll()임을 합니다.find_all()아니면select()와 함께css selectors- 자세한 내용은 잠시 문서를 확인해 보십시오.

이 사건을 처리하는 가장 간단한 방법은getattr()를 사용자의 할 수 이 예제를 필요에 맞게 사용할 수 있습니다.

from bs4 import BeautifulSoup

source_html = """
<span class="ratingsDisplay">
    <a class="ratingNumber" href="https://www.youtube.com/watch?v=oHg5SJYRHA0" target="_blank" rel="noopener">
        <span class="ratingsContent">3.7</span>
    </a>
</span>
"""

soup = BeautifulSoup(source_html, "lxml")
my_ratings = getattr(soup.find('span', {"class": "ratingsContent"}), "text", None)
print(my_ratings)

이렇게 하면 텍스트 요소를 찾을 수 있습니다."3.7" 개체 에서, 에서 에서 <span class="ratingsContent">3.7</span>우은과다는다과no은t는우t,,rNoneType그렇지 않을 때는

getattr(object, name[, default])

개체의 명명된 속성 값을 반환합니다.이름은 문자열이어야 합니다.문자열이 개체의 특성 중 하나의 이름이면 결과는 해당 특성의 값입니다.예를 들어 getattr(x, 'foobar')은 x.foobar와 같습니다.명명된 특성이 없으면 제공된 경우 기본값이 반환되고 그렇지 않으면 AttributeError가 발생합니다.

언급URL : https://stackoverflow.com/questions/1936466/how-to-scrape-only-visible-webpage-text-with-beautifulsoup

'programing' 카테고리의 다른 글

Chrome의 대용량 JSON 데이터 검사 (0)	2023.09.07
Git에서 병합 커밋의 부모를 얻으려면 어떻게 해야 합니까? (0)	2023.09.07
자바빈과 스프링빈의 차이 (0)	2023.09.07
스택 오버플로에서와 같이 팝업 메시지를 표시하는 방법 (0)	2023.09.07
체크아웃 없이 다른 분기를 현재 상태로 재설정 (0)	2023.09.07

현재글아름다운 수프로 보이는 웹 페이지 텍스트만 긁어내는 방법?

각종 프로그래밍 정보를 다루는 블로그입니다.

sql-server, CSS, TypeScript, MariaDB, android, Excel, C, git, reactjs, MongoDB, Python, oracle, JSON, WordPress, Ajax, spring-boot, AngularJS, jQuery, MySQL, ASP.NET,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

css3

아름다운 수프로 보이는 웹 페이지 텍스트만 긁어내는 방법?

아름다운 수프로 보이는 웹 페이지 텍스트만 긁어내는 방법?

갱신하다

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

아름다운 수프로 보이는 웹 페이지 텍스트만 긁어내는 방법?

아름다운 수프로 보이는 웹 페이지 텍스트만 긁어내는 방법?

갱신하다

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바