网站首页 > 开源技术正文

python用requests BeautifulSoup下载网页到txt并去掉html标记

wxchong 2025-06-04 02:21:24 开源技术 64 ℃ 0 评论

import requests
from bs4 import BeautifulSoup

url = "https://www.5a8.com"
filename = "www5a8com.txt"

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()

    # 自动检测编码
    response.encoding = response.apparent_encoding

    # 使用 BeautifulSoup 提取纯文本
    soup = BeautifulSoup(response.text, "html.parser")
    visible_text = soup.get_text(separator="\n", strip=True)  # 用换行符分隔内容

    # 保存处理后的文本
    with open(filename, "w", encoding="utf-8") as f:
        f.write(visible_text)
    print(f"已提取可见文本至 {filename}")

except requests.exceptions.RequestException as e:
    print(f"下载失败: {e}")
except Exception as e:
    print(f"处理过程中发生错误: {e}")

运到方法

D:\code\python\get>python geturl1.py
已提取可见文本至 www5a8com.txt

上一篇：测试进阶:实现跨请求地保持登录的神器session你get了么?
下一篇： macOS Ventura 13.6 (22G120) 正式版 ISO、IPSW、PKG 下载 (安全更新)

网站首页 > 开源技术正文

python用requests BeautifulSoup下载网页到txt并去掉html标记

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎你发表评论:

网站首页 > 开源技术 正文

python用requests BeautifulSoup下载网页到txt并去掉html标记

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎 你 发表评论:

网站首页 > 开源技术正文

取消回复欢迎你发表评论: