用Python读取sitemap并调用百度接口推送URL

技术博客

February 1, 2021

SEO对于网站的推广很重要，大多数搜索引擎都提供了一些API用于给站长主动提交URL，加快网页被收录的速度。

百度提供了快速收录的API接口，下面这个Python脚本可以用来读取本地磁盘中的sitemap.xml文件，并调用接口提交URL至百度。

仅需要修改下面的参数：

lastUpdateTimeStr - 上次推送的时间。会与sitemap.xml中的时间做比较，仅推送在该时间之后更新的URL
siteMapPath - sitemap.xml在本地磁盘上的存放路径
siteUrl - 网站地址
baiduApiToken - Baidu API的token
tmpFile - 临时文件的保存地址
ignorePathPrefixes - 需要忽略的URL的前缀

#!/usr/bin/env python3
# coding: utf-8

import xml.etree.ElementTree as ET
from datetime import datetime
import os


### Methods #########
def stripNs(el):
  # Recursively search this element tree, removing namespaces.
  if el.tag.startswith("{"):
    el.tag = el.tag.split('}', 1)[1]  # strip namespace
  for k in el.attrib.keys():
    if k.startswith("{"):
      k2 = k.split('}', 1)[1]
      el.attrib[k2] = el.attrib[k]
      del el.attrib[k]
  for child in el:
    stripNs(child)


### Arguments to change ####
lastUpdateTimeStr='2021-01-26T00:00:00+08:00'
siteMapPath='public/sitemap.xml'
siteUrl='https://www.zengxi.net'
baiduApiToken='faketoken'
tmpFile="/tmp/submitSiteMap"
ignorePathPrefixes=[
     'https://www.zengxi.net/archives/',
     'https://www.zengxi.net/categories/',
     'https://www.zengxi.net/links/',
     'https://www.zengxi.net/posts/',
     'https://www.zengxi.net/series/',
     'https://www.zengxi.net/tags/'
]

### CONSTANTS ###
SITEMAP_DATETIME_FORMAT='%Y-%m-%dT%H:%M:%S%z'


lastUpdateTime=datetime.strptime(lastUpdateTimeStr, SITEMAP_DATETIME_FORMAT)


tree = ET.parse(siteMapPath)
urlset = tree.getroot()

with open(tmpFile, 'w') as f:
     for url in urlset:
          location = ''
          lastmod = lastUpdateTime

          for urlChild in url:
               stripNs(urlChild)

               if urlChild.tag == 'loc':
                    location = urlChild.text
               elif urlChild.tag == 'lastmod':
                    lastmod = datetime.strptime(urlChild.text, SITEMAP_DATETIME_FORMAT)
          
          ignore = False
          for prefix in ignorePathPrefixes:
               if location.startswith(prefix):
                    ignore = True
                    break
          
          if ignore:
               continue

          if lastmod >= lastUpdateTime:
               f.write(location + '\n')


command="""
curl -H 'Content-Type:text/plain' --data-binary @{filePath} "http://data.zz.baidu.com/urls?site={siteUrl}&token={token}"
"""

commandToExecute=command.format(filePath=tmpFile, siteUrl=siteUrl, token=baiduApiToken)
tmpres = os.popen(commandToExecute).readlines()

print(commandToExecute)
print(tmpres)

最后更新于 November 30, 2024

使用Python在两个Postgres数据库直接复制数据订单ES查询性能优化