我的个人博客:逐步前行STEP

使用Goutte + GuzzleHttp 爬取网页时,如下代码中的请求头设置无效:

$jar = CookieJar::fromArray([
            "HMACCOUNT" => 'C0CDC28BD0110387',
        ], self::$host);

        $client = new GoutteClient();

        $guzzle_client = new GuzzleClient([
            'timeout'=>20,
            'headers'=>[
                'Referer'=>$prefix_url,
                'User-Agent'=>'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
            ],
            'cookies' => $jar,
            'debug'=>true,
        ]);

        $client->setClient($guzzle_client);

经过研究源码发现,User-Agent 请求字段使用了默认值,没有应用传入的参数,而cookies配置则因为语法问题被覆盖丢失。

以下是具体探究过程:
vendor/symfony/browser-kit/Client.php中:


......
    /**
     * @param array     $server    The server parameters (equivalent of $_SERVER)
     * @param History   $history   A History instance to store the browser history
     * @param CookieJar $cookieJar A CookieJar instance to store the cookies
     */
    public function __construct(array $server = [], History $history = null, CookieJar $cookieJar = null)
    {
        $this->setServerParameters($server);
        $this->history = $history ?: new History();
        $this->cookieJar = $cookieJar ?: new CookieJar();
    }
......
    /**
     * Sets server parameters.
     *
     * @param array $server An array of server parameters
     */
    public function setServerParameters(array $server)
    {
        $this->server = array_merge([
            'HTTP_USER_AGENT' => 'Symfony BrowserKit',
        ], $server);
    }
......

设置了$this->sever的初始值,然后在该文件的:

public function request(string $method, string $uri, array $parameters = [], array $files = [], array $server = [], string $content = null, bool $changeHistory = true)
    {
......

        $server = array_merge($this->server, $server);
......
        $this->internalRequest = new Request($uri, $method, $parameters, $files, $this->cookieJar->allValues($uri), $server, $content);
......
        if ($this->insulated) {
            $this->response = $this->doRequestInProcess($this->request);
        } else {
            $this->response = $this->doRequest($this->request);
        }
......

如果Goutte 的 request 中没有设置相同键的sever ,生成的请求对象的sever属性就初始化包含HTTP_USER_AGENT(因为当前需求是在实例化的时候传参作为全局配置,不考虑在request之前设置header来使配置生效的方案),而在vendor/fabpot/goutte/Goutte/Client.php中:

protected function doRequest($request)
    {
        $headers = array();
        foreach ($request->getServer() as $key => $val) {
            $key = strtolower(str_replace('_', '-', $key));
            $contentHeaders = array('content-length' => true, 'content-md5' => true, 'content-type' => true);
            if (0 === strpos($key, 'http-')) {
                $headers[substr($key, 5)] = $val;
            }
            // CONTENT_* are not prefixed with HTTP_
            elseif (isset($contentHeaders[$key])) {
                $headers[$key] = $val;
            }
        }
......
        if (!empty($headers)) {
            $requestOptions['headers'] = $headers;
        }
......
        // Let BrowserKit handle redirects
        try {
            $response = $this->getClient()->request($method, $uri, $requestOptions);
            }
......

可见,Request的sever属性被用于作为GuzzleHttp实例的请求头,不过在上面的代码中,键 HTTP_USER_AGENT 已经被更改为user-agent,而从vendor/guzzlehttp/guzzle/src/Client.php文件可以看出 GuzzleHttp 实例的request方法调用了requestAsync方法,requestAsync中将上面代码传入的$requestOptions 作为请求头字段,在该文件中,从构造器可知,本文第一段代码中传入构造器的参数都会作为配置使用,在方法configureDefaults和prepareDefaults都有做处理,并将传入的请求头从以header为键换成了以_conditional为键:

private function prepareDefaults($options)
    {
        $defaults = $this->config;

        if (!empty($defaults['headers'])) {
            // Default headers are only added if they are not present.
            $defaults['_conditional'] = $defaults['headers'];
            unset($defaults['headers']);
        }
......
}

vendor/guzzlehttp/guzzle/src/Client.php的:

private function applyOptions(RequestInterface $request, array &$options)
{
......

        // Merge in conditional headers if they are not present.
        if (isset($options['_conditional'])) {
            // Build up the changes so it's in a single clone of the message.
            $modify = [];
            foreach ($options['_conditional'] as $k => $v) {
                if (!$request->hasHeader($k)) {
                    $modify['set_headers'][$k] = $v;
                }
            }
......

查找了_conditional数据是否在Request对象的请求头中存在,不存在就新增,至此,User-Agent配置失效的原因出来了,就是在此处被丢弃了,作如下修改,将传入的参数覆盖默认参数:

private function applyOptions(RequestInterface $request, array &$options)
{
......

        // Merge in conditional headers if they are not present.
        if (isset($options['_conditional'])) {
            // Build up the changes so it's in a single clone of the message.
            $modify = [];
            foreach ($options['_conditional'] as $k => $v) {
                if (!$request->hasHeader($k)) {
                //改动此处
                    $modify['set_headers'][$k] = $v;
                }
            }
......

这样,User-Agent配置就可以被正确使用了。

然而,设置的cookie还是无效,
继续调试源码,可以发现,在vendor/guzzlehttp/guzzle/src/Client.php的prepareDefaults 函数:

private function prepareDefaults($options)
    {
......
        // Shallow merge defaults underneath options.
        $result = $options + $defaults;
......        return $result;
    }

有一个合并数组的语句 $result = $options + $defaults;,但是,经过测试,该语句没有进行数组合并,我的php版本是7.1.3,这个应该跟版本有关,暂时没有查资料看具体适用于什么版本,我这儿直接改了就好了,类似还有该文件的另一处地方:

private function configureDefaults(array $config)
    {
......
        $this->config = $config + $defaults;
......

将其改成:

private function configureDefaults(array $config)
    {
......
        $this->config = array_merge($defaults, $config);
......

即可(array_merge中注意两个数组的顺序)。

标签: none

添加新评论