我的个人博客:逐步前行STEP
使用Goutte + GuzzleHttp 爬取网页时,如下代码中的请求头设置无效:
$jar = CookieJar::fromArray([
"HMACCOUNT" => 'C0CDC28BD0110387',
], self::$host);
$client = new GoutteClient();
$guzzle_client = new GuzzleClient([
'timeout'=>20,
'headers'=>[
'Referer'=>$prefix_url,
'User-Agent'=>'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
],
'cookies' => $jar,
'debug'=>true,
]);
$client->setClient($guzzle_client);
经过研究源码发现,User-Agent 请求字段使用了默认值,没有应用传入的参数,而cookies配置则因为语法问题被覆盖丢失。
以下是具体探究过程:
在vendor/symfony/browser-kit/Client.php
中:
......
/**
* @param array $server The server parameters (equivalent of $_SERVER)
* @param History $history A History instance to store the browser history
* @param CookieJar $cookieJar A CookieJar instance to store the cookies
*/
public function __construct(array $server = [], History $history = null, CookieJar $cookieJar = null)
{
$this->setServerParameters($server);
$this->history = $history ?: new History();
$this->cookieJar = $cookieJar ?: new CookieJar();
}
......
/**
* Sets server parameters.
*
* @param array $server An array of server parameters
*/
public function setServerParameters(array $server)
{
$this->server = array_merge([
'HTTP_USER_AGENT' => 'Symfony BrowserKit',
], $server);
}
......
设置了$this->sever的初始值,然后在该文件的:
public function request(string $method, string $uri, array $parameters = [], array $files = [], array $server = [], string $content = null, bool $changeHistory = true)
{
......
$server = array_merge($this->server, $server);
......
$this->internalRequest = new Request($uri, $method, $parameters, $files, $this->cookieJar->allValues($uri), $server, $content);
......
if ($this->insulated) {
$this->response = $this->doRequestInProcess($this->request);
} else {
$this->response = $this->doRequest($this->request);
}
......
如果Goutte 的 request 中没有设置相同键的sever ,生成的请求对象的sever属性就初始化包含HTTP_USER_AGENT(因为当前需求是在实例化的时候传参作为全局配置,不考虑在request之前设置header来使配置生效的方案),而在vendor/fabpot/goutte/Goutte/Client.php
中:
protected function doRequest($request)
{
$headers = array();
foreach ($request->getServer() as $key => $val) {
$key = strtolower(str_replace('_', '-', $key));
$contentHeaders = array('content-length' => true, 'content-md5' => true, 'content-type' => true);
if (0 === strpos($key, 'http-')) {
$headers[substr($key, 5)] = $val;
}
// CONTENT_* are not prefixed with HTTP_
elseif (isset($contentHeaders[$key])) {
$headers[$key] = $val;
}
}
......
if (!empty($headers)) {
$requestOptions['headers'] = $headers;
}
......
// Let BrowserKit handle redirects
try {
$response = $this->getClient()->request($method, $uri, $requestOptions);
}
......
可见,Request的sever属性被用于作为GuzzleHttp实例的请求头,不过在上面的代码中,键 HTTP_USER_AGENT 已经被更改为user-agent,而从vendor/guzzlehttp/guzzle/src/Client.php
文件可以看出 GuzzleHttp 实例的request方法调用了requestAsync方法,requestAsync中将上面代码传入的$requestOptions 作为请求头字段,在该文件中,从构造器可知,本文第一段代码中传入构造器的参数都会作为配置使用,在方法configureDefaults和prepareDefaults都有做处理,并将传入的请求头从以header为键换成了以_conditional为键:
private function prepareDefaults($options)
{
$defaults = $this->config;
if (!empty($defaults['headers'])) {
// Default headers are only added if they are not present.
$defaults['_conditional'] = $defaults['headers'];
unset($defaults['headers']);
}
......
}
在vendor/guzzlehttp/guzzle/src/Client.php
的:
private function applyOptions(RequestInterface $request, array &$options)
{
......
// Merge in conditional headers if they are not present.
if (isset($options['_conditional'])) {
// Build up the changes so it's in a single clone of the message.
$modify = [];
foreach ($options['_conditional'] as $k => $v) {
if (!$request->hasHeader($k)) {
$modify['set_headers'][$k] = $v;
}
}
......
查找了_conditional数据是否在Request对象的请求头中存在,不存在就新增,至此,User-Agent配置失效的原因出来了,就是在此处被丢弃了,作如下修改,将传入的参数覆盖默认参数:
private function applyOptions(RequestInterface $request, array &$options)
{
......
// Merge in conditional headers if they are not present.
if (isset($options['_conditional'])) {
// Build up the changes so it's in a single clone of the message.
$modify = [];
foreach ($options['_conditional'] as $k => $v) {
if (!$request->hasHeader($k)) {
//改动此处
$modify['set_headers'][$k] = $v;
}
}
......
这样,User-Agent配置就可以被正确使用了。
然而,设置的cookie还是无效,
继续调试源码,可以发现,在vendor/guzzlehttp/guzzle/src/Client.php
的prepareDefaults 函数:
private function prepareDefaults($options)
{
......
// Shallow merge defaults underneath options.
$result = $options + $defaults;
...... return $result;
}
有一个合并数组的语句 $result = $options + $defaults;
,但是,经过测试,该语句没有进行数组合并,我的php版本是7.1.3,这个应该跟版本有关,暂时没有查资料看具体适用于什么版本,我这儿直接改了就好了,类似还有该文件的另一处地方:
private function configureDefaults(array $config)
{
......
$this->config = $config + $defaults;
......
将其改成:
private function configureDefaults(array $config)
{
......
$this->config = array_merge($defaults, $config);
......
即可(array_merge中注意两个数组的顺序)。