....1....
PHP + Tidy + DOMDocument (有處理中文亂碼 (UTF-8)):
$html = file_get_contents($url);
$tidy = new tidy();
$conf = array(
'output-xhtml'=>true,
'drop-empty-paras'=>FALSE,
'join-classes'=>TRUE,
'show-body-only'=>TRUE,
'output-encoding' => 'raw',
);
$html = $tidy->repairString($html,$conf,'utf8');
$dom = new DOMDocument;
@$dom->loadHTML('' . $html);
foreach($dom->getElementsByTagName('table') as $table)
{
if ( ! $table->hasAttribute('class'))
{
continue;
}
$class = explode(' ', $table->getAttribute('class'));
if ( in_array('baseTB', $class) || in_array('listTB', $class) )
{
$rows = $table->getElementsByTagName("tr");
foreach ($rows as $row) {
foreach($row->getElementsByTagName('a') as $a)
{
if($a->nodeValue)
{
$items[] = array(
'name' => $a->nodeValue,
'href' => $a->getAttribute('href')
);
}
}
}
//print_r($items);
//echo "
";
}
}
return $items;
Reference :
PHP+Tidy-完美的XHTML纠错+过滤_php技巧_脚本之家
DOMDocument->loadHTML()处理中文的一点问题 - Fwolf's Blog
沒有留言:
張貼留言