2012年12月11日 星期二

php tidy DOMDocument parse html 範例

HTML :

   
        ....1....
   
   
        ....2....
   
   
        ....3....
   
   
        ....4....
   





PHP + Tidy + DOMDocument (有處理中文亂碼 (UTF-8)):
            $html = file_get_contents($url);

            $tidy = new tidy();
            $conf = array(
                'output-xhtml'=>true,
                'drop-empty-paras'=>FALSE,
                'join-classes'=>TRUE,
                'show-body-only'=>TRUE,
                'output-encoding' => 'raw',
            );
            $html = $tidy->repairString($html,$conf,'utf8');

            $dom = new DOMDocument;
            @$dom->loadHTML('' . $html);

            foreach($dom->getElementsByTagName('table') as $table)
            {
                if ( ! $table->hasAttribute('class'))
                {
                    continue;
                }

                $class = explode(' ', $table->getAttribute('class'));

                if ( in_array('baseTB', $class) || in_array('listTB', $class) )
                {
                    $rows = $table->getElementsByTagName("tr");

                    foreach ($rows as $row) {
                        foreach($row->getElementsByTagName('a') as $a)
                        {
                            if($a->nodeValue)
                            {
                                $items[] = array(
                                    'name' => $a->nodeValue,
                                    'href' => $a->getAttribute('href')
                                );
                            }
                        }
                    }
                    //print_r($items);
                    //echo "

";
                }
            }

            return $items;




Reference :
PHP+Tidy-完美的XHTML纠错+过滤_php技巧_脚本之家
DOMDocument->loadHTML()处理中文的一点问题 - Fwolf's Blog

沒有留言:

張貼留言