Extract HTML with PHP’s DOM extension
I recently had to parse HTML with PHP and had a look at PHP's DOM at last. Here's a way to extract an element's content by ID:
<?php /** * Extract an element by ID from an HTML document * Thanks http://codjng.blogspot.com/2009/10/unicode-problem-when-using-domdocument.html * * @param string $content A website * * @return string HTML content */ function extract_id( $content, $id ) { // use mb_string if available if ( function_exists( 'mb_convert_encoding' ) ) $content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'); $dom= new DOMDocument(); $dom->loadHTML( $content ); $dom->preserveWhiteSpace = false; $element = $dom->getElementById( $id ); $innerHTML = innerHTML( $element ); return( $innerHTML ); } /** * Helper, returns the innerHTML of an element * * @param object DOMElement * * @return string one element's HTML content */ function innerHTML( $contentdiv ) { $r = ''; $elements = $contentdiv->childNodes; foreach( $elements as $element ) { if ( $element->nodeType == XML_TEXT_NODE ) { $text = $element->nodeValue; // IIRC the next line was for working around a // WordPress bug //$text = str_replace( '<', '<', $text ); $r .= $text; } // FIXME we should return comments as well elseif ( $element->nodeType == XML_COMMENT_NODE ) { $r .= ''; } else { $r .= '<'; $r .= $element->nodeName; if ( $element->hasAttributes() ) { $attributes = $element->attributes; foreach ( $attributes as $attribute ) $r .= " {$attribute->nodeName}='{$attribute->nodeValue}'" ; } $r .= '>'; $r .= innerHTML( $element ); $r .= "</{$element->nodeName}>"; } } return $r; } ?>
As you can see the code is not polished, but maybe it'll be useful to you.
Not bad, could come handy in certain situations but why didn't you use javascript (maybe even with jquery or mootools) to extract DOM content?
Because I needed to do it on the server. Are you suggesting I use a server-side JavaScript interpreter?
Nope, Atm I don't see why someone should use JavaScript as a server side language of choice..if you need to do sth on server-side php is just fine (ok, not as cool as python but anyway ;-)). Mind you asking me why you had to parse HTML? Just curious...
great !! thanks alot