Extract HTML with PHP’s DOM extension
I recently had to parse HTML with PHP and had a look at PHP's DOM at last. Here's a way to extract an element's content by ID:
<?php /** * Extract an element by ID from an HTML document * Thanks http://codjng.blogspot.com/2009/10/unicode-problem-when-using-domdocument.html * * @param string $content A website * * @return string HTML content */ function extract_id( $content, $id ) { // use mb_string if available if ( function_exists( 'mb_convert_encoding' ) ) $content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'); $dom= new DOMDocument(); $dom->loadHTML( $content ); $dom->preserveWhiteSpace = false; $element = $dom->getElementById( $id ); $innerHTML = innerHTML( $element ); return( $innerHTML ); } /** * Helper, returns the innerHTML of an element * * @param object DOMElement * * @return string one element's HTML content */ function innerHTML( $contentdiv ) { $r = ''; $elements = $contentdiv->childNodes; foreach( $elements as $element ) { if ( $element->nodeType == XML_TEXT_NODE ) { $text = $element->nodeValue; // IIRC the next line was for working around a // WordPress bug //$text = str_replace( '<', '<', $text ); $r .= $text; } // FIXME we should return comments as well elseif ( $element->nodeType == XML_COMMENT_NODE ) { $r .= ''; } else { $r .= '<'; $r .= $element->nodeName; if ( $element->hasAttributes() ) { $attributes = $element->attributes; foreach ( $attributes as $attribute ) $r .= " {$attribute->nodeName}='{$attribute->nodeValue}'" ; } $r .= '>'; $r .= innerHTML( $element ); $r .= "</{$element->nodeName}>"; } } return $r; } ?>
As you can see the code is not polished, but maybe it'll be useful to you.
I am Nicolas Kuttler, a web developer, system administrator and IT consultant from France, currently living in Germany.
Just what I was looking for,
Cheers!