Extract HTML with PHP’s DOM extension

I recently had to parse HTML with PHP and had a look at PHP's DOM at last. Here's a way to extract an element's content by ID:

 
<?php
/**	 
 * Extract an element by ID from an HTML document
 * Thanks http://codjng.blogspot.com/2009/10/unicode-problem-when-using-domdocument.html
 *
 * @param string $content A website
 *
 * @return string HTML content
 */

function extract_id( $content, $id ) {
	// use mb_string if available
	if ( function_exists( 'mb_convert_encoding' ) )
		$content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');
	$dom= new DOMDocument();
	$dom->loadHTML( $content );
	$dom->preserveWhiteSpace = false;	$element = $dom->getElementById( $id );
	$innerHTML = innerHTML( $element );
	return( $innerHTML ); 
}

/**	 
 * Helper, returns the innerHTML of an element
 *
 * @param object DOMElement
 *
 * @return string one element's HTML content
 */

function innerHTML( $contentdiv ) {
	$r = '';
	$elements = $contentdiv->childNodes;
	foreach( $elements as $element ) { 
		if ( $element->nodeType == XML_TEXT_NODE ) {
			$text = $element->nodeValue;
			// IIRC the next line was for working around a
			// WordPress bug
			//$text = str_replace( '<', '&lt;', $text );
			$r .= $text;
		}	 
		// FIXME we should return comments as well
		elseif ( $element->nodeType == XML_COMMENT_NODE ) {
			$r .= '';
		}	 
		else {
			$r .= '<';
			$r .= $element->nodeName;
			if ( $element->hasAttributes() ) { 
				$attributes = $element->attributes;
				foreach ( $attributes as $attribute )
					$r .= " {$attribute->nodeName}='{$attribute->nodeValue}'" ;
			}	 
			$r .= '>';
			$r .= innerHTML( $element );
			$r .= "</{$element->nodeName}>";
		}	 
	}	 
	return $r;
}
?>

As you can see the code is not polished, but maybe it'll be useful to you.

Published on Oct. 22, 2010 at 3 p.m. by Nicolas and tagged DOM, PHP, innerHTML, HTML parser. You can follow the discussion with the comment feed for this post.

7 comments

  • avatar
    Oliver Baus wrote this comment on Nov. 9, 2010, 9:55 a.m.
    Not bad, could come handy in certain situations but why didn't you use javascript (maybe even with jquery or mootools) to extract DOM content?
    Reply to this comment
    • avatar
      nicolas wrote this comment on Nov. 9, 2010, 12:35 p.m.
      Because I needed to do it on the server. Are you suggesting I use a server-side JavaScript interpreter?
      Reply to this comment
      • avatar
        Oliver Baus wrote this comment on Nov. 9, 2010, 2:51 p.m.
        Nope, Atm I don't see why someone should use JavaScript as a server side language of choice..if you need to do sth on server-side php is just fine (ok, not as cool as python but anyway ;-)). Mind you asking me why you had to parse HTML? Just curious...
        Reply to this comment
  • avatar
    Mr.Xprt wrote this comment on March 22, 2011, 8:22 p.m.
    great !! thanks alot
    Reply to this comment
  • avatar
    Jazza wrote this comment on June 2, 2011, 5:39 a.m.
    Ahh thanks!
    Just what I was looking for,
    Cheers!
    Reply to this comment
  • avatar
    SteveST wrote this comment on June 21, 2011, 12:22 p.m.
    Nicolas! Ta for the handy pointers here, but you forgot to take account of self-closing tags! Just putting your three tag-closing lines inside a wee regex conditional seems to sort it out, though. :-)
    Reply to this comment
  • avatar
    Ode wrote this comment on Feb. 13, 2012, 12:19 a.m.
    Awesome, this was just what i needed! Thank you very much! Working like a charm :)
    Reply to this comment

Start a new thread

Cancel reply
Markdown. Syntax highlighting with <code lang="php"><?php echo "Hello, world!"; ?></code> etc.