Extract HTML with PHP’s DOM extension

I recently had to parse HTML with PHP and had a look at PHP's DOM at last. Here's a way to extract an element's content by ID:

 <?php
/**	 
 * Extract an element by ID from an HTML document
 * Thanks http://codjng.blogspot.com/2009/10/unicode-problem-when-using-domdocument.html
 *
 * @param string $content A website
 *
 * @return string HTML content
 */

function extract_id( $content, $id ) {
	// use mb_string if available
	if ( function_exists( 'mb_convert_encoding' ) )
		$content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');
	$dom= new DOMDocument();
	$dom->loadHTML( $content );
	$dom->preserveWhiteSpace = false;	$element = $dom->getElementById( $id );
	$innerHTML = innerHTML( $element );
	return( $innerHTML ); 
}

/**	 
 * Helper, returns the innerHTML of an element
 *
 * @param object DOMElement
 *
 * @return string one element's HTML content
 */

function innerHTML( $contentdiv ) {
	$r = '';
	$elements = $contentdiv->childNodes;
	foreach( $elements as $element ) { 
		if ( $element->nodeType == XML_TEXT_NODE ) {
			$text = $element->nodeValue;
			// IIRC the next line was for working around a
			// WordPress bug
			//$text = str_replace( '<', '&lt;', $text );
			$r .= $text;
		}	 
		// FIXME we should return comments as well
		elseif ( $element->nodeType == XML_COMMENT_NODE ) {
			$r .= '';
		}	 
		else {
			$r .= '<';
			$r .= $element->nodeName;
			if ( $element->hasAttributes() ) { 
				$attributes = $element->attributes;
				foreach ( $attributes as $attribute )
					$r .= " {$attribute->nodeName}='{$attribute->nodeValue}'" ;
			}	 
			$r .= '>';
			$r .= innerHTML( $element );
			$r .= "</{$element->nodeName}>";
		}	 
	}	 
	return $r;
}
?>

As you can see the code is not polished, but maybe it'll be useful to you.

8 comments

Oliver Baus wrote this comment on Nov. 9, 2010, 9:55 a.m.

Not bad, could come handy in certain situations but why didn't you use javascript (maybe even with jquery or mootools) to extract DOM content?
nicolas wrote this comment on Nov. 9, 2010, 12:35 p.m.

Because I needed to do it on the server. Are you suggesting I use a server-side JavaScript interpreter?
Oliver Baus wrote this comment on Nov. 9, 2010, 2:51 p.m.

Nope, Atm I don't see why someone should use JavaScript as a server side language of choice..if you need to do sth on server-side php is just fine (ok, not as cool as python but anyway ;-)). Mind you asking me why you had to parse HTML? Just curious...
Mr.Xprt wrote this comment on March 22, 2011, 8:22 p.m.

great !! thanks alot
Jazza wrote this comment on June 2, 2011, 5:39 a.m.

Ahh thanks! Just what I was looking for, Cheers!
SteveST wrote this comment on June 21, 2011, 12:22 p.m.

Nicolas! Ta for the handy pointers here, but you forgot to take account of self-closing tags! Just putting your three tag-closing lines inside a wee regex conditional seems to sort it out, though. :-)
Ode wrote this comment on Feb. 13, 2012, 12:19 a.m.

Awesome, this was just what i needed! Thank you very much! Working like a charm :)
Ervin Bernhardt wrote this comment on June 27, 2013, 5:33 p.m.

This is a very nice code. But what id I had only a space inside a node ( ). How could I preserve that?

8 comments

Reply