Do you know how to extract text from the HTML <body> tag with one regular expression? Here it is:
function extractBody($htmlContent) {
$result = '';
$regExp = '/.*<body[^>]*>(.*)<\/body>.*/is';
if (preg_match($regExp, $htmlContent)) {
$result = trim(preg_replace($regExp, '\1', $htmlContent));
}
return $result;
}
Note that you have to use "i" and "s" modifiers, otherwise this regular expression will not work in all cases. "i" modifier helps to detected <body> tag in mixed or upper case. "s" modifier forces regular expression to work correctly with new line characters.
Comments
I just used this on my Mac:
lynx -dump example.html >example.txt
worked fine ;-)
Add a comment
All fields in this form are required!