PHP Resources

Some PHP Odds 'n' Ends you my find useful

Extracting Text from MS Word

If your PHP installation is Unix/Linux, rather than Windows, you don't have access to PHP's COM abilities. This makes it difficult to extract infomation from Microsoft Word documents.

Being able to get at the text from a Word document can be useful, especially for building indexers for search engines.

The solutions that are currently available usually involve using binaries such as catdoc or antiword.  Good as these products are, they can be complicated to install and configure (sometimes impossible if using a shared hosting account).

Here's a simple attempt at a solution using just PHP. I don't pretend that it makes a complete success of extracting the text from all Word documents, but I've found it very reliable for the vast majority of the several thousand docs I've used it with. The function returns text from the Word document as a string, with all the formatting removed. Please note that some parts of the Word document (header, footer etc) are not parsed.

<?php /***************************************************************** This approach uses detection of NUL (chr(00)) and end line (chr(13)) to decide where the text is: - divide the file contents up by chr(13) - reject any slices containing a NUL - stitch the rest together again - clean up with a regular expression *****************************************************************/ function parseWord($userDoc) { $fileHandle = fopen($userDoc, "r"); $line = @fread($fileHandle, filesize($userDoc)); $lines = explode(chr(0x0D),$line); $outtext = ""; foreach($lines as $thisline) { $pos = strpos($thisline, chr(0x00)); if (($pos !== FALSE)||(strlen($thisline)==0)) { } else { $outtext .= $thisline." "; } } $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext); return $outtext; } ?>

Using the function is as easy as:

$text = parseWord($userDoc);

The recovered text can then be processed as required, e.g. put into an index, or a MySQL table having a FULLTEXT index applied etc.

Detecting PHP use

You can tell if a server has PHP installed, irrespective of what file extensions they may be using (some sites change the default .php extension for security reasons).  Simply append:

?=PHPB8B5F2A0-3C92-11d3-A3A9-4C7B08C10000

to the URL of the domain in question.  If PHP is installed, you should get a nicely formatted page showing the PHP credits.

Try it with this site by clicking here.

Measuring Page Loading Time

You may have noticed some pages (including this one) displaying a little message, usual at the page foot, saying 'Page loaded in 0.*** secs' or similar.  here's how to do it using PHP's microtime() function:

In a PHP code block at the start of the page, put:

$mic_time = explode(" ",microtime());
$mic_time = $mic_time[1] + $mic_time[0];
$starttime = $mic_time;

Near the page end, put the following lines in a PHP code block:

$places = 5;      // However many decimal places you require
$mic_time = explode(" ",microtime());
$mic_time = $mic_time[1] + $mic_time[0];
$finishtime = $mic_time;
echo "Page loaded in ". round(($finishtime - $starttime),$places) ." secs";

The microtime() function returns the Unix timestamp for the current moment, in the form "msec sec" where sec is the current time measured in the number of seconds since the Unix Epoch (0:00:00 January 1, 1970 GMT), and msec is the miiliseconds part.

We manipulate the string to give us a start time and finish time, and simply subtract one from the other to give us our page loading time.

A Custom Error Page generator

This is to be used with suitable ErrorDocument entries in an .htaccess file - see the .htaccess tutorial for details.

Rather than create a whole bunch of different pages to cater for all of the different error codes, we can quite easily write a PHP script to generate them on-the-fly.

First, take your favourite programmer's editor and create a file called err.php.  This file will contain the program that will be called by the .htaccess file when an error occurs.  Here's an example line from our .htaccess file, this one dealing with the common 404 or 'Page Not Found' error:

ErrorDocument 404 /err.php?code=404

We need one such line in .htaccess for each different error code we choose to process. 

Our PHP script will store error messages in an associative array $errortext, with each array entry being of the form

$errortext[errorcode] = "Explanatory text for this error";

The script will simply look for the given error code (passed, via the URL, in the variable code) and output the associated text into a web page template.  So inside our PHP code block:

// First let's generate our array of error messages
$errortext["400"] = "Bad server request.  You may have made a syntax error.";
$errrortext["404"] = "I'm sorry, but that page doesn't seem to exist.";
... and so on

Now we need to check the error code, passed to the script in the code variable, to get the appropriate error message:

$errorcode = $_GET["code"]; 

// get the relevant error code:
if(array_key_exists($errorcode,$errortext))
{
  $output = $errortext[$errorcode];
}

We now have a variable $output, which we can simply echo to our user:

echo $output;

You can download a complete script to study HERE.  This script has some added features:

  • An option to send an email to the webmaster when an error is generated
  • All error pages have a link for the user to redirect them to the site's homepage

The code is commented and would be easy to extend further.

 


Think this may help others? Please share ...Google+FacebookPinterestStumbleUponLinkedInTwitter

Amazon Books: Mouse over for brief details or click to visit

 
Site Map Page loaded in 0.17365 sec
© 2005 The Mouse Whisperer | Cookies