Sitemaps: One file, many content types

Have you ever wanted to submit your various content types (video, images, etc.) in one Sitemap? Now you can! If your site contains videos, images, mobile URLs, code or geo information, you can now create—and submit—a Sitemap with all the information.

Site owners have been leveraging Sitemaps to let Google know about their sites’ content since Sitemaps were first introduced in 2005. Since that time additional specialized Sitemap formats have been introduced to better accommodate video, images, mobile, code or geographic content. With the increasing number of specialized formats, we’d like to make it easier for you by supporting Sitemaps that can include multiple content types in the same file.

The structure of a Sitemap with multiple content types is similar to a standard Sitemap, with the additional ability to contain URLs referencing different content types. Here’s an example of a Sitemap that contains a reference to a standard web page for Web search, image content for Image search and a video reference to be included in Video search:

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"        xmlns:image="http://www.sitemaps.org/schemas/sitemap-image/1.1"        xmlns:video="http://www.sitemaps.org/schemas/sitemap-video/1.1">  <url>     <loc>http://www.example.com/foo.html</loc>     <image:image>       <image:loc>http://example.com/image.jpg</image:loc>     </image:image>    <video:video>     <video:content_loc>http://www.example.com/videoABC.flv</video:content_loc>          <video:title>Grilling tofu for summer</video:title>     </video:video>  </url></urlset>

Here’s an example of what you’ll see in Webmaster Tools when a Sitemap containing multiple content types is submitted:

We hope the capability to include multiple content types in one Sitemap simplifies your Sitemap submission. The rest of the Sitemap rules, like 50,000 max URLs in one file and the 10MB uncompressed file size limit, still apply. If you have questions or other feedback, please visit the Webmaster Help Forum.


Init Google Gears with jQuery

I took a look around the web for a jQuery plugin for google gears as i would like to initialise gears on user input, and not as the web page loads, but could not find one anywhere.

So i have written the following plugin, based on googles own gears_init.js code which inits gears on page load.

/*
 * jQuery Gears Init by Darren Horrocks
 *
 * jQuery Plugin to init google gears at any point during a page
 *
 * code based on gears_init.js from http://code.google.com/apis/gears/gears_init.js
 *
 */
(function() {
  jQuery.initGears = function() {
    // We are already defined. Hooray!
    if (window.google && google.gears) {
      return;
    }

    var factory = null;

    // Firefox
    if (typeof GearsFactory != 'undefined') {
      factory = new GearsFactory();
    } else {
      // IE
      try {
        factory = new ActiveXObject('Gears.Factory');
        // privateSetGlobalObject is only required and supported on IE Mobile on
        // WinCE.
        if (factory.getBuildInfo().indexOf('ie_mobile') != -1) {
          factory.privateSetGlobalObject(this);
        }
      } catch (e) {
        // Safari
        if ((typeof navigator.mimeTypes != 'undefined') && navigator.mimeTypes["application/x-googlegears"]) {
          factory = document.createElement("object");
          factory.style.display = "none";
          factory.width = 0;
          factory.height = 0;
          factory.type = "application/x-googlegears";
          document.documentElement.appendChild(factory);
        }
      }
    }

    // *Do not* define any objects if Gears is not installed. This mimics the
    // behavior of Gears defining the objects in the future.
    if (!factory) {
      return;
    }

    // Now set up the objects, being careful not to overwrite anything.
    if (!window.google) {
      google = {};
    }

    if (!google.gears) {
      google.gears = {factory: factory};
    }
  };
})();

You can now initialise gears using the the jQuery class or the dollar shortcut.

jQuery.initGears();
$.initGears();

To slash or not to slash

That is the question we hear often. Onward to the answers! Historically, it’s common for URLs with a trailing slash to indicate a directory, and those without a trailing slash to denote a file:

http://example.com/foo/ (with trailing slash, conventionally a directory)
http://example.com/foo (without trailing slash, conventionally a file)

But they certainly don’t have to. Google treats each URL above separately (and equally) regardless of whether it’s a file or a directory, or it contains a trailing slash or it doesn’t contain a trailing slash.

Different content on / and no-/ URLs okay for Google, often less ideal for users

From a technical, search engine standpoint, it’s certainly permissible for these two URL versions to contain different content. Your users, however, may find this configuration horribly confusing — just imagine if www.google.com/webmasters and www.google.com/webmasters/ produced two separate experiences.

For this reason, trailing slash and non-trailing slash URLs often serve the same content. The most common case is when a site is configured with a directory structure:
http://example.com/parent-directory/child-directory/

Your site’s configuration and your options

You can do a quick check on your site to see if the URLs:

  1. http://<your-domain-here>/<some-directory-here>/
    (with trailing slash)
  2. http://<your-domain-here>/<some-directory-here>
    (no trailing slash)

don’t both return a 200 response code, but that one version redirects to the other.

  • If only one version can be returned (i.e., the other redirects to it), that’s great! This behavior is beneficial because it reduces duplicate content. In the particular case of redirects to trailing slash URLs, our search results will likely show the version of the URL with the 200 response code (most often the trailing slash URL) — regardless of whether the redirect was a 301 or 302.
  • If both slash and non-trailing-slash versions contain the same content and each returns 200, you can:
    • Consider changing this behavior (more info below) to reduce duplicate content and improve crawl efficiency.
    • Leave it as-is. Many sites have duplicate content. Our indexing process often handles this case for webmasters and users. While it’s not totally optimal behavior, it’s perfectly legitimate and a-okay. :)
    • Rest assured that for your root URL specifically, http://example.com is equivalent to http://example.com/ and can’t be redirected even if you’re Chuck Norris.

Steps for serving only one URL version

What if your site serves duplicate content on these two URLs:

http://<your-domain-here>/<some-directory-here>/
http://<your-domain-here>/<some-directory-here>

meaning that both URLs return 200 (neither has a redirect or contains rel=”canonical”), and you want to change the situation?

  1. Choose one URL as the preferred version. If your site has a directory structure, it’s more conventional to use a trailing slash with your directory URLs (e.g., example.com/directory/ rather than example.com/directory), but you’re free to choose whichever you like.
  2. Be consistent with the preferred version. Use it in your internal links. If you have a Sitemap, include the preferred version (and don’t include the duplicate URL).
  3. Use a 301 redirect from the duplicate to the preferred version. If that’s not possible, rel=”canonical” is a strong option. rel=”canonical” works similarly to a 301 for Google’s indexing purposes, and other major search engines as well.
  4. Test your 301 configuration through Fetch as Googlebot in Webmaster Tools. Make sure your URLs:
    http://example.com/foo/
    http://example.com/foo
    are behaving as expected. The preferred version should return 200. The duplicate URL should 301 to the preferred URL.
  5. Check for Crawl errors in Webmaster Tools, and, if possible, your webserver logs as a sanity check that the 301s are implemented.
  6. Profit! (just kidding) But you can bask in the sunshine of your efficient server configuration, warmed by the knowledge that your site is better optimized.


PHP Installer for Web Apps

We all know uploading things via ftp like Squirrel Mail or phpMyAdmin, or any application that has 100′s of small files takes a long time. This is mainly due to the overhead in the commands that have to be preformed to upload each file individually.

The first thing we need to do is to package all of our files into a single large file, this will allow us to massively reduce overhead, since there is only ever one send file command sent. How we do this, is simply add all the files (and the root directory if required) into a zip, so that when they are extracted to the directory on the server, they will appear on the server correctly.

Once we have zipped all of our files up, how do we then expect to be able to unzip them? We need a small bootstrap script (example below) to extract them to the server, which yes, is a second file that is small, but beats having to upload those over few 100 files.

$zip = new ZipArchive();
$r = $zip->open("myzip.zip");
if($r == TRUE) {
  $zip->extractTo("./");
}
$zip->close();

If you now upload and run your bootstrap script, you will notice (if you have the zip extension loaded in PHP) that your zip file has been extracted on the server, and was done a vast amount quicker than it would have taken to upload the individual files.

$package = "myadmin.zip";

$c = file_get_contents($package);

$content = base64_encode($c);
$bcontent = "";

for($i=0; $i<strlen($content); $i+=1024) {
  $c = substr($content, $i, 1024);
  $bcontent .= "\$content .= \"{$c}\";\r\n";
}

$f = fopen("installer.php", "w+");
if($f) {
  fwrite($f, "<?php\r\n");
  fwrite($f, "\$content = \"\";\r\n");
  fwrite($f, $bcontent);
  fwrite($f, "file_put_contents('{$package}', base64_decode(\$content));\r\n");
  fwrite($f, "\$zip = new ZipArchive();\r\n");
  fwrite($f, "\$zip->open('{$package}');\r\n");
  fwrite($f, "\$zip->extractTo('./');\r\n");
  fwrite($f, "\$zip->close();\r\n");
  fwrite($f, "?" . ">");
  fclose($f);
}

Javascript Sudoku Solver using jQuery

I took the time earlier in the week to write a html/javascript sudoku solver. This code could be either used to cheat at sudoku or used as a basis for creating an online sudoku game.

Either way, have fun: http://www.bizzeh.com/solver/


PHP Data Optimisation/PHP Function Optimisation

Using the keyword generator code from the previous post, i have written a small example to benchmarking script to test it setting up the array in get_filter_words with 2200 keywords:

The test code is as follows:

$text = strip_tags(file_get_contents('http://www.bizzeh.com/'));

for($x=0; $x<3; $x++) {
  $start = microtime(true);
  for($y=0;$y<100; $y++) {
    $ar = get_valid_keywords($text);
  }
  $end = microtime(true);

  echo($end-$start . "<br/>");
}

Using the default get_filter_words function we get average execution times of:

33.2730691433

I decided to optimise this function slightly to improve performance:

function get_filter_words() {
  static $words;
  if(empty($words)) $words = array('000', ..., 'zwölf' );
  return $words;
}

Defining $words as static allows it to persist across the entire script without being cleared on return, and since we are now checking if its empty and only loading the array if it is empty, we are now saving quite a lot of script time:

9.74925804138


PHP Keyword Generator and Keyword Density Generator

What we do here is first, create an array of bad words that we want to filter out, ie, the most common words in the most common languages such as “and” “or” “are”.

We also need to have a lower boundary to check against, most search engines have a lower word bound of 3 characters, so this is what we will use here.

We now check valid keywords against a string of text which will explode the string into an array based on commas or spaces and check each word against our list and that its 3 characters or greater. If we are not in the bad words list and we are 3 characters or greater, add it to the valid array, and if we are already there, increase the count by 1.

once we have cycled through all the words, we then order by largest first to smallest and then return our array.

we now have an array of words ordered by their popularity in the string that was given.

function get_filter_words() {
  $words = array('000', ..., 'zwölf' );
  return $words;
}

function is_valid_keyword($word) {
  $common_words = get_filter_words();

  return (strlen($word) >= 3 && !in_array($word, $common_words)) ? true : false;
}

function get_valid_keywords($words) {
  $word_arr = array();
  $word_ret = array();

  if(!is_array($words)) {
    $word_arr = preg_split("/[\s,]/", $words, -1, PREG_SPLIT_NO_EMPTY);
  }

  foreach($word_arr as $word) {
    if(is_valid_keyword($word)) {
      if(empty($word_ret[$word])) {
        $word_ret[$word] = 1;
      } else {
        $word_ret[$word]++;
      }
    }
  }

  arsort($word_ret, SORT_NUMERIC);

  return $word_ret;
}

Btw, you need to find your own bad word list


Load and Parse Large XML Files in PHP

Usually, PHP is limited to using somewhere between 16mb and 128mb of RAM. So what happens if  you want to parse a 1.1gb file of exported product data (over 500,000 products) and not hit the RAM limiter?

At first this seemed to be a pretty impossible task, as to parse the file you require the entire XML to parse it to a tree.

Usually you would run something such as file_get_contents() and then parse the contents returned, but this would load in the entire 1.1gb of XML and put you well beyond most PHP ram limiters.

What you need to do is parse the XML in small chunks (example  below uses 128kb chunks) and parse those bit by bit, this way, you get to speedily parse through your XML file, while at the same time, steer clear of the PHP RAM limiter.

set_time_limit(0);
define('__BUFFER_SIZE__', 131072);
define('__XML_FILE__', 'pf_1360591.xml');

function elementStart($p, $n, $a) {
  //handle opening of elements
}

function elementEnd($p, $n) {
  //handle closing of elements
}

function elementData($p, $d) {
  //handle cdata in elements
}

$xml = xml_parser_create();

xml_parser_set_option($xml, XML_OPTION_TARGET_ENCODING, 'UTF-8');
xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($xml, XML_OPTION_SKIP_WHITE, 1);

xml_set_element_handler($xml, 'elementStart', 'elementEnd');
xml_set_character_data_handler($xml, 'elementData');

$f = fopen(__XML_FILE__, 'r');
if($f) {
  while(!feof($f)) {
    $content = fread($f, __BUFFER_SIZE__);

    xml_parse($xml, $content, feof($f));

    unset($content);
  }
  fclose($f);
}

Imitate target=_blank with jquery

You can replace target=_blank with jquery, or prototype or with raw javascript if you like. Here is the code i came up with to do just this.

jQuery

$(function() {
	$('a[rel*=external]').click(function() {
		var w = window.open(this.href);
		if(!w) alert("Boo! A popup blocker stopped our window from opening");
		return false;
	});
});

Appliance World Online

This website is one of my earlier large projects, and was created for a company in the Manchester area. This website was the basis and drive to create the DCom ecommerce system. it took roughly four weeks to create from start to finish. It includes a full product information system, a shopping cart, user registration/login system with a user control panel so that people who register can have a wish list of items they wish to buy in the future, and so that they can view previous orders. it also has a unique infinitely recursive category system, which allows as many categories within the tree as you would ever need, and you can go as many categories deep as you would ever need. the website also features a linking system unlike other ecommerce solutions in that, you can place any one product under as many categories as you wish.


Older posts >>