Start In Business
Bookmark and Share
 home
  IT tips for brokers, business sales agents, and maintainers of published catalogues

Keeping a catalogue of ads updated automatically on multiple sites

Basic data syndication

By John Walker
Copyright 2010

This article is primarily aimed at business sales agents and estate agents who are looking to find an easier way to maintain their listings of properties on the various ad directory web sites that they use.

The Problem

A business transfer agency maintains a list/catalogue of all the businesses that they currently have on offer. This catalogue is for internal use and acts as the 'master' copy of that information. So, lets call it the master catalogue.

They also promote clients' 'businesses for sale' on their own site and various internet directories (online promotions). Each of these web directories has a control panel where the agent's staff can log-in to add, edit and delete the directory's entries.

The master catalogue changes often (whenever an item is created, amended or deleted). Whenever this happens, the change has to be reflected in their online promotions. In many cases, a member of staff will simply log-in to each online directory in turn and perform the appropriate modifications.

If the agency only has a few clients, the practice of manually updating their online promotions is a slight chore but probably the best solution. However, as the agency takes on more and more clients, the catalogue grows, the rate of change increases and the process of cascading every change over to the copies becomes more and more tiresome.

Furthermore, the more steps involved in cascading the changes to the copies, the more chance there is of an error occurring.

Number of actions required =
 Number of Changes to Master Catalogue x Number of Slave Copies

The Solutions

Ideally, the agent's staff would only have to change to the master catalogue and then trigger a process that updates the online promotions automatically.

There are many ways to accomplish this and various technical standards and practices have evolved such as; Single Sourcing, Screen Scraping/Crawling, RSS (Really Simple Syndication), Atom, XML and, more broadly, the Semantic Web.

From my own experience, as soon as I start rattling off these buzzwords and acronyms to an owner of a business transfer agency their attention starts waning dramatically. So, I will try to distill the concepts into basic layman's terms by forfeiting a degree of technical accuracy.

Web Directory's Published API

In this scenario, the web directory publisher publishes a set of one-size-fits-all specifications known as an Application Programming Interface or API. If you think of the web directory as 'a software application', the API describes how other software programs should interact with it. Computer programmers can then write custom software (on behalf of the business transfer agents) which allows them to automatically send a series of instructions (such as; create, modify or delete) to the web directory which then processes them accordingly.

Publishing an Application Programming Interface can be quite restrictive for the publisher of the web directory. Once the specifications have been set in stone and published, they are tricky to change. The clients/agents will have worked hard to satisfy them and will not be too happy if they are constantly being forced to tweak and test their own system at every whim of the directory publisher. See a Google Video About Designing Application Interfaces.

Often, for a small-to-medium-sized business transfer agency, the time and effort involved writing custom software that caters to the needs of each of the many web directory API's is too prohibitive.

Screen Scraping / Crawling

One way of synchronising a web directory with an agent's catalogue is to get the publishers of the directory to automatically visit the agent's web site, crawling each page to gather up all the information.

Under the hood, a web page is simply a text document that holds information which is wrapped in tags. The tags mark-up the information so that software (i.e. a web browser) can gain a bit more understanding about what it is and how it should be presented. Its what HTML is all about.

<tag>information</tag>

A web page promoting a business for sale might look like this:

<html>
<head>
<title>Fish and Chip Shop For Sale</title>
</head>
<body>
<h1 class="business_type">Fish and Chip Shop</h1> <h2 class="location">Essex Seaside town</h2> <p>Advert ref: 100395</p>
<p class="description">Located in the popular sea-side town of...</p>
<img source="http://example.com/pictures/100395/1.jpg"></img>
</body>
</html>

This information can be used by a web browser to know how to render:

  • the page title <title>,
  • a top level headline <h1>,
  • a secondary level headline <h2>, and
  • a couple of paragraphs <p>
  • an image <img>

Because the information is marked-up so nicely, a web directory publisher can put on his Sherlock Holmes deerstalker hat and use this information deduce that:

Advert ref = 100395
business_type = Fish and Chip Shop
location = Essex Seaside town
description = Located in the popular sea-side town of...

This means that they can write some software that automatically grabs the page and then use that deduced information to update their copy of catalogue item #100395.

This is very easy for the agent. All they have to do is provide a web address. But it is, quite a tricky task for the publishers of the directory to pull off with 100% accuracy (even if the information is served up in a well structured format). Its brittle. If the agent's site changes, then the publisher of the directory may have to change the way it automatically reads and understands the new-look site.

The key point is that :

The information in the web page is marked-up to let browsers know how to display the information. That's its job.

It is NOT marked-up to make web directories know how to understand the information.

Its a subtle, but important difference. By imposing an additional responsibility onto the web pages (using rubber bands and chewing gum) an agency is lining themselves up for a bite on the bottom sometime down the line.

This leads on to the next option..

Agent-Provided Data in an Agreed Format (XML or CSV)

So:

  • an agent cannot invest in writing custom software for each directory's API to automatically push the master catalogue data into their online promotions, and
  • a web directory cannot easily and reliably pull the data from the agent's web site using screen scraping techniques

The third solution is a slight variation on the first two:

  • the business transfer agent works with the web directory publishers to agree on the format of the data to be exchanged
  • changes occur to the master catalogue during day-to-day operations of the agency
  • at intervals, the agent exports the salient master catalogue data into a file (e.g. current-items.xml or current-items.txt)
  • the data file is posted onto the web (e.g. at http://example.com/current-items.txt )
  • the web directory publishers automatically collect the data file and import it into their directory on a regular basis

So, instead of scraping web pages and guessing how to interpret the information contained within, the web directory publisher is pointed to a text file in which the information is marked-up specifically for the purpose of syndication (not for display).

In order for this solution to work the agent must have the ability to:

  • Export the salient master catalogue data into a file
  • Post that file onto the web

These are not particularly difficult things to do.

It is slightly more effort than simply providing a web site address and asking the web directory publisher to screen scrape it. But, its lot less effort than writing custom software that 'talks' to a web directory application's API.

The agent is making a small effort towards structuring data for easier exchange which makes the web directory publisher's job a whole lot easier.

Exchanging Data

Which file type is best?

It does not really matter which file type is used as long as it can be read by a large amount of software applications.

A Word document (something.doc), for example, is NOT very easy to read. In order to read a Word document you need to open it within software that is specially designed for that purpose - i.e. Microsoft Word. Likewise an Excel file (something.xls) can only be opened by Microsoft Ex cell (or similar applications).

A text file (something.txt), on the other hand, CAN be read by many applications on many different operating systems; Windows, Apple Mac, Linux etc. and so is a good choice.

Within that text file, the data can be structured in the form of a CSV (Comma-Seperated Values) spreadsheet or perhaps marked-up by XML tags.

CSV

CSV data is structured like a data table. With each table row represented by 'a line' and each column separated by a comma. Often, the first row is used to label the columns.

So this:

Name , Fave Colour , Fave Food
Joe , Brown, Egg
Sue , Green, Rice

means :

Name Fave Colour Fave Food
Joe Brown Egg
Sue Green Rice
XML

XML is a mark-up language which is similar to the HTML used in web documents (see example above). The difference is that XML is a bit more extensible. It can be used to mark-up almost anything that you can imagine.

If we converted the CSV data, above, into XML it could look this:

<mydata>
 <record>
  <name>Joe</name>
  <favecolour>Brown</favecolour>
  <favefood>Egg</favefood>
 </record>
 <record>
  <name>Sue</name>
  <favecolour>Green</favecolour>
  <favefood>Rice</favefood>
 </record>
</mydata>
But what about the pictures?

A lot of business transfer agents see the internet as a kind of 'publishing platform', as a series of web pages to visit and read like a giant interactive magazine. But, when it comes to data syndication, it is probably best to think of the internet as a 'repository of resources'.

An agent's list of current items is a 'resource' that can be found at a certain location; i.e. a web address like http://example.com/current-items.txt. (In fact, another term for ' web address' is 'URL' which means Uniform Resource Locator.) As long as that resource stays in that location an agent can simply distribute that URL to all interested parties. Its virtually the same as actually distributing the resource itself.

If you delve into the semantics of this too much your head might explode. But, here's an analogy: Think of the agent's information as 'water'. Think of the list of current items as a 'tap'. Think of the URL of the list of current items as the 'address of the tap'. In the virtual world of the Internet you can distribute the 'tap' to everyone who needs it and, at the same time, control the water that runs though it.

But wait...there's more!

The agent can control the water that runs through those taps. So, they can add things to that water. Furthermore, because distributing the address of a thing is as good as distributing the thing itself....the thing added does not even have to fit through the tap.

To syndicate the pictures in the catalogue, all an agent has to do is think of a picture as one type of resource (a relatively inactive one) that is channeled through another type of resource.

Then:

  • Ensure that the picture can be found on the internet.
  • Make a note of where it is located (in the form of an URL/web address )
  • Store that URL (along with the other data about the business) as a record in the master catalogue
  • Syndicate the catalogue

Most agents who syndicate their catalogue would have no more than 5 pictures for each business. So, a data record could simply 'store the pictures' in the form of XML text:

...
<picture1>http://example.com/pictures/100395/1.jpg</picture1> <picture2>http://example.com/pictures/100395/2.jpg</picture2> ...
<picture5>http://example.com/pictures/100395/5.jpg</picture5>

Summary

  • Its more efficient for an agent to edit the master catalogue once and syndicate the information to various the internet directories
  • Syndicating the information is as simple as :
    • Ensuring data is in an agreed format
    • Posting it as a set of 'resources' on the web
    • Ensuring web directory publishers know where the resources are (providing the URL)
  • The web directory publishers can do the rest

Questions?

If you would like more about how to enable easy syndication of your catalogue, feel free to call me on:




Promote your new business with EMC Ad Gifts