Or: The science behind RP Data’s RP Prospector Tool

My side of the RP Data business is one that in operates as a ‘Data Factory’.  What we do is obtain data from different sources and present it to users in a website or through a web service.

Sounds pretty simple

The reality is we are taking thousands of different data fields from hundreds of different feeds, matching them to each other, de-duplicating and removing ‘junk’, transforming them into meaning, and then charting, mapping, modelling, indexing, valuing, deriving analytics from, storing, hosting, and presenting a coherent picture drawn from the data.  Our goal is to deliver your data and analytics requirements simply – regardless of the underlying complexity.

So how do we know how well we are doing?

This post will describe one of the ways – the RP Prospector button and its use.

We tune prospector to give you the best ‘bang for your buck’

If you look at the button you see in the RP Prospector box in RP Professional for ‘Find Leads’ you will see an example of us taking a lot of data and simplifying it down to a button:

The work we do on this simple button is extensive.  We use a model that takes various data inputs and then works out what properties are more likely to list in the next 3 months.  The model provides simple output designed for:

  • Real estate agents: optimizing which of their contacts (i.e. those people they have an existing relationship with) to talk to
  • Mortgage brokers: helping them to keep their mortgage trail and cross-sell to non mortgage clients
  • Financial planners: knowing that there is an imminent event in a client’s life that gives them a reason to re-establish contact and provide advice

The underlying model that drives RP Prospector is even used by Banks, Utilities and other large enterprises as part of the RP Data ‘ARCS’ process: Acquire ~ Retain ~ Cross-Sell.

 How does the button work and how accurate is it?

First off what we are trying to do is take a list of properties (the properties in the suburb you choose) and break them into 2 groups:

  • Find all the properties that are going to list for sale in the next 3 months
  • Find all the properties that are not going to list for sale in the next 3 months

These 2 lists make up all of the properties in the suburb.

 Of course it would be great if we could accurately predict for every property if it was going to list or not.  But in real life we make errors – what are called ‘false positives’ and ‘false negatives’.  So in truth we have this:

This otherwise known as the ‘confusion matrix’ – which maybe an apt term!

So how do we use this to know how well we have done and how can we maximize your ‘hit rate’?

We have a ‘model’ that is developed to place a property in either the ‘will list for sale’ or ‘wont list for sale’ buckets.  This model is built using different algorithms – from a simple case based on the current properties selling and their median hold period (‘current OTM spike report’), to models based on external and internal variables like interest rates, % equity the homeowner has, listings frequency in the same street, consumer report activity from the myRP site, etc.

It is then a simple matter of testing how well our model works when compared to ‘random’ – i.e. to just randomly selecting a group of properties from your list to contact:

So we give you a model that pushes you ‘above the line’ – i.e. for the same number of contacts you get a better ‘hit rate’:

And we can measure our model by how far ‘above the line’ it goes – i.e. how much better than random it is.  This is not too different to how good agents know who is going to list from their contacts and who isn’t – they can tell from conversations, from when the person last sold, from family circumstances, what is happening on the street, etc.  The agent has a model ‘in their head’ – and sometimes in their Client Relationship Management database – that builds on their hyper-local knowledge to enable them to just ‘know who to contact’.  These agents operate far better ‘than random’.  Our intention is to augment this knowledge with data and push even higher ‘above the line’.

The best model we can give you is the one where: ‘every person you contact is going to list their property’.  This is the 100% hit rate – no false positives or false negatives!

This means we move from:




i.e everyone you contact is going to list.

This is the holy grail of a prospector model (and we are not there yet ? ).

The holy grail is every contact you make is to someone who is going to list – 100% effectiveness (no false negatives) with 100% efficiency (no false positives)

This is incidentally the basis of ‘ROC’ curves that started in World War 2 to work out how ‘good’ radar was at finding enemy planes.  Then the prediction really was a matter of life or death:

  • Predict a radar signal is positive for an enemy bomber.  Get it wrong and you have sent up a fighter plane in vain.
  • Predict a radar signal is negative for an enemy bomber.  Get it wrong and the bomber gets through and bombs your city!

ROC curves are now used in psychology, medicine, radiology, biometrics, pharmacology, and now inside the RP Prospector product (so in marketing for real estate ?  ).

Watch this space for updates / additions as we tune the models we use and get further ‘above the line’.


P.S. Wikipedia has an excellent introduction to ROC curves for anyone interested in getting the next level of detail.  Search for ‘Receiver operating characteristic’