Sunday, January 3, 2010

Perils of Sampling

You have so much online data - storing it is so difficult - How about using a sample for it? What is a good sample? 1:1, 2:1, 4:1 or 16:1

Hmm......Good idea????? Will save me COSTS and PROCESSING TIME

Election results are announced for an entire nation of 100 crores with a sample of 10,000 - some of them accurately predicted - so is 16:1 a good sample for online data?

My answer is an 'EMPHATIC NO'

The answers that an election survey and online analytics are trying to answer are VERY DIFFERENT.
* Election survey is trying to answer a SINGLE QUESTION with RESTRICTED CHOICES (Which party is going to win?)
* In online data the QUESTIONS ARE MANY and the CHOICES ARE NOT KNOWN.

Sampling online data can be HAZARDOUS for PAGES THAT GET A FRACTION OF THE SITE VISITS. Say there is a site with 10K visits a day and a page with 100 visits a day.

Sampling means the results are based on 2.5K out of 10K visits. now it is possible that only 10 of these 2.5K fell into the 100 for the page. The results are going to be OBVIOUSLY WRONG

So my advice: it is a MUST to store ALL DATA for a reasonable TIME PERIOD.

