Improved Variable and Value Ranking Techniques for Mining Categorical Traffic Accident Data

  • December 1st, 2005
  • in

This paper reviews the use of two new metrics for the process of assessing the significance of attributes in a database when two subsets of the data are compared. Traditional statistical techniques are useful, and the sample size in public safety databases usually allows the normal approximation to the binomial distribution to be used in comparing proportionate values. For example, the comparison of the proportion of alcohol related crashes on Saturdays would show an very highly significantly higher proportion than that for non-alcohol related crashes. However the new metrics go a step further than this in that they provide a clear intuitive grasp to the user as to exactly how much more is occurring, not in terms of proportions but in terms of number of crashes (for the traffic safety example). The metric is called Maximum Gain, and it measures directly the number of crashes over and above that which is typically expected. This provides a clear indication to the user of just what the potential gain is by applying a countermeasure related to the attribute (e.g., applying selective enforcement on Saturdays). It is not realistic to think that this gain would include all of the crashes for the attribute value; rather, it is realistic to view the maximum gain to be the total over-represented amount.