By Bahar Gidwani
In the previous post, we showed that our rating system results in overall scores that follow a Beta distribution. This is consistent with the idea that there is a “norm” for each of the twelve subcategories that we measure. Companies that meet this norm should score around a 50 on a 0 to 100 scale, while those that exceed the norm should rate higher and those who do not should be lower.
How confident can we be that a company that gets a higher overall score really is performing better? We will first state that we do not claim to give the “correct” overall score for any company, industry, or country. Instead, our goal is to provide the best available estimate of how a company’s sustainability performance is viewed by those who rate and measure it. So, if all of those who track a company are fooled into thinking it is good (when it is not), our score will be high (even though it should not be).
We base our scores on a wide variety of sources—including socially responsible investment (SRI) research houses, not-for-profit organizations, government databases, and data from activists and consumer groups. On average, we have about ten different sources for each of the around 5,000 companies we rate. Our database contains more than 2,000,000 separate rating contributions.
Each of our overall scores is the result of separate evaluations of a company’s performance for each of our twelve subcategory rating levels. The twelve subcategory scores are rolled up to the four category levels and then adjusted by the settings in a user’s profile to give a final, overall score. When we ask how accurate our scores are, we are asking how accurately they reflect the conclusion a particular user might have arrived at, if she or he reviewed and evaluated all the data points we have collected both for a particular company and for all of its peers and competitors.
Most of our sources provide more than one rating data element for each company they follow. We do not rate companies when we have fewer that about 70 rating elements. We require data for at least nine subcategories from at least two sources, before we will attempt to publish an overall rating.
The chart that follows shows how much each individual score differs from the overall score for a company. (The bottom axis is Overall Score – Individual Data Point Score. So, if a company’s overall score is 43, an individual data point score of 60 would give a difference of 43 – 60 = -17.) The chart is adjusted for the fact that we weight various data elements differently but not for the fact that user profile preferences also add an important weighting component.
This pattern is pretty close to a “Normal” distribution—the distribution we expect when our “errors” in ratings are distributed randomly. When we compare our curve against a normal curve with a standard deviation of 20.5 and a median of 1.5 (shown above) we see an excellent Anderson Darling fit score of 1745. There are a few differences—a lump on the left side of the curve and a peak that seems to have pushed a bit to the right. Also, the “tails” are not as long or stretched out as they would be for a normal curve. We understand what causes these features:
- The bump on the left at -40 is the partner of a bump on the right at about plus 20. They indicate that there is an excess of situations where the overall rating is 40 points or so less or 30 points or so more than they should be. The left side bump is caused by data elements of 100 (where a company gets a full score for something) that are compared to the better company scores of 50 and 60. 50 or 60 minus 100 gives a peak a -50 to -40. Similarly, when a company scores a zero for failing a certain test, this element will be subtracted from an overall 20 or 30 score to give a bump in the +20 to +30 area. The presence of the bumps supports our observation that most of the 90 and 100 scores go to better performing companies and most of the zero and 10 scores go to poor-performing companies.
- As noted above, there are many sources that give “full credit” to a company on a particular subject. These 100 scores stretch out the left side of the curve push the peak a bit to the right. This could be interpreted as evidence that raters are an average of 1.5 points too positive about the companies we track.
- With our max score at around 70 and our min at around 20, we cannot have a difference more than 70 points above the overall score (70 score minus 0 rating point = 70 difference) or 80 points below the overall score (20 score minus 100 rating point - -80). Therefore, the tails of our distribution are truncated.
Our average company has about 500 contributing data elements. We would like to assume that the differences between the individual ratings estimates and our overall rating are normally distributed because we can then use standard statistical techniques to estimate how confident we can be in our rating estimates. We must also assume that each data element is an independent estimate of what the overall rating should be. With these two assumptions, we can calculate that our estimate of the overall rating for an average company in our system should be within plus or minus 1.8 points of its true value, 95% of the time.
For example, if we estimate that a company has an overall score of 65 points, based on 500 data elements, random variations in the data we collect mean that if we had full data (or investigated and checked each data point) the true rating of the company could be between 65 – 1.8 = 63.2 or 65 + 1.8 – 66.8 95% of the time. If we are comparing this company to another that was rated 4 points lower, there is at least 95% chance that the first company is really higher-rated than the second.
This is the answer we sought. Our ratings are accurate to within 1.8 points, plus or minus. We believe our data shows that our ratings of company CSR performance are fairly accurate. As we aggregate more data, we will strive towards a goal of 95% confidence in our rating at plus or minus 0.5 rating points. (We need about 6,000 data points per company to reach this goal—well within the design parameters of our system.) This level of accuracy would ensure that all rank orderings done using our system will be statistically significant.
Please note again that we have made a few explicit assumptions:
- Normal distribution for the ratings differences. The fact that we work with many different types of sources, makes this more likely. However, many social measures and scores (including stock market returns!) do not turn out to be randomly distributed.
- Independent estimates. We would hope that rating sources make their own independent estimates of company social performance. However, they probably instead influence each other a lot.
- System effects. We put the data from each rating source through an iterative normalization and bias removal processes. This process could introduce correlations between data points or add random noise to our ratings.
However, remember our goal! We do not pretend to describe the absolute truth of a company’s performance, but instead describe the general perception of that performance.
Corporate managers who want to benchmark the performance of their sustainability programs against those of their competitors should be cautious about drawing conclusions about a score difference of one or two points. The same is true when they track their improvement in performance, over time—a one or two point change may be only random noise. Thanks to the quantity of data available to us, we are able to offer a transparent estimate the range of possible error in our estimate. Based on our analysis, we conclude that our scores should be at least as accurate as those available via other sources.