Suggestion: normalize feedback
Hi there,
Background:
I'm a buyer on odesk, and I think the service is great. I've had some interesting interactions with providers regarding feedback that I'd like to talk about here.
In general, I've had two classes of experiences with odesk providers: folks who basically "got it" and completed work on a reasonable schedule at reasonable quality, and folks who "don't get it" who need lots of hand holding, lots of iterations, and ultimately fail to deliver working code (I primarily hire Java developers, and I'm a developer myself).
The folks in category 1 get 4.5 - 5 star feedback from me. I think that's deserved.
The folks in category 2 get ~3 star feedback from me typically. In some cases I feel even that is high.
I recently had an IM conversation with a provider who felt I hadn't given them a chance, and was basically scared that they'd never get another job on odesk again. I felt bad for this person. He offered to work for free, and generally wanted to make it right. Unforutnately I felt that he "did't get it". He just wasn't my guy.
He seemed to imply that other providers essentially grade on a scale of 4 to 5. 4 is basically the lowest feedback they'd give. He thought it was only fair that I change my feedback to a 4 (up from a 3.3, which I thought was high to begin with).
My suggestion:
I think the feedback ratings should be normalized based on the submitter's scale.
I think there's rampant grade inflation on odesk. If scores were normalized to the scale of the submitter, then you'd get a MUCH more accurate picture of the relative quality of that provider. And providers wouldn't get freaked out when they encounter a buyer (like me) that uses a wider scale.
Hot or Not does this (not that'd I'd ever go to that site.. ahem). See FAQ #14
http://www.hotornot.com/pages/faq.html
Thoughts?
-- James
We've toyed with the idea of "Feedback Authority"
on Thu, 2008-10-16 21:18.The usefulness and reliability of feedback is clearly very important to our marketplace. It is a known issue that some providers are more liberal with giving good feedback while others are more strict. However, implementing a large "normalized feedback" matrix is not straight-forward. Should we account for the buyer's experience on oDesk? In terms of number of projects? Hours on oDesk? Dollars spent? Can we assume that a "strict grader" will always be strict, or could their opinion change overtime?
Rest assured that we are definitely thinking about ways to improve our feedback system. Please do keep the suggestions coming.
Yang
oDesk
Hi there, I should clarify
Hi there,
I should clarify what I mean by "normalize". I'm speaking in the strict statistical meaning of the term. A bell curve.
To do this you would calculate the mean and standard deviation of all the ratings that a person gives out. Then you scale their raw rating using the formula:
(rating - average) / stddev
You use this normalized rating to produce a normalized score which you then use to rank candidates.
Here's a screen shot of a spreadsheet with an example:
And another example:

These are small examples, but amplified across the whole network of users, the difference in rankings for providers will be profound, and much more fair than the present raw mean score.
-- James
Here's the other problem with this
Hi there,
I should clarify what I mean by "normalize". I'm speaking in the strict statistical meaning of the term. A bell curve.
To do this you would calculate the mean and standard deviation of all the ratings that a person gives out. Then you scale their raw rating using the formula:
(rating - average) / stddev
You use this normalized rating to produce a normalized score which you then use to rank candidates.
Here's a screen shot of a spreadsheet with an example:
And another example:

These are small examples, but amplified across the whole network of users, the difference in rankings for providers will be profound, and much more fair than the present raw mean score.
-- James
A couple of folks have commented on a couple of reasons they don't see your 'bell curve' working. I have another one for you. Why as a provider (and I'm both a buyer and a provider) should I get a 4.6 rating if my work is 4.9 just because that is the 'standard' rating for the buyer.
I sort of understand where you're coming from but frankly, the 1-5 and the ability to add comments about *why* I've used that rating (or earned that rating) is far preferable than a 'belled' rating.
Doreen
Hi Doreen, Thanks for you
Hi Doreen,
Thanks for your comment. You ask: "Why as a provider (and I'm both a buyer and a provider) should I get a
4.6 rating if my work is 4.9 just because that is the 'standard' rating
for the buyer."
Here's a few reasons:
1) Most importantly, because a normalized 4.6 would pretty much put you at the top of the heap. Your raw 4.9 now would get you buried on page 50 of the search results (see below) after the folks with their 1-2 perfect 5.0 scores.
2) Because all ratings are subjective. There is no objective "4" rating.
3) Because there is a strong cognitive bias among odesk participants to only provide feedback if it's postive, and to withold feedback in other cases. You don't want to make someone feel bad or (more importantly) become unable to work.
4) Because engagements differ in complexity.
It seems what we have on odesk right now is a version of the "Lake Wobegon" effect:
http://en.wikipedia.org/wiki/Lake_Wobegon_effect
All the developers on odesk are above average.
Look at the feedback form. I just did a mouseover of a 5, and the hover text said "Exceptional". Then search providers for "Java", order by feedback. Scroll through the dozens of pages of "Exceptional" Java developers. Pretty amazing right?
Consequently a user's aggregate feedback score, and the "sort by feedback" feature become some of the least useful tools a buyer has to screen candidates. Yes, I use other criteria. But given that this is the default sort criteria, it seems like oDesk should strive to make it be more meaningful.
Hence this thread.
-- James
thanks
Hi Doreen,
Thanks for your comment. You ask: "Why as a provider (and I'm both a buyer and a provider) should I get a
4.6 rating if my work is 4.9 just because that is the 'standard' rating
for the buyer."
Here's a few reasons:
1) Most importantly, because a normalized 4.6 would pretty much put you at the top of the heap. Your raw 4.9 now would get you buried on page 50 of the search results (see below) after the folks with their 1-2 perfect 5.0 scores.
2) Because all ratings are subjective. There is no objective "4" rating.
3) Because there is a strong cognitive bias among odesk participants to only provide feedback if it's postive, and to withold feedback in other cases. You don't want to make someone feel bad or (more importantly) become unable to work.
4) Because engagements differ in complexity.
It seems what we have on odesk right now is a version of the "Lake Wobegon" effect:
http://en.wikipedia.org/wiki/Lake_Wobegon_effect
All the developers on odesk are above average.
Look at the feedback form. I just did a mouseover of a 5, and the hover text said "Exceptional". Then search providers for "Java", order by feedback. Scroll through the dozens of pages of "Exceptional" Java developers. Pretty amazing right?
Consequently a user's aggregate feedback score, and the "sort by feedback" feature become some of the least useful tools a buyer has to screen candidates. Yes, I use other criteria. But given that this is the default sort criteria, it seems like oDesk should strive to make it be more meaningful.
Hence this thread.
-- James
Thanks for your response James - my whole point here is that there has to be some honesty in feedback. Frankly, while I understand your concern - what I'd rather see happen is that buyers use the scale the way it's supposed to be used and leave 'real' comments in the feedback.
I agree with you on another point: My 4.8 rating based on the number of feedbacks (I think I have 11) is diluted by those who have had 1 or 2 short assignments and get perfect 5's. - Overall I think that the system that is here is good in so far as it does allow for the 360 feedback and if it's used correctly it serves two purposes (a) allows buyers to get a 'real' feel for those they are hiring and (b) allows providers to learn about the buyers.
I also do however feel that instances like you refer to above (in your initial post) are all too common. All of us don't work at a 5 all the time (even if we do most of the time). Let's face it we all have bad days, bad weeks and even bad months (fortunately not so much you'd notice), so theoretically the best possible worker can have 'off' times which obviously are reflected more heavily in short term assignments.
The other concern is also the length of an assignment also - clearly someone who gets a 5 after 1500 hours on an assignment is in a different category than some-one who gets 5 after a 15 hour assignment but as you so eloquently point out - they're rated overall the same in terms of those who 'pop up' in the search engines.
I don't necessarily know the answer to this predicament but I don't see how your model helps address this either, because in reality, even using your rating system a 4 is a 4 is a 4 whether I have 1 or 12 of them!
Doreen
Based on the real world
I think the most important thing to keep in mind is that a system needs to be designed and upgraded according to how people actually use it...not how we would like people to use it in a perfect world or how we would prefer people to use it.
I think James' comments are interesting and it would be good for oDesk to look into them more.
In the 'real world'
I think the most important thing to keep in mind is that a system needs to be designed and upgraded according to how people actually use it...not how we would like people to use it in a perfect world or how we would prefer people to use it.
I think James' comments are interesting and it would be good for oDesk to look into them more.
Your subjected to the opinions of your direct supervisor. In this world you're feedback is a compilation of every buyer that you work with. Comparisons have to be done apples to apples and not apples to oranges. I had a struggle with one buyer when I was a freelance provider (versus my current status) where he ranked me all 4's - and it was so far off base from where his emails stated that he felt my work was that I asked him about it - and much like James comments above, this was his 'best ranking' as he felt that '5' ratings were for 'over and above' and 'perfection' and he felt that they were 'false' ratings regardless of the provider. His comments in the feedback were so far off base from the '4' ranking that any buyer interested in hiring me would understand that he has his own 'scale'. It is up to the providers to encourage their buyers to discuss feedback with them if they don't feel it's fair and if the buyer is using their own 'scale' that can be added to the comments by the buyer.
I don't entirely disagree with James - I just am not sure that his model is appropriate for everything. Bell curve ratings are typical in 'brick and mortar' establishments, not so much in freelancing.
Doreen
HUGE problem with this
It's a good idea, on paper, but there is one HUGE problem with this. Some buyers are much better at selecting really good providers than others, because they are not lazy in their hiring practices. I don't know about your particular situation. I understand that it is probably difficult to find good Java programmers.
Should buyers that are good at hiring qualified people have their ratings adjusted to a scale of 1-5 just because they only end up grading between 4-5? No way. You can't account for that situation mathematically.
Interesting comment. Let me
Interesting comment.
Let me ask the community this. Is this rating system intended to be used this way:
In your opinion how did this provider score relative to others you've worked with:
5 - Best I can imagine
4 - Very good
3 - Average
2 - Below average
1 - About as bad as I can imagine
-- James
I personally find
Actually I think that if the buyers/providers are honest with their feedback that it's critical that those looking at profiles look at and understand the reasons for the rating.
For instance, using your 'thumbnail' I'm going to *assume* that you probably said that you and the provider simply were not a good fit. My profile has some odd ratings in it that I'm less than thrilled about (i.e. I had one buyer who spent less than an hour, admitted that he didn't do a good job on describing the job but 'reduced' my quality to a 4).
It's important for buyers to take a look at the providers *entire* history when they are considering hiring them. If they do and the provider doesn't consistently score '3' then chances are that the provider is doing an overall good job. I personally don't feel that ONE bad review makes for that much difficultly in finding additional tasks.
Doreen