How does Google deal with code refactoring? If one splits the code file into two files, how does the algorithms deals with its former repository commits scoring? Sure splitting classes in Java is hard (one file per class), but in languages like C++, C#, PHP with namespaces it's trivial and happens all the time. (disclaimer: I read ~80% of the article and comments, and used the browser search function, but found no answer)
So I actually wrote this article (strange to see it again!)
We don't run the code any more and I would have to go back to the source to check, but I think that it handles renames OK, but it doesn't have any handling for splitting two files up that way.
It's a somewhat by-the-by issue, as my intuition are that the files that actually can be broken up and successfully refactored are not the same ones that are going to get flagged. The flagged files are the ones that are churning because no-one really knows how to write them any better.
So the follow-up paper that assesses the impact is here [1]
TL;DR is that developers just didn't find it useful. Sometimes they knew the code was a hot spot, sometimes they didn't. But knowing that the code was a hot spot didn't provide them with any means of effecting change for the better. Imagine a compiler that just said "Hey, I think this code you just wrote is probably buggy" but then didn't tell you where, and even if you knew and fixed it, would still say it due to the fact it was maybe buggy recently. That's what TWR essentially does. That became understandably frustrating, and we have many other signals that developers can act on (e.g. FindBugs), and we risked drowning out those useful signals with this one.
Some teams did find it useful for getting individual team reports so they could focus on places for refactoring efforts, but from a global perspective, it just seemed to frustrate, so it was turned down.
From an academic perspective, I consider the paper one of my most impactful contributions, because it highlights to the bug prediction community some harsh realities that need to be overcome for bug prediction to be useful to humans. So I think the whole project was quite successful... Note that the Rahman algorithm that TWR was based on did pretty well in developer reviews at finding bad code, so it's possible it could be used for automated tools effectively, e.g. test case prioritization so you can find failures earlier in the test suite. I think automated uses are probably the most fruitful area for bug prediction efforts to focus on in the near-to-mid future.
I was one of the interviewees for the study (or at least, I remember ranking those three lists as described in the experimental design).
My impressions were that the results of the algorithm were pretty accurate, but they were not very actionable. Very often, the files identified were ones the team knew to be buggy, but there were good reasons they were buggy, eg. the problem the code was solving was complex, that area of the code was undergoing heavy churn because the problem it solved was a high priority, or the code was ugly but another system was being developed to replace it and it wasn't worth fixing when it was going to be thrown away anyway. In some cases, proposals to fix or refactor the code had been nixed repeatedly by executives.
Basically - not all bugs are created equal. Oftentimes code is buggy because it's important, and the priority is on satisfying user needs rather than fixing bugs.
I work in software reliability (bug finding through dynamic program analysis) which is a related domain of this research.
Most of these machine learning based software engineering research tools are based on unrealistic scenarios, full of over-promises and very little to deliver in real life.
Curious why this isn't used anymore? Seems like it would have been a useful thing to flag certain files as being worth extended review. Did it not provide the expected benefit(s)? I'm interpreting 'we' as 'google'...
My interpretation after reading the whole article is that if two new files are created, the history of this 'hot spot' will be forgotten. If one new file is created and the old one remains, that old one will only remain a hot spot as long as code is changed in it.
I'd love to hear a followup and if this worked out for them or ended up being more trouble than it was worth. I remember at the time when it came out being a bit skeptical about the process.
edit: Although skimming the comments, maybe they need to turn that machine learning on blogger comments. What a wasteland of spam and crap...
Ehhh, "if" is a bit of a specialized case. In general programs don't map fully to true/false, just leave it as "error code".
It's not that we should use any other interpretation, it's that that interpretation is only mostly true. Don't overgeneralize lest you introduce mistakes.
I'm a little amused that they did not remark that perfect bug prediction is known to be impossible in general. Is this because they assume every reader already knows that, or because they forgot their theory lessons?
On another note, I wonder whether one could rigorously define bug prediction for a "helpful" programmer who isn't trying to trick the machine by using diagonalization tricks and obfuscating things.
They probably didn't mention it because formal, deterministic bug prediction may as well be on a different planet from probabilistic bug prediction based on hot spots in commit logs. When a blog post cites two research articles, it's generally safe to assume the authors haven't forgotten their theory.
Pretty interesting to revisit this idea. In practice, I never found that automated bug detection or auto-code-quality tools ever really helped when we used it as a tool to pinpoint problems in the code.
That being said, I am a fan of tools like gitprime, to identify opportunities to improve visibility and insight into the dev process. It expands on this idea of risk identification, but applies it to the project not the file.