Can Git detect if two source files are essentially copies of each others?
Sorry if this is off-topic, but here is your chance to reduce the amount of “homework” questions on this site 🙂
I’m teaching a class of C programming where the students work on a small library of numeric routines in C. This year, the source files from several groups of students had significant amounts of code duplication in them.
(Down to identically misspelled
printf debug statements. I mean, how dumb can you be.)
I know that Git can detect when two source files are similar to each others beyond a certain threshold but I never manager to get that to work on two source files that are not in a Git repository.
Keep in mind that these are not particularly sophisticated students. It is unlikely that they would go to the trouble of changing variable/function names.
Is there a way I can use Git to detect significant and literal code duplication a.k.a plagiarism? Or is there some other tool you could recommend for that
5 Solutions collect form web for “Can Git detect if two source files are essentially copies of each others?”
Why use git at all? A simple but effective technique would be to compare the sizes of the diffs between all of the different submissions, and then to manually inspect and compare those with the smallest differences.
Moss is a tool that was developed by a Stanford CS prof. I think they use it there as well. It’s like diff for source code.
Adding to the other answers, you could use
diff — but I don’t think the answers will be that useful by themselves. What you want is the number of lines that match, minus the number of non-blank lines, and to get that automatically you need to do a fair bit of magic with
wc -l and
grep to compute the sum of the lengths of the files, minus the length of the diff file, minus the number of blank lines that
diff included as matching. And even then you’ll miss some cases where
diff decided that identical lines didn’t match because of different things inserted before them.
A much better option is one of the suggestions listed in https://stackoverflow.com/questions/5294447/how-can-i-find-source-code-copying (or in https://stackoverflow.com/questions/4131900/how-to-detect-plagiarized-code, though the answers seem to duplicate).
You could use
diff and check whether the two files seem similar:
diff -iEZbwB -U 0 file1.cpp file2.cpp
Those options tell
diff to ignore whitespace changes and make a
diff file. Try it out on two samples.
Using diff is absolutely not a good idea unless you want to venture in the realm of combinatory hell:
- If you have 2 submissions, you have to perform 1 diff to check for plagiarism,
- If you have 3 submissions, you have to perform 2 diff to check for plagiarism,
- If you have 4 submissions, you have to perform 6 diff to check for plagiarism,
- If you have n submissions, you have to perform
On the other hand, Moss, already suggested in an other answer, uses a completely different algorithm. Basically, it computes a set of fingerprints for significant k-grams of each document. The fingerprint is in fact a hash used to classify documents, and a possible plagiarism is detected when two documents end-up being sorted in the same bucket.