Can Git detect if two source files are essentially copies of each others?

Sorry if this is off-topic, but here is your chance to reduce the amount of “homework” questions on this site 🙂

I’m teaching a class of C programming where the students work on a small library of numeric routines in C. This year, the source files from several groups of students had significant amounts of code duplication in them.

  • How to handle subprojects with autotools?
  • Failed importing code from SVN to GIT - SVN trunk is empty
  • Decoding git objects / “Block length does not match with its complement” error
  • How do I keep connection string passwords secure on a git repository?
  • Git and KDevelop
  • Handling #include paths on different platforms
  • (Down to identically misspelled printf debug statements. I mean, how dumb can you be.)

    I know that Git can detect when two source files are similar to each others beyond a certain threshold but I never manager to get that to work on two source files that are not in a Git repository.

    Keep in mind that these are not particularly sophisticated students. It is unlikely that they would go to the trouble of changing variable/function names.

    Is there a way I can use Git to detect significant and literal code duplication a.k.a plagiarism? Or is there some other tool you could recommend for that

  • .gitignore doesn't ignore files
  • Git commands with double dashes
  • Git's local repository and remote repository — confusing concepts
  • Can I revert a specific commit such that it leaves what it would delete as unstaged in the file?
  • Git - rewriting history and treating folders as branches
  • Does “git clone” create a totally relative directory?
  • 5 Solutions collect form web for “Can Git detect if two source files are essentially copies of each others?”

    Why use git at all? A simple but effective technique would be to compare the sizes of the diffs between all of the different submissions, and then to manually inspect and compare those with the smallest differences.

    Moss is a tool that was developed by a Stanford CS prof. I think they use it there as well. It’s like diff for source code.

    Adding to the other answers, you could use diff — but I don’t think the answers will be that useful by themselves. What you want is the number of lines that match, minus the number of non-blank lines, and to get that automatically you need to do a fair bit of magic with wc -l and grep to compute the sum of the lengths of the files, minus the length of the diff file, minus the number of blank lines that diff included as matching. And even then you’ll miss some cases where diff decided that identical lines didn’t match because of different things inserted before them.

    A much better option is one of the suggestions listed in (or in, though the answers seem to duplicate).

    You could use diff and check whether the two files seem similar:

    diff -iEZbwB -U 0 file1.cpp file2.cpp

    Those options tell diff to ignore whitespace changes and make a git-like diff file. Try it out on two samples.

    Using diff is absolutely not a good idea unless you want to venture in the realm of combinatory hell:

    • If you have 2 submissions, you have to perform 1 diff to check for plagiarism,
    • If you have 3 submissions, you have to perform 2 diff to check for plagiarism,
    • If you have 4 submissions, you have to perform 6 diff to check for plagiarism,
    • If you have n submissions, you have to perform (n-1)! diff !

    On the other hand, Moss, already suggested in an other answer, uses a completely different algorithm. Basically, it computes a set of fingerprints for significant k-grams of each document. The fingerprint is in fact a hash used to classify documents, and a possible plagiarism is detected when two documents end-up being sorted in the same bucket.

    Git Baby is a git and github fan, let's start git clone.