Language agnostic way to remove sensitive information from files before committing to Git

What are the recommended procedures for automatically removing sensitive information from files before committing to Git?

For example, say I have the following in a file called code.rb:

personal_stuff = "some personal stuff"

How can I automatically remove the personal information from code.rb before committing to version control? The solution should be language-agnostic.

  • Copy part of SVN repo to new repo?
  • how to deactivate vc-git in GNU Emacs?
  • Do you keep your build tools in version control?
  • Automatic version in DATETIME format on every “git push” on Heroku or elsewhere
  • Git directed acyclic graph - children know their parents but not the other way around
  • How to connect local folder to Git repository and start making changes on branches?
  • What's the difference between 'git fetch' and 'git fetch --all'
  • Which version control software allows to whitelist dirs?
  • 5 Solutions collect form web for “Language agnostic way to remove sensitive information from files before committing to Git”

    Using a “clean filter” for specific files is another way to go.

    Update an example, as demanded:

    Add a “clean” filter to the local repository configuration, consisting of one call to sed. This could be a path to a shell script or to any program which consumes data on its standard input and writes processed data to its standard output:

    $ git config --add filter.classify.clean \
        'sed -e '\''s!\<\(personal_stuff\s\+=\s\+\)"[^"]\+"!\1"SECRET"!'\'
    

    Now Register our filter to be applied for files which names match *.rb:

    $ cat >.gitattributes
    *.rb    filter=classify
    ^D
    

    Create a couple of test files:

    $ cat >test.rb
    aaa
    bbb
            personal_stuff  = "sensitive data"
    ccc
    ^D
    $ cat >test.txt
    aaa
    xxx
    personal_stuff = "super secret"
    yyy
    ^D
    

    Now add and commit them:

    $ git add test.*
    $ git commit -q -m 'root commit'
    ...
    

    Now see what has happened to the contents of test.rb, that is, what does its blob in the just recorded commit contains:

    $ git cat-file -p HEAD
    tree 7adaac5cc23c69ff9459635d666ca63ffb9757aa
    author Konstantin Khomoutov <flatworm@...ourceforge.net> 1368453302 +0400
    committer Konstantin Khomoutov <flatworm@...ourceforge.net> 1368453302 +0400
    
    
    root commit 
    $ git cat-file -p 7adaa
    100644 blob e49630236eb74d8c7ccbcccc83c7c18af0cb4b96    test.rb
    100644 blob aecd9ade78e18d5b5ded99a1e41cf366fa52e619    test.txt
    $ git cat-file -p e496302
    aaa
    bbb
            personal_stuff  = "SECRET"
    ccc
    

    Verify this did not affect the work tree:

    $ cat test.rb
    aaa
    bbb
            personal_stuff  = "sensitive data"
    ccc
    

    You can write your own pre-commit hook. This hook will scan your code and decline commit if it can find something that it does not like.

    Writing actual hook can be a challenge, you should be able to find some examples online.

    One solution is to move your confidential informations to an external file which will be ignored.

    There is two ways to ignore a file in git:

    • Using the .gitignore file (permanent)
    • Using the git update-index command (temporary)

    In your case, the more flexible solution would be:

    1. Create an empty files with fake personnal stuff (like password = "mypassword1234" or whatever…)
    2. Commit and push this file
    3. Ignore its futur modifications with git update-index --no-assume-unchanged your_file

    Use ‘.gitattributes’ with ‘.gitfilters’. Here is an example with ‘rcs-keywords’; you’d follow the same structure but with filters for your sensitive data.

    Your attributes files maps from file glob to filter, as such:

    # .gitattributes
    # Map file extensions to git filters
    *.h filter=rcs-keywords
    *.c filter=rcs-keywords
    

    Your .gitfilters files implement a ‘clean’ and ‘smudge’ filter. For the above ‘rcs-keywords’ filters this is:

    $ ls .gitfilters/
    rcs-keywords.clean*  rcs-keywords.smudge*
    

    The ‘clean’ filter removes stuff prior to commit; the ‘smudge’ filter adds stuff back on checkout.

    The filters are any script. Again, for ‘rcs-keywords’ the ‘clean’ filter looks like:

    #!/usr/bin/perl -p
    s/\$Id[^\$]*\$/\$Id\$/; 
    s/\$Date[^\$]*\$/\$Date\$/;
    

    whereby rcs Id and Date information is removed. The associated ‘smudge’ filter adds that information back in.

    Lastly, you configure git as

    git config --add filter.rcs-keywords.clean  .gitfilters/rcs-keywords.clean
    git config --add filter.rcs-keywords.smudge .gitfilters/rcs-keywords.smudge
    

    For your case, the clean filter axes the sensitive data and the smudge filter adds it back in.

    If you can’t use .gitignore as you need to make parallel changes in the same file(as found in your comments) then one option is git add -p Using this you can add or skip accordingly.

    The problem using the above git command here is, it will be more of a manual process. I guess you may not find any other automated approach for you problem.

    Git Baby is a git and github fan, let's start git clone.