git filter-branch --subdirectory-filter удаляет merge --no-ff коммит

sgfault · Сообщение **sgfault** » 22.10.2012 23:43

У меня есть такой репозиторий

...
$ git branch new_feature
$ git checkout  new_feature
Switched to branch 'new_feature'
$ echo z >> xyz/x.txt
$ git commit -a -m New_feature
$ git checkout master
Switched to branch 'master'
$ git merge --no-ff -m "Merge new feature"  new_feature
Merge made by the 'recursive' strategy.
 xyz/x.txt |    1 +
 1 file changed, 1 insertion(+)
$ git log --graph --pretty=oneline --abbrev-commit
*   a45d83f Merge new feature
|\
| * 1ecb650 New_feature
|/
* e168101 Second
* 212e963 Initial

Теперь мне нужно сделать папку xyz отдельным репозиторием, и я использую для этого filter-branch --subdirectory-filter

Код: Выделить всё

$ git filter-branch --subdirectory-filter xyz -- --all
Rewrite 1ecb650e8339aa992f2e111b9a4a2b2668db4e2d (3/3)
Ref 'refs/heads/master' was rewritten
Ref 'refs/heads/new_feature' was rewritten
$ git log --graph --pretty=oneline --abbrev-commit
* 7cbc244 New_feature
* b0fb1a3 Second
* a3ec0db Initial

Но merge-комита (который был сделан с --no-ff) в истории больше нет. Почему? И как мне его сохранить?

IMB · Сообщение **IMB** » 23.10.2012 09:03

Возможно потому, что merge-комит относится уже к другой ветке, но это только предположение.

sgfault · Сообщение **sgfault** » 23.10.2012 18:42

IMB писал(а): ↑
23.10.2012 09:03
Возможно потому, что merge-комит относится уже к другой ветке, но это только предположение.

Ээ.. не понял, что значит к другой ветке? Ведь я фильтрую по всем веткам (branch) (там в конце вызова filter-branch-а есть опция '--all' ). Или вы имели в виду что-то другое?

sgfault · Сообщение **sgfault** » 25.10.2012 22:03

Я думаю, что понял в чем причина: merge --no-ff комит не изменяет ни один файл в директории xyz, а значит не включается в результирующую историю после subdirectory-filter. И, видимо, так все и должно быть.

Просто, как всегда, вот только решения все равно нет. (смешно, но получается, что гит мешает мне больше, чем помогает; особенно если участь, что вся эта история, которую я так хочу спасти, мне и вовсе не нужна)

sgfault · Сообщение **sgfault** » 03.12.2012 22:47

(upd2, 3/6/2013)
Прошло так много времени.. и, кажется, я теперь знаю ответ. Вроде и не было тут ничего ничего ни сложного, ни нового, но счет часам и дням потерян был. Ниже приведен полный (авторский) перевод git filter-branch '--subdirectory-filter' preserving '--no-ff' merges.

Как я уже писал выше, '--subdirectory-filter' не сохраняет merge коммиты, созданные при fast-forward merge (с опцией '--no-ff'), тк они пустые и, соответственно, не изменяют файлы в директории (указанной 'subdirectory-filter'-у).

Но тем не менее, задача выделить одну поддиректрию проекта (дальше я буду писать просто "директория") в отдельный проект, сохранив при этом _все_ merge коммиты, остается. И вот план, который я попытаюсь осуществить:

1. Отследить каждый файл из директории (для каждой ветки, которую я хочу сохранить) через все переименования до его настоящего происхождения. Такую историю файла, состоящую из пар имя файла и коммит, я буду называть "треком" (track file).
2. Собрать все треки в один - это будет список всех файлов, которые должны быть сохранены на каждом коммите. Этот трек я буду называть "единым треком".
3. Запустить filter-branch со следующим '--tree-filter'-ом:
- если директория не существует на данном коммите, то удалить все файлы, не указанные в едином треке.
- если директория существует, то удалить все файлы _вне_ директории, не указанные в едином треке, и потом переместить все файлы из директории в корневую папку репозитория. При любом конфликте имен, кроме явно исправленных tree-filter-ом, прерывать выполнение.

Код: Выделить всё

Содержание.
    1. Отслеживание переименований файла.
    2. Побочный эффект отслеживания переименований.
    3. Запуск примеров.
    4. Почему не использовать 'git log --follow' для отслеживания переименований файлов?
    5. Замечание об удалении файлов из поддиректории.
    Приложение А. Скрипт 'track_renames.sh'.
    Приложение Б. Скрипт 'examples/jp_prepare.sh'.
    Приложение В. Скрипт 'examples/jp_gen_track.sh'.
    Приложение Г. Скрипт 'examples/jp_rewrite.sh'.
    Приложение Д. Скрипт 'examples/jp_tree_filter.sh'.
    Приложение Е. Скрипт 'examples/jp_finalize.sh'.
    Приложение Ж. Скрипт 'examples/jp_check.sh'.
    Приложение З.  Скрипт 'examples/ex_index_filter.sh'.

1. Отслеживание переименований файла.

Это делает скрипт 'track_renames.sh'. Все команды ниже почти дословно скопированы оттуда.

Каждый трек состоит из нескольких непрерывных отрезков истории. Я начинаю на каком-то коммите c0, на котором файл f точно существует. Затем я нахожу конечную точку данного непрерывного отрезка истории файла f (я прохожу историю в обратном порядке, от дочернего коммита к родительскому):

Код: Выделить всё

git log --diff-filter=A --pretty='format:%H' -1 c0 -- f

Обратите внимание на аргумент '-1'. Он гарантирует, что история непрерывна, те что файл f существует на каждом коммите из ancestry chain (не придумал хорошего перевода, поэтому оставлю как есть) между результатом команды выше (обозначим его как коммит c1) и коммитом c0. Затем я получаю эту ancestry chain (те все коммиты, которые одновременно потомки c1 и предки c0):

Код: Выделить всё

git log --pretty='format:%H' --ancestry-path c1..c0 --

Сделать это за один запуск 'git log'-а не получается, тк 'git log -- f' не показывает коммиты, которые не изменяют файл f (либо он показывает вообще все, что просто бесполезно), но мне они тоже нужны (иначе tree-filter удалит файл f на этих коммитах!). Затем я группирую коммиты и имена файлов в пары.

Затем я должен найти "источники" (origin) файла f на конечной точке его истории (там, где файл f был добавлен в репозиторий) - коммите c1. Для этого я использую 'git blame':

Код: Выделить всё

git blame -C -C --incremental 'c1^!' -- f

Обратите внимание, что я использую две опции '-C' на коммите, который создает файл f. Это значит, что blame будет искать "источники" для строчек файла f во _всех_ файлах на коммите c1.

Результатом этой команды может быть:
- Тот же самый файл f на том же самом коммите c1. Это происходит, если какие-то строки были добавлены в файл на коммите c1. Эти результаты мне не нужны, тк я уже записал файл f в трек на предыдущем шаге с 'git log'-ом.
- Другие файлы g, h, .. на родительском коммите c1^. Это происходит, если какие-то строки были скопированы или перемещены из других файлов.

Почему строки, скопированные из других файлов, не могут быть приписаны коммиту c1?
Коммит, которому приписываются строки, всегда тот, который изменил их последним. Но если строки приписываются файлу G, то они не могут быть добавлены в файл G на коммите c1 - они должны существовать там раньше, на коммите c1^. Тк иначе файл G не будет источником для этих строчек (они были просто добавлены в файл f и файл G). Другими словами, 'blame' ищет строки из файла f в версии файла G на коммите c1^, и, если находит, ограничение истории (revision) 'c1^!' заставляет 'blame' не искать дальше, когда соответствующие строки в файле G были изменены последний раз, а просто сказать, что это произошло на граничном коммите 'c1^'.

Итак, меня интересуют только другие файлы, являющиеся источниками файла f. Те, трек файла имеет вид дерева:
- Корень - это файл f на каком-то коммите c0.
- История узла (файла) отслеживается 'git log'-ом (начиная с коммита, где файл, соответствующий узлу, был обнаружен 'blame'-ом, и заканчивая коммитом, где этот файл был последний раз добавлен в репозиторий).
- Разделение узла производит 'git blame'.

Код: Выделить всё

                                            git log
                                . (c1^, h) ----------> ..
             git log           . git blame
    (c0, f) ---------> (c1, f).
                            A  . git blame  git log
                                . (c1^, g) ----------> ..

В этом дереве скорей всего будут одинаковые ветки, которые будут отслеживаться независимо. Другими словами, мой алгоритм неэффективен, тк если в каком-то месте дерева я встречаю пару (коммит c, файл f), которая уже была отслежена раньше, конечный трек этой пары будет таким же, но текущая реализация не использует треки повторно и каждый раз все делает заново.

Треки записываются в файл и имя файла выводится на stdout:

Код: Выделить всё

./track_$(date ..)-$c0_$(echo "$f" | sed -e's:/:_:g').txt

Скрипт 'examples/jp_gen_track.sh' составляет треки для всех требуемых файлов на всех требуемых ветках. Затем он собирает все треки, удаляя одинаковые, в единый трек файл.

2. Побочный эффект отслеживания переименований.

Хотя все может показаться правильным, переписывание истории (с помощью filter-branch-а) с сохранением только файлов, указанных в едином треке (см. 'examples/ex_index_filter.sh'), сгенерированном для всех файлов на всех ветках, может поломать проект на некоторых коммитах: некоторые файлы могут быть удалены, и проект больше не соберется, или некоторые сообщения коммитов, которые упоминают удаленные файлы, станут непонятными. Другими словами, переписанная история скорей всего будет отличаться от исходной.

Причина этого в том, что все файлы, которые больше не существуют на конце ветки (branch tip), могут быть включены в единый трек файл только как источник (в каком-то поколении) для файла, сущствующего на конце ветки, и только, когда эти источники будет искать 'git blame'. Другими словами, только после коммита (помните, я иду по истории в обратном направлении), где текущий отслеживаемый файл (либо существующий на конце ветки, либо его предок N-го поколения) был добавлен в репозиторий последний раз.

Рассмотрим пример проект с одной веткой и одним файлов f на конце ветки, где
- Файл f отслеживается. Файл f1 обозначает другое состояние (на другом коммите) файла f, а не новый файл.
- Файл g - источник файла f. Файлы 'g1, 'g2,' g3' также обозначают другое состояние (на другом коммите) файла g, а не новые файлы.
- Коммит P - конец ветки.
- Коммит Q - где файл g был удален.
- Коммит R - где файл g был добавлен.
- Коммит S - где файл f был добавлен.
- Буква 'D' под именем файла обозначает, что файл был удален на этом коммите.
- Буква 'A'под именем файла обзначает, что файл был добавлен на этом коммите.
- Буква 'o' под именем коммита, обозначает коммит.
- Черточка ('-') обозначает.. ээ.. commit ancestry chain (более поздние левее).
- Точка обозначает историю файл (или строк файла).

Есть три возможных варианта новой (переписанной) истории:

1. Переписанная история не будет содержать файл g совсем. Следовательно, коммиты, изменяющие только файл G, станут пустыми и будут удалены, если была указана опция '--prune-empty':

Код: Выделить всё

(история показана в обратном порядке, более поздние коммиты левее)
    P      Q       R    S                     P'               S'
    o ---> o ----> o -> o                     o -------------> o
       (git log -- f)              (filter-branch)
    f1 ................ f               ==>   f1 ............. f
                        A
           g1  ... g
           D       A

2. Переписанная история будет такой же, как исходная:

Код: Выделить всё

(история показана в обратном порядке, более поздние коммиты левее)
                       (Q)
    P                   S    S^    R          P'               S'   S'^    R'
    o ----------------> o -> o --> o          o -------------> o -> o  --> o
       (git log -- f)             (filter-branch)
    f1 ................ f.              ==>   f1 ............. f
                        A . (git blame)                        A
                           .
                       g2 . g1 ... g                           g2 . g1 ... g
                       D           A                           D           A

3. Переписанная история будет содержать только часть истории файла g. Также, как в случае 1, коммиты, изменяющие только файл g после коммита S (потомки коммита S, те на ancestry chain S..Q), станут пустыми и будут удалены, если была указана опция '--prune-empty':

Код: Выделить всё

(история показана в обратном порядке, более поздние коммиты левее)
    P      Q            S    S^    R          P'               S'   S'^    R'
    o ---> o ---------> o -> o --> o          o -------------> o -> o  --> o
       (git log -- f)             (filter-branch)
    f1 ................ f.              ==>   f1 ............. f
                        A . (git blame)                        A
                           .
           g3 ........ g2 . g1 ... g                           g2 . g1 ... g
           D                       A                           D           A

Вот несколько примеров, демонстрирующих ситуации описанные выше. Во всех примерах я переписывал историю сохраняя все файлы и все ветки. Также ниже использованы некоторые команды из 'track_renames.sh' .

Пример 1.

"3-ий случай" в линейной истории: источник отслеживаемого файла удален на всех коммитах, где отслеживаемый файл существует.

Исходная история:

Код: Выделить всё

    | * 419ca9a Delete experimental files.
    | | D       show_words/src/multiline.hs
    | | M       show_words/src/testSgfList.hs
    | | D       show_words/src/tf.hs
    | * 9bb2a68 Finally fix foldrMerge. Add newtype ZipList'. Add some test for SgfList.
    | | M       show_words/src/SgfList.hs
    | | A       show_words/src/testSgfList.hs
    | * fd11529 Replace eq with (Eq a) in transp.
    | | M       show_words/src/SgfList.hs

Переписанная история:

Код: Выделить всё

    | * 2fcaa27 Delete experimental files.
    | | M       show_words/src/testSgfList.hs
    | * 091d23e Finally fix foldrMerge. Add newtype ZipList'. Add some test for SgfList.
    | | M       show_words/src/SgfList.hs
    | | D       show_words/src/multiline.hs
    | | A       show_words/src/testSgfList.hs
    | * 2205617 Replace eq with (Eq a) in transp.
    | | M       show_words/src/SgfList.hs

Как вы видите, в переписанной истории файл 'multiline.hs' удален раньше (на коммите "Finally fix.."), чем в исходной истории (на коммите "Delete.."), что делает имя коммита "Delete experimental files." немного непонятным.

Почему так произошло? Файл 'multiline.hs' был добавлен в единый трек файл, как источник файла 'testSgfList.hs', который был добавлен в репозиторий на коммите "Finally fix..":

Код: Выделить всё

$ git blame -C -C --incremental '9bb2a68^!' -- show_words/src/testSgfList.hs \
    | sed -ne '/^[0-9abcdef]\{40\} /{ s/ .*//; h; }; /^filename /{ H; g; s/\nfilename / /p; };' \
    | uniq
9bb2a6842fbf995f24460812cb34ddbd2cdb864a show_words/src/testSgfList.hs
fd115298dab19b959d0ace361f068d7197c5c796 show_words/src/multiline.hs

но на коммита "Delete.." файл 'testSgfList.hs' все еще существует (или "уже" существует, но тк я прохожу историю в обратном порядке, то я выбрал "все еще"), поэтому я даже не думаю о его источниках. На коммите "Finally fix..", где он был добавлен, я уже ищу источники, но они мне все еще не нужны, тк 'testSgfList.hs' все еще существует. И только на коммите "Replace eq.." мне становится нужен источник файла 'testSgfList.hs'. Другими словами, я замечаю источник только, когда достигаю коммита, где появился файл-потомок. И мне нужно, чтобы источник сущствовал только до (раньше) появления файла-потомка.

Пример 2.

"3-ий случай" в нелинейной истории: источник отслеживаемого файла удален на другой ветке истории (под "веткой" я имею в виду просто последовательность коммитов, а не ref).

Исходная история:

Код: Выделить всё

    | * 4e2d0f9 Support multiline input.
    | | M       show_words/src/SgfList.hs
    | | M       show_words/src/ShowWords.hs
    | | M       show_words/src/multiline.hs
    | | D       show_words/src/test_1.txt
    | | D       show_words/src/test_2.txt
    | | D       show_words/src/test_3.txt
    | | D       show_words/src/test_words.txt
    | | A       show_words/test/1.txt
    | | A       show_words/test/2.txt
    | | A       show_words/test/3.txt
    | | A       show_words/test/words_jp_ru.txt
    | *   42ab9af Merge branch 'show_words' into show_words_multiline
    | |\
    | | * b57f90f Move all generic list functions into SgfList.
    | | | A     show_words/src/SgfList.hs
    | | | D     show_words/src/SgfListIndex.hs
    | | | M     show_words/src/SgfOrderedLine.hs
    | | | M     show_words/src/ShowWords.hs
    | * | ff1f738 Rewrite and rename a little.
    | | | M     show_words/src/multiline.hs
    | * | 1d2fe0c Two multiline implementations. Both works.
    | |/
    | |   A     show_words/src/multiline.hs
    | * 7ff496a Add fixmes.
    |/
    |   M       show_words/src/ShowWords.hs
    *   20ea746 Merge branch 'master' into show_words

Переписанная история:

Код: Выделить всё

    | * a29ccf8 Support multiline input.
    | | M       show_words/src/SgfList.hs
    | | M       show_words/src/ShowWords.hs
    | | M       show_words/src/multiline.hs
    | | D       show_words/src/test_1.txt
    | | D       show_words/src/test_2.txt
    | | D       show_words/src/test_3.txt
    | | D       show_words/src/test_words.txt
    | | A       show_words/test/1.txt
    | | A       show_words/test/2.txt
    | | A       show_words/test/3.txt
    | | A       show_words/test/words_jp_ru.txt
    | *   271e25e Merge branch 'show_words' into show_words_multiline
    | |\
    | | * 11bb38f Move all generic list functions into SgfList.
    | | | A     show_words/src/SgfList.hs
    | | | D     show_words/src/SgfListIndex.hs
    | | | M     show_words/src/SgfOrderedLine.hs
    | | | M     show_words/src/ShowWords.hs
    | * | f3601db Rewrite and rename a little.
    | | | M     show_words/src/multiline.hs
    | * | d30b3d7 Two multiline implementations. Both works.
    | |/
    | |   D     show_words/src/SgfListIndex.hs
    | |   A     show_words/src/multiline.hs
    | * d00c0b5 Add fixmes.
    |/
    |   M       show_words/src/ShowWords.hs
    *   52b84e9 Merge branch 'master' into show_words

Как вы видите, в переписанной истории на коммите "Two multiline.." удален файл 'SgfListIndex.hs', и, следовательно, на коммите "Rewrite and rename.." он также не существует. Без этого файла проект не компилируется.

Почему это произошло? Файл 'SgfList.hs' был отслежен 'git log'-ом до коммита "Move all..", где он был добавлен в репозиторий:

Код: Выделить всё

$ git log --diff-filter=A --oneline -1 4e2d0f9  -- show_words/src/SgfList.hs
b57f90f Move all generic list functions into SgfList.

затем 'git blame'-у был задан вопрос: "откуда взялись строчки этого файла?", - и он показал на файл 'SgfListIndex.hs' с коммита "Add fixmes", тк это родительский коммит для коммита "Move all..":

Код: Выделить всё

$ git blame -C -C --incremental 'b57f90f^!' -- show_words/src/SgfList.hs \
    | sed -ne '/^[0-9abcdef]\{40\} /{ s/ .*//; h; }; /^filename /{ H; g; s/\nfilename / /p; };' \
    | uniq
b57f90f50b176532717d10b9e5ac92a521d90b21 show_words/src/SgfList.hs
7ff496a4acfbcb20fedb4167134b120bdb468a20 show_words/src/SgfListIndex.hs
7ff496a4acfbcb20fedb4167134b120bdb468a20 show_words/src/ShowWords.hs

Коммиты "Two multiline.." и "Rewrite and rename.." недостижимы из коммита "Move all.." , поэтому они не могли быть источниками ни для какой строчки этого файла.

Те коммиты "Rewrite and rename.." и "Two multiline.. были добавлены в единый трек файл из треков других файлов (не 'SgfList.hs'). Я могу посмотреть в трек файлы, созданные скриптом 'examples/jp_gen_track.sh' (точнее, созданные скриптом 'track_renames.sh', вызванным для каждой ветки и каждого файла), чтобы найти какие треки ссылаются на эти два коммита и на какие файлы в этих двух коммитах:

Код: Выделить всё

$ grep -R -e ff1f738 .  | head -n5
./track_20121116_221210-f15c70a5b6fc847403f62595cb21d95e8b7a44a2_show_words_src_SgfOrderedLine.hs.txt:ff1f7389ed50b20e70e62d362d44853b22ab7d5c show_words/src/SgfOrderedLine.hs
./track_20121116_221215-a5ddf92c8c499833c755af9ea058fb5ad2914c48_show_words_README.txt:ff1f7389ed50b20e70e62d362d44853b22ab7d5c show_words/README
./track_20121116_221216-a5ddf92c8c499833c755af9ea058fb5ad2914c48_show_words_tests_testSgfList.hs.txt:ff1f7389ed50b20e70e62d362d44853b22ab7d5c show_words/src/multiline.hs
./track_20121116_221213-0d44c0a5b9ade4becc4be793d586aa74a4da039e_.gitignore.txt:ff1f7389ed50b20e70e62d362d44853b22ab7d5c .gitignore
./track_20121116_221213-dd2e8267e848c779501411fd4584e42dcbad875e_show_words_tests_words_words_jp_ru.txt.
txt:ff1f7389ed50b20e70e62d362d44853b22ab7d5c show_words/src/test_words.txt

но на файл 'SgfListIndex.hs' не ссылается ни один из этиъ трек файлов:

Код: Выделить всё

$ grep -R -e ff1f738 .  | grep SgfListIndex
$
$ grep -R -e 1d2fe0c . | grep SgfListIndex
$

и поэтому 'SgfListIndex.hs' был удален на этих двух коммитах.

Пример 3.

"1-ый случай": файла вообще нету в переписанной истории (полностью удален из репозитория), тк он не является источником ни для одного отслеживаемого файла. И, следовательно, некоторые коммиты стали пустыми и тоже были удалены (с опцией '--prune-empty').

Исходная история:

Код: Выделить всё

    | * 132e21f Column equality function in config. Finally implement zipFoldM.
    | | M       show_words/src/SgfList.hs
    | | M       show_words/src/SgfOrderedLine.hs
    | | M       show_words/src/ShowWords.hs
    | | M       show_words/src/ShowWordsConfig.hs
    | | M       show_words/src/ShowWordsOutput.hs
    | | D       show_words/src/zipFold.hs
    | * b658f1a Fix monadic version of generalized list eq.
    | | M       show_words/src/zipFold.hs
    | * 37e9130 Generalized list Eq for some eq function. Draft.
    | | M       show_words/src/SgfList.hs
    | | A       show_words/src/zipFold.hs

Переписанная история:

Код: Выделить всё

    | * ad31503 Column equality function in config. Finally implement zipFoldM.
    | | M       show_words/src/SgfList.hs
    | | M       show_words/src/SgfOrderedLine.hs
    | | M       show_words/src/ShowWords.hs
    | | M       show_words/src/ShowWordsConfig.hs
    | | M       show_words/src/ShowWordsOutput.hs
    | * 52e81cc Generalized list Eq for some eq function. Draft.
    | | M       show_words/src/SgfList.hs

Как вы видите, файла 'zipFold.hs' нету в переписанной истории, тк он не является источником ни для одного отслеживаемого файла. И поэтому переписанный коммит "Fix monadic.." стал пустым и был удален.

Это можно проверить поиском по трек файлам:

Код: Выделить всё

$ grep -R -e zipFold .
$

3. Запуск примеров.

Итак, когда все последствия запуска "track_renames.sh" рассмотрены, написать другие вспомогательные скрипты должно быть достаточно просто, и я не буду описывать их здесь (см. исходники и комментарии). Вот правильная последовательность запуска скриптов-примеров (обратите внимание, что все они предполагают мой репозиторий и пути, поэтому они не будут "просто работать" - вы должны написать свои, используя эти просто как примеры):

1. Подготовка: clone, пересоздать локальные ветки, которые вы хотите оставить в новом репозитории, из удаленных (remote) и удалить все удаленные:

Код: Выделить всё

$ sh ./jp_prepare.sh

2. Сделать единый трек файл для всех файлов из поддиректории на всех ветках:

Код: Выделить всё

$ cd jp
$ sh ../jp_gen_track.sh '' 'subdir' 'show_words'

3. Переписать историю: 'git filter-branch' будет использовать 'examples/jp_tree_filter.sh' как '--tree-filter'; скрипт 'tree filter'-а и единый трек файл должны быть на один уровень выше в дереве каталогов:

Код: Выделить всё

$ sh ../jp_rewrite.sh

4. Очистка: сбросить состояние рабочей директории (working tree) до HEAD и удалить все неотслеживаемые git-ом файлы (untracked files), удалить бекапы ссылок (ref), удалить все логи ссылок (expire reflogs), удалить все неиспользуемые объекты (unreferenced objects):

Код: Выделить всё

$ sh ../jp_finalize.sh

5. Проверить, что все переписано правильно (не забудьте посмотреть сгенерированные диффы):

Код: Выделить всё

$ sh ../jp_check.sh

4. Почему не использовать 'git log --follow' для отслеживания переименований файлов?

Потому, что '--follow' работает не всегда. И потому, что это неправильный способ. Здесь вы можете найти более подробное объяснение: Directory renames without breaking git log (см. последнее сообщение от Junio C Hamano). A вот упомяенутое им сообщение Линуса.

5. Замечание об удалении файлов из поддиректории.

Если я хочу полностью удалить директорию, включая источники (origin) всех ее файлов, у меня есть два пути:
- Удалить эту директорию на всех ветках, которые я собираюсь переписать с помощью filter-branch-а. Тогда файлы из этой директории не попадут в трек, и при условии, что ни один из файлов этой директории (или его источник) не окажутся источником какого-нибудь отслеживаемого файла, директория и все ее источники будут полностью стерты.
- Или я могу отследить файлы из этой директории, и затем просто удалить все найденное (те все файлы, перечисленные в треке).

Обычно, первый способ лучше, тк index-filter или tree-filter, удаляюший только файлы из трека, оставит их на коммитах, на которых они (файлы) не были записаны в трек (см. гл. "2. Побочный эффект отслеживания переименований.").

Если посмотреть снова на "три возможных варианта новой (переписанной) истории" из главы 2, то
- В случае 1 файл g будет присутствовать в переписанной истории, тк его нету в треке.
- В случае 3 файл g будет добавлен на коммите S' и удален на коммите Q' (в который будет переписан коммит Q), тк файла g нету в треке на всех коммитах из ancestry chain S..Q .

Дополнение А. Скрипт 'track_renames.sh'.

Код: Выделить всё

#!/bin/sh

# Track specified file starting at specified commit down the history to its
# real origins.
# Arguments:
# 1 - start commit sha1.
# 2 - filename.
# Result:
#   in track file.
# Stdout:
#   track file name.

set -euf

readonly newline='
'
readonly ret_success=0
readonly ret_error=1
OIFS="$IFS"

readonly full_history='1'         # Option (set == non empty).

# FIXME: Change order of date and commit/filename in tack filename. Probably,
# commit/filename should be first? If many logs generated, they'll have
# slightly different times.
# FIXME: Or accept track file prefix through cmd to make different invocations
# to be grouped together.
readonly track_prefix="./track_$(date '+%Y%m%d_%H%M%S')-"
readonly track_suffix='.txt'
track_file=''

set_track_file()
{
    # Set track_file variable (must be declared!) to track file name. Old
    # track file with the same name will be deleted and new one created.
    if [ "x${track_file+x}" = 'x' ]; then
        echo "get_track_file(): track_file variable not set" 1>&2
        return $ret_error
    fi
    track_file="${track_prefix}${1}${track_suffix}"
    rm -f "$track_file" && touch "$track_file"
}

hist_step()
{
    # Track down history of file f starting at commit c0. Result will be
    # commit and filename pairs separated by space for either each commit in
    # continuous piece of file f history or for last commit only (where file f
    # have been added).
    # 1 - start commit sha1.
    # 2 - filename.
    # Result:
    #   track file - (maybe) current history and its end point.
    # Pipe (stdout):
    #   start points for new history.
    if [ $# -lt 2 -o ! -f "$track_file" ]; then
        echo "hist_step(): Too few arguments or track file does not exist." 1>&2
        exit $ret_error
    fi
    local OIFS="$IFS"

    local c0="$1"       # Start commit (sha1) for file f history.
    local f="$2"        # Filename, which history i will look for.
    local h1=''         # End point (commit and filename) of file f history.
    local c1=''         # End point commit (sha1).
    local hs=''         # Full file f history (commit and filename pairs).
    local h0s=''        # Start points (commit and filename pairs) for
                        # file f 's origins history.
    local sha1_rx='[0-9abcdef]\{40\}'

    # stdout is pipe, so do not write!
    IFS="$newline"
    # I work only with continuous history pieces, hence, file f must exist at
    # commit c0.
    if [ "$(git cat-file -t "${c0}:$f")" != 'blob' ]; then
        echo "hist_step(): File '$f' does not exist at commit '$c0'." 1>&2
        return $ret_error
    fi
    # Take latest file f addition into git as end point to ensure history
    # continuity.
    c1="$(git log --diff-filter=A --pretty='format:%H' -1 "$c0" -- "$f")"
    h1="$c1 $f"
    if [ -n "${full_history:-}" ]; then
        hs="$(git log --pretty='format:%H' --ancestry-path "${c1}..$c0" -- \
                | sed -e"a\\$f" \
                | sed -e'N; s/\n/ /;'
            )"
    fi
    hs="${hs:+${hs}$newline}$h1"
    # Write file f history directly into resulting file.
    echo "$hs" >>"$track_file"

    # Find start points (commit and new filename) for file f 's origins
    # history (probably, consequence of file f rename). I must ensure, that
    # file f 's history end point (both commit and filename match) will not be
    # included as one of start points. This is possible, if some lines to file
    # f were added at the commit, which creates file f.
    h0s="$(git blame -C -C --incremental "${c1}^!" -- "$f" \
                | sed -ne"
                        /^${sha1_rx} /{ s/ .*//; h; };
                        /^filename /{ H; g; s/\nfilename / /p; };
                    " \
                | ( grep -F -v -e "$h1" || true )
        )"
    # Lines origin commit is always one, which have last modified them in file
    # they attributed to. But if lines attributed to other file G, matched
    # lines in file G may not be added at commit c1 - they must exist before
    # (at commit c1^).  Hence, boundary commit c1^ will be blamed for these
    # lines. I just want to be sure, that i'm right.
    if echo "$h0s" | grep -q "^$c1 "; then
        echo "hist_step(): Some lines attributed to other file at end point commit." 1>&2
        echo "hist_step(): This should never happen." 1>&2
        echo "hist_step(): And means some critical flaw in algorithm." 1>&2
        return $ret_error
    fi
    # Result may contain duplicates. Duplicates will go sequentially,
    # because 'git blame --incremental' outputs by commits, not by lines
    # (all lines, which one commit blamed for, then all lines, which other
    # commit blamed for, etc).
    echo "$h0s"
    IFS="$OIFS"
}

rename_hist()
{
    # Track down history of file f starting at commit c0. Unlike hist_step(),
    # this function tracks down to where file f or _all_ of its origins really
    # have first appeared in repository. It combines continuous history pieces
    # for all origins into single track. Hence, some commits, which may be
    # common in histories of several origins, may appear in the track several
    # times.  Resulted track file will contain date and file f name in its
    # filename.
    # 1 - start commit sha1.
    # 2 - filename.
    # Result:
    #   track file - created by hist_step().
    if [ $# -lt 2 ]; then
        echo "rename_hist(): Too few arguments." 1>&2
        return $ret_error
    fi
    local OIFS="$IFS"

    local c0="$1"       # Start commit (sha1) for file f history.
    local f="$2"        # Filename, which history i will now look for.
    local hs=''         # Start points (commit and filename pairs) for
                        # file f 's origins history.
    local h0s=''        # All not yet tracked down start points.

    IFS="$newline"
    set_track_file "${c0}_$(echo "$f" | sed -e's:/:_:g')"
    h0s="$c0 $f"
    while [ -n "$h0s" ]; do
        set -- $h0s
        c0="${1%% *}"
        f="${1#* }"
        shift
        h0s="$*"
        hs="$(hist_step "$c0" "$f")" # If call in 'set --' errexit won't work.
        set -- $hs $h0s
        # 'uniq' is enough, because duplicates, if any, go sequentially (see last
        # comment in hist_step()).
        h0s="$(echo "$*" | uniq)"
    done
    IFS="$OIFS"
}

rename_hist "$@"
echo "$track_file"

exit 0

Дополнение Б. Скрипт 'examples/jp_prepare.sh'.

Код: Выделить всё

#!/bin/sh

# Prepare 'jp' repository for git filter-branch rewriting: clone original
# repository, recreate local branches tracking corresponding remotes (ones,
# which i want to preserve) and remove remote.

set -euf

newline='
'
OIFS="$IFS"

keep_branches='rewriteSplitBy
show_words
show_words_build_for_HP2010.2
show_words_index_by_Writer
show_words_readme'
orig_repo_path='/home/sgf/Documents/jp'
repo_path='/home/sgf/tmp/t'
repo='jp'

rm -rvf "$repo_path/$repo"
cd "$repo_path"
git clone "$orig_repo_path" "$repo"
cd "$repo"
IFS="$newline"
for b in $keep_branches; do
    git branch -t "$b" origin/"$b" || true
done
IFS="$OIFS"
git remote rm origin
git gc --aggressive --prune=now

Дополнение В. Скрипт 'examples/jp_gen_track.sh'.

Код: Выделить всё

#!/bin/sh

# Generate rename history (using track_renames.sh) for 'jp' repository. Script
# must be run in the repository's top directory ('jp' in this case). Script
# 'track_renames.sh' expected to be found one level higher in directory tree
# (i.e.  at '../track_renames.sh').
# Arguments:
# 1 - ref to rewrite.  You may specify either one particular ref or empty ref
# for rewriting all refs under refs/heads.
# 2 - mode ('subdir' or 'files').
# >=3 - for 'subdir' mode subdirectories, which contain files to track, or
# files itself for 'files' mode.
# Result:
#   track file one level higher in directory tree (in ../).

set -euf

readonly newline='
'
readonly ret_success=0
readonly ret_error=1
OIFS="$IFS"

refs="refs/heads${1:+/$1}"
subdir=''
keep_files=''
c0=''
f=''
tf=''
# Combine tracks for all requested files into single track file, (sort and)
# remove duplicates, and move it one directory upper at the end.
track_file='jp_track.txt'
# Defaults depending on the number of arguments:
# 0 or 1 - subdir mode.
# 2 - subdir mode with 'show_words' as subdir.
# >2 - depending on mode and other args.
IFS="$newline"
rm -rf "$track_file" && touch "$track_file"
echo "Generate for ref(s):$newline$(git for-each-ref "$refs")"
if [ $# -lt 2 -o "${2:-}" = 'subdir' ]; then
    echo "Subdir mode."
    case $# in
        0 | 1 ) subdir='' ;;
        2 ) subdir='show_words' ;;
        * ) shift 2; subdir="$*" ;;
    esac
    echo "Generating for subdir(s)${subdir:+:${newline}}${subdir:- all.}"
    for c0 in $(git for-each-ref --format='%(objectname) %(refname)' "$refs"); do
        echo "For ref '${c0#* }' .."
        c0="${c0%% *}"
        for f in $(git ls-tree -r --name-only "$c0" $subdir); do
            echo "    .. file $f"
            tf="$(../track-renames.sh "$c0" "$f")"
            cat "$track_file" "$tf" | sort -u >"${track_file}.tmp"
            mv -T "${track_file}.tmp" "$track_file"
        done
    done
elif [ $# -gt 2 -a "${2:-}" = 'files' ]; then
    # At least one filename required.
    echo "Files mode."
    shift 2
    keep_files="$*"
    echo "Generate for files:${newline}$keep_files"
    for c0 in $(git for-each-ref --format='%(objectname)' "$refs"); do
        echo "$c0"
        all_files="$(git ls-tree -r --name-only "$c0")"
        for f in $(echo "${all_files}${keep_files:+${newline}${keep_files}}" \
                    | sort \
                    | uniq -d);
        do
            echo "$f"
            tf="$(../track-renames.sh "$c0" "$f")"
            cat "$track_file" "$tf" | sort -u >"${track_file}.tmp"
            mv -T "${track_file}.tmp" "$track_file"
        done
    done
else
    echo "Incorrect mode '$2' or too few arguments."
fi
IFS="$OIFS"
mv -T "$track_file" ../"$track_file"
echo "Resulting track file: '../$track_file'"

Дополнение Г. Скрипт 'examples/jp_rewrite.sh'.

Код: Выделить всё

#!/bin/sh

# Just call filter-branch with --tree-filter for 'jp' repository. Script must
# be run in the repository's top directory. Tree filter script and track file
# expected to be found one level higher in directory tree (at
# '../jp_tree_filter.sh' and '../jp_track.txt'). Will use tmp directory under
# '/tmp' for filter-branch. Make sure, that tmpfs is mounted there (otherwise,
# tree filter will run very slow).

set -euf
tmp_dir='/tmp/jp_rewrite'
tree_filter="$(cat ../jp_tree_filter.sh)"
rm -rf "$tmp_dir"
# Note, that when using tmp directory for filter-branch, real working tree
# remains unchanged, and, hence, after history rewrite you need to remove
# _all_ files, except .git folder, from working tree and reset --hard to some
# ref.
git filter-branch   --prune-empty -d "$tmp_dir" \
                    --tag-name-filter cat \
                    --tree-filter "$tree_filter" -- --all

Дополнение Д. Скрипт 'examples/jp_tree_filter.sh'.

Код: Выделить всё

# Tree filter script for 'jp' reposirory, which
#   - if directory '$subdir' does not exist at the current commit, removes all
#   files, except listed in the track file.
#   - otherwise (if '$subdir' exist), removes all files, except ones from
#   '$subdir' or ones listed in the track file. Then move all files from
#   '$subdir' to the top repository dir. Any name conflict is fatal. So, be
#   sure to resolve them manually (see below).

newline='
'


subdir='show_words'     # Will keep all files from this subdir.
track_file='/home/sgf/tmp/t/jp_track.txt'       # Will use this track file.
# Keep all files from track.
keep_files="$(sed -ne"s/^$GIT_COMMIT //p" "$track_file")"
# Keep all files from subdir.
keep_subdir="$(git ls-tree -r --name-only "$GIT_COMMIT" "$subdir")"
all_files="$(git ls-tree -r --name-only "$GIT_COMMIT")"

# Remove trailing slashes ('/' will be reduced to empty).
subdir="$(echo "$subdir" | sed -ne'1s:/*$::p')"
# Add files from '$subdir' to list of kept files.
keep_files="$(
    echo "${keep_files:-$keep_subdir}${keep_subdir:+${newline}$keep_subdir}" \
        | sort -u
    )"
# Remove only files not listed in $keep_files.
echo "${all_files:-$keep_files}${keep_files:+${newline}$keep_files}" \
    | sort \
    | uniq -u \
    | xargs -r -d'\n' rm -v

if [ -d "$subdir" ]; then
    # Hardlink files from $subdir to top project dir. If some filename already
    # exists, it'll not be hardlinked and then following `find` fails. I.e. i use
    # hardlinks as flag to indicate whether file sucessfully copied or not.
    cp -PRln "$subdir" -T .
    # Resolve known name conflicts manually.
    if [ -f ".gitignore" -a -f "$subdir/.gitignore" ]; then
        cp -Plv --remove-destination "$subdir/.gitignore" -T '.gitignore'
    fi
    find -P "$subdir" -depth -links '+1' -delete -o -print
fi

Дополнение Е. Скрипт 'examples/jp_finalize.sh'.

Код: Выделить всё

#!/bin/sh

# Finalize git filter-branch rewrite of 'jp' repository: reset working tree
# (remove _all_ files, except git repository, and hard reset HEAD), remove
# backup refs, expire all reflogs and remove unrefenced objects to reduce .git
# folder size.

set -euf

newline='
'
repo_path='/home/sgf/tmp/t/jp'

cd "$repo_path"
# Remove _all_ files, except git repository itself. This is needed, because if
# tree-filter uses temp directory, working tree still contains old data.
# Moreover, it most likely still contains track files, generated by
# 'jp_gen_track.sh'.
find -mindepth 1 -maxdepth 1 -name '.git' -prune -o -exec rm -rf {} \;
git reset --hard
git for-each-ref --format='%(refname)' 'refs/original' \
    | xargs -r -d'\n' -n1 git update-ref -d
git reflog expire --expire=now --all
git gc --aggressive --prune=now

Дополнение Ж. Скрипт 'examples/jp_check.sh'.

Код: Выделить всё

#!/bin/sh

# Generate diffs between "what new history should looks like" and real new
# history. If rewrite went well, these diffs (for each kept branch) should be
# either very small or no diff at all.

set -euf

newline='
'
OIFS="$IFS"

orig_repo='/home/sgf/Documents/jp'
new_repo='/home/sgf/tmp/t/jp'
log_dir='/home/sgf/tmp/t/jp'

# Branches, which i have kept during rewrite.
keep_branches='rewriteSplitBy
show_words
show_words_build_for_HP2010.2
show_words_index_by_Writer
show_words_readme'
# Subdir, which files were moved to the top repository dir.
subdir='show_words'
# FIFO to use, when comparing diffs.
cmp_diff_fifo='cmp_diff.fifo'
new_branches=''
b=''
orig_log=''
new_log=''
diff_log=''
prev_diff_log=''
diffs_are_different=''  # Flag used by cmp_diff(). Empty, when diffs are
                        # identical.

# Remove trailing slashes ('/' will be reduced to empty).
subdir="$(echo "$subdir" | sed -ne'1s:/*$::p')"

gen_log()
{
    # Generate log suitable for gen_diff() for specified branch.
    # Arguments:
    # 1 - branch name.
    git log --pretty='format:%s' --numstat "$1" --
}

gen_diff()
{
    # Remove leading 'subdir' (if present) from all filenames in 1st file,
    # containing `git log --numstat` output. Then generate diff.
    # Arguments:
    # 1 - 1st filename (original history generated by gen_log() expected).
    # 2 - 2nd filename (new history generated by gen_log() expected).
    sed "$1" -e "s@^\(\([[:digit:]]\+\t\)\{2\}\)$subdir/@\1@" \
        | ( diff -u - "$2" || true )
}

reduce_diff()
{
    # Remove some parts from diff to make it suitable for comparison by
    # cmp_diffs().
    # Arguments:
    # 1 - filename (unified diff expected).
    tail -n'+3' "$1" | sed -ne'/^@@ /!p'
}

cmp_diffs()
{
    # Compare two diffs. I will use global variables from caller's
    # environment.
    if [ -n "$prev_diff_log" -a -z "$diffs_are_different" ]; then
        if [ ! -p "$cmp_diff_fifo" ]; then
            echo "cmp_diff(): FIFO does not exist" 1>&2
            exit 1
        fi
        reduce_diff "$prev_diff_log" > "$cmp_diff_fifo" &
        if ! reduce_diff "$diff_log" | diff -q "$cmp_diff_fifo" - ; then
            echo "cmp_diff(): Files '$prev_diff_log' and '$diff_log' differs."
            diffs_are_different=1
        fi
    fi
}

# Check, that rewritten repository has exactly all kept branches (no more, no
# less).
if [ -z "$keep_branches" ]; then
    echo "No branches were kept." 1>&2
    exit 1
fi
cd "$new_repo"
new_branches="$(git for-each-ref --format='%(refname)' | sed -e's:.*/::')"
b="$(echo "$keep_branches${new_branches:+${newline}$new_branches}" \
        | sort \
        | uniq -u
    )"
if [ -n "$b" ]; then
    echo "Following branches missed either from kept branches list of from rewritten repository:" 1>&2
    echo "$b" 1>&2
    exit 1
fi
echo "Branches matched."

# Generate history for each branch in original and new (rewritten) repository.
# Then compare transformed original history (with 'subdir' removed) with real
# new history.  Since this transformation is exactly, what i expect
# tree-filter have done, the diff should be either very small or no diff at
# all.
IFS="$newline"
mkfifo "$cmp_diff_fifo"
for b in $keep_branches; do
    echo "Checking $b.."
    orig_log="$log_dir/orig_$b.log"
    new_log="$log_dir/new_$b.log"
    diff_log="$log_dir/$b.diff"
    cd "$orig_repo"
    gen_log "$b" > "$orig_log"
    cd "$new_repo"
    gen_log "$b" > "$new_log"
    gen_diff "$orig_log" "$new_log" > "$diff_log"
    cmp_diffs
    prev_diff_log="$diff_log"
done
rm -vf "$cmp_diff_fifo"
diffs_are_different="${diffs_are_different:+Some diffs are different.}"
diffs_are_different="${diffs_are_different:-All diffs are identical.}"
echo "Diffs generated. $diffs_are_different"

exit 0

Приложение З. Скрипт 'examples/ex_index_filter.sh'.

Код: Выделить всё

# Example of index filter, which removes all files, except listed in the track
# file.

newline='
'
track_file='/home/sgf/tmp/t/jp_track.txt'
keep_files="$(sed -ne"s/^$GIT_COMMIT //p" "$track_file")"
all_files="$(git ls-files -c)"
# Remove only files not listed in $keep_files.
echo "${all_files:-$keep_files}${keep_files:+${newline}${keep_files}}" \
    | sort \
    | uniq -u \
    | xargs -r -d'\n' git rm --cached

И вот, наконец-то, он наступил. Момент, которого я так ждал, когда я наконец смогу сказать эти слова: "Сильвана, взлет!".

Upd1. Добавлен скрипт 'examples/ex_index_filter.sh'.
Upd2. Добавлено "Замечание об удалении файлов из поддиректории".

unixforum.org

git filter-branch --subdirectory-filter удаляет merge --no-ff коммит

git filter-branch --subdirectory-filter удаляет merge --no-ff коммит

Re: git filter-branch --subdirectory-filter удаляет merge --no-ff коммит

Re: git filter-branch --subdirectory-filter удаляет merge --no-ff коммит

Re: git filter-branch --subdirectory-filter удаляет merge --no-ff коммит

Re: git filter-branch --subdirectory-filter удаляет merge --no-ff коммит