Track changes in Excel, Word, PowerPoint, PDFs, Images & Videos with Git
source link: https://tech.marksblogg.com/git-track-changes-in-media-office-documents.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Mark Litwintschik
I have 15 years of consulting & hands-on build experience with clients in the UK, USA, Sweden, Ireland & Germany. Past clients include Bank of America Merrill Lynch, Blackberry, Bloomberg, British Telecom, Ford, Google, ITV, LeoVegas, News UK, Pizza Hut, Royal Bank of Scotland, Royal Mail, T-Mobile, TransferWise, Williams Formula 1 & UBS. I hold both a Canadian and a British passport. My CV, Twitter & LinkedIn profile.
Git is an open-source, distributed version control system. It was designed by Linus Torvalds in the Spring of 2005 with the goal of improving the workflow around the development of the Linux Kernel. Within the space of a month, Linus, along with Junio Hamano and a handful of others, had written the initial C code and used Git to merge multiple branches together. By June of 2006, Git had been used for the Linux Kernel's version 2.6.12 release.
Git can describe changes in files well when the contents can be rendered as plain text. Unfortunately, the underlying diff tool being used will simply state that binary files "differ".
$ git diff test.mp4
diff --git a/test.mp4 b/test.mp4 index 9a9efb7..d9607c6 100644 Binary files a/test.mp4 and b/test.mp4 differ
Microsoft Office began using XML files within PKZip containers in 2007. This made the task of building open-source readers for Word, Excel and PowerPoint documents much easier compared to Microsoft's previous formats. Between Pandas and Pandoc you can dump the contents of any given Office document into plain text. These plain text dumps can then be compared against one another and this process can be wrapped up inside the git diff command.
Likewise, Phil Harvey's ExifTool tool can be used to dump metadata around images and videos. While these aren't visual comparisons, they can still be a useful starting place for understanding when change has taken place.
In this post, I'll describe ways of using alternative deserialisation tools so that you can get more informative difference information between binary files using Git.
Installing Tooling
Using Homebrew, I'll install ExifTool, Git, Pandas and Python 3 on my MacBook Pro.
$ brew install \ exiftool \ git \ pandoc \ virtualenv
I'll set up a virtual environment and install some Python packages. These will be used for extracting document contents.
$ virtualenv ~/.officecomp $ source ~/.officecomp/bin/activate $ python3 -m pip install \ openpyxl \ Pandas \ python-pptx \ textract
Comparing Office Documents
I'll initialise a new Git repository, configure it and then populate it with blank Word, Excel and PowerPoint documents.
$ mkdir -p ~/repo $ cd ~/repo $ git init
The following will map various file extensions to alternative diff configurations.
$ vi .gitattributes
*.docx diff=word *.pptx diff=powerpoint *.xlsx diff=excel *.pdf diff=pdf *.jpg diff=images_videos *.png diff=images_videos *.gif diff=images_videos *.mp4 diff=images_videos
Inside of the repository's configuration, I'll set up BASH commands that will either extract a file's contents as plain text or dump its metadata. I will also set up a wdiff alias which will provide a colourful and concise output for the diff command.
$ vi .git/config
[core] repositoryformatversion = 0 filemode = true bare = false logallrefupdates = true ignorecase = true [alias] wdiff = diff --word-diff=color --unified=1 [diff "word"] textconv=pandoc --to=markdown binary=true prompt=false [diff "excel"] textconv=python3 /Users/mark/Downloads/test/xlsx-dump.py binary=true [diff "powerpoint"] textconv=python /Users/mark/Downloads/test/pptx-dump.py binary=true [diff "pdf"] textconv=python3 /Users/mark/Downloads/test/pdf-dump.py binary=true [diff "images_videos"] textconv=exiftool binary=true
Below you can see the three Office documents. Their file extensions hide that they are actually ZIP files containing XML data.
$ ls -l *.{docx,pptx,xlsx}
-rw-r--r--@ 1 mark staff 8145 Aug 24 10:34 Book1.xlsx -rw-r--r--@ 1 mark staff 11588 Aug 24 10:34 Doc1.docx -rw-r--r--@ 1 mark staff 32326 Aug 24 10:35 Presentation1.pptx
I'll add them to the repository and commit them so that we can compare future changes to these files.
$ git add \ Book1.xlsx \ Doc1.docx \ Presentation1.pptx $ git commit -am'Adding blank documents'
Comparing Word Documents
I've added a heading to the Word document and saved it. Below is the output when running wdiff.
$ git wdiff Doc1.docx
diff --git a/Doc1.docx b/Doc1.docx index 90ff36f..5b96082 100644 --- a/Doc1.docx +++ b/Doc1.docx @@ -1 +1 @@ # Hello
I can also view changes to this file in reverse chronological order.
$ git log -p --word-diff=color Doc1.docx
commit c824375dac2f78cc8bcfacdb6766e00219efe0aa Author: Mark Litwintschik Date: Tue Aug 24 11:14:53 2021 +0300 Adding initial screenshot diff --git a/Doc1.docx b/Doc1.docx index 5b96082..d05790f 100644 --- a/Doc1.docx +++ b/Doc1.docx @@ -1 +1 @@ # HelloHello2 commit cfc43b74eee780bcb18ec3dae37d910daa86a428 Author: Mark Litwintschik Date: Tue Aug 24 11:01:42 2021 +0300 Saving inital PDF diff --git a/Doc1.docx b/Doc1.docx index 90ff36f..5b96082 100644 --- a/Doc1.docx +++ b/Doc1.docx @@ -1 +1 @@ # Hello commit 78584bb9b7464a02e4e3784418e13465c612f21d Author: Mark Litwintschik Date: Tue Aug 24 10:36:05 2021 +0300 Adding blank documents diff --git a/Doc1.docx b/Doc1.docx new file mode 100644 index 0000000..90ff36f --- /dev/null +++ b/Doc1.docx @@ -0,0 +1 @@
Comparing PowerPoint Documents
I'll build a small PowerPoint to plain text dump utility using the python-pptx library that was installed earlier.
$ vi pptx-dump.py
import sys from pptx import Presentation for slide in Presentation(sys.argv[1]).slides: for shape in slide.shapes: if not shape.has_text_frame: continue for paragraph in shape.text_frame.paragraphs: for run in paragraph.runs: print(run.text)
I'll then add a heading to Presentation1.pptx and view the difference using Git.
$ git wdiff Presentation1.pptx
diff --git a/Presentation1.pptx b/Presentation1.pptx index 9df8333..ff5383c 100644 --- a/Presentation1.pptx +++ b/Presentation1.pptx @@ -0,0 +1 @@ Hello
Comparing Excel Documents
I'll build a small Excel to plain text dump utility using Pandas and OpenPyXL. This will iterate through each worksheet and dumps its contents, including headers, in CSV format.
$ vi xlsx-dump.py
from io import StringIO import sys import pandas as pd for sheet_name in pd.ExcelFile(sys.argv[1]).sheet_names: output = StringIO() print('Sheet: %s' % sheet_name) pd.read_excel(sys.argv[1], sheet_name=sheet_name)\ .to_csv(output, header=True, index=False) print(output.getvalue())
I'll then change some values on both Sheet1 and Sheet2 within Book1.xlsx and view the difference.
$ git wdiff Book1.xlsx
diff --git a/Book1.xlsx b/Book1.xlsx index 1678ee3..58278dc 100644 --- a/Book1.xlsx +++ b/Book1.xlsx @@ -1,6 +1,6 @@ Sheet: Sheet1 -Hello +Hello! Sheet: Sheet2 -Hello Again +Hello Again!
Comparing PDF Documents
I'll print Doc1.docx to Doc1.pdf and add it to the repository.
$ git add Doc1.pdf $ git commit -am"Saving inital PDF"
Then, using textract, I'll create a small PDF to text dump utility.
$ vi pdf-dump.py
import sys import textract print(textract.process(sys.argv[1]).decode('utf-8'))
I'll then change the heading in the Word document and print it out to Doc1.pdf again. Below is what the difference looks like.
$ git wdiff Doc1.pdf
diff --git a/Doc1.pdf b/Doc1.pdf index 964baee..35f2f4b 100644 --- a/Doc1.pdf +++ b/Doc1.pdf @@ -1,2 +1,2 @@ HelloHello2
Keep in mind there are red and green background colours behind the text that differs. These won't render on this website but will inside the Terminal.
Comparing Images
The temptation to use an image-to-ASCII renderer here was strong but for the sake of simplicity, I'll only compare the image metadata.
I took a screenshot, named it test.png and added it to the repository.
$ git add test.png $ git commit -am'Adding initial screenshot'
I then took another screenshot and replaced test.png with its contents. Below is what the difference looks like.
$ git wdiff test.png
diff --git a/test.png b/test.png index af264c8..d82cfad 100644 --- a/test.png +++ b/test.png @@ -1,9 +1,9 @@ ExifTool Version Number : 12.30 File Name : yDRG2g_test.pngtest.png File Size : 151684 KiB File Modification Date/Time : 2021:08:24 11:16:10+03:002021:01:20 21:28:57+02:00 File Access Date/Time : 2021:08:24 11:16:10+03:00 File Inode Change Date/Time : 2021:08:24 11:16:10+03:0011:15:54+03:00 File Permissions : -rw--------rw-r--r-- File Type : PNG @@ -11,4 +11,4 @@ File Type Extension : png MIME Type : image/png Image Width : 10201738 Image Height : 8842056 Bit Depth : 8 @@ -54,4 +54,4 @@ Resolution Unit : inches User Comment : Screenshot Exif Image Width : 10201738 Exif Image Height : 8842056 Pixels Per Unit X : 5669 @@ -61,3 +61,3 @@ XMP Toolkit : XMP Core 6.0.0 Apple Data Offsets : (Binary data 28 bytes, use -b option to extract) Image Size : 1020x8841738x2056 Megapixels : 0.9023.6
Again, there are red and green background colours behind the text that differs. These won't render on this website but will inside the Terminal.
Comparing Videos
I copied a video off of my phone, named it test.mp4 and added it to the repository.
$ git add test.mp4 $ git commit -am'Adding initial video'
I then replaced that video with another video off my phone. Below is the difference.
diff --git a/test.mp4 b/test.mp4 index 9a9efb7..d9607c6 100644 --- a/test.mp4 +++ b/test.mp4 @@ -1,9 +1,9 @@ ExifTool Version Number : 12.30 File Name : saDQQs_test.mp4test.mp4 File Size : 6063 MiB File Modification Date/Time : 2021:08:24 11:23:30+03:002019:12:14 15:12:25+02:00 File Access Date/Time : 2021:08:24 11:23:30+03:00 File Inode Change Date/Time : 2021:08:24 11:23:30+03:0011:23:08+03:00 File Permissions : -rw--------rw-r--r-- File Type : MP4 @@ -14,9 +14,9 @@ Minor Version : 0.0.0 Compatible Brands : isom, mp42 Media Data Size : 6320243665774570 Media Data Offset : 32 Movie Header Version : 0 Create Date : 2019:12:14 13:03:1213:04:10 Modify Date : 2019:12:14 13:03:1213:04:10 Time Scale : 1000 Duration : 10.4310.90 s Preferred Rate : 1 @@ -34,6 +34,6 @@ Android Version : 9 Track Header Version : 0 Track Create Date : 2019:12:14 13:03:1213:04:10 Track Modify Date : 2019:12:14 13:03:1213:04:10 Track ID : 1 Track Duration : 10.4310.90 s Track Layer : 0 @@ -52,9 +52,9 @@ Pixel Aspect Ratio : 65536:65536 Color Representation : nclx 1 1 1 Video Frame Rate : 30.02429.992 Matrix Structure : 1 0 0 0 1 0 0 0 1 Media Header Version : 0 Media Create Date : 2019:12:14 13:03:1213:04:10 Media Modify Date : 2019:12:14 13:03:1213:04:10 Media Time Scale : 48000 Media Duration : 10.4010.86 s Handler Type : Audio Track @@ -68,3 +68,3 @@ Image Size : 3840x2160 Megapixels : 8.3 Avg Bitrate : 48.548.3 Mbps GPS Latitude : 0 deg 0' 0.00" N
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK