Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
481 views
in Technique[技术] by (71.8m points)

svn - Will Subversion efficiently store OpenXML Office documents?

I have been managing Subversion as an engineering document storage repository for my company. It is working fairly well, however I have a question about how MS Office 2007 formats are (should be) handled by Subversion.

I'm looking at an Excel 2007 spreadsheet (extension .xlsx) in my working copy that Subversion has applied the svn:mime-type property application/octet-stream. This means that Subversion is treated it as binary, right?

I was hoping that the new MS Office document formats would be stored efficiently by Subversion. My understanding is that a full copy of a binary file will be made on every commit of that file, whereas if the file is text, a small change to the file will result in a small amount of additional data being added to the repository (in a typical situation at least).

I don't understand much of the details of XML, but I thought that an XML file was text, and that it would therefore be efficiently stored by Subversion.

Is it possible to configure Subversion so that MS Office OpenXML documents are stored efficiently?

Follow-up (2009-11-09): I've found that Office documents can be stored as plain text using the Office 2003 XML document formats (Excel: XML Spreadsheet 2003; Word: Word XML Document. There is a warning about loss of formatting, but I have yet to encounter any noticeable loss of formatting.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

From the OpenXML article on wikipedia:

An Office Open XML file is a ZIP-compatible OPC package containing XML documents and other resources.

In other words, OpenXML files are actually zip files with XML files in them. Compression or encryption "scrambles" the data, sabotaging subversion's ability to generate deltas between revisions. This is not related to the svn:mimetype. Subversion considers all files to be binary when generating deltas.

In Dutch we have a saying "measuring is knowing". The graph below shows the results of an experiment where I imported a 500K OpenXML document in a SVN 1.6 repository (revision 1). I then added a paragraph from another document, saved and committed. This was repeated 5 times (revision 2 to 6).

image

As you can see, committing a new docx revision that just adds a paragraph will cost you about 150K disk space. This is still much more efficient than just storing a copy of each revision without the help of a version control system.

I also repeated the experiment with a separate test repository by uncompressing each revision of the docx. As you can see, the storage of the document revisions would be much more efficient if it wasn't compressed. It's also interesting to see that subversion's own data compression is about as efficient as zip. Storing the first revision of an uncompressed docx in subversion takes about the same space as the original docx.

YMMV.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...