Binary v. Text 
 Machine-readable data - from binary to text 
Traditionally, program data would be in very efficient binary format:
 (2-byte-number)(4-byte-number)(1-byte-character)(2-byte-number)....
Program needs to know structure of the file to display it.
Otherwise it doesn't know where to put boundaries
- might display:
 (4-byte-no)(1-byte-char)(2-byte-no)(4-byte-no)....
There has been a    trend   towards program data
that humans can read in a text editor: 
 (1-byte-character)(1-byte-character)(1-byte-character)....
which displays 
characters that express the contents.
- html, xml, json, and (sort of) ps, tex
 Text format is less efficient, but easier to work with 
 To store a 2 byte short integer in a file:
  -  Binary format: First 2 bytes of the file are:
 10010000 10000000
 You just have to "know" that the first 2 bytes are to be read together
as defining an integer.
If we interpret them that way,
 they define the number:
 36992
 
   
-  Readable text format: First 16 bytes of the file are:
 00111100 01101110 01110101 01101101 00111110 00110011 00110110 00111001 00111001 00110010 00111100 00101111 01101110 01110101 01101101 00111110
 which translates byte-by-byte as the characters:
 60 110 117 109 62 51 54 57 57 50 60 47 110 117 109 62
 that is,  the characters:
 '<' 'n' 'u' 'm' '>' '3' '6' '9' '9' '2' '<' '/' 'n' 'u' 'm' '>'
 i.e. when displayed in a text editor this file will read:
 <num>36992</num>
 All this to make the number human readable.
To edit the data:
  -  Binary format:
Write a program to edit it.
 
-  Text format:
Use text editor  to edit it.
 "Human readable"  machine data 
With the readable format,
an expert human can debug, tweak the data
in a simple text editor,
if they know what they are doing.
It is a much less efficient format 
- might take 16
 1-byte characters to display the 2-byte-number
- and this is why binary was so popular in the past.
But can  use  such schemes now
 because machines   more powerful, disk space bigger,
bandwidth better.
HTML showed it could be done.
XML
took the idea  much  further.
"Like HTML, XML files are text files that people shouldn't have to read, but may when the need arises."
Especially hard to work with are  
secret binary formats like  
 the 
 .doc 
format in 
 
Microsoft Word
from 1983
to 2007
(and still widely used  today).
With secret-format binary, 
  you had to use  Microsoft Word  to modify the data.
The .doc format is still binary, but no longer secret-format.
If you use binary   Microsoft Word .doc,
it is hard to write scripts to manipulate your files
as you can  with text files.
 
Instead you may have  to point and click  inside other people's menus.
This is the main reason why I never used Microsoft Word.
If you use     Microsoft Word .docx,
the raw data is text format (XML).
This is  easier to script, but there are still  issues.
 
 
  
 
 
  -  Old post sums it up:  What's So Bad About Microsoft?
 
- Alternative View of the Microsoft Monopoly
- 1999 discussion on   secret  file formats.
-  Interesting argument that:
 
"The file formats of MS Office were designed by Microsoft to be difficult to reverse
               engineer and to be as closely tied as possible to the Microsoft platform. This does not translate to a good standard. If a standard is to be decided for Word
               Processing it should be human readable, easily understandable, cross platform, and leave room for upgrades with bidirectional compatibility. 
               The Office formats have no concept of expandability and are neither forward nor backward compatible because Microsoft always intends to replace the
               format with something incompatible in the next release to force users to upgrade. The Office file formats have no concept of interoperability because
               Microsoft's primary concern is forcing people to use Microsoft Office on Microsoft Windows. The Office file formats are not easy to implement or
               understand because part of their purpose is to delay competitors from reverse engineering them."
 
 
 
|  Scripting .doc files   It is possible   to write scripts to process Word (and other Office) files.
  grep *doc Q. Say you have 1,000 MS Word .doc   documents on disk.
How do you search for a string in them?
You can't do:grep string *doc
 
 
 
The default search
in
Windows   Explorer
can search for strings in Word files, but only returns the list of files that match,
not the detailed output  
that grep returns.
Q. Can Windows Explorer search be called from a DOS script?
(Would need to return text output list of files.)
 
Desktop search programs
will pre-index your files and search them with a GUI or Web interface.
Some can search   Word files.
 Some can be called through a programming API.
 Q. Can any of these be called from a  DOS script like grep?
 
Google Desktop
 
Windows   Search
 
 
 You can use VBScript. 
  
 
  sed *doc Q. Say you have 1,000 MS Word  .doc  documents on disk.
How do you go through them,
changing string S1 everywhere it occurs into string S2?This is easy   if you have text format files.
A short Linux script using, say,
sed
can change them all.
 
 
 You can use VBScript:
 
  More questions 
 How do  I script .doc files on Linux?
VBScript does not run on Linux.
 
 Can I   script access to online  
Office  applications  like
Google Docs?
 
  Google Apps Script 
- scripting access to applications   in 
   the
Google Apps
family.
 | 
 
 
 Part of a  Word file in XML format (DOCX):
 
The above corresponds to this part of the document:
 
 
How to see the XML:
-  Rename file.docx to file.zip
-  Unzip it
-  See document.xml
|    Scripting .docx files   Can we write a
 find and replace script  
for .docx files?
   Unzip .docx to get XML
 sed on the XML
 Zip the changed file  back up as .docx
  Does this work? Or, if text changes, do other things (e.g. margins) need to change?
 |