Using files

File - A named section of disk.

Files implementation: Not necessarily a contiguous section of disk (but that fact may be hidden from users and programs).
Normally both user and programmer never deal with disk directly, but only by calling named files.

In some high-performance application (e.g. writing a high-speed search engine), you may need to implement your own file system, but this is obviously difficult and full of dangers.

File Types

List of file formats
Alphabetical list of file extensions
file - query file type
Programs (machine readable)
- Extensions: class, exe, com, o, obj, a, dll
- See also Program File Types
Program source code (human readable)
- java, c, cxx, h, hxx, pas, asm
Programs (human readable) - interpreted scripts
- js, php, sh, bat, pl
Program data (machine readable). Often strictly formatted. Precise length of each field pre-defined (for ease of machine reading, and so data can be read into pre-defined fixed-size program variables).
- Database files.
- Documents for display. e.g. Word docs (doc), ps, pdf, rtf, tex, dvi
- Multimedia files - images, audio, video. - gif, jpg, jpeg, mpg, mpeg, ram, avi, qt, au.
Program data (human readable). Often variable size, free-form text.
- Preferences files, rc.
- Documents for display. e.g. HTML docs (htm, html, shtml), xml, Office xml (docx), latex, txt
- Human readable program data - xml, json
- See also Human readable program data
Log files.
Archive files - tar
Compressed files - zip, arc, gz, Z.

File system divisions

Windows file system can spread over multiple pieces of hardware. Each given its own (single-letter) drive:

 drive:\dir\file

Can also partition a single piece of hardware into multiple drives.

UNIX file system can spread over multiple pieces of hardware too. But everything appears as sub-directories of a single file hierarchy.
Path may indicate hardware, something equivalent to:

 /drive/dir/file

or may hide hardware entirely:

 /dir/file

Hierarchical file system

Can organise files in separate dirs (Many web authors seem not to have discovered sub-dirs!).
Crucial to keep user files separate from system files (Why?).
Windows C:\Users\me
UNIX $HOME
Can reuse same file names in different sub-dirs (like index.html).

Long file names

All modern OS's allow long filenames:

 photos.kenya.apr.1963.html

Legacy systems:

DOS and Windows before Windows 95 had 8 char name, 3 char extension:
```
 phka0463.htm
```
VM/CMS had 8 char filenames and no sub-directories!

Short file names are good for ..

Short file names are good, though, for:

File names you type. e.g. If you are typing file names at command-line. All-lower-case is easiest to type.
Program names at the command-line (i.e. the program you call has a short filename). sed, grep, ls, cut, etc. All-lower-case is easiest to type.
Some people say also URLs?
Maybe you should never type URLs. At most you type the host name that you saw somewhere. For everything else you cut and paste, or click.
Maybe short URLs: https://en.wikipedia.org/wiki/Othello make the web a more pleasant experience than long URLs:
http://odp.org/Arts/Literature/World_Literature/British/Shakespeare/Works/Plays/Tragedies/Othello/.
It is nice to have short, "guessable" URLs.
See "URL as UI"
See URL shortening. (Used e.g. on Twitter.)

Short URLs should probably be used in posters and ads:
This health poster on campus caught my eye.
This probably should use a shorter, and lowercase, URL, like:
hse.ie/chlamydia

Q. Is there still a problem with that URL?

Some web server set-ups generate super-complex URLs, which can then get pasted into documents.
This is apparently a real ad.
From here.

Symbolic link (cross-link, breaking the hierarchy, "shortcut") in UNIX

File

Can also just give a file multiple names:

 ln -s file secondname

Or have pointer in one dir to file in other dir.
Can do this on Windows as well (have multiple shortcuts to a data file or program).

On DCU Linux you will see lots of pointers:

$ ls -l /bin      | grep '^l'
$ ls -l /usr/bin  | grep '^l'
 
$ ls -l /usr/bin/touch
lrwxrwxrwx 1 root root 10 May  1  2020 /usr/bin/touch -> /bin/touch

Sometimes you see a link like this:

 /bin/ls -> /usr/bin/ls

Q. Why do programs sometimes call a specific path to a program, e.g. they call /bin/ls rather than just ls ?

Problems with cross-links

With shortcuts, if doing a recursive search of disk, can get infinite loop problems, or at least duplication. e.g. List all files on disk. If follow symbolic links may list files twice.

Q. Also, if delete file, do you delete symbolic link? If so, how do you find them - do you have reverse directory of them? Also, I make symbolic link to other user's file. They delete file. They can't delete my link.
A. If link doesn't work, so what. Might even leave it dangling as reminder.

Security

If your directory is accessible by others on your local machine, someone on your machine can make it readable by the world on the Web (either maliciously or accidentally):

cd     /homes/your-userid/public_html
ln -s  /homes/other-userid/dir          shortcut

The world can then read other user's directory through:

http://host/~your-userid/shortcut/

Has valid uses too. Might want to make one of your own dirs visible without having to have it under public_html, e.g. public_html disk is full, dir is on another disk.

Another example - ftp may only drop you in home directory rather than root directory and you may not be able to go upwards. What you do is put symbolic links in your home directory and you can access any directory through them:

  ln -s /var/mail  email
  ln -s /htdocs    ht

"Hierarchy with some cross-links" a very powerful model

How to structure a large collection of books / computer files / web pages.

A strict hierarchy with no cross-links. e.g. Dewey library system.
A basic hierarchy with some cross-links for difficult points. e.g. Linux file system.
Just a huge number of items and a search engine. e.g. The web.

In genealogy, family trees are no.2. Basically hierarchical, with arbitrary cross-links needed for difficult points, rather than strictly hierarchical as many people seem to think.

Recycle bin (Windows)

Windows Recycle bin visible through GUI, but also visible as directory through Windows command line:

cd c:\$recycle.bin
dir (apparently empty)
dir /ah (show hidden files)

Backup

If it's data (1's and 0's), there's no real excuse for losing it. You can make automated copies and store them all over the world. Disk space is big and cheap. Machines are often idle. The network is always on. Backups can be automated across the network by scripts.

In future, backup and long-term storage will be increasingly important service, like a bank.

Removable media - DVDs, CDs, tapes, USB keys, external hard disk.
v.
Backup to cloud / server. Distributed file system. Network read-write ftp, automated scripts, mirrors.

Other people back you up

Even if you back up nothing, your web pages are being backed up by other people:

Google cache (click on "Cached")
Microsoft cache (click on "Cached page")
Internet Archive
Flickr - network photo storage
Other social media.
Often our data is on a remote backed-up server by default.

Backup policy

Periodically dump entire file system to backup.
v.
Keep a running "mirror", and only backup things that have changed since last time they were synch-ed.

Perhaps only backup user files.
OS, system and application files can be recovered from install CDs / tapes.

Which of these is the most dangerous:

Keep 1 synchronised copy of your files. Backup the changes every night.
Keep 1 synchronised copy of your files. Backup the changes every hour.
Take a copy of all of your files once a week. Keep all these old copies. Do no backups at all during the week.
Take a copy of all of your files once a month. Keep all these old copies. Do no backups at all during the month.

Remember - it may take days or even months before an intrusion and destruction, or accidental damage, is noticed.
User may realise 2 years later that he has deleted some file and needs it back.

VAX/VMS (DEC) could be set to keep all drafts of a file since created.
- The equivalent of ls would hide all except the latest one by default. Unless explicitly asked otherwise.
- Programming with DCL would work with the latest one by default. Unless explicitly asked otherwise.
- Lot to be said for such an approach, now that disk space is cheap.
Versioning file systems
Google Docs saves all old drafts/versions of docs.