School of Computing. Dublin City University.
Online coding site: Ancient Brain
coders JavaScript worlds
When file first referenced, directory structure must be searched. Does file exist. Where is file actually stored on disk. Do I have permission to read (write) it. May also take the opportunity to read the first n locations, assuming they will be used, rather than (inevitably) make the disk seek later. This data is then cached so search does not have to be done again. e.g. In C++:
#include <iostream.h> #include <fstream.h> { ifstream f ( "dir/file53.txt", ios::in ); if ( ! f ) { cout << "Can't open file. \n"; exit(1); } // else just use the f pointer // no longer refer to it by file name (don't do search again) f >> x; // read 1st line into variable } // end of scope closes file (remove cache info in memory)
Mounting a file system
is like opening a new file, for entire file system:
i.e. Things that need only be done once:
Where do we access the hardware. Where do we attach the new file system (assign a unique drive number, assign point in hierarchy). Does it contain a valid file system in expected format.
Advantages:
Fast searching. Can quickly jump to block n for direct access
(e.g. binary file with fields).
Don't have to move disk head back in forth.
Especially if writing sequential data, head is always at correct position.
Disadvantages:
Get fragmentation problems
like with memory.
1 G of disk space free but no more than 50 M in a row.
Defragmentation
can take hours.
1st block links to 2nd block, 2nd links to 3rd, etc., the link field for the last block contains an EOF marker (in fact, EOF may be half-way through the last block - internal fragmentation).
To expand file, simply grab free block anywhere and link it on end of list.
or:
Free-blocks list in FAT is rather elegant - it is one enormous linked list linking together all unused blocks. When delete a file, just remove directory entry and hook its linked list of blocks onto end of free list. Minimal work to delete:
Windows NT family replaces FAT with NTFS but provides backward compatibility.
Centralised index also, like FAT.
(i.e. Data not scattered round disk.)
In this case, each file has a list of blocks that it is held in.
e.g. File's index is:
9, 16, 1, 10, 25
Just a variable-size array.
Benefit - Direct access to n'th block.
Direct access is good -
Easier to append.
Or rewrite after location 300 K (only 2nd half of file
has been edited).
Problem - Index is proportional to size of file.
Solutions:
Also along these lines - pre-allocate a big chunk of disk to FAT / inodes (pre-allocated even if disk empty). Can reserve a good chunk of disk for these structures if they help fast I/O.
We also cache file information from Open() system call and disk blocks in memory. This is good - avoid disk I/O again. But limited because memory is limited. Block replacement algorithm like page replacement in demand paging - perhaps use LRU to decide who to kick out of cache.
Just like memory info sometimes ends up on disk (swapping,
demand paging),
so disk info sometimes ends up in memory (disk-block cache).
Disk is overflow of crowded memory.
Memory is not overflow of disk - disk is not crowded.
Rather, memory is fast mirror of bits of disk
that we are working on.
e.g. Read block, keep in memory, if read again
can return the info immediately without any
I/O either now or later.
Technically, system does not have to actually write blocks to disk until a read request comes in for that file (from this process or other process). But this can be dangerous - if power goes off we may lose data that we thought was saved to disk long ago. In reality write-to-disk is asynchronous but not delayed indefinitely. Actually will happen quite quickly if nothing else going on.
How do we always know how big a file is?
file_size = (no. of blocks - 1) (size of block) + (how many bytes used in last block)Last block has data (binary or text) ending with EOF / Ctrl-D char half-way through block.
Many systems now have a "soft" delete, where "deleted" files are moved to a "Trash can", "Waste basket" or "Recycle bin", but this bin is not emptied until the user asks for it to be emptied. The files can be recovered at any time.
Obviously, if the files can be recovered, then the blocks they occupy on disk cannot be marked free. Rather, they are still in use.
When you finally empty the Recycle Bin, the blocks are finally marked free. However, even after this "real" DELETE action, the files still exist on disk!
Why? Because there is no need for the OS to actually overwrite those blocks on disk. Doing so would take time, and is not necessary. All that is necessary is that the blocks are now marked free and available for use by other files. Generally, DELETE does the absolute minimum necessary to mark the space free and move on.
One unfortunate consequence of this is that no OS really deletes your files, and anyone who is willing to put a bit of effort in might be able to recover long-ago deleted data from your disk.
Exercise - In the above, when you invented a "Recycle bin" for UNIX, how would you implement a "Empty Recycle bin"? (e.g. If you have over-ridden the system "rm" command, then how do you really rm things?)
In a FAT, UNDELETE is easy. Recall above - DELETE leaves the linked list of blocks intact (or, to be precise, entries in the FAT that can be interpreted as a linked list). It just hooks the linked list on to the end of the free-blocks linked list. One (perhaps unexpected) consequence of this is that the file is easy to recover. Just find the start-block and we can recover the whole file. Because we hook to the end of the free-list, the blocks may not be overwritten for a long time.
And in fact the old directory entry may still exist. All DELETE did was strike out the 1st char of the filename to mark the entry free. The no. of the start block is still intact. UNDELETE can list the marked-free directory entries, user supplies the 1st char of the filename, and UNDELETE is often able to recover the whole file.
If dir entry reused / overwritten,
have to search every block on disk by hand.
Although when find start block,
can use FAT to find next blocks.
- If it was recently that you deleted,
you could start search at end of free-blocks list.
Perhaps UNDELETE being easy is an advantage of a FAT. Or perhaps it is a disadvantage. e.g. In the Monica Lewinsky case people under investigation were startled to discover that old deleted files of theirs could be undeleted and recovered from their PCs.
Perhaps an OS should make clear to the user if their data still exists, and "Empty the Trash Bin" should mean "Delete forever", even if it takes a long time. Perhaps have a "Shredder" icon. Could even say "Shredding these files will take 2 hours. Continue?"
It gets worse. Data must have been in memory sometime. Might have been in swap space. Could it still exist in swap file on disk? Conclusion - If selling computer, the only real security is to remove and destroy the hard disk first.
Lot more difficult. The blocks do not refer to each other. If the list of them is lost, we may have to read every single block on disk to recover the file - no idea where it is.
And the list is lost, because the inode index data structures are immediately reused by other files.
Summary - after "rm" on UNIX, the data blocks still exist on disk (vulnerable to overwriting of course), but to UNDELETE you might have to read the entire disk (i.e. every block) by hand.
If you delete a file and it is not in an old backup, it's effectively gone, unless you have the resources to pay someone to search disk by hand (and file is that important).