[WBEL-users] OT: Fwd: Re: Booting linux host, getting error message (SOLVED)

James Knowles jamesk@ifm-services.com
Sun, 15 Aug 2004 14:49:34 -0600


OK. I think I understand where you're coming from better.

What you're interpreting as "the correct behavior of standard 
filesystems" appears to be rooted in observing side effects of several 
layers of ideas, along with some confusion about terminology.

Your conclusions are logical and mostly correct.





The following glosses over some things (and I may have goofed up in 
places)....

1) A filesystem is nothing more than a data structure, normally a 
single-rooted tree.
2) A filesystem as a computer science construct has no connection with 
any storage medium, be it magnetic (disk or tape), solid-state 
(persistent or volitile), or even an object database. [I'll get to 
implementation later.]

In other words, view a filesystem as a data structure completely 
separate from its storage medium (what you call "backing store"). If you 
view the two as being tied together on a theory level, then confusion 
can creep in.

Why do media come into play?

In an ideal computer, there would be unlimited non-volitaile RAM that 
operated at the same speed as the CPU. In this situation all data would 
always be available immediately. Shutting down the computer would not 
cause data loss. Unfortunately this is world of computer theory, the 
discrete mathematical underpinnings of computers.

When we cross into the "real" world engineering becomes dominant. Not 
only is RAM volitile, but it's freaking expensive, especially on-chip 
RAM. L1 cache typically runs at full speed, but it's not cost effective 
in large quantities. L2 cache is slower and cheaper, but still not cost 
effective in large quantities. Moving off-chip we find ordinary DRAM 
which is even slower -- but also cheaper.

But we still need non-volitile storage so we're not using toggle 
switches to manually key in stuff every time we start the computer. Of 
the different technologies, magnetic media have dominated because of the 
cost/benefit ratio. However, they are many orders of magnitude slower 
than the CPU. We measure instruction execution in nanoseconds (or 
faster), but measure hard drive access in milliseconds -- a million 
times slower. Even when data is flooding in, it's still tens or hundreds 
of times slower than RAM. That's a lot of virtual thumb-twiddling for 
the CPU.

Because it's a huge issue that affects overall performance, the kernel 
does provide some help in the form of caching. That's a different topic.

OK. So what does that have to do with the Unix system calls?

Unix has a standard view of all things [with few exceptions], including 
goofy things such as serial ports. The Unix paradigm see everything as 
nodes in a uniform single-rooted tree structure. This data structure is 
the Unix filesystem, and is entirely conceptual. Programs access the 
computer through a uniform set of system calls that access the edges and 
nodes of this conceptual tree.

Not all of these resources are actually files on a spinning magnetic 
disk. The /proc filesystem does not store files, but simply presents 
kernel information using a uniform subset of Unix system calls.

When we marry a data structure with the Unix system calls, we have a 
filesystem implementation. It doesn't matter whether there are really 
any magnetic bits on a spinning disk. The filesystem is simply data 
accessible through the same subset of system calls because that's the 
paradigm that the Unix designers imposed.

That's it. That's a filesystem. Nothing more, nothing less. There is no 
requirement for any medium.

OK. Stop jumping around. Put these pieces together.

Now we get into implementing a filesystem for generic data storage, 
where we need to start dealing with the "real" world of physical storage.

A filesystem in its purest form simply stores information in RAM (by 
virtue of it being a data structure). It does not use any other medium. 
When data needs to be stored, a new node in the tree is created to hold 
the data (occupying RAM). When information is not needed, that node in 
the tree is removed (freeing the RAM). This is very simple programming 
stuff. Create an object. Free it.

This is what tmpfs does. It's about the simplest data storage filesystem 
that one can write because one only needs to deal with a world close to 
pure theory.

Here's are misconceptions:
1) tmpfs is not "fuzz[ing] over medium and filesystem."
2)  "[T]hey [filesystem and media] seem to be inextricably tied together."

There is no fuzzing because there is no medium. Filesystems are 
independent of media. (See /proc example.)

If you can get that, the rest should be straightforward.

Block Devices

When one wants to bind magnetic media to a filesystem, then we have to 
make some adjustments. Again, the software presents an interface in 
compliance with the Unix system calls. However, it uses a block-special 
device. The block-special devices present a uniform access methodology 
by implementing certain Unix system calls.

These have certain characteristics:
1) Slow!
2) Fixed size
3) Organized and only accessible in uniform blocks (e.g. 512 bytes) 
[leaving off wierd exceptions]
4) Non-volitile

In implementing the filesystem, we have to accept restrictions and 
conditions imposed by the rules. The numbers correspond to the list above.
1) We can pull several dirty tricks to speed things up.
2) We can simplify programming by exploiting this.
3) Forces us to organize data in certain ways.
4) This is what we want, but allows some dirty tricks that cause 
problems elsewhere.

The different filesystems for block devices (ext2, ext3, reiserfs, fat, 
even ntfs or the Amiga ffs) have common ideas mainly stemming from (2) 
(3) and (4). Because there's more than one way to skin a cat, each 
tackles the problem of how to divvy up the space and organize the data 
in a different way. For example, some are better optimized for small 
files, others large files. Others are optimized for space, or speed. 
Some handle file fragmentation better than others.

The common denominator is that they all require a block-special device 
to talk to.

Loopback devices

The loopback device is a way to trick software that talks to 
block-special devices into thinking that a file is really a block 
device. This is good and this is bad.

The great thing is that one can build a disk image in a file, then 
record it the the medium. It's also handy for reading CD-ROM ISOs. The 
good thing is the loopback devices are so handy.

Because the files are typically stored on hard drive, it's normally no 
big deal.

But there is a problem when the loopback files are
1) stored on some sort of RAM disk (including tmpfs), and
2) one deletes files.

With a hard disk, one can assume that the block device that logically 
wraps it has a fixed number of blocks. These blocks always exist. (I'm 
ignoring bad blocks here.) Because the blocks always exist, when we talk 
of allocating and freeing blocks, it's not a physical action.

If we "free" block 312, it doesn't suddenly vanish. It's still a 
physical space on a disk platter with a magnetic pattern. When we "free" 
a block, we mean it in computer science terminology. We alter bits 
elsewhere in the data structure to remove any prior references to the 
block. We simply ignore the data that was previous written to that 
region on the disk platter because regardless of what bits are in block 
312, it wil always exist. Writing zeroes does not alter that fact, and 
just wastes time.

When we carry these assumptions over to a sparse file on a RAM disk, 
trouble happens. The filesystem implementation (e.g. ext3) does nothing 
different. It talks to a block device and assumes it's a hard disk with 
the above characteristics.

Scenario #1:

Take the following actions:
1) A file is created in the filesystem using the loopback device, say 
using ext3fs. This means that bits are altered in the data structure, 
and bits are written to block 312 in the loopback device. The filesystem 
assumes this is a physical disk. It's ignorant of the loopback device.
2) The loopback device writes 512 bytes of data (one block) to offset 
159,744 (312 * 512) of the underlying sparse file. This causes the 
sparse file to reserve space in /its/ underlying filesystem... say tmpfs.
3) The loopback file is on tmpfs. Because new space is being reserved, 
tmpfs allocates RAM to hold the data.
4) The file is deleted in the filesystem on top of the loopback device. 
Ext3fs assumes that the block device it is talking to is a hard drive. 
It updates the internal data structure, BUT DOES NOTHING WITH BLOCK 312. 
See above for why.
5) The loopback device RECEIVES NO INSTRUCTIONS regarding offsets 
159,744 through 160,255. It DOES NOT ALTER the file holding disk image.
6) tmpfs which holds the file holding the diskimage RECEIVES NO 
INSTRUCTIONS.

Thus in this scenario one has deleted a file in the loopback device, but 
no memory was freed by tmpfs because it was never told to delete anything.

This is a side-effect of several layers of assumptions being violated by 
the layer beneath.

Scenario #2:

Assume the same set of actions above, but assume that in step #4 the 
implementing filesystem was designed to always write out zeroes to 
deleted files because the designer values security more than speed. 
Let's call this this the "foo filesystem:" foofs.
1) same as above
2) same as above
3) same as above
4) The file is deleted in the filesystem on top of the loopback device. 
Foofs updates the internal data structure, and writes 512 zeroes to 
block 312.
5) The loopback device writes 512 zeroes to the file starting at offset 
159,744.
6) Because the file under the loopback device is a sparse file, it DOES 
NOT FREE already-allocated i-nodes. It writes the zeroes but DOES NOT 
FREE the i-nodes.
7) tmpfs which holds the file holding the diskimage RECEIVES NO 
INSTRUCTIONS.

Thus in this scenario one has deleted a file in the loopback device, but 
no memory was freed by tmpfs because it was never told to delete anything.

This is a side-effect of several layers of assumptions being violated by 
the layer beneath.

- - - - - - -


>You CAN mount other standard filesystems from a file as a medium via loopback; AFAIK tmpfs will only use your VM as a medium.
>  
>

This is because what you call "standard" filesystems only use 
block-special devices. The loopback device sits on top of the file and 
translates block instructions into file operations.

So:
1) tmpfs only uses VM as a medium
2) "standard" filesystems (designed for hard disks) only uses block 
devices as a medium

Do not disparage tmpfs because it's limited to one medium. So are 
ext2/3, reiserfs, etc. They are limited to one medium. The loopback 
device simply tricks them into thinking it's a block-special device.

>it's right for the special requirements of UML, but it's not the correct behaviour of standard filesystems.
>  
>

Actually, just as the loopback is allowing block-oriented filesystems to 
be used in ways not intended, /dev/anon is doing the same type of thing. 
Both are correct behavour. /dev/anon is simply compensating for problems 
inherient in applying a block-oriented filesystem to a non-block medium 
(VM).

You could argue that the "standard" filesystems are broken in that 
regard. They're not really "broken" but "optimized" for a specific use 
(being violated by UML).


>My layman's view
>is that tmpfs is able to release space back to the backing store not
>because it is actually VM underlying, but because it is somehow able
>to rearrange itself or indicate to the store that certain portions are
>now unused.
>

tmpfs is being told explicitly to free things, and because it's using VM 
it can just go ahead and do so.

This gets into the specifics of the tree hierarchy that represents a 
filesystem. tmpfs is able to free memory because it's being told to 
remove items from the tree. Freeing items from the tree gives it 
explicit knowledge that memory must be freed. There's no magic here.

>If we could somehow put another filesystem on the same
>backing store, the filesystem still has to indicate which portions are
>now safe to be reclaimed.
>  
>

You're correct here. The filesystem knows this explicitly, but the 
details of the underlying medium allow it to choose how to behave. A 
hard drive-oriented filesystem is just "lazy" because it is allowed to 
be lazy. tmpfs is burning up RAM, so it can't be lazy.

>Put another way, I _believe_ (but could be wrong) that ramfs presents
>a form of virtual memory medium as a block device on top of which you
>can layer standard filesystems and that deleting files there does not
>return space to the VM because the filesystem is still holding on to
>it...  
>

Very close. It doesn't return space to the VM because the filesystem 
(e.g. ext3) was written for block devices where blocks are assumed to 
always exist.

In other words, it's not that ext3/reiser/etc. *hold on to* the space, 
but their assumptions *preclude* the notion that space can be freed.

>If a hypothetical filesystem is created that can indicate to
>the backing store that an area should be treated as being sparse, it
>could become a loopback version of tmpfs
>  
>

Essentially true.