[WBEL-users] OT: Fwd: Re: Booting linux host, getting error message
(SOLVED)
James Knowles
jamesk@ifm-services.com
Sun, 15 Aug 2004 14:49:34 -0600
OK. I think I understand where you're coming from better.
What you're interpreting as "the correct behavior of standard
filesystems" appears to be rooted in observing side effects of several
layers of ideas, along with some confusion about terminology.
Your conclusions are logical and mostly correct.
The following glosses over some things (and I may have goofed up in
places)....
1) A filesystem is nothing more than a data structure, normally a
single-rooted tree.
2) A filesystem as a computer science construct has no connection with
any storage medium, be it magnetic (disk or tape), solid-state
(persistent or volitile), or even an object database. [I'll get to
implementation later.]
In other words, view a filesystem as a data structure completely
separate from its storage medium (what you call "backing store"). If you
view the two as being tied together on a theory level, then confusion
can creep in.
Why do media come into play?
In an ideal computer, there would be unlimited non-volitaile RAM that
operated at the same speed as the CPU. In this situation all data would
always be available immediately. Shutting down the computer would not
cause data loss. Unfortunately this is world of computer theory, the
discrete mathematical underpinnings of computers.
When we cross into the "real" world engineering becomes dominant. Not
only is RAM volitile, but it's freaking expensive, especially on-chip
RAM. L1 cache typically runs at full speed, but it's not cost effective
in large quantities. L2 cache is slower and cheaper, but still not cost
effective in large quantities. Moving off-chip we find ordinary DRAM
which is even slower -- but also cheaper.
But we still need non-volitile storage so we're not using toggle
switches to manually key in stuff every time we start the computer. Of
the different technologies, magnetic media have dominated because of the
cost/benefit ratio. However, they are many orders of magnitude slower
than the CPU. We measure instruction execution in nanoseconds (or
faster), but measure hard drive access in milliseconds -- a million
times slower. Even when data is flooding in, it's still tens or hundreds
of times slower than RAM. That's a lot of virtual thumb-twiddling for
the CPU.
Because it's a huge issue that affects overall performance, the kernel
does provide some help in the form of caching. That's a different topic.
OK. So what does that have to do with the Unix system calls?
Unix has a standard view of all things [with few exceptions], including
goofy things such as serial ports. The Unix paradigm see everything as
nodes in a uniform single-rooted tree structure. This data structure is
the Unix filesystem, and is entirely conceptual. Programs access the
computer through a uniform set of system calls that access the edges and
nodes of this conceptual tree.
Not all of these resources are actually files on a spinning magnetic
disk. The /proc filesystem does not store files, but simply presents
kernel information using a uniform subset of Unix system calls.
When we marry a data structure with the Unix system calls, we have a
filesystem implementation. It doesn't matter whether there are really
any magnetic bits on a spinning disk. The filesystem is simply data
accessible through the same subset of system calls because that's the
paradigm that the Unix designers imposed.
That's it. That's a filesystem. Nothing more, nothing less. There is no
requirement for any medium.
OK. Stop jumping around. Put these pieces together.
Now we get into implementing a filesystem for generic data storage,
where we need to start dealing with the "real" world of physical storage.
A filesystem in its purest form simply stores information in RAM (by
virtue of it being a data structure). It does not use any other medium.
When data needs to be stored, a new node in the tree is created to hold
the data (occupying RAM). When information is not needed, that node in
the tree is removed (freeing the RAM). This is very simple programming
stuff. Create an object. Free it.
This is what tmpfs does. It's about the simplest data storage filesystem
that one can write because one only needs to deal with a world close to
pure theory.
Here's are misconceptions:
1) tmpfs is not "fuzz[ing] over medium and filesystem."
2) "[T]hey [filesystem and media] seem to be inextricably tied together."
There is no fuzzing because there is no medium. Filesystems are
independent of media. (See /proc example.)
If you can get that, the rest should be straightforward.
Block Devices
When one wants to bind magnetic media to a filesystem, then we have to
make some adjustments. Again, the software presents an interface in
compliance with the Unix system calls. However, it uses a block-special
device. The block-special devices present a uniform access methodology
by implementing certain Unix system calls.
These have certain characteristics:
1) Slow!
2) Fixed size
3) Organized and only accessible in uniform blocks (e.g. 512 bytes)
[leaving off wierd exceptions]
4) Non-volitile
In implementing the filesystem, we have to accept restrictions and
conditions imposed by the rules. The numbers correspond to the list above.
1) We can pull several dirty tricks to speed things up.
2) We can simplify programming by exploiting this.
3) Forces us to organize data in certain ways.
4) This is what we want, but allows some dirty tricks that cause
problems elsewhere.
The different filesystems for block devices (ext2, ext3, reiserfs, fat,
even ntfs or the Amiga ffs) have common ideas mainly stemming from (2)
(3) and (4). Because there's more than one way to skin a cat, each
tackles the problem of how to divvy up the space and organize the data
in a different way. For example, some are better optimized for small
files, others large files. Others are optimized for space, or speed.
Some handle file fragmentation better than others.
The common denominator is that they all require a block-special device
to talk to.
Loopback devices
The loopback device is a way to trick software that talks to
block-special devices into thinking that a file is really a block
device. This is good and this is bad.
The great thing is that one can build a disk image in a file, then
record it the the medium. It's also handy for reading CD-ROM ISOs. The
good thing is the loopback devices are so handy.
Because the files are typically stored on hard drive, it's normally no
big deal.
But there is a problem when the loopback files are
1) stored on some sort of RAM disk (including tmpfs), and
2) one deletes files.
With a hard disk, one can assume that the block device that logically
wraps it has a fixed number of blocks. These blocks always exist. (I'm
ignoring bad blocks here.) Because the blocks always exist, when we talk
of allocating and freeing blocks, it's not a physical action.
If we "free" block 312, it doesn't suddenly vanish. It's still a
physical space on a disk platter with a magnetic pattern. When we "free"
a block, we mean it in computer science terminology. We alter bits
elsewhere in the data structure to remove any prior references to the
block. We simply ignore the data that was previous written to that
region on the disk platter because regardless of what bits are in block
312, it wil always exist. Writing zeroes does not alter that fact, and
just wastes time.
When we carry these assumptions over to a sparse file on a RAM disk,
trouble happens. The filesystem implementation (e.g. ext3) does nothing
different. It talks to a block device and assumes it's a hard disk with
the above characteristics.
Scenario #1:
Take the following actions:
1) A file is created in the filesystem using the loopback device, say
using ext3fs. This means that bits are altered in the data structure,
and bits are written to block 312 in the loopback device. The filesystem
assumes this is a physical disk. It's ignorant of the loopback device.
2) The loopback device writes 512 bytes of data (one block) to offset
159,744 (312 * 512) of the underlying sparse file. This causes the
sparse file to reserve space in /its/ underlying filesystem... say tmpfs.
3) The loopback file is on tmpfs. Because new space is being reserved,
tmpfs allocates RAM to hold the data.
4) The file is deleted in the filesystem on top of the loopback device.
Ext3fs assumes that the block device it is talking to is a hard drive.
It updates the internal data structure, BUT DOES NOTHING WITH BLOCK 312.
See above for why.
5) The loopback device RECEIVES NO INSTRUCTIONS regarding offsets
159,744 through 160,255. It DOES NOT ALTER the file holding disk image.
6) tmpfs which holds the file holding the diskimage RECEIVES NO
INSTRUCTIONS.
Thus in this scenario one has deleted a file in the loopback device, but
no memory was freed by tmpfs because it was never told to delete anything.
This is a side-effect of several layers of assumptions being violated by
the layer beneath.
Scenario #2:
Assume the same set of actions above, but assume that in step #4 the
implementing filesystem was designed to always write out zeroes to
deleted files because the designer values security more than speed.
Let's call this this the "foo filesystem:" foofs.
1) same as above
2) same as above
3) same as above
4) The file is deleted in the filesystem on top of the loopback device.
Foofs updates the internal data structure, and writes 512 zeroes to
block 312.
5) The loopback device writes 512 zeroes to the file starting at offset
159,744.
6) Because the file under the loopback device is a sparse file, it DOES
NOT FREE already-allocated i-nodes. It writes the zeroes but DOES NOT
FREE the i-nodes.
7) tmpfs which holds the file holding the diskimage RECEIVES NO
INSTRUCTIONS.
Thus in this scenario one has deleted a file in the loopback device, but
no memory was freed by tmpfs because it was never told to delete anything.
This is a side-effect of several layers of assumptions being violated by
the layer beneath.
- - - - - - -
>You CAN mount other standard filesystems from a file as a medium via loopback; AFAIK tmpfs will only use your VM as a medium.
>
>
This is because what you call "standard" filesystems only use
block-special devices. The loopback device sits on top of the file and
translates block instructions into file operations.
So:
1) tmpfs only uses VM as a medium
2) "standard" filesystems (designed for hard disks) only uses block
devices as a medium
Do not disparage tmpfs because it's limited to one medium. So are
ext2/3, reiserfs, etc. They are limited to one medium. The loopback
device simply tricks them into thinking it's a block-special device.
>it's right for the special requirements of UML, but it's not the correct behaviour of standard filesystems.
>
>
Actually, just as the loopback is allowing block-oriented filesystems to
be used in ways not intended, /dev/anon is doing the same type of thing.
Both are correct behavour. /dev/anon is simply compensating for problems
inherient in applying a block-oriented filesystem to a non-block medium
(VM).
You could argue that the "standard" filesystems are broken in that
regard. They're not really "broken" but "optimized" for a specific use
(being violated by UML).
>My layman's view
>is that tmpfs is able to release space back to the backing store not
>because it is actually VM underlying, but because it is somehow able
>to rearrange itself or indicate to the store that certain portions are
>now unused.
>
tmpfs is being told explicitly to free things, and because it's using VM
it can just go ahead and do so.
This gets into the specifics of the tree hierarchy that represents a
filesystem. tmpfs is able to free memory because it's being told to
remove items from the tree. Freeing items from the tree gives it
explicit knowledge that memory must be freed. There's no magic here.
>If we could somehow put another filesystem on the same
>backing store, the filesystem still has to indicate which portions are
>now safe to be reclaimed.
>
>
You're correct here. The filesystem knows this explicitly, but the
details of the underlying medium allow it to choose how to behave. A
hard drive-oriented filesystem is just "lazy" because it is allowed to
be lazy. tmpfs is burning up RAM, so it can't be lazy.
>Put another way, I _believe_ (but could be wrong) that ramfs presents
>a form of virtual memory medium as a block device on top of which you
>can layer standard filesystems and that deleting files there does not
>return space to the VM because the filesystem is still holding on to
>it...
>
Very close. It doesn't return space to the VM because the filesystem
(e.g. ext3) was written for block devices where blocks are assumed to
always exist.
In other words, it's not that ext3/reiser/etc. *hold on to* the space,
but their assumptions *preclude* the notion that space can be freed.
>If a hypothetical filesystem is created that can indicate to
>the backing store that an area should be treated as being sparse, it
>could become a loopback version of tmpfs
>
>
Essentially true.