SUMMARY - Problem with Sun Drive Array/RAID 5, AGAIN

From: Bushman, Gonzo (BUSHMAN@comswsys.tinkernet.af.mil)
Date: Thu Feb 15 1996 - 16:05:00 CST


I received four responses and one was correct. I want to first thank the
four that responded:

mrs@cadem.mc.xerox.com
rmk@tif623.ed.ray.com Rick Kelly
ken.dickey@acsacs.com Ken Dickey
koen@ciminko.be Koen Peeters

Again, thanks for the help.

The one correct answer explained best along with a trick that I didn't know:

>Maybe some process on your system still has the deleted file open.
>In that case the file will still exist on the filesystem, although it >does
no longer have an entry in any directory. As soon as your program >closes
the file the file will disappear from the filesystem.

>Some programmer's use this trick to created unnamed tempory files : >open a
new file with a bogus name and immediatly delete the file. You >can than do
all your tempory file bussiness without ever having to >worry about deleting
the tempory files.

This is exactly what happened. As soon as we "bounced" the database, the
file systems reported the correct numbers.

Thanks to all,

Gonzo Bushman, TSgt, USAF
Lead, System Admin Team ---- __o
WWOLS-R Project ---- _\<,
Tinker AFB, OK USA 73145 ------- (_)`(_)
Phone: 405-734-3283 -----------------------
E-Mail: bushman@comswsys.tinkernet.af.mil

 -------------------------------------------------------------------
My original message follows:

Well, I am having problems with my SSA 100 drive arrays, AGAIN. The system
is a 1000E with two SSA 100's (about 30GB each). Each tray in each array
has either 9 or 10 1.05 GB drives, depending on the tray, and they are
configured using RAID 5. The result of this is that each array has three
filesystems of about 6-7 GB storage each (because of RAID 5 it is not the
expected 9-10GB).

Today's problem happened something like this: our Oracle DBA deleted a 2GB
file on a filesystem (called /u01) on the first array. He then went to set
up this supposedly free space as an Oracle 2 GB datafile. Oracle came back
with an error stating that it was unable to create the file. He looked at
the filesystem and realized that this should be wrong as there should be
about 3 GB of space left (there was about 1GB free prior to the deletion of
the 2GB file that started this whole thing.) He then came and got me for an
explanation since I am the lead System Admin.

I ran df and du on this partition and they agreed that there was only about
1GB free. I then started running through all of the directories on this
partition to see where all of the files were. All of the files on this
partition are in one directory and there are only a couple. I added the
file sizes together (using ls -la) and found that there is actually only
about 4.1GB in the files! Where is the other 2.5+GB of space?

I then ran df and du again. du now reports the correct values (see listing
below), but df still returns bad values. Why would these two now not agree?

Okay, if you are still reading you probably think I am nuts. Let the
numbers speak now.

#df -k /u01
filesystem kbytes used avail capacity Mounted on
/dev/vx/dsk/rootdb/u01 6900947 6149028 61829 99% /u01
#du -k /u01
8 /u01/lost+found
1945 /u01/oradata/INSX/exp
4100002 /u01/oradata/INSX
4100003 /u01/oradata
4100012 /u01

This is the output from these two commands.

To ward off any last thoughts that I may be nuts, let me propose a theory.
 I wish somebody could confirm my ideas or dispell them with the real
answer. This is what I am asking from all of you.

My theory:

When you delete (rm) such a large file, it will take the arrays some time to
reconfigure all of the parity bits since these arrays are using RAID 5. It
appears that the du command has corrected itself after some time, so
shouldn't the df command also in more time? I know that these two commands
shouldn't disagree, at least not this much, but the du command did report
bad numbers at first also and now it doesn't. Remember too that both of the
arrays are using the same optical interface in the server so there could be
a bottleneck there also. Although this idea makes sense to me, it doesn't
explain why it is so slow. It took the du command something in order of
15-20 minutes to report correct numbers. Is this realistic? I certainly
hope not! If it is, I can work around it. But this seems awfully slow to
me.

Any takers? Can somebody give me a good explanation of what is going on
here?

I would really appreciate any answers since this is holding up the database
portion of our project. I will summarize.

Thanks,

Gonzo Bushman, TSgt, USAF
Lead, Systems Administrations Team
WWOLS-R Project
Tinker AFB, OK, USA 73145
Phone: 405-734-3283
E-Mail: bushman@comswsys.tinkernet.af.mil



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:53 CDT