SUMMARY: 2 million small files

From: Becki.Kain@galegroup.com
Date: Thu Jan 20 2000 - 07:52:01 CST


My original question:
> I'm looking for a recommendation. I have a set of clients that are going
to
> have about 2 million small files (like 1 k in size or less) and I need to
> make a recommendation of how to organize them in directories. Ideas?
this
> is under solaris 7

Most of the suggestions were to break up the files into some sort of
reasonable about of directories by some sort of sort scheme. I can't newfs
the box because the data is actually going to sit on an MTI Vivant and be
accessed by the solaris 7 box via nfs. I realised that I wasn't clear on
that in the original question. Because the answers were somewhat
different, I'm enclosing them all.

You are going to have to newfs the partition and change the number of
Inodes. Every file needs at least one Inode, and there are a set amount
in the file system.

michael.stapleton@learnix.com

---------------------------------------------------------------------------
First off - do *not* put them all in one directory!!

My suggestion is to create 2048 directories with 1024 files apiece. If
there is some good way to partition them (e.g. decide which file goes
where) to keep the spread even, that would be best. You might even
want to think about 2 levels deep of directory (100 directories with
100 directories apiece with 200 files in each of those) if you are
doing relative lookups.

That keeps it pretty flat, keeps the individual directories reasonable.

With UFS, time to search a directory for a node grows linearly with the
number
of entries. VXFS does hashing so it is better on the search time.

The DNLC should be expanded (ncsize) so that you're keeping them
in core as much as possible to avoid that search.

With both VXFS and UFS, time to create a new file grows linearly as the
number of entries, so keeping the number of entries reasonable is a Good
Thing. Makes ls a whole lot less painful too :-)

kevin@joltin.com
---------------------------------------------------------------------------
--------------------------------------------------------------------------
mikejd@bnpcn.com asked a bunch of questions which our client answered
privately :-)
---------------------------------------------------------------------------
---------------------------------------------------------------------------
To me it would depend on their names and/or what they're going to be used
for.

Perhaps by first letter or number (assuming they spread out.....wouldn't
want 1.5M "a"'s and 10 "z"'s and 50,000 "2"'s.

If some sort of "e-" stuff, perhaps by region - southeast, northeast,
etc. Perhaps by company-defined regions.

Perhaps by how often they are going to be used.

jford@tusc.net
---------------------------------------------------------------------------
--------------------------------------------------------------------------

I would hash them using some semi-intelligent algorithm to choose the
hash. I say, semi-intelligent, because I'm sure there is bound to be a
more efficient binary hash algorithm out there but it's normally
non-intuitive because your directory structure ends up being something
like /0xFD/665E/ instead of /m/e/0/5/me100005/.

Try running the following on your data set... first, put your dataset into
a single file so you can analyze it. If all the files are in one
directory, you could do 'ls -1 > datafile', then:

cut -b1 datafile | sort | uniq -c

will produce a long list of distributions on the first letter.

If the distribution is too normal, then you can further:

cut -b1,2 < datafile | sort | uniq -c | sort -n -r | head -10

And this will produce a distribution for the first 2 letters, and so on.
kevin@interq.or.jp
---------------------------------------------------------------------------
---------------------------------------------------------------------------

I would do it alphabetically or numerically by file name.
Make the directory names something like a1-1000, a1001-2000,....
b1-1000,.....b9001-10000 etc. This will break it up into something
a little more manageable. Maybe you should consider using a database
like Oracle since the files are are so small. Each file would be
a record in the database.....Just a thought. Good luck.
Stephen.Michaels@jhaupl.edu
---------------------------------------------------------------------------

--
Having so mant files (small) poses a big problem when several users/apps
try to acces/create files on the same diretory. Try to divide the 2million
files into ubdiretories start with a-z the aa-az..za-zz.

dir----a -----aa | | .. | | | |-az | |-z -----az | .. | |-az

esilva@netcom.com --------------------------------------------------------------------------- ------------------------------------------------------------------------- Becki, I don't have any particular ideas on how to organize them, but I have a suggestion on thresholds....

I think Bourne Shell has a limit - either 512 or 1024 (more likely), of how many parameters can be on a command line. Given that, I would not put more than 500 files in one directory, giving you a margin for about 11 command-line arguments. ($0, the first parameter to a script, is the command used to run it.) srothenb@montefiore.org --------------------------------------------------------------------------- -------------------------------------------------------------------------- First, use subdirectories if at all possible. Use something like /a/aaaa.xxx and /z/zzzzz.xxx to organize the files. That will reduce the number of files per directory by 26 (or so, depending on if you use upper case).

The problem is that when you get so many files per directory, the directory itself gets huge and access times get lousy. Reducing the size helps a LOT.

Also, I suggest you use Veritas File System, it will make a big difference in overall performance. marc.newman@chase.com --------------------------------------------------------------------------- --------------------------------------------------------------------------- ----



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:14:01 CDT