I need to walk the file system of a set of CIFS shares presented natively by a NetApp filer. The shares contains approximately 2 TB of data in 100's of millions of discrete files. The aim is to collect file metadata information specifically the last write timestamp, name, size and in cases where files with exact size matches exist the hash of contents. The purpose is to detect duplicate file creation even if the name or other meta data is different.
Currently I do this using a single threaded .NET application I wrote which uses basic file system API's to collect the information as a CIFS client. Obviously the performance isn't fantastic and a single full scan can take up to three days to complete. Before rewriting the application with threading in mind I wanted to query the hive mind about the file API in the NetApp SDK. I see similar questions were asked, but unanswered, in this thread http://communities.netapp.com/message/9710#9710.
Do I risk killing my filer (either consuming critical stack/memory on the controllers or simply maxxing out the CPU's) by making multiple parallel calls to the file API (say file-list-directory-iter-next and/or file-get-fingerprint)? Are there any best practices to using these API calls? Should I restrict myself to using the file-list-directory-iter-xxx calls and use native Windows API calls via CIFS to generate the hashes? Was this the intended purpose of this API or would I be abusing the API by trying to use it in this manner?
I would look at the SDK fastfilewalk examples in <SDK>/src/sample/Data_ONTAP/C/fastfilewalk. From README.txt:
Fast file walking from Solaris for Data ONTAP
While the sample fast file walking code for Solaris and NT is different (it was
written by different developers and uses different threading libraries), the
spirit of the two remains much the same. Directories are placed on a queue as
they're encountered. Threads repeatedly dequeue a directory name, walk it, and
process files they encounter. If any of the files are directories, they're
placed on the queue for processing by the next thread available.
This code assumes fairly normal directory structures, i.e. enough directories to
keep all the threads busy. Very flat directory structures, or directories with
millions of files in them, will mandate a slightly different approach, which is
left as an exercise for the reader.
Informal tests at Netapp on a Sun Ultra Enterprise with a gigabit LAN connection
to an F880 filer imply that performance gains tail off at about five threads.
The filer clocked between 8 and 10,000 NFS ops per second at maximum traffic
levels. Initial runs on small filesystems take much longer, due both to NFS
caching and to the filer's buffer cache which tends to have all the inodes in
memory after the first run. On very large filesystems, this won't be the case,
and thread use needs to be determined dynamically depending on the results of
earlier runs. A writeup of some early performance studies is here.
The sample code #DEFINEs INLINE and STATIC to be inline and static respectively.
The difference in performance between this, and simply defining the terms as
blanks, was undetectable, given the wide swings in results that happen because
of other factors.
In a quick look of the code, I did not see any APIs calls.
All the file- APIs were created for low usage. We expected high usage to go through the file protocols. I do not know of any performance studies done using the
- Rick -