2022.01.16 00:42

Sort big files

However, since we're dealing with a CSV here, we want our lines to be complete. So we continue reading until we reach a newline separator extraBuffer. We then create a new file and write the buffer and extraBuffer to it. We then repeat this until all the bytes has been read from the source file. OpenRead unsortedFilePath , File. OpenWrite sortedFilePath ; File. Delete unsortedFilePath ; sortedFiles. We remove the unsorted one. This is key, the size of the unsorted file controls how much memory we will allocate.

We sort the array using Array. We create a new sorted file. When all files has been processed, we now have the following files on disk: The "only" thing left to do is to merge the files.

Move GetFullPath files. First , GetFullPath outputFilename. Replace TempFileExtension, string. GetFileNameWithoutExtension x ; return int. ToArray ; if sortedFiles. We loop through each chunk, we also create 1 output stream for each chunk the result of the merging of the 10 files. We call the Merge method on the 10 files see below. We remove the tmp extension of the output file using File.

We check how many. If more than 1, we run the loop again. Merge method. This is my implementation of the K-way merge mentioned earlier. OutputBufferSize ; while! Compare row1. Value, row2. StreamReader; await outputWriter. WriteLineAsync valueToWrite. We also read one line from each file to populate the rows list. We sort the rows list. We write the first item in the sorted list to the output stream.

We check if there's anything left to read in the current StreamReader. The final stage is merging the sorted list of sorted files into a single file. For simplicity let's assume that we run a very ancient computer and can't afford sorting files more than 10 bytes long.

Our input file is larger, so we need to split it. We start the process by splitting by one character. So after the first step, we will have three files:. Files m and o can now be sorted in memory. However file a is still too large. So we need to split further. File av is less than ten bytes, however file ap is still large.

So we split once again. Now that we have five small sorted files instead of a single big one, we range them in order of their beginnings and merge them together, saving results into output file.

Looks good, however this algorithm has a flaw: consider that the input file contains five gigabytes of the same line repeated many times.

It's easy to see that in this case the algorithm will be stuck in an endless loop trying to split this file over and over again. A similar problem is illustrated below.

Consider we have the following string s and our memory is not sufficient to sort them. As they all start from ' a ', they all will be copied into the same chunk in the first iteration. In the second iteration, we are going to split line by first two characters, however line ' a ' consists of only one character!

We'll face the same situation in each iteration. I handle these two problems by separating string s smaller than current substring length into a special unsorted file.

As we don't need to sort it, the file can be of any size. If only case-sensitive sorting were supported, it wouldn't be necessary to save short lines into a file, but only calculate their number. Incidentally the algorithm is stable, i. The class HugeFileSort contains the following properties, which specify how sorting will be performed:.

The main method is called simply Sort. It accepts two string s: input and output file names. If size of input file is less than MaxFileSize , its content is simply loaded into memory and being sorted.

Otherwise the procedure described above is being performed. During the execution, temporary directory tmp is created in the current folder. For the sake of better display, final set of temporary files is not deleted and stays in the folder. In production code, please uncomment two lines in the Merge method.

FileChunk and ChunkInfo are auxiliary nested classes. The former is the helper that corresponds to new files on each iteration and is used to write lines into files. The latter contains information about the data that will be merged into the resultant file. During the recursive work of the algorithm, the program populates the sorted dictionary that maps text line startings to the instances of ChunkInfo.

Test application requires three command-line arguments: input file name, output file name and max file size in bytes. It performs case-insensitive sorting of UTF8 files. If you want to break it, pass in a large file with little or no line ends.

As the algorithm uses standard TextReader to read text line by line from the file, it is not designed to handle such input data. Sign in Email. Forgot your password? Search within: Articles Quick Answers Messages. You need an external merge sort to do that. Here is a Java implementation of it that sorts very large files. Instead of loading all the data into memory at once, you could read just the keys and an index to where the line starts and possibly the length as well e.

Once you have sorted this array, you can use RandomAccessFile to read the lines in the order they appear. Note: since you will be randomly hitting the disk, instead of using memory this could be very slow. A typical disk takes 8 ms to randomly access data and if you have 10 million lines this will take about one day. This is absolute worst case In memory it would take about 10 seconds. You need to perform an external sort. Change this lines according to your system. Change these paths according to your machine.

Then run the following programme. Your final sorted file will be created with the name op in fdir path. Changing Thread count will change the name of final output file. Like for 16, it will be op, for 32 it will be op, for 8 it will be op etc.

Use the java library big-sorter which can be used for sorting very large text or binary files. You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you.

What you need to do is to chunk the files in via a stream and process them separately. Then you can merge the files together as they will already be sorted, this is similar to how merge sort works. The answer from this SO question will be of value: Stream large files. Operating systems come with powerful file sorting utility. A simple function that calls a bash script should help. Code is well documented, so easy to understand. Stack Overflow for Teams — Collaborate and share knowledge with a private group.

Create a free Team What is Teams? Collectives on Stack Overflow. Learn more. How do I sort very large files Ask Question. Asked 10 years, 2 months ago. Active 2 months ago. Viewed 52k times. Lines look like John Teddy George Clan How can I sort the files??

theocamito1982's Ownd

0コメント

1000 / 1000