27

When Python is not enough – (Processing) Sorting large data using ROOT and C++

 5 years ago
source link: https://www.tuicool.com/articles/hit/ABn6nae
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

So. I’m tasked with sorting a 120GB CSV file by date. And unfortunately I only have 64G on my workstation; the file won’t fit in RAM. Time to bust out Hadoop? NO! It’s only 120GB, you don’t need Hadoop for it. It’s time for some CERN technology.

Yes, that’s CERN. The one that builds the Large Hadron Collider . And ROOT is used by partial physicists to analyze data from the LHC. So you know its going to be good.

F3AVRvy.png!web

Background and Setup

ROOT is first developed by CERN in 1985 to handle all the data their experiments are generating. And is now a very powerful data analysis framework in C++ – With an interactive C++ shell – JIT compiled. And scales nearly perfectly from KBs of data to multi PBs of data.

Setting up ROOT on my workstation is as easy as it gets. As I’m running Arch Linux. A simple sudo pacman -S root does all the job. If you are on CentOS, Ubuntu, Windows or Mac, you can go to ROOT’s download page and download the latest release.

Convert CSV into .root files

ROOT operates on it’s own data format – the .root files. They are miniature NoSQL style (..hmm… I guess ROOT is way older then NoSQL), (I think) randomly accessible, machine independent, compressable database. Woah that’s long. Anyway, we need to convert CSV into .root files. And here is one of the beauty of ROOT. ROOT can operate nearly entirely on disk; ROOT is designed to handle datasets larger then RAM.

Converting is easy. We first create a .root file. Then dump all the contents of out CSV into it.

//conv_ttree.cpp
void conv_ttree() {
   //Create the .root file where we are saving out data
   //RECREATE means create a new file is not exist and overwrite if exits
   TFile *f = new TFile("/media/marty/HDD/convered.root","RECREATE");

   //TTree is how ROOT stores data. It is a tree like database
   //Here we create a new TTree named "Data".
   TTree *tree = new TTree("Data","data from csv file");


   //Load the csv and convert it. The second string defines the
   //structure of out CSV file. "C" means it stores a C
   //style (NULL termincated) string. " : " separates each element 
   tree->ReadFile("/media/share/HDD/data.csv",
   	"time/C:type/C:size/C:delta/C:address/C:category/C",',');
   //Force ROOT to write all unwritten data to disk
   f->Write();

}

Now to execute this thing. We launch the ROOT prompt by typing “root” into the terminal. And then .x conv_ttree.cpp . .x means to execute a C/C++ file and root automatically execute the function with the same name as the file.

jm2aEjZ.png!web Converting CSV to .root. The fail are normal as ROOT attempts to display it’s logo but I’m connected using SSH

Sorting the data

Now with the data ready. We can perform our task of sorting it. The code is a lot longer. But I totally appreciate how much control ROOT gives it’s users over on how data is processed. First, we load the .root file, select out TTree. Then select what we are sorting against. Next set up caching and performance parameters. Finally open a new .root file and save the values.

//sort.cpp

#include <chrono>

void sort()
{
	//Open the previously created .root file
	TFile *f = new TFile("/media/marty/HDD/convered.root");
	//Select the "Data" tree
	TTree *tree = (TTree*)f->Get("Data");

	//Create a list of sorting orders
	Int_t nentries = tree->GetEntries();
	Int_t *index = new Int_t[nentries];

	cout << "Done initalizing..." << endl;

	//Set how much data do we want to sort against (in this case, all data)
	tree->SetEstimate(nentries);
	//We are sorting against the "time" variable
	tree->Draw("time","","goff");

	cout << "Done loading list" << endl;

	//Perform the sorting
	TMath::Sort(nentries,tree->GetV1(),index);
	cout << "Done sorting" << endl;
	

	//Create a new file
	TFile f2("/media/share/storage/Data_sorted.root","recreate");
	//Copy a empty tree so we have something to attach to
	TTree *tsorted = (TTree*)tree->CloneTree(0);
	//Setup caching and fetching parameters, making the process a lot faster
	tree->SetCacheSize(1000*1000*1000);//1G read cache
	tsorted->SetCacheSize(1000*1000*2000); //2G Write Cache
	tree->SetClusterPrefetch(true);

	auto start = chrono::high_resolution_clock::now();
	double last_time = 0;
	for (Int_t i=0;i<nentries;i++) {
		// Flush cached baskets to prevent OOM
		if((i+1)%0xfffff == 0)
			tree->DropBaskets();
		//Progress display
		if(i%8195 == 0) {
			auto now = chrono::high_resolution_clock::now();
			auto time_span = chrono::duration_cast<chrono::duration<double>>(now - start);
			double spend = time_span.count();

			double pct = 100.0*(double)i/nentries;
			printf("%c[2K\r", 27);
			cout << pct << "% "
				<< i << "/" << nentries << ", "
				<< "time: " << spend << "s. ETA: "
				<< (size_t)((nentries-i)*(spend/i)) << "s"
				<< ". delta = " << spend-last_time
			       	<< flush;

			last_time = spend;
		}
		//Fetch data from the source tree
		tree->GetEntry(index[i]);
		//And put it into the cloned tree
		tsorted->Fill();
	}
	cout << "\nDone sorting" << endl;
	tsorted->Write();
	delete [] index;
}

And now again to execute it. Launch ROOT and type .x sort.cpp++ . Wait… what’s the ++ at the end? It means perform a full compilation on the program instead of using the JIT. It makes the code just that much faster

And… after a small while, the sorting is done!

VfUZjiv.png!web

Converting .root back to CSV

I’ld like to just work with ROOT on this data analysis project. But my college wants to work use Hadoop and R. So, I still need to convert .root back into CSV. Fortunately the process is straightforward (not saying the code is short).

//Convert a vector of strings into CSV columns
std::string csvify(const std::vector<std::string>& line)
{
	std::string str;
	str.reserve(240);
	for(size_t i=0;i<line.size();i++)
		str += line[i] + (i==line.size()-1?"":",");
	str += "\n";
	return str;
}

void tocsv()
{
	//Load the data
	TFile* f = new TFile("ASE_sorted.root");
	TTree* t = (TTree*)f->Get("ASEFile");

	//Since we stored data in C style strings. We extract them into C style arrays
	char time[100] = {0};
	char type[100] = {0};
	char size[100] = {0};
	char address[1000] = {0};
	char category[100] = {0};
	
	Int_t nentries = t->GetEntries();
	//Set where the data will be loaded
	t->SetBranchAddress("time", time);
	t->SetBranchAddress("type", type);
	t->SetBranchAddress("size", size);
	t->SetBranchAddress("address", address);
	t->SetBranchAddress("category", category);
	
	//Open a file to save to
	std::ofstream out("/media/share/storage/Data_sorted.root");
	//Print the header
	out << csvify({"time", "type", "type", "address", "category"});
	
	for(int i=0;i<nentries;i++) {
		//Read data from column
		t->GetEntry(i);

		//Progress display
		cout << (double)i/nentries << "\r" << flush;

		// Save the data
		vector<string> data = {time, type, type, address, category};
		out << csvify(data);
	}
	
	out.close();
}
QbE7feE.png!web

Launch ROOT and execute… and we are done! All 120GB of data sorted and saved into a CSV! ROOT is awesome!

Some resources to learn more about ROOT:

Deb Mohapatra – ROOT Basics

ROOT – Introductory Tutorials

Advertisements


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK