Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've worked at a place had a good excuse for using a real "big data" processing system - a large hadoop cluster on bare metal - because the dataset was way bigger than would fit on a single machine. The cluster had over 20 machines that had 10 or so 3TB disks each, and a large computation might use half the dataset. Even if you could put all the data in a storage appliance and read it from a single machine, you needed the separate memories and data busses to just sift through the data and keep the relevant bits in non-swapped memory in less than a day.

I think the important lesson from this paper, and which a few researchers also learned from our cluster, is that the amount of inefficiency that can be in software, and then removed by competent programming, is astronomical these days. You see many arguments like yours - programmer time is expensive, you need to be a super expensive expert, just pay for the systems, blah blah... it underestimates the cost of the ridiculous inefficiency, and overestimates the cost of competency.

Even scientists in-experienced with serious programming can often get appreciably better at writing their data processing jobs before their first job finishes when you're dealing with the really big data. What a lot of people call big data isn't even big data. They'd rather go through the motions of setting up and using a big-data processing system and using it poorly, than learn better software engineering skills, even if that would take less of (theirs + others) time, amortized over the next few months of their work.

This isn't toyota vs bugatti. This is... freight train with conductor, engineer, station staff, and one car of payload... vs getting a drivers license for a large van.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: