Surfing the Wave of Large Data Sets

Estimated Reading Time: 5 minutes Listen to this
Erin Todd

Erin Todd Member Name

Engineering Geologist/Seismologist

The power and sophistication of instruments and technologies we use to gather and record data at work is increasing rapidly. To keep up with this increase, we must adapt to the ever-growing waves of data coming toward us, or else risk being swept away in the current. Something as simple as upgrading our data-recording equipment can mean we generate exponentially larger data sets. Where the old version of an instrument might have collected data every hour, the new version now collects data every minute, or even every second. The ease of upscaling the amount of data available to us presents its own advantages and challenges.

There are certainly times that ‘big data’ – as an unstructured and extraordinarily large tidal wave – may require the use of hefty computational power, expensive software, and massive parallelism. Still, there are times when the data deluge falls under the other popular interpretation of ‘big data’ – where you’re facing extremely large data sets that have exceeded the limits of traditional spreadsheets and require a strategic and thoughtful approach. When facing error messages and frozen screens, its critical to know what type of wave is looming behind your data problem.

Smart consultants who are looking for the right tool to ride the incoming waves of information must understand the difference between ‘big data’ and large data sets and would be wise to reconsider traditional handling techniques before jumping straight from spreadsheets into licensed software.

More data makes the workflow more complex

At the less complex end of the scale, the data workflow is reasonably smooth and straightforward. Data is gathered, compiled, transferred onto a computer, processed to obtain results, and then the outputs are extracted, and results can be presented in reports. This neat, linear process works well when the amount of data to be processed is manageable.

However, when the waves of large data sets come rolling in, things get more complicated and can seem overwhelming. Take, for example, a project that looks at ground-shaking from regional earthquakes over a few days, recorded using a seismometer. Seismometers can measure ground-shaking at hundreds of samples per second. If you try to open and process the data in a spreadsheet, problems will emerge rapidly. A spreadsheet in Excel has space for just over a million rows, so if you do the math, that’s only about three hours of data. Large amounts of data can end up creating multiple spreadsheets that hold different parts of the whole data set, which quickly creates problems with organisation and sharing.

Intuitively, it seems that more data is a great thing, yet problems can emerge which are usually unforeseen. The time it takes to deal with data doesn’t always scale with the size of the data set; usually there is a point where things become too inefficient to handle and time is taken up on troubleshooting.

Decisions may be made to cut the data down in order to be able to deal with it, but that means the benefits of using the latest high-resolution equipment are being lost because there hasn’t been a corresponding upgrade of the ability to deal with the increased amount of data that the instruments generate.

Managing the flow… with a database

How can we make this complicated sea of data easier to paddle in?

There are steps that can make working with large volumes of data more manageable while still maintaining the complexity and amount of data collected for the project. Using a database management system rather than spreadsheets is a start. With a database, the data can be stored and organised in a space-efficient way, and we can then extract the bits of it that we want for a task, as and when we need them. To use a different analogy, you could think of this as gathering water into a water tank for storage, and then controlling the amount you take out by opening and closing the outflow tap as needed.

This all sounds very straightforward, but how does adding a database component actually work in a project?

Loading data into a database isn’t intuitive, and probably requires some programming in a database language. But if you’re not inclined to become a data scientist or programmer, you can have someone build you some simple software or purchase a licence for proprietary software.

A simple, customised software could be as straightforward as having an application with a button for importing data (which loads and organises the raw data into a database on a backed-up server) and a button for exporting data between a start and end date. That creates an easy way to ‘turn the tap on and off’ to obtain the data as you need it. This way, the data can all be stored in a central ‘water tank’ location (server), so that anyone who has the software on their computer is able to easily access the parts of the data needed. The result: no more time wasted in searching multiple locations for the data you need.

More functionality can be added to the customised software to make managing the data even easier – such as adding a button to make tables and plots or calculate additional statistics for parts of the data in the database. This makes dealing with large data sets a reality by reducing the time spent trying to organise data without reducing its size or complexity.

When is the right time to get on board?

Now that we know how to approach problems that involve large data sets, it is just as important to know when to adopt the approach of using custom-built software rather than licensed software to store and process data.

The answer is a matter of scale. A spreadsheet program such as Excel is efficient if the quantity of data is manageable and it can be stored, processed and displayed in the same space. In the case of large and complex data sets that are added to frequently and used by people often over an extended period (such as geotechnical borehole logs), a licensed commercial software is a better choice.

What is less understood is the right approach for the middle ground, between simple data sets and large, complex, often-used data sets.

These scenarios might arise where you collect large amounts of data for a stand-alone project. You might want to access that data occasionally, but it doesn’t justify paying an expensive yearly licence fee just to be able to look at your own data. In this space, it’s worth considering whether the benefits you require can be achieved from simple software that is built and customised for you and your project. Licensed software comes at a cost, which may be more than is warranted for your purposes. While custom-built software will require an initial outlay, it may be better tailored for your particular needs.

Regardless of the software and method you choose, the time to get on board is now. Dealing with large data sets will be an increasingly frequent challenge for projects as data collection becomes ever more accurate and more accessible. For this data to be useful, we must advance our processing techniques too, so that we can surf the giant waves of large data sets rather than let them send us crashing.


About the Author

Erin Todd

Erin Todd Member Name

Engineering Geologist/Seismologist

discover more

Golder uses cookies to ensure that we give you the best experience on our website. By continuing to use this website, we assume that you consent to receive all cookies on our website.

OK Learn More