How does parquet compare to CSV?

CSVs need you to read the entire file to query just one column, which is highly inefficient. On the other hand, Parquet's columnar storage and efficient compression makes it well-suited for analytical queries that only need to access specific columns .

How to convert parquet to CSV using Spark?

The parquet file is converted to CSV file using "spark. write. fomat("csv) function , which is provided in DataFrameWriter class, without requiring any additional package or library for convertion to CSV file format.

What is the advantage of using a Parquet file rather than a CSV file?

The main benefits of Parquet are: Less data scanned at ingestion leading to faster scans and lower query and memory costs because you only have to read the columns that you actually need. Compressed file format leading to lower storage costs.

Parquet is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. This approach is best especially for those queries that need to read certain columns from a large table . Parquet can only read the needed columns therefore greatly minimizing the IO.

Can you use Spark SQL to read a Parquet data?

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.

Is Parquet faster than CSV pandas?

Working with Parquet and Feather in Pandas is extremely easy. Because Parquet is compressed and column-oriented, it can load and save DataFrames much faster than CSV , while using less disk space.

What is the best tool to compare CSV files?

xlCompare is the best performing tool to compares CSV files for difference. Filter updated and equal records, get the mismatching records, find and extract duplicate rows. xlCompare is your reliable partner if you are using CSV file in your business.

Does Parquet reduce file size?

By storing data column-wise, Parquet can apply compression algorithms that significantly reduce the storage footprint . This is particularly beneficial for repetitive data, such as log files or transactional data, where columns often contain many repeated values.

What is the difference between Parquet and CSV snowflake?

Parquet is column oriented and CSV is row oriented . Row-oriented formats are optimized for OLTP workloads while column-oriented formats are better suited for analytical workloads. Column-oriented databases such as AWS Redshift Spectrum bill by the amount data scanned per query.

MF4 Decoders - DBC Decode CAN Data to CSV/Parquet [+ Data Lakes] (2024)

Need to DBC decode your CAN/LIN data to CSV/Parquet files?

The CANedge records rawCAN/LINdata to an SD card in the popular ASAM MDF format (MF4).

The simple-to-use MF4 decoders let you DBC decode your log files to CSV or Parquet files -enabling easy creation of powerful data lakes and integration with 100+software/API tools.

Learn more below - and try the decoders yourself!

DBC DECODE

DBC decode your MF4 files to interoperable CSV/Parquet files

DRAG & DROP

Drag & drop files/folders onto the decoder to process them

AUTOMATE

Optionally use via the command line or in scripts for automation

DATA LAKE

Easily create powerful Parquet data lakes for use in 100+ tools

WINDOWS/LINUX

Decoders can be used on both Windows and Linux operating systems

100% FREE

The decoders are 100% free and can be integrated into your own solutions

The ASAM MDF (Measurement Data Format) is a popular, open and standardized format for storing bus data e.g.from CAN bus (incl. J1939, OBD2, CAN FD etc) andLIN bus.

The CANedge records raw CAN/LIN data in the latest standardized version, MDF4 (*.MF4). The log fileformatis ideally suited for pro specs CAN logging at high bus loads and enables both lossless recording and 100%powersafety. Further, the CANedge supports embedded encryption and compression of the log file data (both casesnatively supported by the MF4 decoders).

The raw MF4 data from the CANedge can be loaded natively in various software/API tools, including the asammdf GUI/API, our Python API, the MF4 converters - and the MF4decoders (detailed in this article).

To learn more about the MDF4 file format, see our MF4 intro.

The MF4 decoders can be deployed in numerous ways - below are two examples:

Example 1: Local PC drag & drop usage

A CANedge records rawCAN/LIN data to an SD card
A log file folder from the SD card is copied to a PC
The relevant DBC files are placed next to the MF4 decoder
The data is DBC decoded to CSV via drag & drop
The CSV files can be directly loaded in e.g. Excel

Example 2: Cloud based auto-processing

A CANedge uploads rawCAN/LIN data to an AWS S3 bucket
When a log file is uploaded it triggers a Lambda function
The Lambda uses the MF4 decoder and DBC files from S3
Via the Lambda, the uploaded file is decoded to Parquet files
The Parquet files are written to an AWS S3 'output bucket'
The data lake can be visualized in e.g. Grafana dashboards

Learn more in our Grafanadashboardarticle.

DBC decode raw MF4 data via drag & drop

The simple-to-use MF4 decoders let you drag & drop CANedge log files with raw CAN/LIN data to DBC decodethem using yourown DBC file(s) - outputting the data as CSV or Parquet files.

Batch decode (nested) folders

You can also drag & drop entire folders of MF4 log files onto a decoder to batch process thefiles.This also works for nested folders with e.g. thousands of log files.

Automate decoding via CLI/scripts

The decoder executables can be called via the CLI or from any programming language. Ideal for automatedDBC decodinglocally, in the cloud (e.g. in AWS Lambda), on RaspberryPisetc.

import subprocess

subprocess.run(["mdf2parquet_decode.exe", "-i", "input", "-O", "output"])

Easily use with S3 storage

The CANedge2/CANedge3upload data to your own S3 server. Mountyour S3 bucket and use the MF4 decoders as if files were storedlocally. Or use in e.g. AWS Lambda for full automation.

Easily decompress and/or decrypt your raw data

The CANedge supports embedded compression and encryption of log files on the SD card. The MF4 decodernatively supports compressed/encrypted files, simplifying post processing.

Create powerful Parquet data lakes

The decoders are ideal for creating powerful Parquet data lakes with an efficient date-partitionedstructure of concatenated files - stored locally or e.g. on S3.

Visualize your CAN/LIN data in Grafana dashboards

Many dashboard tools can query data from Parquet data lakes via SQL interfaces (like Athena orClickHouse), enabling low cost scalable visualization - see e.g. our Grafana-Athenaintro.

Use Python to analyze Parquet data lakes

Python supports Parquet data lakes, enabling e.g. big dataanalysis. With S3 support, you can alsoanalyzedatadirectly in e.g. Colab JupyterNotebooks. See thedocs for scriptexamples.

Use MATLAB to analyze Parquet data lakes

MATLAB natively supportsParquet data lakes - making it easy toperform advanced analysis at scale with support for S3 and out-of-memory tall arrays. See thedocs for scriptexamples.

Use Excel or Power BI to analyze your data lakes

Excel and Power BI let you load DBC decoded CSV/Parquet files for quick analysis - oruse e.g. Athena/ClickHouse ODBC drivers to query data(beyond memory) from your data lakes via SQL

Easily analyze data via ChatGPT

ChatGPT is great for analysing large amounts of DBC decoded CAN/LIN data in CSV format. Learnmore in our intro.

Want to try this yourself? Download the decoders and MF4 sample databelow:

DownloaddecodersDownload MDF4data

Store your data lake anywhere - and integrate with everything

Parquet data lakes combine cheap flexible storage with efficient interoperableintegration opportunities.

Agnostic low cost storage

Parquet data lakes are comprised of compact, efficient binary files - meaning they can be stored at extremelylow cost in any cloud file storage (e.g. AWS S3, Google Cloud Storage, Azure Blob Storage), self-hosted S3buckets (e.g. MinIO) - or simply onyourlocal disk. Storing Parquet files in e.g. AWS S3 is typically 95% lower cost vs. storing the equivalent datavolumein adatabase.

Native Parquet support

As illustrated, Parquet data lakes are natively supported by a wide array oftools. For example, you can directly work with Parquet files within any programming language like Python/MATLAB- whether the files are stored locally or on S3. Further, Parquet files can be natively loaded in many desktoptools like Microsoft Power BI or Tableau Desktop.

Powerful interfaces

Parquet data lakes are natively supported by a 'interfaces' likeAmazonAthena, Google BigQuery, Azure Synapse and open source options like ClickHouse and DuckDB. These expose SQLquery interfaces and ODBC/JDBC drivers that dramatically expand integration options- and super charge query speed. You can for example use interfaces to visualize your data inGrafana dashboards.

Parquet data lakes are comprised of files, meaning they can be stored in file storage solutions like e.g. AWS S3.Storing data on S3 is incredibly low cost (0.023$/GB/month) compared to most databases (typically~1.5$/GB/month), which is relevant as many CAN/LIN data logging use cases can require terabytes of storage overtime.

The 'downside' to storing files on S3 vs. in a database is generally the fact that it is much slower to query thedata, e.g. for analytics or visualization. However, this is where the interface tools like Amazon Athena comeinto play as outlined below.

We refer to tools like Amazon Athena, Google BigQuery and Azure Synapse Analytics as 'interfaces' for simplicity.They can also be referred to as cloud based serverless data warehouse services. The serverless part isimportant: It means that it's simple to set up - and you pay only when you query data.

Many automotive OEM engineers often need to store terabytes of data for analysis - yet, the engineers may onlyneed to access small subsets of the data on an infrequent basis. Yet, when they access the data, the query speedhas to be fast - even if they query gigabytes of data. Tools like Amazon Athena are ideally suited for this.When you query the data, Athena spins up the necessarycompute and parallelization in real-time - meaning you can extract insights across gigabytes of your S3 datalake in seconds using standard SQL queries. At the same time, all the complexity is abstracted away - and thesolutions can be automatically deployed as per our step-by-step guides.

There are too many software/API integration examples for us to list - below is a more extensive recap oftools:

Direct integration examples

Below are examples of tools that can directly work with Parquet files:

MATLAB:Natively supports local or S3 based Parquet data lakes, with powerful support for out-of-memory tall arrays
Python:Natively supports local or S3 based Parquet data lakes and offers libraries for key interfaces(Athena, ClickHouse etc.)
Power BI:Supports reading Parquet files from the local filesystem, Azure Blob Storage, and Azure Data Lake StorageGen2
Tad: Free desktop tool forviewing and analyzing tabular data incl. Parquet files. Useful for ad hoc review of your data
Apache Spark: A unified analytics engine forlarge-scale data processing that supports Parquet files
Databricks: A platform for massive scale dataengineering and collaborative data science, supporting Parquet files
Tableau: A data visualization tool that can connectto Parquet files through Spark SQL or other connectors
Apache Hadoop: Supports Parquet file format forHDFS and other storage systems
PostgreSQL: With the appropriate extensions, itcan query Parquet files
Cloudera: Offers a platform that includes Parquetfile support
Snowflake: A cloud data platform that can load andquery Parquet files
Microsoft SQL Server: Can accessParquet files via PolyBase
MongoDB: Can import data from Parquet files usingspecific tools and connectors
Teradata: Supports querying Parquet files usingQueryGrid or other connectors
Apache Drill: A schema-free SQLQuery Engine for Hadoop, NoSQL, and Cloud Storage, which supports Parquet files
Vertica: An analytics database that can handleParquet file format
IBM Db2: Can integrate with tools to loadand query Parquet files

Interface based integrations

Below are examples of tools that can integrate via interfaces like Athena, BigQuery, Synapse, ClickHouse etc.:

Power BI(driver):By installing a JDBC/ODBC driver (for e.g. Athena), you can use SQL to query your data lake
Excel (driver):By installing a JDBC/ODBC driver (for e.g. Athena), you can use SQL to query your data lake
Grafana: Offers powerful and elegant dashboards for datavisualization, ideal for visualizing decoded CAN/LIN data
Tableau: Known for its interactive datavisualization capabilities, especially popular for business intelligence applications
Looker: Employs an analytics-oriented applicationframework, including business intelligence and data exploration features
Google Data Studio: Customizablereports/dashboards, known for user-friendly design and integration with Google services
AWS QuickSight: A fast, cloud-poweredbusiness intelligence service that integrates easily with e.g. AmazonAthena
Apache Superset Apache Superset is anopen-source dataexploration and visualization platform written in Python
Deepnote Deepnote is a collaborative data notebookbuiltfor teams to discover and share insights
Zing Data Zing Data is a data exploration andvisualization platform supporting e.g. ClickHouse
Explo Customer-facing analytics for any platform.Designedfor beautiful visualization. Engineered for simplicity
Metabase Metabase is an easy-to-use, open source UItoolfor asking questions about your data
Qlik: Offers end-to-end, real-time data integration andanalytics solutions, known for the associative exploration user interface
Domo: Combines a powerful back-end with a user-friendlyfront-end, ideal for consolidating data systems into one platform
Sisense: Known for its drag-and-drop user interface,enabling easy creation of complex data models and visualizations
MicroStrategy: Offers a comprehensive suite ofBI tools, emphasizing mobile analytics and hyper-intelligence features
Splunk: Specializes in processing and analyzingmachine-generated big data via a web-style interface
Exasol: Offers a high-performance, in-memory, MPPdatabase designed for analytics and fast data processing
Alteryx: Provides an end-to-end platform for datascience and analytics, facilitating easy data blending and advanced analytics
SAP Analytics Cloud: Offersbusiness intelligence, augmented analytics, predictive analytics, andenterprise planning
IBM Cognos Analytics:Integrates AI to help users visualize, analyze, and share actionable business insights
GoodData: Provides cloud-based tools for big dataand analytics, with a focus on enterprise-level data management and analysis
Dundas BI: Offers flexible dashboards, reporting, andanalytics features, allowing for tailored BI experiences
Yellowfin BI: Delivers business intelligencetools and a suite of analytics products with collaborative features for sharing insights
Reveal: Provides embeddedanalytics and a user-centric design, making data more accessible for decision makers and teams
Chartio: A cloud-based data exploration tool, knownfor its ease of use and ability to blend data from multiple sources

Visualize data in Grafana dashboards

Want to create dashboard visualizations across all of your CAN/LIN data?

The CANedge2/CANedge3 is ideal for collecting CAN/LIN data to your own server (cloud or self-hosted). Acommon requirement for OEMs and system integrators is the ability to create dashboards for visualizing thedecodeddata. Here, theMF4 decoders can automate the creation of Parquet data lakes at any scale (from MB to TB) stored on S3 -ready forvisualization via Grafanadashboards. Learn more in our dashboard article.

Analyze fleet performance in MATLAB/Python

Need to perform advanced large scale analyses of your data?

The CANedge3lets you record raw CAN data to an SD card and auto-push it to your own S3 server via 3G/4G. Uploaded filescan be DBC decoded to a Parquet data lake, output into a separate S3 bucket. This makes it easy to performadvanced statistical analysis via MATLAB or Python, as both natively support loading Parquet data lakesstored on S3. In turn, this lets you perform advanced analyses - with minimal code. See our scriptexamplesto get started.

Quickly analyse data as CSV via Excel

Need to swiftly review your DBC decoded data?

The MF4 decoders can be useful to quickly understand what can be DBC decoded from your raw CAN/LIN data. Bysimply drag & dropping your LOG/ folder from the CANedge SD card, you can create a copy in DBC decoded CSVform - and directly load this data for analysis in Excel. If you wish to perform more efficient analysis oflarge amounts of data in Excel, you can alternatively use e.g. a ODBC driver via Athena, DuckDB orClickhouse - enabling efficient out-of-memory analyses.

Create a self-hosted multi-purpose Parquet data lake

Need a 100% self-hosted Parquet data lake - using open source tools only?

If you prefer to self-host everything, you can e.g. deploy a CANedge2/CANedge3 to upload data to your ownself-hosted MinIO S3 bucket (100% open source) running on your own Windows/Linux machine (or e.g. a virtualmachine in your cloud). You can run a cron job to periodicallyprocess new MF4 log files and output the result to your Parquet data lake. The Parquet files can be analyseddirectly viaPython. Further, you can integrate it with an open source tool like ClickHouse for ODBC driver integrationsor dashboard visualization viaGrafanadashboards.

FAQ

Yes, you control 100% how you create and store your CSV/Parquet data lake.

In our examples, we frequently take outset in a setup where your CANedge2/CANedge3 uploads data to an AWS S3bucket - with uploaded data automatically processed via AWS Lambda functions. This is a common setup that weprovide plug & play solutions for - hence it will often be the simplest way to deploy your data processing anddata lake.

However, you can set this up in any way you want. For example, you might store your uploaded log files on aself-hosted MinIO S3 bucket instead. In such a scenario, you can periodically process new MF4 log files manually(e.g. via drag & drop or the CLI) to update your data lake - or you can set up e.g. a cron job or similarservice to handle this. The data lake can be stored in another MinIO S3 bucket - and you can then directly workwith the data lake from here (e.g. in MATLAB/Python) or integrate thedata using an open source system like ClickHouse or DuckDB.

The same principle applies if you upload data to Google Cloud S3 or Azure blob storage (via an S3 gateway) - hereyou can use their native data processing services to deploy the MF4 decoders if you wish to fully automate theDBC decoding of incoming data. We do not provide plug & play solutions for deploying this, however.

Of course, you can also simply use the MF4 decoders locally to create a locally stored Parquet data lake. Thiswill often suffice if you're e.g. using a CANedge1 to record your CAN/LIN data - and you simply wish to processthis data on your own PC. In such use cases, the Parquet data lake can still be a powerful tool, since it makesit much easier to perform large-scale data processing compared to tools like the asammdf GUI.

We provide two types of MF4 executables for use with the CANedge: The MF4 converters and MF4 decoders.

The MF4 converters let you convert the MDF log files to other formats like Vector ASC, PEAKTRCand CSV. Theseconverters do not perform any form of DBC decoding of the raw CAN/LIN data - they only change the fileformat.

The MF4 decoders are very similar in functionality as the MF4 converters. However, theseexcecutables DBC decodethe log files to physical values, outputting them as either CSV or Parquet files. When using the MF4decoders,you provide your DBC file(s) to enable the decoding. These executables are ideal if your goal is to analysethehuman-readable form of the data and/or e.g. create 'data lakes' for analysing/visualizing the data at scale.

No, you simply downloadtheconverter executables - no installation required.

As evident, there are almost limitless options for how you can deploy your MF4 decoders, how you can store theresulting data lake, how you provide interfaces for it - and what software/API tools you integrate it with.

You can use the CANedge and our MF4 decoders to facilitate any of these deployment setups. However, our teamoffers step-by-step guides and technical support only on limited sub sets of the deployments, such as ourGrafana-Athena integration.

Need an interoperable CAN logger?

Get your CANedge today!

Buy nowContact us

Recommended for you

CAN TELEMATICSAT SCALE

CANEDGE2 -WIFI CAN LOGGER

CANEDGE3- 3G/4G CAN LOGGER

Use left/right arrows to navigate the slideshow or swipe left/right if using a mobile device