background preloader

File Processing how2s

Facebook Twitter

Processing CSV, PSV, TSV...

Url?sa=t&rct=j&q=&esrc=s&source=web&cd=14&cad=rja&uact=8&ved=0ahUKEwiFk5jeoK3LAhUEmh4KHQ0mANU4ChAWCC0wAw&url= Posted: July 28th, 2011 | Author: Matt Croydon | Filed under: Java, Open Source, Scala | 2 Comments » One of the great things about Scala (or any JVM language for that matter) is that you can take advantage of lots of libraries in the Java ecosystem.


Today I wanted to parse a CSV file with Scala, and of course the first thing I did was search for scala csv. That yielded some interesting results, including a couple of roll-your-own regex-based implementations. I prefer to lean on established libraries instead of copying and pasting code from teh internet, so my next step was to search for java csv. The third hit down was opencsv and looked solid, had been updated recently, and was Apache-licensed. Apache Flink 1.1-SNAPSHOT Documentation: Flink DataSet API Programming Guide. DataSet programs in Flink are regular programs that implement transformations on data sets (e.g., filtering, mapping, joining, grouping).

Apache Flink 1.1-SNAPSHOT Documentation: Flink DataSet API Programming Guide

The data sets are initially created from certain sources (e.g., by reading files, or from local collections). Results are returned via sinks, which may for example write the data to (distributed) files, or to standard output (for example the command line terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs. The execution can happen in a local JVM, or on clusters of many machines. Please see basic concepts for an introduction to the basic concepts of the Flink API.

In order to create your own Flink DataSet program, we encourage you to start with the anatomy of a Flink Program and gradually add your own transformations. Example Program The following program is a complete, working example of WordCount. Back to top DataSet Transformations. A CSV Parser – moving from Scala Parser Combinators to Parboiled2. CSVs are an ubiquitous format for all sorts of tabular data.

A CSV Parser – moving from Scala Parser Combinators to Parboiled2

I assume that every major programming language ecosystem has a handful of libraries handling both parsing and writing them. In Scala you can quickly turn to StackOverflow to find a 30 lines long CSV parser based on the Scala Parser Combinators library. In this short blog post I will show a Parboiled2-based a version of this parser and compare the performance of the two. Parboiled2 is a lightweight parser generator based on Scala Macros.

It compiles the defined grammar rules into JVM bytecode. Below is a gist showing the CSV (actually, you can use an arbitrary delimiter, but for simplicity’s sake I’ll keep using the word CSV) parser definition: It’s structure is roughly the same as that of the parser from StackOverflow. GitHub - mikeaddison93/product-collections: A very simple, strongly typed, scala framework for tabular data. A collection of tuples. A strongly typed scala csv reader and writer. A lightweight idiomatic dataframe / datatable alternative. GitHub - mikeaddison93/PureCSV: A type-safe and boilerplate-free CSV library for Scala. Read CSV in Scala into case class instances with error handling. A Foray into Spark and Scala. Apache Spark is a new wave in Big Data computing, an alternative to technologies such as Hadoop.

A Foray into Spark and Scala

I was recently watching someone analyze log files of image URL requests using shell scripts to create a MySQL database and thought it might be an interesting exercise to try it in Spark as well. Hadoop versus Spark So what is Spark and how is it different from Hadoop? Hadoop is an older, more mature technology.

It is really a collection of technologies such as a distributed resilient file system (HDFS), job tracking, and a parallelized map-reduce engine. Spark is a newer parallel processing platform, but only really replaces the map-reduce engine of Hadoop. If I had to pick an analogy, designing a Spark application feels a lot like defining a series of SQL views, one built on top of another. My Sample Project The project I am going to use in this blog is a CSV log file of image URL requests.

The log file I am using has 16 columns, but only some are relevant. Installation. Spark-csv/CsvRelation.scala at master · databricks/spark-csv. Mikeaddison93/spark-csv: CSV data source for Spark SQL and DataFrames. Spark examples: how to work with CSV / TSV files (performing selection and projection operation) One of the most simple format your files may have in order to start playing with Spark, is CSV (comma separated value or TSV tab…).

Spark examples: how to work with CSV / TSV files (performing selection and projection operation)

Let’s see how to perform, over a set of this files, some operation. As usual, I suggest you to create a Scala Maven project on Eclipse, compile a jar and execute it on the cluster with the spark-submit command. See this previous article for detailed instructions about how to setup Eclipse for developing in Spark Scala and this other article to see how to build a Spark jat jar and submit a job.

12.5. How to Process a CSV File - Scala Cookbook [Book] Combine Recipe 12.1 with Recipe 1.3.

12.5. How to Process a CSV File - Scala Cookbook [Book]

Given a simple CSV file like this named finance.csv: January, 10000.00, 9000.00, 1000.00 February, 11000.00, 9500.00, 1500.00 March, 12000.00, 10000.00, 2000.00 you can process the lines in the file with the following code: object CSVDemo extends App { println("Month, Income, Expenses, Profit") val bufferedSource = io.Source.fromFile("/tmp/finance.csv") for (line <- bufferedSource.getLines) { val cols = line.split(",").map(_.trim) // do whatever you want with the columns here println(s"${cols(0)}|${cols(1)}|${cols(2)}|${cols(3)}") } bufferedSource.close } The magic in that code is this line: val cols = line.split(",").map(_.trim) It splits each line using the comma as a field separator character, and then uses the map method to trim each field to remove leading and trailing blank spaces.