background preloader

Data Import/Export

Facebook Twitter

R Data Import/Export. Export: R package for streamlined export of graphs and data tables. Dealing with a Byte Order Mark (BOM) Encoding Detective Story: Turning Tableau & Tidyverse Tears Into Smiles with Base R. Fstpackage. Vroom: An experiment with lazily reading indexed files. Vroom 1.0.0. I’m excited to announce that vroom 1.0.0 is now available on CRAN!

vroom 1.0.0

Vroom reads rectangular data, such as comma separated (csv), tab separated (tsv) or fixed width files (fwf) into R. It performs similar roles to functions like readr::read_csv(), data.table::fread() or read.csv(). Vroom: Read and Write Rectangular Text Data Quickly. Sqlite, feather, and fst. I don’t think I’m unusual among statisticians in having avoided working directly with databases for much of my career.

sqlite, feather, and fst

The data for my projects have been reasonably small. (In fact, basically all of the data for my 20 years of projects are on my laptop’s drive.) Flat files (such as CSV files) were sufficient. Apache Arrow. Feather V2 with Compression Support in Apache Arrow 0.17.0. Back in October 2019, we took a look at performance and file sizes for a handful of binary file formats for storing data frames in Python and R.

Feather V2 with Compression Support in Apache Arrow 0.17.0

These included Apache Parquet, Feather, and FST. In the intervening months, we have developed “Feather V2”, an evolved version of the Feather format with compression support and complete coverage for Arrow data types. In this post, we explain what Feather V2 is and what you might find it useful. We also revisit the benchmarks from six months ago to show how compressed Feather V2 files compare, demonstrating that they can be even faster than Parquet to read and write. We also discuss some of the situations in which using Parquet or Feather may make more sense. Wes and Hadley developed the original Feather format (“Feather V1”) early in 2016 as a proof of concept of using Arrow Arrow for fast, interoperable frame storage. Additionally, Feather V2 supports incremental and chunked writes. Fannie Mae Loan Performance. Package readOffice. Docxtractr. Disk.frame: Fast disk-based parallelized data manipulation framework for larger-than-RAM data. Haven 2.2.0. We’re delighted to announce that haven 2.2.0 is now on CRAN. haven enables R to read and write various data formats used by other statistical packages by wrapping the ReadStat C library written by Evan Miller.

haven 2.2.0

Write_csv is changing all times/dates to UTC. R: the Excel Connection. By Andy Nicholls, Head of Consulting As companies increasingly look beyond the scope of what is logistically possible in Excel more and more companies are approaching Mango looking for help with connecting to Excel from R.

R: the Excel Connection

With over 6,500 packages now on CRAN it should come as no surprise that there are quite a few packages that have be written in order to connect to Excel from R. So which is the best? Writexl. Quickly export multiple R objects to an Excel Workbook. Working with a business audience, I am frequently called upon to send analytic results to clients in the form of Excel Workbooks.

Quickly export multiple R objects to an Excel Workbook

The xlsx package facilitates exporting tables and datasets Excel, but I wanted a very simple function that would let me easily export an arbitrary number of R objects to an Excel Workbook in a single call. Each object should appear on in own worksheet, and the worksheets should be named after their objects. Specifically, the function should save the R objects mtcars (a data frame), Titanic (a table), AirPassengers (a time series) and state.x77 (a matrix) to the workbook myworkbook.xlsx.

Each object should be in it’s own worksheet and the worksheet should take on the name of the object. Tidyxl: Read untidy Excel files in R. Going from a human readable Excel file to a machine-readable csv with tidyxl. September 11, 2018 I won’t write a very long introduction; we all know that Excel is ubiquitous in business, and that it has a lot of very nice features, especially for business practitioners that do not know any programming.

Going from a human readable Excel file to a machine-readable csv with tidyxl

However, when people use Excel for purposes it was not designed for, it can be a hassle. Often, people use Excel as a reporting tool, which it is not; they create very elaborated and complicated spreadsheets that are human readable, but impossible to import within any other tool. In this blog post (which will probably be part of a series), I show you how you can go from this: Import ragged data with readr. Standard tools like readr::read_csv() can cope to some extent with unusual inputs, like files with empty rows or newlines embedded in strings.

Import ragged data with readr

But some files are so whacky that standard tools don’t work at all, and instead you have to take the file to pieces and reassemble it in a standard design. The readr package has recently acquired a set of tools for taking a file to bits. They are the melt_*() family. The melt_*() family separates delimited text files into individual cells. So “melt” isn’t quite the right name – it should be “disassemble” because it’s about separating the pieces, but “melt” is much shorter to type.

Xltabr: writing formatted crosstabs to Excel using openxlsx. Warning: xltabr is in early development.

xltabr: writing formatted crosstabs to Excel using openxlsx

Please raise an issue if you find any bugs. Rsheets/jailbreakr: Get out of Excel free. When you encode data as cell formatting in Excel. I recently offered to help create the game cards for a mammalogy-themed trivia board game that will be made available later in the year.

When you encode data as cell formatting in Excel

The questions and answers had already been prepared and they were stored in an Excel file. When it was first described to me, the data structure seemed sensible: one worksheet per topic one row per question, followed by the possible answers on the same row All I had to do was wrangle the questions and answers into little tables with one question from each topic and put them in MS Word documents that would then be given to a graphic designer at the print shop. Remove password protection from Excel sheets using R. How to Use googlesheets to Connect R to Google Sheets. Often I use R to handle large datasets, analyze the data and filter out the data I don’t need. When all this is done, I usually use write.csv() to print my data off and reopen it in Google Sheets. My workflow would look something like this: Manage Google Spreadsheets from R using the Sheets API V4. Moving data between R, Excel, and the Windows clipboard. These notes explain how to move data between R and Excel and other Windows applications via the clipboard. writeClipboard R has a function writeClipboard that does what the name implies.

However, the argument to writeClipboard may need to be cast to a character type. For example the code. Datapasta. Release 'open' data from their PDF prisons using tabulizer. There is no problem in science quite as frustrating as other peoples' data. Whether it's malformed spreadsheets, disorganized documents, proprietary file formats, data without metadata, or any other data scenario created by someone else, scientists have taken to Twitter to complain about it.

As a political scientist who regularly encounters so-called "open data" in PDFs, this problem is particularly irritating. PDFs may have "portable" in their name, making them display consistently on various platforms, but that portability means any information contained in a PDF is irritatingly difficult to extract computationally. Extracting the Data from Images of Plots with magick.

Prior to the era of reproducible research, it was quite common for published graphs, charts, and other figures to be released solely as static images such and PNGs or JPEGs. Often times this is not done with accompanying code, or with the plot data available as a separate download, making it difficult to either reproduce or validate the findings. We’ve talked about the virtues of the magick package in the past, and it turns out magick provides us with a way of extracting data from images.

The exact details vary depending on the properties of the plot, including its saturation, lightness, and hue, but some general themes emerge. We wanted to briefly document one particular instance of this problem. Reading tables from images with magick. ImageMagick is a robust and comprehensive open-source image processing library, and per the official docs: Use ImageMagick® to create, edit, compose, or convert bitmap images. It can read and write images in a variety of formats (over 200) including PNG, JPEG, GIF, HEIC, TIFF, DPX, EXR, WebP, Postscript, PDF, and SVG.

Generating codebooks in R. Af: Claus Thorn Ekstrøm, 3. marts 2018.