Мифы и легенды про Big Data / Блог компании ВымпелКом (Билайн) Один из наших кластеров для пилотных задач (Data node: 18 servers /2 CPUs, 12 Cores, 64GB RAM/, 12 Disks, 3 TB, SATA — HP DL380g) — Что такое Big Data вообще?

Все знают, что это обработка огромных массивов данных. Но, например, работа с Oracle-базой на 20 Гигабайт или 4 Петабайта — это ещё не Big Data, это просто highload-БД. — Так в чём ключевое отличие Big Data от «обычных» highload-систем? В возможности строить гибкие запросы. . — Откуда берётся эта новая нагрузка? — Есть пример такой задачи?

— И как это решается? — Так давайте просто промасштабируем их — и проблема решится? — Так что получается в итоге? — Но ведь это чудовищно медленно, разве не так? Короткие запросы с малым количеством join’ов. . — Какие есть известные примеры использования Big Data? Top 30 DSC blogs, based on new scoring technology. 20 short tutorials all data scientists should read (and practice) How to Become a Data Scientist. These days you can get a degree in data science so you can show your diploma that certifies your credentials.

How to Become a Data Scientist

But these are relatively new so, with all due respect, if you only recently got your degree you are still a beginner. Those of us who use this title today most likely came from combination backgrounds of business, hard science, computer science, operations research, and statistics. What you call yourself is one thing but what your employer or client is looking for can be quite a different kettle of fish. A lot has been written about data scientists being as elusive as unicorns. Not being a unicorn I’d say this sets the bar pretty high. All of this confusion over what we’re called and what we actually do can make you down right schizophrenic. Four Types of Data Scientists The information here comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013.

There are 40 pages of good analysis here so this will be only the highest level summary. 38 Seminal Articles Every Data Scientist Should Read. Data Science Cheat Sheet. I will update this article regularly.

Data Science Cheat Sheet

An old version can be found here and has many interesting links. All the material presented here is not in the old version. This article is divided into 11 sections. 1. Hardware A laptop is the ideal device. Even if you work heavily on the cloud (AWS, or in my case, access to a few remote servers mostly to store data, receive data from clients and backups), your laptop is you core device to connect to all external services (via the Internet). 2.

Once you installed Cygwin, you can type commands or execute programs in the Cygwin console. Figure 1: Cygwin (Linux) console on Windows laptop You can open multuple Cygwin windows on your screen(s). To connect to an external server for file transfers, I use the Windows FileZilla freeware rather than the command-line ftp offered by Cygwin. You can run commands in the background using the & operator. . $ notepad VR3.txt & Mining Massive Datasets. Deep Web. Microsoft BizTalk Server. Переводчик Google. Big Data Technology Suite of Cloud Services. Big Data doesn’t need to be so hard We provide value faster and with less complexity with a cloud services approach Infochimps™ Cloud is a suite of cloud services that makes it faster and far less complex to develop and deploy Big Data applications.

Our cloud services handle all of the complex Big Data technologies and processes, giving you a simple, developer-friendly interface. Infochimps Cloud lets you focus on creating the applications that will drive value for your business instead of spending your time managing a Big Data “infrastructure stack.” Cloud::Streams — Streaming data and real-time analyticsCloud::Queries — NoSQL database and ad hoc, query-based analyticsCloud::Hadoop — Elastic Hadoop clusters and batch analytics Infochimps Cloud eliminates all the implementation headaches caused by Big Data enabling your Big Data applications to be completed quickly and fully achieve their objectives.

Flexible, cost-effective cloud deployment.

