Basics of Big Data – Part 2 – Hadoop

As discussed in Part 1 of this series, Hadoop is the foremost among tools being current used for deriving value out of Big Data. The process of gaining insights from data through Business Intelligence and analytics essentially remains the same. However, with the huge variety, volume and velocity (the 3Vs of Big Data), it’s become necessary to re-think of the data management infrastructure. Hadoop, originally designed to be used with the MapReduce algorithm to solve parallel processing constraints in distributed architectures (e.g. web indexing) of web giants like Yahoo or Google, has become the de-facto standard for Big Data (large-scale data-intensive) analytics platforms.

What is Hadoop?

Think of Hadoop as an operating system for Big Data. It is essentially a flexible and available architecture for large scale computation and data processing on a network of commodity hardware.

Conceptually, the key components of the Java-based Hadoop framework are a file store and a distributed processing system:

1. Hadoop Distributed File System (HDFS): provides reliable (fault-tolerant), scalable, low-cost storage

2. MapReduce: Batch-oriented, distributed (parallel) data processing system with resource management and scheduling

As of October 2013, the 2.x GA release of Apache Hadoop also included an enhancement – a key third component:

3. YARN: a general purpose resource management system for Hadoop to allow MapReduce and other data processing services

Hadoop architecture stack

Open-source Hadoop is an Apache project. There are however commercial distributions of Hadoop (similar to UNIX distros) most notably from Cloudera, Hortonworks, MapR, IBM, Amazon etc. The Hadoop ecosystem has several projects in development, seeking to enhance the Hadoop framework to make it more suited to performing Big Data tasks including ETL and analytics.

The key components of the Hadoop distribution:

1. Distributed file system and storage – HDFS

HDFS – a Java based file system providing scalable and reliable data storage, designed to span large clusters of commodity servers

2. Data integration – Flume, Sqoop

Flume – service for integrating large amounts of streaming data (e.g. logs) into HDFS

Sqoop – tool for transferring bulk data between Hadoop and structured databases e.g. RDBMSes

3. Data access – HBase, Hive, Pig, Impala (CDH version for interactive SQL query), Storm, MapReduce jobs in Java/Python etc.

HBase – a non-relational (NoSQL) columnar database running on top of HDFS.

Hive – a data warehouse infrastructure built on Hadoop, providing a mechanism to project structure onto the data  and query it using SQL like language – HiveQL

Pig – allows writing complex MapReduce jobs using a scripting language – PigLatin

Impala – SQL query engine running natively in Hadoop, allows querying data in HDFS and HBase. It is part of Cloudera’s CDH distribution.

Storm – provides real-time data processing capabilities to Hadoop which is traditionally batch oriented (based on MapReduce).

4. Operations- Oozie, Ambari, ZooKeeper

Oozie – Java web application used to schedule Hadoop jobs

Ambari – Framework and tools to provision, manage and monitor Hadoop clusters

ZooKeeper – provides operational services for Hadoop – e.g. distributed configuration service, named registry, synchronization service etc.

5. Resource management – YARN

YARN – separates the resource management and processing components in Hadoop 2.x which used to be done in MapReduce packages in Hadoop 1.x

A schematic of Cloudera’s Hadoop distribution (CDH) is shown below:


Why Hadoop?

Hadoop has gained immense traction in a very short amount of time and is proving useful in a range of applications, including deriving insights from Big Data analytics.

The key advantages of Hadoop as a data processing platform are:

1. Scalability and availability – Due to its ability to store and distribute extremely large datasets across hundreds of inexpensive servers operating in parallel, Hadoop offers extreme scalability. With high-availability HDFS feature in Hadoop 2.0 providing redundant namenodes for standby and failover, Hadoop now also provides high availability

2. Cost-effectiveness – Due to its design incorporating fault-tolerance and scale-out architecture, Hadoop clusters can be built with relatively inexpensive commodity hardware instead of costly blade servers, thereby providing great savings for storage and computing abilities on a per TB basis.

3. Resilience - With built-in fault tolerance, e.g. multiple copies of data replicated on cluster nodes, and with high availability HDFS in version 2.0, Hadoop provides cost-effective resilience to faults and data loss.

4. Flexibility and performance – Ability to access and store various types of data – both structured and unstructured, with no constraints of schema-on-write, along with the emergence of new ways of accessing and processing data – e.g. Storm for real-time/streaming data, SQL-like tools including Impala, Hadapt, Stinger etc.

Due to these key advantages, Hadoop lends itself to several data processing use cases. Key use cases are:

1. Data store / Enterprise data warehouse (EDW) – cost-effective storage for all of an organization’s ever expanding data

2. Active archive – allowing cost-effective querying on historical data from archival systems

3. Transformation – executing data transformations (T step of ETL/ELT) for improved throughput and performance

4. Exploration – allows fast exploration and quicker insights from new questions and use cases, taking advantage of Hadoop’s schema-on-read model instead of schema-on-write models of traditional relational databases

5. Real-time applications – usage of flexible add-ons like Storm to provide dynamic data mashups

6. Machine learning, data mining, predictive modeling and advanced statistics

The early adopters of Hadoop are the web giants like Facebook, Yahoo, Google, LinkedIn, Twitter etc.

Facebook uses Hadoop – Hive and HBase for data warehousing (over 300 PB in aggregate and over 600 TB daily data inflows) and real-time application, serving up dynamic pages customized for each of its over 1.2 billion users.

Yahoo uses Hadoop and Pig for data processing and analytics, web search, email antispam and ad serving with more than 100,000 CPUs in over 40,000 servers running Hadoop with 170 PB of storage .

Google had used MapReduce to create its web index from crawl data and also uses Hadoop clusters on its cloud platform with Google Compute Engine (GCE).

LinkedIn uses Hadoop for data storage and analytics driving personalized recommendations like “People you may know” and ad targeting.

Twitter uses Hadoop – Pig and HBase for data visualization, social graph analysis and machine learning.

Limitations of Hadoop

While Hadoop is the most well-known Big Data solution, it is just one of the components in the Big Data landscape. While in theory, Hadoop is infinitely scalable and resilient and allows a great deal of flexibility in storing structured and unstructured data, in practice, there are several considerations to be taken care of while architecting Hadoop clusters due to the inherent limitations of Hadoop.

1. Workloads - Hadoop is suitable for various types of workloads, however mixed workloads or situations where the workload may vary widely or is not known ahead, makes it difficult to optimize the Hadoop architecture.

2. Integration – Hadoop should not be a stand-alone solution, else it will quickly become a data silo unconnected with the rest of the data management infrastructure. The Hadoop strategy needs to fit into the overall data management and processing framework of the organization to allow for growth and maintenance while not sacrificing on flexibility and agility

3. Security - In the enterprise, security is a big deal. While Hadoop was originally built without a security model, the Hadoop ecosystem is evolving with various projects for security, including Kerberos authentication, the Sentry offering from Cloudera, Project Rhino from Intel, Apache Knox as reverse proxy (with contribution from Hortonworks) or using Apache Accumulo for cell-level security;  however most are complex to setup and there is still no reference standard across deployments.

4. Complexity - the complexity of Hadoop as a Big Data platform lies in its evolving ecosystem of newer technologies, with most data warehousing and analytics specialists skilled in traditional relational databases, SQL and techniques which are difficult to use on Hadoop due to the lack of tools (e.g. still evolving SQL access) and the need for additional skills including data mining or advanced statistical techniques.

5. Availability - Up until the 2.0 release, Hadoop with single-master nodes in HDFS and MapReduce was subject to single point of failure.

6. Inefficiency - HDFS is inefficient for handling small files thereby making analysis on smaller datasets extremely inefficient. This is especially painful while designing models or finding patterns on smaller datasets. MapReduce is also a batch-oriented architecture not suitable for real-time access, but this is being addressed with tools like Storm. Tools like Impala provide interactive SQL-like querying on HDFS, which helps in improving quick adhoc analysis on smaller datasets.

7. Processing framework – Not all data processing problems or analytic questions can be designed with the MapReduce framework. Hadoop is therefore ill suited for such problems which cannot be expressed as problems with Map and Reduce steps and need other data processing paradigms. There are improvements being developed with Storm for real-time access or Spark for improving the data analytics performance with in-memory distributed computing to get around these issues.

In the next parts of this series, I will explore topics of building a Hadoop data warehouse, big data analytics with tools like R as well as other Big Data solutions, Hadoop enhancements  and alternatives to Hadoop.

Basics of Big Data – Part 1

You can’t miss all the buzz about Big Data! Over the past few years, the buzz around the cloud and Big Data shaping most of the future of computing, IT and analytics in particular has grown incessantly strong. As with most buzz words, which are then hijacked by marketing to suit their own products’ storylines, but which nonetheless manage to confuse users in business and staff in IT as well, Big Data means several things to several people.

So what exactly is Big Data?

Big Data refers to the enormously large datasets that are challenging to store, search, share and analyze with conventional data storage and processing systems. While these challenges remain, our ability to generate such large datasets have grown exponentially over the past few years. With the march of the digital age, and the growing popularity of social networking, the amount of data generated today is growing enormously. Not only within public domains like Google, Facebook, Youtube and Twitter but also within organizations, the amount of data being generated with more powerful servers, softwares and applications far exceeds our capacity to effectively analyze and make sense of this data.

The table below shows the growth of data and the new terminology that has been coined to address the growth of data volumes.

Amount of data



Real-world analogy

103 bytes

Kilobytes (kB)

1.44 MB High-density Floppy disk

Files, Folders

106 bytes

Megabytes (MB)

Disks, tape

Folders, Cabinets

109 bytes – 1012 bytes

Gigabytes (GB) – Terabytes (TB)

Disk arrays


1015 bytes

Petabytes (PB)


1018 bytes

Exabytes (EB)


1021 bytes

Zettabytes (ZB)


1024 bytes

Yottabytes (YB)


Volume, Velocity, Variety & Value

The 3Vs of Big Data – volume, velocity and variety have been popularized by Gartner’s analysis. Gartner defines Big Data as “high volume, velocity, and variety information assets that demand cost-effective, innovative forms of information process for enhanced insight and decision making.” What it essentially means is Big Data beyond the high volumes, moves too fast and is not always structured according to conventional database architectures. For example, multimedia content uploaded on YouTube or comments on Twitter or Facebook, coupled with the velocity at which it is generated and churned makes it obvious that this data is not in a structured format for conventional data processing. To analyze and gain insights from this data, rather derive “value”, would require a wholly different approach.

Old wine in new bottle? Data mining and Big Data Analytics

A modeling approach is required to derive value out of Big Data. A hypothesis is proposed, statistical models are created and validated or updated using data. It is interesting to note that this approach bears substantial resemblance or even overlap with “data  mining”. For those unfamiliar with the term, the more obscure and geeky part of business intelligence and information management, “data mining”  is the process of discovering patterns in large datasets, usually in data warehouses, involving several or all of methods and tools of statistics, databases, machine learning and artificial intelligence. This hypothesis-model-validate and refine approach for deriving value out of Big Data, could be manual with help from data analysts or specialists (data scientists), but could also be “machine-based” depending on adaptive machine-learning. It is important to understand due to the velocity of the data, the algorithm for deriving value could be short-lived and actions based on the insights may need to be implemented rather quickly.

As an illustration, consider minor changes done by Google in its algorithms to serve ads in its search results, collection of a large dataset based on usage for a limited period (maybe a few hours) and analyzing it to understand user response to specific Adwords across dimensions like demographics, geolocation, events, timing etc. can provide Google valuable insights on how to tweak its algorithms to serve advertising and generate the maximum revenue out of it.

The rise of the “Data Scientist”

While the most popular data mining example remains the myth about Walmart’s correlation of beer purchases with diapers, there are celebrated stories about statisticians at other retailers like Target using data mining to understand users’ buying patterns helping focus marketing efforts on target customers. Organizations now understand that there is hidden value in the Big Data haystack which can be leveraged for competitive advantage. With the amount of adaptive machine learning and automation, it used to be argued even a few years ago whether theory and modeling by analysts would be needed at all and whether the sheer volume of Big Data was sufficient to measure and predict patterns. However, it has slowly been understood that due to the breadth and depth of data to be examined, and to ensure correlation between the business context and the data being analyzed, there needs to be a key role played by humans in making sense of Big Data. Though not exclusively tied to Big Data projects, the data scientist role is the new role envisaged for analyzing data across multiple sources and delivering insights related to business problems. In essence, a data scientist is a marriage of 2 roles: the business analyst and the data mining statistician. A data scientist typically has similar backgrounds as a data analyst, being trained in statistics, mathematics, modeling and analytics as well as having strong business acumen and communication skills to convey highly technical analyses in business terms to both business and IT leadership. The data scientist has been variously cited as being an awesome nerd to having the sexiest job in the 21st century.

The tools for Big Data analytics

As with any new technology and hype cycles, enterprises are careful and calibrate their involvement in joining the Big Data bandwagon. It is no wonder that organizations which generate most of the Big Data in public domain e.g. social networks like Facebook or Google, also make the most use of analytics to derive value from Big Data. e.g. Google used Big Data to identify its famous hiring puzzles and brain-teasers were useless.  Facebook currently collects 320TB data each day working with Hadoop and Hive and is adding on a relational data warehouse of around 250PB. In fact, a lot of the open source tools and technologies in use today have been developed by these organizations. These include Hadoop which is an open-source Apache project, allowing storage of enormous datasets across distributed clusters of servers and running distributed analysis applications on the clusters. It utilizes the MapReduce parallel processing programming model, originally developed by Google to process large datasets on a cluster or distributed clusters.

I will explore details of Hadoop, Hive, HDFS, NoSQL, Spark, Impala and other technologies related to Big Data in the second part of this post. I will also explore the unique challenges of Big Data architecture including rapid use, rapid interpretation, lack of schema to sync across tools and other data, data quality and governance and integrating Big Data analytics into the enterprise architecture.

BI maturity models

With most companies listing BI within their top agenda, and with the rising costs and confusion around proving the worth of BI and justifying its costs, it makes sense to try and understand the evolution of BI adoption and maturity in organizations. Knowing what is possible with BI and knowing the challenges and pitfalls allows organizations to plan their BI strategy and implementation.

There are quite a few schools of thought and available literature on the lifecycle of BI implementation and maturity in organizations, defining the models. Most are proprietary models provided by consultancies, which are primarily based on technical point of view or applies the knowledge management function to BI, following the Ladder of Business Intelligence (LOBI) model.

LOBI includes 6 levels of maturity moving up the knowledge management value chain from Facts > Data > Information > Knowledge > Understanding > Enabled Intuition.

There are several other models in the public domain e.g.

  • Business Information maturity model
  • AMR research’s BI/Performance management maturity model
  • Business Intelligence development model
  • Business Intelligence maturity hierarchy
  • Infrastructure optimization maturity model

I’ll not go into the details of the models above but discuss the three of the more popular and well documented models available.

1. The TDWI BI Maturity Model

The Data Warehousing Institute (TDWI) is a premier body in the field of BI and eponymous Data warehousing and proposes a six stage BI maturity model. The underlying assumption being that BI implementation in organizations typically evolves from a low-value cost centre operation to a high value strategic utility to provide competitive advantage.

Stage 1: Prenatal – Executive perception is that of a cost-center, which primarily churns out static reports for management operational reporting. It is also the stage which costs the most.

Stage 2: Infant – The BI function’s role is to inform executives, with several reports leading to “spreadmarts

A ‘Gulf‘ separates Stage 2 and Stage 3.

Stage 3: Child – The BI function’s role is perceived to empower workers, and this is the first evolution into an analytical system where OLAP and ad-hoc reports are used off data marts.

Stage 4: Teenager – The BI function has evolved into a performance monitoring system by now, using Dashboards and Scorecards, supported by data warehouses.

A ‘Chasm‘ separates Stage 4 and Stage 5.

Stage 5: Adult – This is where the ROI from the BI function shoots up, with predictive analytics answering what-if questions making the BI a strategic utility. The TDWI thinks that organizations’ BI architecture has evolved to have enterprise DW by now, with BI becoming a ‘Drive the Business’ function.

Stage 6: Sage – The BI function at this stage has the highest ROI and decreasing costs based off Analytic Services (SOA) with pervasive BI (e.g. embedded BI) making it ‘Drive the market’

2. The HP Business Intelligence Maturity Model

It has 5 stages based on the evolution of Business enablement, Information technology and program management.

Stage 1 – Operation (Running the business) – involves ad-hoc solutions focused at project activities alone

Stage 2 – Improvement (Measuring and monitoring the business)– involved localized solutions with project management

Stage 3 – Alignment – includes shared resources with program management and governance integrating performance management and BI programs

Stage 4 – Empowerment – includes enterprise operationalization with portfolio management focusing on organization innovation and people productivity through knowledge management

Stage 5 – Transformation (Change the business) – involves enterprise services tracked by service management creating strategic agility and differentiation

3. Gartner BI Maturity Model

Gartner, the IT research and advisory group’s BI maturity model is based on 3 key areas of assessment – people, processes and metrics. It has 5 maturity levels:

Level 1 – Unaware – Spreadsheet and information anarchy, one-off report requests

Level 2 – Tactical – Usage limited to few executives with data inconsistency and stovepipe systems

Level 3- Focused – Specific ser if users realize value, with focus on specific business need and BI competency centre (BICC) in place

Level 4 – Strategic – Business objectives drive the BI and performance management systems with well defined and enforced governance policies and standards

Level 5 – Pervasive – Use of BI is extended to suppliers and customers, information is trusted (holy grail of single version of truth) with analytics embedded in business processes

Barriers to BI adoption and maturity

The 3 models discussed above do a good job of explaining the continuum of maturity levels, which makes it difficult to identify explicit stages, however the common theme across these are:

  1. Each model has at least 5 stages of maturity – this is more than a simple 1-2-3. This indicates the path of BI evolution is longer and more complex than most think while jumping onto the BI bandwagon
  2. Each model starts with operational / one-off reporting and culminates in pervasive BI where BI is embedded in business processes and provides actionable insight for strategic advantage
  3. The models do not focus on technology alone and hinge on the involvement of people and process as well. Moving from one-off reporting to driving the enterprise involves big changes in organization culture and business processes and not just implementing the latest BI tool off the market.

The main barriers to BI adoption and demonstrating its worth as a strategic tool lies in its complexity. BI is a broad area encompassing both technical and non-technical aspects like people and process; therefore the models can only provide a prescriptive framework which needs to be adapted by each organization. It is important to understand that various departments of an organization can be at varying levels of maturity and not every organization follows the same trajectory of evolution or has to go through each stage.

It is however noteworthy that for organizations trying to move from basic levels (e.g. TDWI Level 2 –Infant stage) to higher levels (e.g. TDWI Level 5 – Adult stage) may find it very difficult to leapfrog levels. In fact regressing stages is also possible due to changes like mergers and acquisitions of organizations at different levels of maturity where differences across people, processes and technology may be difficult to reconcile or could be delayed. The TDWI model recognizes these difficulties as:

  • The Gulf – between level 2 (infant) and level 3(child) – mainly due to differences in executive perception, data quality issues and spreadmart anarchy
  • The Chasm – between level 4(teenager) and level 5(adult) – mainly due to differences in executive perception, spreadmarts, architectural inflexibility or lock-ins

It is important to take lessons from the BI maturity models and develop a BI strategy while planning to implement BI, rather than as a bolt-on which can provide instant ROI. The strategy needs to focus on quick wins at inception to build buy-ins and get executive sponsorship which is critical to the funding of the BI program and would help overcome organizational barriers in people and processes, and then should build on its success with incremental gains and asking the right questions. We’ll look at developing a BI strategy in a subsequent post.

What’s new in SAP BusinessObjects Enterprise BI 4.0 platform or BOXI 4.0 aka Aurora?

SAP BI 4.0 release (codenamed Aurora) has been the first major release of the BI platform since SAP acquired BusinessObjects. In this release, the semantic layer (universe layer for the uninitiated) has been re-worked completely to expose all business data under a single umbrella. The self-service BI portal (aka Infoview) has been revamped with a new AJAX based design and providing quicker and easier access to content. Publishing and distribution of BI content to mass audience has been made easier. There are also improvements to the lifecycle management (LCM tool) and platform administration (CMC, CCM) from a single console. This is in a nutshell are the changes that Aurora or SAP BO 4.0 bring, allowing BI content to be delivered across different channels ranging from the browser (BI Launch Pad, SharePoint, SAP NetWeaver Portal, Java Portal) to desktop (widgets), MS-Office and mobile.

In the following section I’ll try to cover the major changes that have been effected in the following products:

Semantic Layer - A new tool, Information Design Tool enhances the Universe Designer. The universes created by this tool are identified by the .UNX file extension and allow connections to multiple data sources.

Multiple data sources in the new Information Design Tool

The universe designer is still there. Renamed as universe design tool, it allows creating single data source universes (.UNV file extension) as before.

Conversion of previous universe .unv versions is supported only for relational universes created in previous universe designer versions and not possible for OLAP universes or universes based on stored procedures or Data Federator data source.

No authentication is required to start the information design tool. Users can create and edit unsecured resources (data foundations, business layers, connections) in local projects and publish them to the repository to make them secure.

Connections to relational data sources, OLAP data sources as well as SAP NetWeaver BEx query can be created, be local (saved locally as .cnx files) or secured (stored in the repository).

Add a connection to a multi-source enabled data foundation universe

The newly named “Data foundations” are analogous to the schema browsers in Universe Designer. They contain the schema of relevant tables and joins from one or more relational databases that are used as a basis for one or more business layers.

The business layer is the universe metadata. Depending on the type of data source for the business layer, several types of objects e.g. folders, dimensions, analysis dimensions, measures, attributes, filters, hierarchies (OLAP only) can be created and edited in the business layer.

Search - enhancements include a new enhanced search engine allowing search by document attributes as well as content. Search results can be filtered and refined easily and the search GUI is integrated in the BI launch pad


There are also enhanced options through the OpenSearch API which enables integration with other search systems like Google Search Appliance, Microsoft SharePoint portal and NetWeaver Enterprise Search.

BI Portal - includes a new look re-designed web portal (InfoView) now called the BI LaunchPad providing a rich new user experience. It provides quick and easy access to BI applications and search, a handy list of recently used reports, scheduled documents, alerts etc., multiple tabs and pinning options, and a reduction in the manual steps for common tasks like:

  • Ability to create new folder while Saving
  • Schedule and Send To actions in Document viewers
  • Auto-refresh in History page
BI Portal
Alerting, Monitoring & AuditingThe alerting framework allows triggering of alerts based on events (schedule completion, ETL completion, system monitoring etc.) or data conditions as also reactions to those events e.g. scheduling report to run or send notification message. Subscription to alerts is made easier with a consistent workflow, allowing notification emails or messages in the BI Launch Pad.

New monitoring applications are available to keep tabs on system health and performance (server metrics, custom probes, user-defined watch conditions, visualization dashboard in CMC) and integrate with infrastructure monitoring tools like Tivoli and SAP Solution Manager.

Auditing enhancements include simplified system wide configuration, auto-purging of old data and an enhanced audit store schema which simplifies reporting and application development.

Lifecycle Management - The LCM console replaces the import wizard. It allows connection override in bulk mode automatically, supports version control and rollback, is audit-able and provides scripting facility.

Upgrades and deployments - A new optimized upgrade management tool is provided, combining the best of Import Wizard and Database Migration tool in XI 3.x. This caters to one-click full upgrade or selective incremental upgrades, allowing direct upgrade from XI R2 SP2 or later. There’s enhanced scalability in deployment with virtualization and 64-bit support.

Upgrade Management Tool

Analytics – out of the box: SAP Business Analytics

SAP finally announced on September 14, 2010 that it was getting onto the pre-packaged analytics bandwagon.  SAP announced ten applications in this first release for six industries (Consumer Products, Healthcare, Financial Services, Public Sector, Retail and Telecommunications) in its BusinessObjects  offering.

Building on the rapid-marts offering that the then BOBJ used to have and leveraging SAP’s industry and line of business expertise, these new applications are based on the SAP Business Objects XI platform – WebIntelligence, Crystal Reports and Dashboards (formerly Xcelsius). Bill McDermott, the joint CEO of SAP, described it as “complete and ready-to-go” and claimed the applications can be deployed in as less as eight weeks.

You may remember the brouhaha created by SAS last year , when it kicked off the controversy on Business Analytics being the future, rather than Business Intelligence. Going back even further, Oracle already had this in its Siebel Analytics pre-built analytic applications for various industries. Therefore, it would seem that SAP is already late in the game, but considering that neither Microsoft nor IBM have similar offerings, it may not be too bad for SAP. Better late than never…

Under the hood:

The pre-packaged analytic applications are based on the BusinessObjects XI platform – with the universe as the semantic layer or metadata model. It can be based on both SAP and non-SAP data, OLTP and data warehouse, relational and unstructured.  SAP would work with its partners HP and Teradata to optimize the analytic solutions on their hosting and data warehousing solutions.

Business Analytics dashboards are Xcelsius flash files which can be used with web services/QAWS to deliver real-time analytics. It may also be possible to use these with SAP Business Objects Explorer (formerly Polestar) and/or SAP BW Accelerator or the SAP high-performance analytic appliance (HANA).

Business Analytics vs. Business Intelligence – Revisiting the controversy:

When SAS created this controversy last year, an important point noted by many was the SAS home page titled:

SAS | Business Intelligence Software and Predictive Analytics

It’s important to see how the rebranding has reflected in a change to the SAS home page a year hence. It now reads:

SAS | Business Analytics and Business Intelligence Software

SAS Institute was always viewed as a niche vendor, operating in the pure-play statistical and predictive analytics space and this marketing was to re-brand SAS’ offerings to move it mainstream.  In effect, it signaled the market assessment by these major vendors, that in tough times, customers were seeking shorter lead times and demanding better tools which are quick and easy to introduce and provide quicker return on investment.  As we come around the downturn, with SAP still focusing on this segment, it is clear that traditional BI is clearly seen as complex, costly and difficult to implement.

Open questions:

There are several questions open at the moment, given that this is an initial launch. SAP plans to offer more applications over the next year-18 months in collaboration with customers and its partners.  The partners include Aster Group, Blueprint, Capgemini, Column5, CSC, Fusion Consulting, The Glenture Group, LSI Consulting and syskoplan and surely it would take quite a while for the ecosystem to develop.  It remains to be seen whether the prepackaged analytics catches on like Xcelsius dashboards did for BOBJ.

It is not clear whether the prepackaged analytics would be positioned at the bigger enterprises or the SME segment only, as its success could cannibalize revenues from the flagship Enterprise XI suite.

There are also questions around the scalability of the framework the analytic applications are built on. The extensibility APIs and reference architectures for partners to build their own add-ons and plugins / applications  of their own is not yet out (planned in 2011), so it’s not quite like the iPhone/iPad app store yet. It is also not clear how customizations to the applications would be supported or to what extent these could be customized.  The long awaited universe rewrite including data federation might be a part of plans if the analytic applications turn out to be truly backend-agnostic and do support future in-memory data structures (SAP’s acquisition of Sybase would indicate likely support for the Sybase ASE in-memory database). If this happens, it would be in line with earlier plans to roll-out in-memory EPM and OLTP solutions.

Review of the BT Summit – Cloud computing, SOA and BI tracks

I attended the Business Technology Summit in Bangalore last week – 3rd and 4th November. There were 3 tracks on cloud computing, Service Oriented Architecture and Business Intelligence, and I chose a mix of sessions across each.

Overall impression: The BT Summit was heavily focused on cloud computing with half of second day having a deep dive into Amazon’s EC2 cloud offering, and several keynotes. SOA and web services, REST and similar architectural sessions were interspersed but definitely not a first-class citizen. BI came a poor third with a poor choice of sessions, and more of a rehash of what is out there for everyone, rather than something on the cutting-edge including use of appliances and columnar databases, as also in-memory databases and use of Flash and AJAX for interactive BI front-ends.

Session-wise review: (Speaker profiles available here). I was able to speak to and ask questions of Vinod Kumar, Vijay Doddavaram, Abhinav Agarwal and Dr. Bob Marcus.


Probably the highlight of the keynotes, this was a pep-talk about the inevitable interconnected future with smart products and services and for good measure Charney threw out some statistics on broadband growth and bandwidth usage and India’s readiness and potential in the scheme of things.

The worst of the lot – this started by comparing the spectrum of offerings in the cloud from Amazon’s DIY EC2 and AWS, Google appengine and apps to Microsoft’s Azure and ended up as a promo touting Azure as the best buy among all.

A very good keynote, focusing on what makes sense to migrate to the cloud and what doesn’t, what are the hidden costs, the myth of unlimited elasticity in the cloud and what Yahoo is doing to use open source software like Hadoop and Hive for cloud computing. In the short time span, Shouvick also tried to address some of the other considerations – including re-architecting existing applications, availability, data storage and movement considerations.

This post-lunch keynote by Sharma was a rambling talk on how technology keeps redefining our lives, and why it is important to think outside-the-box. He used the example of the iPhone to illustrate how such thinking has the potential to alter the established rules of the industry and redefine it as we know it.

Puhlmann provided the security perspective on how easy it to break/hack enterprise systems and how anti-virus and anti-spyware are always playing catch-up, the entire economy that is spawned by the “bad-guys” in technology and why our systems need to be smart and be built from the ground-up for security rather than as an afterthought. He provided valuable insights into what questions we should ask ourselves as we embrace cloud computing, the changing technology landscape making it easy for consuming information but easier still for the security breachers. Puhlmann concluded by suggesting it may be worthwhile including a level of risk assessment and mitigation, and collaboration with ethical hackers, rather than trying to do the impossible of removing all security threats.

Barely managed to sleep through it – this one talked about moving towards a virtual enterprise – with a focus on virtualized architecture, including cloud computing. As boring as they can get.

Other sessions:

  • SOA, Composite Applications, and Cloud Computing: Three pillars of a modern technology solution by Robert Schneider

Robert  Schneider presented the different facets of SOA, Composite applications (superset of mash-ups) and Cloud computing and contrasted them regarding the time to yield benefits, the maturity of the vision, involvement and buy-in from business and where they lie in the tactical-strategic plane. There wasn’t anything regarding why we are stuck with these three for a modern technology solution, or what other paradigms are out there beyond the old-world enterprise computing framework, possibly due to time constraints.

  • Self-service analysis and the future of Business Intelligence by Vinod Kumar

A lot of the BI folks were waiting for this, as Vinod performed the Project Gemini (Office 2010 Excel and PowerPivot) demo live for the first time in India, with several folks, including yours truly, sitting on the stairs. [We have had to rely on Youtube videos and MS Office 2010 preview videos earlier]. The demo was impressive fetching over 13 million records into Excel using a standard DDR laptop, using compression and in-memory technologies. The bigger question around unleashing another round of Excel hell went unanswered due to time constraints, however the presentation probably hinted at Microsoft’s vision of “self-service BI” or so-called “underground-BI” as the power-users of Excel (estimated at 2M worldwide, at 4% of the Excel user base) have been doing. Microsoft’s strategy around pushing SharePoint adoption in the Enterprise was made clear tacitly with SharePoint being the only “portal” to publish and share BI analysis (typical size of these Excel spreadsheets is upwards of 200MB) with other users in the enterprise.

  • Designing and Implementing RESTful web services by Eben Hewitt

Eben Hewitt started off with a very brief comparison between SOAP (Simple Object Access Protocol) modeled more on the lines of RPC (Remote Procedure Call) and REST (Representational State Transfer) and clarified that REST is more an architectural style rather than specifications. The remainder of the talk delved into details of implementation of REST – usage of simple ‘verbs’ and complexity in ‘nouns’, uniform interface, using named resources, java REST frameworks like Jersey, MIME types – JSON, XML, YAML and HTTP operations supported – POST, GET, PUT and DELETE.

I attended with some expectations on how a BI project can be executed possibly with open-source or free software like MySQL/Postgres, Pentaho/Talend, Jaspersoft/MicroStrategy reporting suite etc., but was highly disappointed by the presentation. Ramaswamy spoke on BI usage, barriers to BI adoption, costs of BI implementation and spewed statistics like m&m’s with cursory references to Forrester, Gartner and “research studies”, but there wasn’t anything tangible on how to go about a project execution except for some common-sense talk on “evaluating options” between open-source and licensing costs, offshoring and outsourcing, RDBMS vs. analytica databases and appliances etc.

  • Business Intelligence – Leveraging and Navigating during current challenging times by Vijay Doddavaram

Vijay spoke of the current global economic downturn and how it had taken everyone unawares during the downturn as well as when the current quarter the tide seems to have returned. With the example of a fictitious company in China, he illustrated the importance of trade-off between tactical and strategic decision making and whether and how business intelligence can make a difference in either a downturn and the upswing (whether it is a U, V, or a W curve). Thought-provoking, one couldn’t help feel that BI software has not yet eliminated the “intelligence” that people bring to the table, and made a distinct point about the “human analysis/intelligence” against the out-of-the-box actionable-intelligence marketed by the BI vendors. It would have been interesting to prolong the discussion, with a focus on the “predictive-analytics” offerings in the market (from SAP, WPC, SPSS and the open-source R etc.), we had once again run out of time, and it was the last session of the day as well.

  • Towards a unified Business Intelligence and Enterprise Performance Management Strategy by Abhinav Agarwal

Abhinav is from Oracle and he used this session to basically present the BI and EPM strategy of Oracle. Refreshing when contrasted with the usual Oracle marketing hype, Abhinav made it a point to stress the difficulty of delivering best-in-breed products due to numerous acquisitions and the inevitable integrations compared to the disruptive start-ups which could be one-trick ponies but nevertheless manage to push the technology envelope. Most of the session focused on Oracle BI server offering and the roadmap of integrating with the Fusion middleware, and brief touchpoints on the capabilities of the Oracle BI server: federated queries (acquired from nQuire, which Siebel systems had acquired, prior to being bought by Oracle), and real-time updates, including Oracle RTD (Real-time Decisions) and the segregation of the BI and EPM software offerings.

  • 10 Things software architects should know by Eben Hewitt

I was able to attend part of it, but for the most part- the bottomline of this talk was the trade-offs architects need to make and understanding there may not be a “solution” to a problem, it may just be “moving the problem” – the idea that each “solution” brings its own issues and tradeoffs into the picture. Being more focused on java APIs and cloud computing frameworks, it could have done better with something related to networks and database architecture in general for audience to relate better (for most of my time, I couldn’t relate to a BI applications and data-warehousing infrastructure).

Being late from an overcrowded dining hall, I was able to attend part of this. Bob spoke of the various public and private initiatives including those from the federal government, NASA Nebula and made the distinction early on between the types of offerings on the cloud: SaaS (Software as a service), IaaS (Infrastructure as a service) and PaaS (Platform as a Service). He mentioned in passing the and initiatives of the Obama administration as also about RACE (Rapid Access Computing Environment) from the Dept. of Defense – Defense Information Systems Agency.

Vivek Khurana did a very short presentation to an overflowing hall on clichéd but nevertheless important aspects of information visualization while designing dashboards: clutter vs. simplicity, proper designing of KPIs, importance of delivery to mobile devices, and learning from news aggregation sites and portals on presentation.

  • Implementing Enterprise 2.0 using Open Source products by Udayan Banerjee

Banerjee did a great job of presenting what his vision of implementing Enterprise 2.0 in NIIT was – implementing SLATES (coined by Andrew McAfee) – Search, Links, Authoring, Tags, Extensions and Signals. Within half-an-hour he navigated us through using open-source products for collaboration using blogs and wiki (MediaWiki), using single-sign-on with enterprise databases, using links and tag clouds and integrating Search as well as implementing a text-based instant messenger.

I had missed the earlier session of Alan on lessons learnt using SharePoint, so I made it a point to attend the last of this at the summit – even though it meant I had no clue sometimes of what was being talked about! Alan spoke of the emergence of the multi-vendor CMIS standard for Enterprise Content Management – the various facets of ECM – from digital and media assets, email archiving, Internet content, web analytics, document types, rich media and the problems with the earlier Java standards like JSR 170 – most notably the absence of support from Microsoft. He also spoke about the vendor landscape and a 9-block rating similar to Gartner’s magic quadrant, plus various other important standards, including XAM – eXtensible Access Method – a storage standard developed by SNIA (Storage and Networking Industry Association)

Presentation files: Most presentation files are available here. You’ll need to register though to download.

- Maloy

Evolution of the BO XI platform – from XI R2 to XI 3.1 SP2

With BO XI 3.1 SP2 out in July this year, it is probably time to make a trip down the years to find out how the XI platform has evolved and matured.

The timeline:

  • XI R2 SP2 – service pack release in March 2007 with productivity pack – QaaWS and LiveOffice connectors
  • XI 3.0 – new major release in February 2008 – the first release after SAP acquired BOBJ in October 2007
  • XI 3.1 – upgrade release in September 2008
  • XI 3.1 SP2 – service pack release on 24 July 2009 – with enhanced SAP integration

Where were we with XI R2:

  • Change to Crystal service-oriented platform (Crystal 10 architecture)
  • Ability to plug Crystal Reports, Web Intelligence, Desktop Intelligence, OLAP Intelligence, Dashboard Manager, Performance Manager directly into the framework
  • Single repository, security, system management, publishing, portal
  • Infoview (Replaced old BO Infoview and Crystal ePortfolio)
  • Central Management Console (CMC)
  • Import Wizard (upgrades from BO 5, 6, XI, Crystal 8.5, 9, 10)
  • Desktop Intelligence (new name for BO full client + ability to query and display Unicode data)
  • Publishing, Encyclopedia, Discussions, OLAP Intelligence, Performance Management
  • Changes to Data Integrator, Composer, Metadata Manager

XI 3.0 (Titan)

  • All administration moved to the Central Management Console – CMC – with new GUI
  • Bulk action support in CMC
  • Central Configuration Manager – CCM is still there (to manage multiple nodes) with 2 entries : Tomcat & SIA
  • Server Intelligence Agent (SIA) – handles service dependencies
  • Server Intelligence in CMC – clone server deployments
  • Repository Federation – replicate repository on other BO cluster
  • Repository Diagnostic Tool (Infostore vs FileStore – repair inconsistencies between CMS database entries and files in FRS)
  • Improved Import Wizard
  • Web Intelligence Rich Client (offline viewing of WebI reports, no session timeout)
  • Data change tracking in Web Intelligence
  • Designer – “Database delegated” projection on measures
  • Universe based on stored procedures
  • Prompt syntax extension (persistent/primary_key undocumented features, finally!)
  • Personal data provider – combine data from Excel, text, csv and get into a single report
  • Smart cubes – support for non-additive measures (percentages, ratios) and RDBMS analytical functions
  • Multi language support – dimensions, measures, prompts automatically localized to report viewer’s language
  • Native Web Intelligence printing (without PDF)
  • Enbed image in Web Intelligence report
  • Hyperlinks dialog box makes links easy to create – syntax generated by WebIntelligence (remember opendocument()?)

What’s new in XI 3.1

  • Support for multi-forest Active Directory authentication
  • IP v6 support
  • Lifecycle Management Tool (LCMBIAR files, replace Import Wizard)
  • Saving Web Intelligence documents as CSV (data-only files) – new sheets for every 65K rows of data
  • Web Intelligence Autosave
  • “Begin_SQL” SQL prefix variable
  • Prompt syntax extension (support for key-value pairs!)
  • Business Objects Voyager enhancements
  • Live Office enhancements
  • WebIntelligence – Automatic loading of cached LOVs, interactive drag-drop, report filter bar, cancel refresh-on-open

What’s new in XI 3.1 SP2

In one of my next posts, I’ll cover selected new features in detail.