The ability to scale with semantic technology is creating new opportunities. In this article/interview, CTO of Cambridge Semantics, Sean Martin, shares these new developments. CK
Article written by Jack Vaughan originally appeared on SearchDataManagement on January 1, 2017.
Behind the newly surging interest in graph data is a rebirth of semantic technology. Semantic approaches improved upon relational methods for data analytics, but they have also had to overcome hurdles. To get a better view on graph developments, SearchDataManagement caught up with Sean Martin, CTO at Cambridge Semantics, who is among those at the center of semantic technology advances. After numerous years and adventures in technology skunkworks-style undertakings at IBM, Martin founded Cambridge Semantics in 2007 to further knowledge graphs and semantic technology in the enterprise.
Scalability, he said, has been a challenge he has worked consistently to meet. Released last year, his firm’s Anzo Smart Data Lake is based on an in-memory massively parallel processing (MPP) graph database engine. The product was obtained in 2015, along with Cambridge Semantic’s purchase of SPARQL City, whose principals included the top technologists behind MPP pioneers Netezza and ParAccel. At the heart of Anzo Smart Data Lake is support for Resource Description Framework/SPARQL standards for data storing and querying.
Graphs and semantic technology have traveled a long road, but things seem to be coalescing lately. Is that true?
Sean Martin: Well, semantics standards came out 15 or more years ago, but scalability has been an inhibitor. Now, the graph technology has taken off. Most of what people have been looking at it for is [online transactional processing]. Our focus has been on [online analytical processing] — using graph technology for analytics.What held graph technology back from doing analytics was the scaling problem. There was promise and hype over those years, but, at every turn, the scale just wasn’t there. You could see amazing things in miniature, but enterprises couldn’t see them at scale. In effect, we have taken our query technology and applied MPP technology to it. Now, we are seeing tremendous scales of data.From our point of view, the ability to take on data warehouse loads happened, really, this year. Now, we find we can implement complex data lakes, and graph is a big element of that. Still, we see some people using graph and others hedging their bets; using graph and also using Hadoop family software for analytics.
What does semantic technology bring to the party? What is it meant to improve upon?
Martin: You will see much richer representations of data. One of the issues people have is that the data representations they have been practically capable of, using the traditional tool sets, have been pretty limited. They are just not practical anymore. People can create very elaborate relational structures, but the more rich they get, and the more types of classes they represent, then the more difficult it gets to write queries that touch many of the different tables you need to create, and there [are] a lot of artifacts around how the data is actually stored. So, in effect, there is a practical limit on how easy it is to represent the data richly using the traditional tools.Relational tools or tabular tools that businesses are using today are also poor for extracting data from text and gaining rich representations. Sometimes it can be done, but what you end up with is impractical.Meanwhile, you have users that want different cuts of that data. There is more and more demand for custom extracts of the information.Yet another problem is that people continually want to include data from other external sources. Those are issues we now see being addressed by semantic technology.
A lot of the technology has had to fall into place here — what are some of the pieces you see contributing to making these systems successful?
Martin: Certainly, there is a series of standards. There is OWL (Web Ontology Language), which is a modeling language. And that allows you to describe what type of data you expect to see, and how that data relates to other entities. You describe an entity and then you describe its attributes, like you would in a table. And then you can describe another entity and its attributes, and some of those attributes can be connected.The key thing about OWL is that it is neutral to how the data is stored or queried. It acts as a template to bring data in. It’s an open standard, and there are lots of tools for OWL. It’s a good way for sharing models. It is being used in different domains — so you see something like the [Financial Industry Business Ontology] model in financial services.A second technology is the triple-store graph stores. The issue with those has been that, over the years, they have not scaled very well. That has held up the use of semantic technology. In particular, the most compute- and storage-intensive applications, like data warehouses and data marts, have just been beyond the scale of semantic technology. But now there is an underlying arc of technologies — which includes in-memory technology, commodity CPU cores, fast interconnects, cloud fabric networking — that have moved things ahead.Also, there is a third technology. That is SPARQL, which we see as a semantic successor to SQL in the graph world. It is a standard protocol for talking to a remote database, as well as a query language. It can do anything you can do in SQL, but then it does much better at queries around relationships — how things are related. It is also more tractable for automatic code generation and user interface building.So, coupled with OWL, you have a recipe there for allowing people, without learning SPARQL, to use it easily.
I think it is usually worth noting that, while the root word relation is in relational database, graph databases have benefits when it comes to disclosing relationships.
Martin: They do. As one of my colleagues sometimes points out, with a relational database, when you are asking about relationships, you have to explicitly ask ‘are they related like this’ or ‘are they related like that.’ Whereas, with SPARQL, you can just do one query and say ‘show me all the ways in which all these things are related,’ and it will list them for you. In relational technology, you actually have to know, upfront, all the ways in which they might be related, and then run separate queries.