Newscycle granted patent on new technique in unsupervised learning

The new technique, used in Newscycle's NewsEdge content-as-a-service platform, clusters like-with-like data at real-time speeds, providing exact categorization, where previous methods have been slower and approximate.


NEWSCYCLE Solutions has received a patent for a new technique in unsupervised learning. The technology uses methods in the NewsEdge software for clustering like content.

"The challenge of unsupervised learning is this: take a huge sample of individuals (or individual things, like news articles), and group like-with-like," said Dr. Lawrence C. Rafsky, chief artificial intelligence/machine learning scientist at Acquire Media, a Newscycle company. Rafsky created the technique in collaboration with Dr. Jonathan A. Marshall.

Rafsky said that there are numerous other techniques for unsupervised learning, but the Newscycle-patented solution is unique in that it runs in real time and gets an exact answer to the underlying combinatorial minimization problem. "It's super-fast and we get the exact answer, whereas other techniques just arrive at an approximation," he said.

Newscycle's NewsEdge content-as-a-service solution uses the algorithm to group more than 750,000 news articles a day into buckets of single-themed news events, continuously throughout the day, in real time. However, its creators note that it could have applications in other industries as well.

"Even though we developed our method to support our business of news article topic clustering, it works equally well in medical data analysis, ecommerce transaction analytics, advertising segmentation, database similarity-joins, and other areas," said Marshall. "For more than 50 years, textbooks have explained why clustering slows down with each arriving data item. Now, for the first time, we have produced a data clustering method that does not slow down – it is just as fast on the millionth data item as on the first. This makes it feasible and practical to find clusters in much larger data sets than before."

Tests involving clustering a data set of more than 10 million news articles found the technique was 10,000 times faster than typical industry-standard solutions.

"We cracked a problem that most data scientists realize they don't have an exact solution for," Rafsky said.

The technique is particularly suited for Newscycle's NewsEdge content-as-a-service platform, which has a proprietary real-time taxonomy for tagging news with very specific controlled-vocabulary keywords. Together, the processes enable NewsEdge products to provide searchable theme-based content bundles to users in milliseconds, with near-zero post-publishing latency.

Learn more:

Calendar View all