BoM

BoM stands for Block-o-Matic Web page segmentation tool and algorithm.

Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called blocks. For determining the coherence of each segment we relies on the content categories classifications made by the W3C for the HTML 5 specification (e.g. sectioning content).

Detecting these different blocks is a crucial step for many applications, such as mobile devices, information retrieval, Web archiving, Web accessibility, evaluating visual quality (aesthetics), among others.

BoM uses the Bottom-Up strategy to detect blocks. It detects the smallest coherent simple blocks from the DOM leaves. Then group them by regions, called composite blocks. Finally, the simple blocks are merged using a computer vision algorithm to produce a tree showing the organization of the content in the Page.

BoM uses the hybrid approach to segmentation. Including DOM and Geometry of elements. In the second version of the algorithm we include also text features.