BoM!
|
Block-o-Matic! (Beta)
a Web Page Segmentation Algorithm
|
Last update: 10-dic-2017 |
About
Get it:
|
BoM chrome extension |
[download] |
|
Pagelyzer suite stand-alone distribution available on Github |
[visit] |
|
Javascript Library on Github |
[visit] |
Chrome Extension Installation Instruction
- Download .crx file to your computer
- Open a new Chrome browser window and write in the url bar
chrome://extensions
- Drag & Drop this files to this new window
- Click 'Add'
- Done. Happy BOM!
Stand-alone Installation Instruction
- After compiling and building the Jar file, the segmentation can be obtained using the following command:
java -jar jPagelyzer.jar -get segmentation -url pageurl -browser abrowser
Javascript Library Installation Instruction
- It is enough to insert the bomlib.js into the page to segment with a script tag.
- Ready to be used in conjunction with Selenium or just as is
- After page loading event, we can start the segmentation with the following function call:
startSegmentation(win, pac, pdc, returnType, psourceurl);
Where,
- win, in general is the current window object, but we can indicate to segment a page on an iFrame element or into another window, if it is desired
- pAC represents the granularity. It is an integer value between 0 and 10. 0 small blocks while 10 represent bigger blocks
- pDC represents the maximum separation allowed between blocks. In general a value of 50px it is enough. For example suppose that we would like to have into the same block a paragraph with its title. A value of pS=10 will keep the apart, but pDC=50 will keep them toghether
- returnType The output. It can be: record or vidiff([3])
- psourceurl URL ben analyzed. It is usefull to keep your url unchanged in the output, because browsers can change the url on page load.
- The variable page hold the segmentation in the form of a tree of Logical Objects
Ground truth construction tool
Publications
References
- Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Extracting content structure for web pages based on visual
representation. In Fifth Asia Pacific Web Conference (APWeb'03), 2003.
- Jisheng Liang, Ihsin T. Phillips, and Robert M. Haralick. Performance evaluation of document structure extraction
algorithms. Computer Vision and Image Understanding, 84(1):144-159, 2001.
- Zeynep Pehlivan, Myriam Ben-Saad, and Stéphane Gançarski. Vi-diff: understanding web pages changes. In Proceedings of
the 21st international conference on Database and expert systems applications: Part I, DEXA'10, pages 1-15, Berlin,
Heidelberg, 2010. Springer-Verlag.
- Y.Y. Tang, C. Y. Suen, C. D. Yan, and M Cheriet. Document analysis and understanding: a brief survey. First International
Conference on Document Analysis and Recognition, pages 17-31, 1991.
- Yeliz Yesilada. Web page segmentation: A review. Technical report, Middle East Technical University Northern Cyprus
Campus, March 2011.
- https://github.com/openplanets/pagelyzer
-
Faisal Shafait, Daniel Keysers, Thomas Breuel, "Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 6, pp. 941-954, June,