{"id":497343,"date":"2025-06-26T14:15:24","date_gmt":"2025-06-26T14:15:24","guid":{"rendered":"https:\/\/webkul.com\/blog\/?p=497343"},"modified":"2025-06-26T14:15:31","modified_gmt":"2025-06-26T14:15:31","slug":"structured-ocr-newspaper-pipeline","status":"publish","type":"post","link":"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/","title":{"rendered":"Structured OCR for Newspapers: Using YOLOX and Vision LLMs"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p>Building a <strong>structured OCR for newspapers<\/strong> is no simple task. Unlike books or documents, newspapers are messy\u2014often noisy, skewed, and low-resolution. <\/p>\n\n\n\n<p>Traditional <a href=\"https:\/\/webkul.com\/blog\/invoice-data-extraction-ocr-ai\/\">OCR tools <\/a>struggle with such complex layouts.<\/p>\n\n\n\n<p>Newspapers also don\u2019t follow a standard layout. They use multiple columns, captions, mixed fonts, and articles that may jump across pages.<\/p>\n\n\n\n<p> Because of this, tools like Tesseract often return jumbled, unstructured text. These tools read line by line\u2014without understanding the context.<\/p>\n\n\n\n<p>But what if you need structured data like titles, authors, dates, or page numbers? Raw text simply isn\u2019t enough.<\/p>\n\n\n\n<p>To solve this, we\u2019ll combine <strong>YOLOX<\/strong> for detecting layout blocks with <strong>Vision LLM<\/strong> for intelligent text extraction. <\/p>\n\n\n\n<p>This modern OCR pipeline turns scanned pages into clean, structured JSON\u2014each block labeled and ordered properly.<\/p>\n\n\n\n<p>This blog walks you through how to build a <strong>structured OCR for newspapers<\/strong> using modern AI tools.<\/p>\n\n\n\n<p>Let\u2019s dive in.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2. Project Overview: Structured OCR for Newspapers<\/h2>\n\n\n\n<p>This project helps extract <strong>structured content<\/strong> from scanned newspaper pages. The system detects layout blocks\u2014such as titles, captions, and article bodies\u2014and then reads the text using AI.<\/p>\n\n\n\n<p>Here\u2019s how it works:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>A user uploads a newspaper image.<\/li>\n\n\n\n<li>The system detects blocks like <strong>titles, subheadings, text<\/strong>, and <strong>captions<\/strong> using YOLOX.<\/li>\n\n\n\n<li>Each block is sent to an OCR engine:\n<ul class=\"wp-block-list\">\n<li><strong>EasyOCR<\/strong> for simpler content<\/li>\n\n\n\n<li><strong>Vision LLM<\/strong> for dense or complex regions<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Extracted text is grouped and labeled.<\/li>\n\n\n\n<li>A clean, structured <strong>JSON<\/strong> file is returned.<\/li>\n<\/ol>\n\n\n\n<p>This JSON can be used for research, digital archiving, or searchable databases. It\u2019s both machine-readable and easy to understand.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Key Components<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>YOLOX<\/strong> \u2013 For object detection and layout analysis<\/li>\n\n\n\n<li><strong>EasyOCR \/ Vision LLM<\/strong> \u2013 For flexible text extraction<\/li>\n\n\n\n<li><strong>Python 3.10<\/strong> \u2013 With <code>.env<\/code> for API key management<\/li>\n<\/ul>\n\n\n\n<p>This system can run locally or on a small server. A GPU helps, but it\u2019s not strictly required for testing.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Training YOLOX for Structured OCR in Newspapers<\/h2>\n\n\n\n<p>Before running the pipeline, you\u2019ll need to train a custom YOLOX model that can detect newspaper block types.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.1 Create a Virtual Environment<\/h3>\n\n\n\n<p>Use Python 3.10.13:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python3.10 -m venv .venv\nsource .venv\/bin\/activate  # macOS\/Linux\n# .venv\\Scripts\\activate    # Windows\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">3.2 Install Dependencies<\/h3>\n\n\n\n<p>First, upgrade pip and install all required packages:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install --upgrade pip\npip install -r requirements.txt\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">3.3 Creating a Newspaper-Specific Dataset for OCR<\/h3>\n\n\n\n<p>Make sure your dataset is annotated in <strong>COCO format<\/strong> with relevant classes like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>title<\/code><\/li>\n\n\n\n<li><code>subheading<\/code><\/li>\n\n\n\n<li><code>textblock<\/code><\/li>\n\n\n\n<li><code>caption<\/code><\/li>\n\n\n\n<li><code>author<\/code><\/li>\n\n\n\n<li><code>page_number<\/code><\/li>\n<\/ul>\n\n\n\n<p>Folder structure should look like this:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">datasets\/\n\u251c\u2500\u2500 train2017\/\n\u251c\u2500\u2500 val2017\/\n\u2514\u2500\u2500 annotations\/\n    \u251c\u2500\u2500 instances_train2017.json\n    \u2514\u2500\u2500 instances_val2017.json\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">3.4 Configure the YOLOX Experiment<\/h3>\n\n\n\n<p>Create an experiment file at:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">exps\/example\/custom\/newspaper_yolox.py\n<\/pre>\n\n\n\n<p>Set training parameters like number of classes, dataset paths, and batch size:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">self.num_classes = 6\nself.data_dir = \"datasets\"\nself.train_ann = \"annotations\/instances_train2017.json\"\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">3.5 Start Training<\/h3>\n\n\n\n<p>Run this command to begin training:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python tools\/train.py -expn newspaper_yolox -d 1 -b 8 --fp16\n<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>-expn<\/code>: Name of your experiment<\/li>\n\n\n\n<li><code>-d<\/code>: Number of GPUs<\/li>\n\n\n\n<li><code>-b<\/code>: Batch size<\/li>\n\n\n\n<li><code>--fp16<\/code>: Enables mixed precision (faster on GPU)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3.6 Save the Best Model<\/h3>\n\n\n\n<p>Once training is complete, use the best checkpoint found at:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">YOLOX_outputs\/newspaper_yolox\/best_ckpt.pth\n<\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4. How Structured OCR for Newspapers Works<\/h2>\n\n\n\n<p>Let\u2019s break down the full pipeline, from layout detection to structured output.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.1 Detecting Layout Blocks with YOLOX<\/h3>\n\n\n\n<p>First, the image is passed through the trained YOLOX model. It detects different layout components like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Titles and subheadings<\/li>\n\n\n\n<li>Body text blocks<\/li>\n\n\n\n<li>Captions and authors<\/li>\n\n\n\n<li>Illustrations and page numbers<\/li>\n<\/ul>\n\n\n\n<p>For each block, YOLOX returns bounding boxes, labels, and confidence scores. These boxes are then cropped to isolate individual regions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.2 Choosing the Right OCR Engine<\/h3>\n\n\n\n<p>Next, each cropped block is passed to an OCR engine. Based on the type and size of the block, we choose:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>EasyOCR<\/strong>: Fast and accurate for clean text<\/li>\n\n\n\n<li><strong>Visiom LLM<\/strong>: More powerful for noisy, wrapped, or stylized blocks<\/li>\n<\/ul>\n\n\n\n<p>This decision can be made automatically using simple logic in your code.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.3 Prompt Engineering for Better OCR Output<\/h3>\n\n\n\n<p>To get the most out of the vision language model, use custom prompts for each block type.<\/p>\n\n\n\n<p>For example:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>&#8220;Extract the full title from this image. Do not include captions or author names.&#8221;<\/p>\n<\/blockquote>\n\n\n\n<p>These prompts help the LLM focus on what matters. You can customize prompts in <code>functions.py<\/code> for each content type.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.4 Structuring the Output<\/h3>\n\n\n\n<p>After text is extracted, we group and label each block. This step includes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Sorting blocks top-to-bottom and left-to-right<\/li>\n\n\n\n<li>Matching captions with illustrations<\/li>\n\n\n\n<li>Linking authors with nearby titles<\/li>\n<\/ul>\n\n\n\n<p>Finally, we create a structured JSON:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{\n  \"title\": \"New Discovery in AI\",\n  \"author\": \"Jane Doe\",\n  \"text\": \"Researchers at XYZ University...\",\n  \"caption\": \"Illustration of the AI model.\"\n}\n<\/pre>\n\n\n\n<p>With YOLOX and Vision LLM, you can finally create a reliable <strong>structured OCR for newspapers<\/strong> that delivers clean, labeled output.<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1080\" style=\"aspect-ratio: 1920 \/ 1080;\" width=\"1920\" controls muted src=\"https:\/\/cdnblog.webkul.com\/blog\/wp-content\/uploads\/2025\/06\/newspaper-ai.webm\"><\/video><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Challenges in Building Structured OCR for Newspapers<\/h2>\n\n\n\n<p>Building this system wasn\u2019t easy. Here are some real challenges we faced\u2014and how we solved them.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5.1 Complex Layouts<\/h3>\n\n\n\n<p>Newspapers don\u2019t follow rules. Articles wrap around ads. Titles sit next to unrelated images. To train YOLOX well, we needed many diverse examples.<\/p>\n\n\n\n<p>The key lesson: annotate a wide range of layouts and fonts to get consistent results.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5.2 OCR Struggles with Noisy Scans<\/h3>\n\n\n\n<p>Low-quality scans are a real problem. Blurry text and ink smudges confused EasyOCR.<\/p>\n\n\n\n<p>Switching to Vision LLM for key blocks (like titles or captions) improved results significantly\u2014but it added cost and latency.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5.3 Balancing Speed and Accuracy<\/h3>\n\n\n\n<p>Vision LLM was accurate, but slow and expensive. So, we added a toggle to choose between EasyOCR (fast) and Vision LLM (accurate) based on the use case.<\/p>\n\n\n\n<p>This way, users could balance performance and quality.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5.4 Annotating the Dataset<\/h3>\n\n\n\n<p>Labeling layout blocks manually took time\u2014but it was essential. We used tools like <strong>Label Studio<\/strong> to speed up annotation.<\/p>\n\n\n\n<p>In the future, pre-trained layout models could help reduce this workload.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5.5 Matching Related Regions<\/h3>\n\n\n\n<p>It wasn\u2019t always easy to connect authors to their articles or captions to illustrations. We used proximity rules to group nearby blocks, but it wasn\u2019t perfect.<\/p>\n\n\n\n<p>A potential improvement could be using <strong>layout graphs<\/strong> or <strong>document parsing models<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Conclusion<\/h2>\n\n\n\n<p>OCR for newspapers is tough\u2014but not impossible. Standard tools alone won\u2019t cut it. You need layout awareness, smart extraction, and structured output.<\/p>\n\n\n\n<p>By training <strong>YOLOX<\/strong> on newspaper-specific classes, we detected meaningful regions like titles, captions, and authors. With <strong>EasyOCR<\/strong> and <strong>Vision LLM<\/strong>, we extracted clean text\u2014even from difficult scans.<\/p>\n\n\n\n<p>The final result? A structured, labeled JSON ready for indexing, research, or digital archives.<\/p>\n\n\n\n<p>Whether you\u2019re digitizing archives or automating editorial tasks, this <strong>structured OCR for newspapers<\/strong> pipeline is powerful, scalable, and open source.<\/p>\n\n\n\n<p>Thanks for reading! Try the pipeline, improve it, and share your results. We\u2019d love to see what you build.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Introduction Building a structured OCR for newspapers is no simple task. Unlike books or documents, newspapers are messy\u2014often noisy, skewed, and low-resolution. Traditional OCR tools struggle with such complex layouts. Newspapers also don\u2019t follow a standard layout. They use multiple columns, captions, mixed fonts, and articles that may jump across pages. Because of this, <a href=\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/\">[&#8230;]<\/a><\/p>\n","protected":false},"author":620,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13702],"tags":[13571,7240],"class_list":["post-497343","post","type-post","status-publish","format-standard","hentry","category-machine-learning","tag-artificial-intelligence","tag-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Structured OCR for Newspapers: Using YOLOX and Vision LLMs - Webkul Blog<\/title>\n<meta name=\"description\" content=\"Build a structured OCR for newspapers using YOLOX and any vision-llm. Detect layouts, extract clean text, and output structured JSON.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Structured OCR for Newspapers: Using YOLOX and Vision LLMs - Webkul Blog\" \/>\n<meta property=\"og:description\" content=\"Build a structured OCR for newspapers using YOLOX and any vision-llm. Detect layouts, extract clean text, and output structured JSON.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/\" \/>\n<meta property=\"og:site_name\" content=\"Webkul Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/webkul\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-06-26T14:15:24+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-26T14:15:31+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/cdnblog.webkul.com\/blog\/wp-content\/uploads\/2021\/08\/webkul-og.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"630\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Darshan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@webkul\" \/>\n<meta name=\"twitter:site\" content=\"@webkul\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Darshan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/\"},\"author\":{\"name\":\"Darshan\",\"@id\":\"https:\/\/webkul.com\/blog\/#\/schema\/person\/dd668ee0a2ff124a8f4991edddd4f8cb\"},\"headline\":\"Structured OCR for Newspapers: Using YOLOX and Vision LLMs\",\"datePublished\":\"2025-06-26T14:15:24+00:00\",\"dateModified\":\"2025-06-26T14:15:31+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/\"},\"wordCount\":1006,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/webkul.com\/blog\/#organization\"},\"keywords\":[\"Artificial Intelligence\",\"machine learning\"],\"articleSection\":[\"machine learning\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/\",\"url\":\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/\",\"name\":\"Structured OCR for Newspapers: Using YOLOX and Vision LLMs - Webkul Blog\",\"isPartOf\":{\"@id\":\"https:\/\/webkul.com\/blog\/#website\"},\"datePublished\":\"2025-06-26T14:15:24+00:00\",\"dateModified\":\"2025-06-26T14:15:31+00:00\",\"description\":\"Build a structured OCR for newspapers using YOLOX and any vision-llm. Detect layouts, extract clean text, and output structured JSON.\",\"breadcrumb\":{\"@id\":\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/webkul.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Structured OCR for Newspapers: Using YOLOX and Vision LLMs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/webkul.com\/blog\/#website\",\"url\":\"https:\/\/webkul.com\/blog\/\",\"name\":\"Webkul Blog\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/webkul.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/webkul.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/webkul.com\/blog\/#organization\",\"name\":\"WebKul Software Private Limited\",\"url\":\"https:\/\/webkul.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/webkul.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/cdnblog.webkul.com\/blog\/wp-content\/uploads\/2021\/08\/webkul-logo-accent-sq.png\",\"contentUrl\":\"https:\/\/cdnblog.webkul.com\/blog\/wp-content\/uploads\/2021\/08\/webkul-logo-accent-sq.png\",\"width\":380,\"height\":380,\"caption\":\"WebKul Software Private Limited\"},\"image\":{\"@id\":\"https:\/\/webkul.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/webkul\/\",\"https:\/\/x.com\/webkul\",\"https:\/\/www.instagram.com\/webkul\/\",\"https:\/\/www.linkedin.com\/company\/webkul\",\"https:\/\/www.youtube.com\/user\/webkul\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/webkul.com\/blog\/#\/schema\/person\/dd668ee0a2ff124a8f4991edddd4f8cb\",\"name\":\"Darshan\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/webkul.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/277e91384cd9de31c5ec0649b4ba9fb5fb43f0575d9abd8b775f6ebaae36c0fe?s=96&d=https%3A%2F%2Fcdnblog.webkul.com%2Fblog%2Fwp-content%2Fuploads%2F2019%2F10%2Fmike.png&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/277e91384cd9de31c5ec0649b4ba9fb5fb43f0575d9abd8b775f6ebaae36c0fe?s=96&d=https%3A%2F%2Fcdnblog.webkul.com%2Fblog%2Fwp-content%2Fuploads%2F2019%2F10%2Fmike.png&r=g\",\"caption\":\"Darshan\"},\"description\":\"Darshan, a Software Engineer, specializes in Machine Learning, crafting intelligent systems that revolutionize automation. Expertise in data-driven algorithms ensures high accuracy and adaptive models, delivering dynamic, innovative solutions.\",\"url\":\"https:\/\/webkul.com\/blog\/author\/darshan-bagisto455\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Structured OCR for Newspapers: Using YOLOX and Vision LLMs - Webkul Blog","description":"Build a structured OCR for newspapers using YOLOX and any vision-llm. Detect layouts, extract clean text, and output structured JSON.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/","og_locale":"en_US","og_type":"article","og_title":"Structured OCR for Newspapers: Using YOLOX and Vision LLMs - Webkul Blog","og_description":"Build a structured OCR for newspapers using YOLOX and any vision-llm. Detect layouts, extract clean text, and output structured JSON.","og_url":"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/","og_site_name":"Webkul Blog","article_publisher":"https:\/\/www.facebook.com\/webkul\/","article_published_time":"2025-06-26T14:15:24+00:00","article_modified_time":"2025-06-26T14:15:31+00:00","og_image":[{"width":1200,"height":630,"url":"https:\/\/cdnblog.webkul.com\/blog\/wp-content\/uploads\/2021\/08\/webkul-og.png","type":"image\/png"}],"author":"Darshan","twitter_card":"summary_large_image","twitter_creator":"@webkul","twitter_site":"@webkul","twitter_misc":{"Written by":"Darshan","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/#article","isPartOf":{"@id":"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/"},"author":{"name":"Darshan","@id":"https:\/\/webkul.com\/blog\/#\/schema\/person\/dd668ee0a2ff124a8f4991edddd4f8cb"},"headline":"Structured OCR for Newspapers: Using YOLOX and Vision LLMs","datePublished":"2025-06-26T14:15:24+00:00","dateModified":"2025-06-26T14:15:31+00:00","mainEntityOfPage":{"@id":"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/"},"wordCount":1006,"commentCount":0,"publisher":{"@id":"https:\/\/webkul.com\/blog\/#organization"},"keywords":["Artificial Intelligence","machine learning"],"articleSection":["machine learning"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/","url":"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/","name":"Structured OCR for Newspapers: Using YOLOX and Vision LLMs - Webkul Blog","isPartOf":{"@id":"https:\/\/webkul.com\/blog\/#website"},"datePublished":"2025-06-26T14:15:24+00:00","dateModified":"2025-06-26T14:15:31+00:00","description":"Build a structured OCR for newspapers using YOLOX and any vision-llm. Detect layouts, extract clean text, and output structured JSON.","breadcrumb":{"@id":"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/webkul.com\/blog\/structured-ocr-newspaper-pipeline\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/webkul.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Structured OCR for Newspapers: Using YOLOX and Vision LLMs"}]},{"@type":"WebSite","@id":"https:\/\/webkul.com\/blog\/#website","url":"https:\/\/webkul.com\/blog\/","name":"Webkul Blog","description":"","publisher":{"@id":"https:\/\/webkul.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/webkul.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/webkul.com\/blog\/#organization","name":"WebKul Software Private Limited","url":"https:\/\/webkul.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/webkul.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/cdnblog.webkul.com\/blog\/wp-content\/uploads\/2021\/08\/webkul-logo-accent-sq.png","contentUrl":"https:\/\/cdnblog.webkul.com\/blog\/wp-content\/uploads\/2021\/08\/webkul-logo-accent-sq.png","width":380,"height":380,"caption":"WebKul Software Private Limited"},"image":{"@id":"https:\/\/webkul.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/webkul\/","https:\/\/x.com\/webkul","https:\/\/www.instagram.com\/webkul\/","https:\/\/www.linkedin.com\/company\/webkul","https:\/\/www.youtube.com\/user\/webkul\/"]},{"@type":"Person","@id":"https:\/\/webkul.com\/blog\/#\/schema\/person\/dd668ee0a2ff124a8f4991edddd4f8cb","name":"Darshan","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/webkul.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/277e91384cd9de31c5ec0649b4ba9fb5fb43f0575d9abd8b775f6ebaae36c0fe?s=96&d=https%3A%2F%2Fcdnblog.webkul.com%2Fblog%2Fwp-content%2Fuploads%2F2019%2F10%2Fmike.png&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/277e91384cd9de31c5ec0649b4ba9fb5fb43f0575d9abd8b775f6ebaae36c0fe?s=96&d=https%3A%2F%2Fcdnblog.webkul.com%2Fblog%2Fwp-content%2Fuploads%2F2019%2F10%2Fmike.png&r=g","caption":"Darshan"},"description":"Darshan, a Software Engineer, specializes in Machine Learning, crafting intelligent systems that revolutionize automation. Expertise in data-driven algorithms ensures high accuracy and adaptive models, delivering dynamic, innovative solutions.","url":"https:\/\/webkul.com\/blog\/author\/darshan-bagisto455\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/webkul.com\/blog\/wp-json\/wp\/v2\/posts\/497343","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/webkul.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/webkul.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/webkul.com\/blog\/wp-json\/wp\/v2\/users\/620"}],"replies":[{"embeddable":true,"href":"https:\/\/webkul.com\/blog\/wp-json\/wp\/v2\/comments?post=497343"}],"version-history":[{"count":4,"href":"https:\/\/webkul.com\/blog\/wp-json\/wp\/v2\/posts\/497343\/revisions"}],"predecessor-version":[{"id":497581,"href":"https:\/\/webkul.com\/blog\/wp-json\/wp\/v2\/posts\/497343\/revisions\/497581"}],"wp:attachment":[{"href":"https:\/\/webkul.com\/blog\/wp-json\/wp\/v2\/media?parent=497343"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/webkul.com\/blog\/wp-json\/wp\/v2\/categories?post=497343"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/webkul.com\/blog\/wp-json\/wp\/v2\/tags?post=497343"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}