Rectangle 27 1

You can try something like this. I would suggest creating an array for the elements inside of game that you want and then iterate over them. I'm sure there's a way to get all of the elements inside the specified one in Nokogiri but this works:

xml = Nokogiri::XML(result)
    xml.css("game").each do |inv|
      inv.css("title").each do |f|  # title or whatever else you want
        puts f.inner_html
      end
    end

inner_html is rarely useful. In this case you really want f.text, and since there's only one title per game, there's not much need for an each

ruby on rails - Parse XML with Nokogiri - Stack Overflow

ruby-on-rails ruby xml-parsing nokogiri
Rectangle 27 39

require 'nokogiri'

b = Nokogiri::XML::Builder.new do |xml|
  xml.send(:"fooo-bar", "hello")
end

puts b.to_xml

where does the hello come in? xml.send(:"foo-bar", "hello")?

Where is it documented in official Nokogiri documentation? can you please share a link?

Bit late to the party here, but that :"xx-aaa" syntax is the standard Ruby way of making a symbol when the syntax won't work for you

ruby - How do I create XML using Nokogiri::XML::Builder with a hyphen ...

xml ruby nokogiri
Rectangle 27 26

Bart Vandendriessche's answer works but there is a simpler solution if you only want a text field within the element.

If you need them to be nested then you can pass a block

require 'nokogiri'

b = Nokogiri::XML::Builder.new do |xml|
  xml.send(:'foo-bar') {
    xml.send(:'bar-foo', 'hello')
  }
end

puts b.to_xml
<?xml version="1.0"?>
<foo-bar>
  <bar-foo>hello</bar-foo>
</foo-bar>

ruby - How do I create XML using Nokogiri::XML::Builder with a hyphen ...

xml ruby nokogiri
Rectangle 27 1

I had the exact same issue recently. What I did was to create a class that inherits from Nokogiri::HTML::Document, and then just override thenew class method to parse the document, then save the url in an instance variable with an accessor:

require 'nokogiri'
require 'open-uri'

class Webpage < Nokogiri::HTML::Document
  attr_accessor :url

  class << self

    def new(url)
      html = open(url)
      self.parse(html).tap do |d|
        d.url = url
      end
    end
  end
end

Then you can just create a new Webpage, and it will have access to all the normal methods you would have with a Nokogiri::HTML::Document:

w = Webpage.new("http://www.google.com")
w.url
#=> "http://www.google.com"
w.at_css('title')
#=> [#<Nokogiri::XML::Element:0x4952f78 name="title" children=[#<Nokogiri::XML::Text:0x4952cb2 "Google">]>]

If you have some relative url that you got from an image tag, you can then make it absolute by passing the return value of the url accessor to URI.join:

relative_link_url = "/media/image/image.png"
=> "/media/image/image.png"
URI.join(w.url, relative_link_url).to_s
=> "http://www.google.com/media/image/image.png"

p.s. the title of this question is quite misleading. Something more along the lines of "Accessing URL of Nokogiri HTML document" would be clearer.

Thanks for the response! I tried your code but unfortunately it still returns the shortened url instead of the full.

You mean it returns the relative url when you get the src of an img tag? It won't change that. All it does is give you access to the root url, which you can then join to get the full url of any link in the doc.

I added an example. See if that makes sense.

It makes sense, unless the "w.url" is a shortened link like "bit.ly/234"

I don't understand. You mean that you want to create a document with a shortened link? That won't work. You can't generate a URI from a relative url, without any other information about the root. You have to have an absolute url somewhere.

ruby on rails - How to get a full URL given a shortened one passed to ...

ruby-on-rails ruby nokogiri mechanize scraper
Rectangle 27 3

Aaron Patterson's answer is correct and will work for element names containing any character that may otherwise be interpreted by the Ruby parser.

Answering Angela's question: to place text inside a element created this way you can do something like this:

require 'rubygems'
require 'nokogiri'

b = Nokogiri::XML::Builder.new do |xml|
  xml.send(:'foo.bar') {
    xml.text 'hello'
  }
end

puts b.to_xml

ruby - How do I create XML using Nokogiri::XML::Builder with a hyphen ...

xml ruby nokogiri
Rectangle 27 14

sku = product.xpath("//sku").text
quan = product.xpath("//inventory-quantity").text
sku = product.xpath("sku").text
quan = product.xpath("inventory-quantity").text

It's because //sku selects all the sku descendants of the document root.

ruby on rails - Using nokogiri to parse XML and create records with mu...

ruby-on-rails ruby xml activerecord nokogiri
Rectangle 27 15

Here's a far simpler version that creates a robust Hash that includes namespace information, both for elements and attributes:

require 'nokogiri'
class Nokogiri::XML::Node
  TYPENAMES = {1=>'element',2=>'attribute',3=>'text',4=>'cdata',8=>'comment'}
  def to_hash
    {kind:TYPENAMES[node_type],name:name}.tap do |h|
      h.merge! nshref:namespace.href, nsprefix:namespace.prefix if namespace
      h.merge! text:text
      h.merge! attr:attribute_nodes.map(&:to_hash) if element?
      h.merge! kids:children.map(&:to_hash) if element?
    end
  end
end
class Nokogiri::XML::Document
  def to_hash; root.to_hash; end
end

Seen in action:

xml = '<r a="b" xmlns:z="foo"><z:a>Hello <b z:m="n" x="y">World</b>!</z:a></r>'
doc = Nokogiri::XML(xml)
p doc.to_hash
#=> {
#=>   :kind=>"element",
#=>   :name=>"r",
#=>   :text=>"Hello World!",
#=>   :attr=>[
#=>     {
#=>       :kind=>"attribute",
#=>       :name=>"a", 
#=>       :text=>"b"
#=>     }
#=>   ], 
#=>   :kids=>[
#=>     {
#=>       :kind=>"element", 
#=>       :name=>"a", 
#=>       :nshref=>"foo", 
#=>       :nsprefix=>"z", 
#=>       :text=>"Hello World!", 
#=>       :attr=>[], 
#=>       :kids=>[
#=>         {
#=>           :kind=>"text", 
#=>           :name=>"text", 
#=>           :text=>"Hello "
#=>         },
#=>         {
#=>           :kind=>"element", 
#=>           :name=>"b", 
#=>           :text=>"World", 
#=>           :attr=>[
#=>             {
#=>               :kind=>"attribute", 
#=>               :name=>"m", 
#=>               :nshref=>"foo", 
#=>               :nsprefix=>"z", 
#=>               :text=>"n"
#=>             },
#=>             {
#=>               :kind=>"attribute", 
#=>               :name=>"x", 
#=>               :text=>"y"
#=>             }
#=>           ], 
#=>           :kids=>[
#=>             {
#=>               :kind=>"text", 
#=>               :name=>"text", 
#=>               :text=>"World"
#=>             }
#=>           ]
#=>         },
#=>         {
#=>           :kind=>"text", 
#=>           :name=>"text", 
#=>           :text=>"!"
#=>         }
#=>       ]
#=>     }
#=>   ]
#=> }

xml - Convert a Nokogiri document to a Ruby Hash - Stack Overflow

xml ruby hash nokogiri libxml-ruby
Rectangle 27 1

doc = Nokogiri::XML(File.open("#{Rails.root}/public/new.xml"))  

variant = doc.xpath("//variant")

sku, quan = nil, nil
variant.each do |product| 


  product.children.each do |child|
      case child.name
       when 'sku'  
          sku = child.text
       when 'inventory-quantity'
          quan = child.text
       end
    end

  Productmapping.create(:sku => sku, :product_quantity => quan)

end

Or more magic and not beautiful but compact way:

sku = product.children[1].text
quan = product.children[3].text

This did work, thanks, but more simple solution below

It's inefficient to loop over all children, checking the name. And as you mention, indexing by position is 'not beautiful'. Better to specify directly, e.g. product.at('sku').text

ruby on rails - Using nokogiri to parse XML and create records with mu...

ruby-on-rails ruby xml activerecord nokogiri
Rectangle 27 3

The file you linked to, and presumably the others, are not valid XML because they do not have a root element. From Wikipedia:

Nokogiri hints at this if you look at the errors (suggested by Arup Rakshit), as detailed in the documentation:

Nokogiri::XML(File.open("/Users/b/Downloads/ipg140513.xml")).errors # =>
# [
#   #<Nokogiri::XML::SyntaxError: XML declaration allowed only at the start of the document>,
#   #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>
# ]

The file appears to be a concatenation of a series of valid XML files, each having a <us-patent-grant/> as its root element.

Fortunately, Nokogiri can handle this invalid XML if you process it as a document fragment. Try this:

Nokogiri::XML::DocumentFragment.parse(File.read('ipg140513.xml')).select{|element| element.name == 'us-patent-grant'}

The select chooses the root node of each concatenated document, ignoring the processing instructions and DTD declarations.

Alternately, you could pre-process the file and split it into its constituent, correctly-formatted documents. Parsing a 650MB document all at once is quite slow and memory intensive.

Thanks for the suggestion, @ArupRakshit, Ive updated the answer.

Thank you Buck! Yeah, I've tried with the DocumentFragment and my laptop just died. Now I'm trying to figure out how to split the xml in multiple files, considering also that I don't need all patents in it, but just few of them. I also searched if USPTO provides xml files for single patents but I cannot find anything, just the bulk zips

How to parse USPTO XML files with Ruby and Nokogiri? - Stack Overflow

ruby xml xml-parsing nokogiri
Rectangle 27 0

require 'nokogiri'

b = Nokogiri::XML::Builder.new do |xml|
  xml.send(:"fooo-bar", "hello")
end

puts b.to_xml

where does the hello come in? xml.send(:"foo-bar", "hello")?

Where is it documented in official Nokogiri documentation? can you please share a link?

ruby - How do I create XML using Nokogiri::XML::Builder with a hyphen ...

xml ruby nokogiri
Rectangle 27 0

Just use one controller, Search, and a model Searches. Then you can store every search in the DB and allow users to retrieve them, or create permanent urls for searches. You could use Nokogiri to do the web crawling.

model view controller - What's the Rails way of making super-minimal w...

ruby-on-rails model-view-controller
Rectangle 27 0

Yes, I would create a model encapsulating the API and your retrieval methods. Also, I would use HTTParty which is designed exactly for this use case. It will automatically do the conversions to and from XML. (Although if that is an RSS feed I'd probably use a dedicated RSS parser)

This separation won't be "long winded," in fact it will be cleaner and can be more efficient, as you can cache or even just memoize in the model minimizing the amount of fetching you have to do.

Thanks for the answer, I was thinking that there was a way to change the details of the XML on the fly, but you are right, get it in the model, it is cleaner. Ill get on the case with HTTParty

ruby on rails - How can I view each individual item of a nodeset? - St...

ruby-on-rails ruby xpath xml-parsing nokogiri
Rectangle 27 0

Your read method is fine. The problem is that the XML document is not what you think it is. If you look at output of your create method, it creates only this document:

<?xml version="1.0"?>
<item>
  <data>9</data>
  <port>3</port>
  <length>max</length>
  <date>date</date>
  <limit>5</limit>
</item>

This is because you are creating (and re-creating, and re-creating) your document INSIDE your while loop. Instead, you want to do something like this:

def create
  var = 0  
  builder = Nokogiri::XML::Builder.new do
    root do
      while var < 10
        item {
          data var
          port '3'
          length 'max'
          date 'date'
          limit '5'
        }
        var += 1  
      end
    end
  end
  return Nokogiri::XML(builder.to_xml).root.to_xml
end

Note that I have added a <root> element to wrap all your items, since an XML document can only have one root element, and you appear to be adding multiple <item> elements to it.

Additionally, may I suggest that instead of your while loop you use the more Ruby-esque:

10.times do |var|
  # no need to var += 1 here
  # the rest of your code
end

Thanks a lot!!!

ruby - nokogiri only return one item - Stack Overflow

ruby xml nokogiri
Rectangle 27 0

require 'nokogiri'

describe 'function' do
  describe '.xml_header' do
    it 'should create valid header' do
        doc = Nokogiri::XML::Document.parse(GEM::xml_header)    
        doc.xpath('//Root/EnvelopeVersion').text.should eq("1.0")
    end     
  end
end

ruby on rails - Testing Nokogiri XML for Attributes - Stack Overflow

ruby-on-rails xml rspec tdd nokogiri
Rectangle 27 0

Instead of using initialize, which always returns a new instance of an object, when creating a new SomeClass from a scraping, I'd use a class method to create the instance. I'm not using exceptions here beyond what nokogiri is throwing because it sounds like nothing else should bubble up further since you just want these to be logged, but otherwise be ignored. You mentioned logging the exceptions--are you just logging what goes to stdout? I'll answer as if you are...

# lib/my_helper_module.rb
module MyHelperModule
  def self.do_the_process(args = {})
    my_models = args[:my_models]

    # Parallel.each(my_models, :in_processes => 5) do |my_model|
    my_models.each do |my_model|
      # Reconnect to prevent errors with Postgres
      ActiveRecord::Base.connection.reconnect!

      some_object = SomeClass.create_from_scrape(my_model.id)

    if some_object
      # Do something super interesting if you were able to get a scraping
      # otherwise nothing happens (except it is noted in our logging elsewhere)
    end

  end
end
# lib/some_class.rb
require_relative 'webpage_helper'
class SomeClass
  attr_accessor :some_data

  def initialize(doc)
    @doc = doc
  end

  # could shorten this, but you get the idea...
  def self.create_from_scrape(arg)
    doc = WebpageHelper.get_doc("http://somesite.com/#{arg}")
    if doc
      return SomeClass.new(doc)
    else
      return nil
    end      
  end

end
# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'

class WebpageHelper
  def self.get_doc(url)
    attempts = 0 # define attempts first in non-block local scope before using it
    begin
      page_content = open(url).read
      # do more stuff
    rescue Exception => ex
      attempts += 1
      puts "Failed at #{Time.now}"
      puts "Error: #{ex}"
      puts "URL: " + url
      if attempts < 3 
        puts "Retrying... Attempt #: #{attempts.to_s}"
        sleep(10)
        retry
      else
        return nil
      end
    end

  end
end

ruby on rails - How should my scraping "stack" handle 404 errors? - St...

ruby-on-rails ruby exception open-uri
Rectangle 27 0

To begin with, you may remove the extra comma towards the end of the Create parentheses . If that doesn't work...

require 'rake' 
    require 'open-uri' 
    namespace :xml_parser do 
    task :new_task => :environment do 
    doc = Nokogiri::XML(open("https://dl.dropboxusercontent.com/u/21695507/openplaques/gb_20151004.xml")) 
doc.css('plaque').each do |node| 
children = node.children 
Plaque.create(
            :title => children.css('title').inner_text,
            :subject => children.css('subjects').inner_text,
            :colour => children.css('colour').inner_text,
            :inscription => children.css('inscription raw').inner_text,
            :latitude => children.css('geo')['latitude'],
            :longitude => children.css('geo')['longitude'],
            :address => children.css('address').inner_text,
            :organisation => children.css('author').inner_text,
            :date_erected => children.css('author').inner_text
            )   
    end
    end

Then run rake xml_parser : new_task That should work. (Also, please check if you are correctly importing :organisation and :date_erected fields).

Thank you, this is running in the terminal. However, I am now getting an error saying "TypeError: no implicit conversion of String into Integer". The datatypes for the database fields are all strings or text, though, so why would it be trying to convert to Integer?

Lat-long fields? Could you share the XML file?

rake task to parse xml from website into rails database - Stack Overfl...

ruby-on-rails xml
Rectangle 27 0

The problem with Nokogiri's current implementation of :has() is that it creates XPath that requires the contents to be a direct child, not any descendant:

puts Nokogiri::CSS.xpath_for( "a:has(b)" )
#=> "//a[b]"
#=> Should output "//a[.//b]" to be correct

To make this XPath match what jQuery does, you need to allow the span to be a descendant element. For example:

require 'nokogiri'
d = Nokogiri.XML('<r><a/><a><b><c/></b></a></r>')
d.at_css('a:has(b)')    #=> #<Nokogiri::XML::Element:0x14dd608 name="a" children=[#<Nokogiri::XML::Element:0x14dd3e0 name="b" children=[#<Nokogiri::XML::Element:0x14dd20c name="c">]>]>
d.at_css('a:has(c)')    #=> nil
d.at_xpath('//a[.//c]') #=> #<Nokogiri::XML::Element:0x14dd608 name="a" children=[#<Nokogiri::XML::Element:0x14dd3e0 name="b" children=[#<Nokogiri::XML::Element:0x14dd20c name="c">]>]>
puts Nokogiri::CSS.xpath_for( "li:has(span.string:not(:empty)) > h1 > a" )
#=> //li[span[contains(concat(' ', @class, ' '), ' string ') and not(not(node()))]]/h1/a
# Adding just the .//
//li[.//span[contains(concat(' ', @class, ' '), ' string ') and not(not(node()))]]/h1/a

# Simplified to assume only one CSS class is present on the span
//li[.//span[@class='string' and not(not(node()))]]/h1/a

# Assuming that `not(:empty)` really meant "Has some text in it"
//li[.//span[@class='string' and text()]]/h1/a

# ..or maybe you really wanted "Has some text anywhere underneath"
//li[.//span[@class='string' and .//text()]]/h1/a

# ..or maybe you really wanted "Has at least one element child"
//li[.//span[@class='string' and *]]/h1/a

jquery - :has CSS pseudo class in Nokogiri - Stack Overflow

jquery css ruby-on-rails ruby nokogiri
Rectangle 27 0

You can use the NOBLANKS option for parsing the XML string, consider this example:

require 'nokogiri'

string = "<foo>\n  <bar>bar</bar>\n</foo>"
puts string
# <foo>
#   <bar>bar</bar>
# </foo>

document_with_blanks = Nokogiri::XML.parse(s)

document_without_blanks = Nokogiri::XML.parse(s) do |config|
  config.noblanks
end

document_with_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Text:0x3ffa4e153dac "\n  ">
#<Nokogiri::XML::Element:0x3fdce3f78488 name="bar" children=[#<Nokogiri::XML::Text:0x3fdce3f781f4 "bar">]>
#<Nokogiri::XML::Text:0x3ffa4e15335c "\n">

document_without_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3f81bef42034 name="bar" children=[#<Nokogiri::XML::Text:0x3f81bef43ee8 "bar">]>

Update: The NOBLANKS shouldn't remove empty nodes:

doc = Nokogiri.XML('<foo><bar></bar></foo>') do |config|
  config.noblanks
end

doc.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3fad0fafbfa8 name="bar">

Update: As OP pointed out the documentation on the Nokogiri website (and also on the libxml website) about the parser options is quite cryptic, following a specification of the behaviour ot the NOBLANKS option:

require 'rspec/autorun'
require 'nokogiri'

def parse_xml(xml_string)
  Nokogiri.XML(xml_string) { |config| config.noblanks }
end

describe "Nokogiri NOBLANKS parser option" do

  it "removes whitespace nodes if they have siblings" do
    doc = parse_xml("<root>\n <child></child></root>")
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end

  it "doesn't remove whitespaces nodes if they have no siblings" do
    doc = parse_xml("<root>\n </root>")
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Text)
  end

  it "doesn't remove empty nodes" do
    doc = parse_xml('<root><child></child></root>')
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end

end

Brilliant! Thanks ever so much. EDIT: Not the right answer, reason given below.

Sorry, was not the correct answer. The reason for that is that if I then add an empty tag, such as <empty></empty>, it would not get parsed and represented. Empty nodes need to be included unfortunately.

@CodingMo Actually the NOBLANKS options should keep empty nodes, can you post the code that strips the footer node in your example?

You are right. It does seem like it includes empty nodes. Although, if you follow this link I give, it does say that it removes empty nodes. nokogiri.org/Nokogiri/XML/ParseOptions.html

@CodingMo, I just discovered Nokogiri's noblanks config option, but now I'm finding that it doesn't ignore every Text node that consists of only whitespace--as I would like. So finding your post was timely. However, there's another twist--see my post(feel free to add it to your post or some modification thereof). Also, the 'strict' config option is the default, so most people are probably going to want config.strict.noblanks.

ruby - Avoid creating non-significant white space text nodes when crea...

ruby xml xml-parsing html-parsing nokogiri
Rectangle 27 0

You can create a query that only returns element nodes, and ignores text nodes. In XPath, * only returns elements, so the query could look like (querying the whole doc):

doc.xpath('//note/*')

or if you want to use CSS:

doc.css('note > *')

If you want to implement your significant_nodes method, you would need to make the query relative to the node passed in:

def significant_nodes(node)
  node.xpath('./*').size
end

I dont know how to do a relative query with CSS, you might need to stick with XPath.

The trouble with doing .xpath('./*'), is that if you do it on an element with a text node that has significant text, those text nodes won't be represented. So if we take ` #(Element:0x3fc07e8d8064 { name = "from", children = [ #(Text "Jani")]})` and do the .xpath('./*') on it, it will not return the text node that has "Jani" in it.

@CodingMo well dont use it on such nodes then :-)

that's a fair point and this is a great answer, I'll find it useful in the future!

@CodingMo you could use an XPath query like '//note/node()[self::* or self::text()[normalize-space()]]' to get elements and non-blank text nodes, although in this specific example thats pretty much the same as using the noblanks option.

ruby - Avoid creating non-significant white space text nodes when crea...

ruby xml xml-parsing html-parsing nokogiri
Rectangle 27 0

It's not well known, but Nokogiri implements some of jQuery's JavaScript extensions for searching using CSS selectors. In your case, the :eq(n) method will be useful:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<html>
<body>
  <table>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
      <td>4</td>
    </tr>
  </table>
</body>
</html>
EOT

doc.at('td:eq(4)').text # => "4"

ruby - How to create an array scraping HTML? - Stack Overflow

ruby nokogiri