Rectangle 27 101

Take advantage of the browser's built-in parser

If you want to break apart any given url, you can take advantage of DOM methods:

//  create an anchor element (note: no need to append this element to the document)
var link = document.createElement('a');
//  set href to any path
link.setAttribute('href', 'http://example.com:12345/blog/foo/bar?startIndex=1&pageSize=10');

The browser's built-in parser has already done its job. Now you can just grab the parts you need:

//  get any piece of the url you're interested in
link.hostname;  //  'example.com'
link.port;      //  12345
link.search;    //  '?startIndex=1&pageSize=10'
link.pathname;  //  '/blog/foo/bar'
link.protocol;  //  'http:'

//  cleanup for garbage collection
link = null;

Chances are you'll probably want to break apart the search url params as well, since '?startIndex=1&pageSize=10' isn't too useable on its own.

Here's two functions that will take care of this:

/**
 *  Break apart any path into parts
 *  'http://example.com:12345/blog/foo/bar?startIndex=1&pageSize=10' ->
 *    {
 *      "host": "example.com",
 *      "port": "12345",
 *      "search": {
 *        "startIndex": "1",
 *        "pageSize": "10"
 *      },
 *      "path": "/blog/foo/bar",
 *      "protocol": "http:"
 *    }
 */
function getPathInfo(path) {
    //  create a link in the DOM and set its href
    var link = document.createElement('a');
    link.setAttribute('href', path);

    //  return an easy-to-use object that breaks apart the path
    return {
        host:     link.hostname,  //  'example.com'
        port:     link.port,      //  12345
        search:   processSearchParams(link.search),  //  {startIndex: 1, pageSize: 10}
        path:     link.pathname,  //  '/blog/foo/bar'
        protocol: link.protocol   //  'http:'
    }
}

/**
 *  Convert search param string into an object or array
 *  '?startIndex=1&pageSize=10' -> {startIndex: 1, pageSize: 10}
 */
function processSearchParams(search, preserveDuplicates) {
    //  option to preserve duplicate keys (e.g. 'sort=name&sort=age')
    preserveDuplicates = preserveDuplicates || false;  //  disabled by default

    var outputNoDupes = {};
    var outputWithDupes = [];  //  optional output array to preserve duplicate keys

    //  sanity check
    if(!search) throw new Error('processSearchParams: expecting "search" input parameter');

    //  remove ? separator (?foo=1&bar=2 -> 'foo=1&bar=2')
    search = search.split('?')[1];

    //  split apart keys into an array ('foo=1&bar=2' -> ['foo=1', 'bar=2'])
    search = search.split('&');

    //  separate keys from values (['foo=1', 'bar=2'] -> [{foo:1}, {bar:2}])
    //  also construct simplified outputObj
    outputWithDupes = search.map(function(keyval){
        var out = {};
        keyval = keyval.split('=');
        out[keyval[0]] = keyval[1];
        outputNoDupes[keyval[0]] = keyval[1]; //  might as well do the no-dupe work too while we're in the loop
        return out;
    });

    return (preserveDuplicates) ? outputWithDupes : outputNoDupes;
}
var link = document.createElement('a'); link.setAttribute('href', 'google.com'); console.log(link.protocol)

Are you doing that on a http page perhaps? If not specified it will 'inherit' from the current location

This is a fantastic answer and should get more votes, because this answer is not limited to just the current location but works for any url, and because this answer utilizes the browser's built-in parser instead of building one ourselves (which we can't hope to do as well or as fast!).

Thank you for this clever trick! I would like to add one thing: There is both host and hostname. The former includes the port (e.g. localhost:3000), while the latter is only the host's name (e.g. localhost).

This works well in case of absolute URL. It fails in case of Relative URL and cross-browser. Any suggestions?

javascript - Get protocol, domain, and port from URL - Stack Overflow

javascript url dns protocols port
Rectangle 27 101

Take advantage of the browser's built-in parser

If you want to break apart any given url, you can take advantage of DOM methods:

//  create an anchor element (note: no need to append this element to the document)
var link = document.createElement('a');
//  set href to any path
link.setAttribute('href', 'http://example.com:12345/blog/foo/bar?startIndex=1&pageSize=10');

The browser's built-in parser has already done its job. Now you can just grab the parts you need:

//  get any piece of the url you're interested in
link.hostname;  //  'example.com'
link.port;      //  12345
link.search;    //  '?startIndex=1&pageSize=10'
link.pathname;  //  '/blog/foo/bar'
link.protocol;  //  'http:'

//  cleanup for garbage collection
link = null;

Chances are you'll probably want to break apart the search url params as well, since '?startIndex=1&pageSize=10' isn't too useable on its own.

Here's two functions that will take care of this:

/**
 *  Break apart any path into parts
 *  'http://example.com:12345/blog/foo/bar?startIndex=1&pageSize=10' ->
 *    {
 *      "host": "example.com",
 *      "port": "12345",
 *      "search": {
 *        "startIndex": "1",
 *        "pageSize": "10"
 *      },
 *      "path": "/blog/foo/bar",
 *      "protocol": "http:"
 *    }
 */
function getPathInfo(path) {
    //  create a link in the DOM and set its href
    var link = document.createElement('a');
    link.setAttribute('href', path);

    //  return an easy-to-use object that breaks apart the path
    return {
        host:     link.hostname,  //  'example.com'
        port:     link.port,      //  12345
        search:   processSearchParams(link.search),  //  {startIndex: 1, pageSize: 10}
        path:     link.pathname,  //  '/blog/foo/bar'
        protocol: link.protocol   //  'http:'
    }
}

/**
 *  Convert search param string into an object or array
 *  '?startIndex=1&pageSize=10' -> {startIndex: 1, pageSize: 10}
 */
function processSearchParams(search, preserveDuplicates) {
    //  option to preserve duplicate keys (e.g. 'sort=name&sort=age')
    preserveDuplicates = preserveDuplicates || false;  //  disabled by default

    var outputNoDupes = {};
    var outputWithDupes = [];  //  optional output array to preserve duplicate keys

    //  sanity check
    if(!search) throw new Error('processSearchParams: expecting "search" input parameter');

    //  remove ? separator (?foo=1&bar=2 -> 'foo=1&bar=2')
    search = search.split('?')[1];

    //  split apart keys into an array ('foo=1&bar=2' -> ['foo=1', 'bar=2'])
    search = search.split('&');

    //  separate keys from values (['foo=1', 'bar=2'] -> [{foo:1}, {bar:2}])
    //  also construct simplified outputObj
    outputWithDupes = search.map(function(keyval){
        var out = {};
        keyval = keyval.split('=');
        out[keyval[0]] = keyval[1];
        outputNoDupes[keyval[0]] = keyval[1]; //  might as well do the no-dupe work too while we're in the loop
        return out;
    });

    return (preserveDuplicates) ? outputWithDupes : outputNoDupes;
}
var link = document.createElement('a'); link.setAttribute('href', 'google.com'); console.log(link.protocol)

Are you doing that on a http page perhaps? If not specified it will 'inherit' from the current location

This is a fantastic answer and should get more votes, because this answer is not limited to just the current location but works for any url, and because this answer utilizes the browser's built-in parser instead of building one ourselves (which we can't hope to do as well or as fast!).

Thank you for this clever trick! I would like to add one thing: There is both host and hostname. The former includes the port (e.g. localhost:3000), while the latter is only the host's name (e.g. localhost).

This works well in case of absolute URL. It fails in case of Relative URL and cross-browser. Any suggestions?

javascript - Get protocol, domain, and port from URL - Stack Overflow

javascript url dns protocols port
Rectangle 27 10

Step 2 : Create the parser

Ok, I did it but I have really too much sources to post it here. I will explain step by step how I did but won't post the classloading part which is simple for an average skilled developper.

One thing is currently not supported by my code is the context config scan.

First, the explanation below depends on your needs and also your application server. I use Glassfish 3.1.2 and I did not find how to configure a custom classpath :

  • classpath prefix/suffix not supported anymore
  • -classpath parameter on the domain's java-config did not work
  • CLASSPATH environment did not work either

So the only available paths in classpath for GF3 are : WEB-INF/classes, WEB-INF/lib... If you find a way to do it on your application server, you can skip the first 4 steps.

NamespaceHandlerSupport
<context:component-scan/>
/**
* Redefine {@code component-scan} to scan the module folder in addition to classpath
* @author Ludovic Guillaume
*/
public class ModuleContextNamespaceHandler extends NamespaceHandlerSupport {
    @Override
    public void init() {
        registerBeanDefinitionParser("component-scan", new ModuleComponentScanBeanDefinitionParser());
    }
}

The XSD contains only component-scan element which is a perfect copy of the Spring's one.

http\://www.yourwebsite.com/schema/context/module-context.xsd=com/yourpackage/module/xsd/module-context.xsd

N.B.: I didn't override the Spring default namespace handler due to some issues like the name of the project which need to have a letter greater than 'S'. I wanted to avoid that so I made my own namespace.

/**
 * Parser for the {@code <module-context:component-scan/>} element.
 * @author Ludovic Guillaume
 */
public class ModuleComponentScanBeanDefinitionParser extends ComponentScanBeanDefinitionParser {
    @Override
    protected ClassPathBeanDefinitionScanner createScanner(XmlReaderContext readerContext, boolean useDefaultFilters) {
        return new ModuleBeanDefinitionScanner(readerContext.getRegistry(), useDefaultFilters);
    }
}
ClassPathBeanDefinitionScanner
String packageSearchPath = "file:" + ModuleManager.getExpandedModulesFolder() + "/**/*.class";
ModuleManager.getExpandedModulesFolder()
C:/<project>/modules/
/**
 * Custom scanner that detects bean candidates on the classpath (through {@link ClassPathBeanDefinitionScanner} and on the module folder.
 * @author Ludovic Guillaume
 */
public class ModuleBeanDefinitionScanner extends ClassPathBeanDefinitionScanner {
    private ResourcePatternResolver resourcePatternResolver;
    private MetadataReaderFactory metadataReaderFactory;

    /**
     * @see {@link ClassPathBeanDefinitionScanner#ClassPathBeanDefinitionScanner(BeanDefinitionRegistry, boolean)}
     * @param registry
     * @param useDefaultFilters
     */
    public ModuleBeanDefinitionScanner(BeanDefinitionRegistry registry, boolean useDefaultFilters) {
        super(registry, useDefaultFilters);

        try {
            // get parent class variable
            resourcePatternResolver = (ResourcePatternResolver)getResourceLoader();

            // not defined as protected and no getter... so reflection to get it
            Field field = ClassPathScanningCandidateComponentProvider.class.getDeclaredField("metadataReaderFactory");
            field.setAccessible(true);
            metadataReaderFactory = (MetadataReaderFactory)field.get(this);
            field.setAccessible(false);
        }
        catch (Exception e) {
            e.printStackTrace();
        }
    }

    /**
     * Scan the class path for candidate components.<br/>
     * Include the expanded modules folder {@link ModuleManager#getExpandedModulesFolder()}.
     * @param basePackage the package to check for annotated classes
     * @return a corresponding Set of autodetected bean definitions
     */
    @Override
    public Set<BeanDefinition> findCandidateComponents(String basePackage) {
        Set<BeanDefinition> candidates = new LinkedHashSet<BeanDefinition>(super.findCandidateComponents(basePackage));

        logger.debug("Scanning for candidates in module path");

        try {
            String packageSearchPath = "file:" + ModuleManager.getExpandedModulesFolder() + "/**/*.class";

            Resource[] resources = this.resourcePatternResolver.getResources(packageSearchPath);
            boolean traceEnabled = logger.isTraceEnabled();
            boolean debugEnabled = logger.isDebugEnabled();

            for (Resource resource : resources) {
                if (traceEnabled) {
                    logger.trace("Scanning " + resource);
                }
                if (resource.isReadable()) {
                    try {
                        MetadataReader metadataReader = this.metadataReaderFactory.getMetadataReader(resource);

                        if (isCandidateComponent(metadataReader)) {
                            ScannedGenericBeanDefinition sbd = new ScannedGenericBeanDefinition(metadataReader);
                            sbd.setResource(resource);
                            sbd.setSource(resource);

                            if (isCandidateComponent(sbd)) {
                                if (debugEnabled) {
                                    logger.debug("Identified candidate component class: " + resource);
                                }
                                candidates.add(sbd);
                            }
                            else {
                                if (debugEnabled) {
                                    logger.debug("Ignored because not a concrete top-level class: " + resource);
                                }
                            }
                        }
                        else {
                            if (traceEnabled) {
                                logger.trace("Ignored because not matching any filter: " + resource);
                            }
                        }
                    }
                    catch (Throwable ex) {
                        throw new BeanDefinitionStoreException("Failed to read candidate component class: " + resource, ex);
                    }
                }
                else {
                    if (traceEnabled) {
                        logger.trace("Ignored because not readable: " + resource);
                    }
                }
            }
        }
        catch (IOException ex) {
            throw new BeanDefinitionStoreException("I/O failure during classpath scanning", ex);
        }

        return candidates;
    }
}
public class ModuleCachingMetadataReaderFactory extends CachingMetadataReaderFactory {
    private Log logger = LogFactory.getLog(ModuleCachingMetadataReaderFactory.class);

    @Override
    public MetadataReader getMetadataReader(String className) throws IOException {
        List<Module> modules = ModuleManager.getStartedModules();

        logger.debug("Checking if " + className + " is contained in loaded modules");

        for (Module module : modules) {
            if (className.startsWith(module.getPackageName())) {
                String resourcePath = module.getExpandedJarFolder().getAbsolutePath() + "/" + ClassUtils.convertClassNameToResourcePath(className) + ".class";

                File file = new File(resourcePath);

                if (file.exists()) {
                    logger.debug("Yes it is, returning MetadataReader of this class");

                    return getMetadataReader(getResourceLoader().getResource("file:" + resourcePath));
                }
            }
        }

        return super.getMetadataReader(className);
    }
}

And define it in the bean configuration :

<bean id="customCachingMetadataReaderFactory" class="com.yourpackage.module.spring.core.type.classreading.ModuleCachingMetadataReaderFactory"/>

<bean name="org.springframework.context.annotation.internalConfigurationAnnotationProcessor"
      class="org.springframework.context.annotation.ConfigurationClassPostProcessor">
      <property name="metadataReaderFactory" ref="customCachingMetadataReaderFactory"/>
</bean>

This is the part I won't post classes. All classloaders extend URLClassLoader.

I did mine as singleton so it can :

The most important part is loadClass which will allow context to load your modules classes after using setCurrentClassLoader(XmlWebApplicationContext) (see bottom of the next step). Concretly, this method will scan the children classloader (which I personaly store in my module manager) and if not found, it will scan parent/self classes.

This classloader simply adds the module.jar and the .jar it contains as url.

This class can load/start/stop/unload your modules. I did like this :

  • load : store a Module class which represent the module.jar (contains id, name, description, file...)
  • start : expand the jar, create module classloader and assign it to the Module class
  • stop : remove the expanded jar, dispose classloader
Module

I named this class WebApplicationUtils. It contains a reference to the dispatcher servlet (see step 7). As you will see, refreshContext call methods on AppClassLoader which is actually my root classloader.

/**
 * Refresh {@link DispatcherServlet}
 * @return true if refreshed, false if not
 * @throws RuntimeException
 */
private static boolean refreshDispatcherServlet() throws RuntimeException {
    if (dispatcherServlet != null) {
        dispatcherServlet.refresh();
        return true;
    }

    return false;
}

/**
 * Refresh the given {@link XmlWebApplicationContext}.<br>
 * Call {@link Module#onStarted()} after context refreshed.<br>
 * Unload started modules on {@link RuntimeException}.
 * @param context Application context
 * @param startedModules Started modules
 * @throws RuntimeException
 */
public static void refreshContext(XmlWebApplicationContext context, Module[] startedModules) throws RuntimeException {
    try {
        logger.debug("Closing web application context");
        context.stop();
        context.close();

        AppClassLoader.destroyInstance();

        setCurrentClassLoader(context);

        logger.debug("Refreshing web application context");
        context.refresh();

        setCurrentClassLoader(context);

        AppClassLoader.setThreadsToNewClassLoader();

        refreshDispatcherServlet();

        if (startedModules != null) {
            for (Module module : startedModules) {
                module.onStarted();
            }
        }
    }
    catch (RuntimeException e) {
        for (Module module : startedModules) {
            try {
                ModuleManager.stopModule(module.getId());
            }
            catch (IOException e2) {
                e.printStackTrace();
            }
        }

        throw e;
    }
}

/**
 * Set the current classloader to the {@link XmlWebApplicationContext} and {@link Thread#currentThread()}.
 * @param context ApplicationContext
 */
public static void setCurrentClassLoader(XmlWebApplicationContext context) {
    context.setClassLoader(AppClassLoader.getInstance());
    Thread.currentThread().setContextClassLoader(AppClassLoader.getInstance());
}
/**
 * Initialize/destroy ModuleManager on context init/destroy
 * @see {@link ContextLoaderListener}
 * @author Ludovic Guillaume
 */
public class ModuleContextLoaderListener extends ContextLoaderListener {
    public ModuleContextLoaderListener() {
        super();
    }

    @Override
    public void contextInitialized(ServletContextEvent event) {
        // initialize ModuleManager, which will scan the given folder
        // TODO: param in web.xml
        ModuleManager.init(event.getServletContext().getRealPath("WEB-INF"), "/dev/temp/modules");

        super.contextInitialized(event);
    }

    @Override
    protected WebApplicationContext createWebApplicationContext(ServletContext sc) {
        XmlWebApplicationContext context = (XmlWebApplicationContext)super.createWebApplicationContext(sc);

        // set the current classloader
        WebApplicationUtils.setCurrentClassLoader(context);

        return context;
    }

    @Override
    public void contextDestroyed(ServletContextEvent event) {
        super.contextDestroyed(event);

        // destroy ModuleManager, dispose every module classloaders
        ModuleManager.destroy();
    }
}
<listener>
    <listener-class>com.yourpackage.module.spring.context.ModuleContextLoaderListener</listener-class>
</listener>
/**
 * Only used to keep the {@link DispatcherServlet} easily accessible by {@link WebApplicationUtils}.
 * @author Ludovic Guillaume
 */
public class ModuleDispatcherServlet extends DispatcherServlet {
    private static final long serialVersionUID = 1L;

    public ModuleDispatcherServlet() {
        WebApplicationUtils.setDispatcherServlet(this);
    }
}
<servlet>
    <servlet-name>dispatcher</servlet-name>
    <servlet-class>com.yourpackage.module.spring.web.servlet.ModuleDispatcherServlet</servlet-class>

    <init-param>
        <param-name>contextConfigLocation</param-name>
        <param-value>/WEB-INF/dispatcher-servlet.xml</param-value>
    </init-param>

    <load-on-startup>1</load-on-startup>
</servlet>

This part is 'optional' but it brings some clarity and cleanness in the controller implementation.

/**
 * Used to handle module {@link ModelAndView}.<br/><br/>
 * <b>Usage:</b><br/>{@code new ModuleAndView("module:MODULE_NAME.jar:LOCATION");}<br/><br/>
 * <b>Example:</b><br/>{@code new ModuleAndView("module:test-module.jar:views/testModule");}
 * @see JstlView
 * @author Ludovic Guillaume
 */
public class ModuleJstlView extends JstlView {
    @Override
    protected String prepareForRendering(HttpServletRequest request, HttpServletResponse response) throws Exception {
        String beanName = getBeanName();

        // checks if it starts 
        if (beanName.startsWith("module:")) {
            String[] values = beanName.split(":");

            String location = String.format("/%s%s/WEB-INF/%s", ModuleManager.CONTEXT_ROOT_MODULES_FOLDER, values[1], values[2]);

            setUrl(getUrl().replaceAll(beanName, location));
        }

        return super.prepareForRendering(request, response);
    }
}

Define it in the bean config :

<bean id="viewResolver"
      class="org.springframework.web.servlet.view.InternalResourceViewResolver"
      p:viewClass="com.yourpackage.module.spring.web.servlet.view.ModuleJstlView"
      p:prefix="/WEB-INF/"
      p:suffix=".jsp"/>

Now you just need to create a module, interface it with ModuleManager and add resources in the WEB-INF/ folder.

After that you can call load/start/stop/unload. I personaly refresh the context after each operation except for load.

The code is probably optimizable (ModuleManager as singleton e.g.) and there's maybe a better alternative (though I did not find it).

My next goal is to scan a module context config which shouldn't be so difficult.

Thanks! It tooks me a while but it's working like a charm. I migrated the project to Wildfly 8.1 and updated Spring from 3 to 4 few days ago. It's still working as intended :)

The project still works on Wildfly 9.0.1 / Spring 4.2.0. I left the Jstl part because we don't use JSP anymore. I migrated XML config to Java config. If I have time, I will do a git with all this.

java - Modular Spring-based application - Stack Overflow

java spring jsp spring-mvc
Rectangle 27 114

  • Metaspec C# Parser: From C# 1.0 to 3.0, commercial product (about 5000$)
  • #recognize!: From C# 1.0 to 3.0, commercial product (about 900) (answer by SharpRecognize)
  • NRefactory: From C# 1.0 to 4.0 (+async), open-source, parser used in SharpDevelop. Includes semantic analysis.
  • C# Parser and CodeDOM: A complete C# 4.0 Parser, already support the C# 5.0 async feature. Commercial product (49$ to 299$) (answer by Ken Beckett)

The problem with assembly "parsing" is that we have less informations about line and file (the informations is based on .pdb file, and Pdb contains lines informations only for methods)

CS-Script(csscript.net) - the C# Script Engine may suite this list. Sample of "Introducing the Microsoft Roslyn CTP" is very like CS-script can do.

While you're mentioning costs, note that Roslyn requires at least the Pro version of Visual Studio.

parsing - Parser for C# - Stack Overflow

c# parsing
Rectangle 27 11

It's a true Bash JSON parser.

#!/bin/bash
. /path/to/ticktick.sh

# File
DATA=`cat data.json`
# cURL
#DATA=`curl http://foobar3000.com/echo/request.json`

tickParse "$DATA"

echo ``pathname``
echo ``headers["user-agent"]``

parsing - Unix command-line JSON parser? - Stack Overflow

json parsing unix
Rectangle 27 13

f = parser.parse('sin(x)*x^2').to_pyfunc()
parser
eval

Equation parsing in Python - Stack Overflow

python parsing equation
Rectangle 27 329

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();

This is a new project, so any ideas for improvement are very welcome!

This thing is fantastic, and I love the CSS selector support. I barely know I'm using a Java library. :-)

Please don't stop supporting this. This is exactly what we've needed to parse HTML using server-side Java! This is awesome! I built a proxy in just a couple hours that modifies all of the src and href links to make them full paths to the origin server.

Unbelievable, this is sooo sick. I was able to process an HTML page within minutes. THANK YOU SO MUCH FOR THIS GREAT WORK.

java - Which HTML Parser is the best? - Stack Overflow

java html parsing html-parsing web-scraping
Rectangle 27 285

  • A HTML DOM parser written in PHP5+ that lets you manipulate HTML in a very easy way!
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';
// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html;
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

I know about SimpleDom, but I was just looking for some more Professional approaches +1

What would make an approach more "professional"?

How is using a tested library with 75,000 downloads and many active users unprofessional? I'm curious :)

Well firstly there's things I need to prepare for such as bad DOM's, Invlid code, also js analysing against DNSBL engine, this will also be used to look out for malicious sites / content, also the as i have built my site around a framework i have built it needs to be clean, readable, and well structured. SimpleDim is great but the code is slightly messy

@Robert you might also want to check out htmlpurifier.org for the security related things.

parsing - How do you parse and process HTML/XML in PHP? - Stack Overfl...

php parsing xml-parsing html-parsing
Rectangle 27 4

Using a native CSV parser

Edit: Having my big file around I also tested Uri Agassi's aproach using grep to get the lines of the file with empty fields:

File.new(filename).grep(/(^,|,(,|$))/)

It's about 10 times faster. If you need access to the fields you can use CSV.parse:

require 'csv'

File.new("/tmp/big.csv").grep(/(^,|,(,|%))/).each do |row_string|
  CSV.parse(row_string) do |row|
    puts row[1]
  end
end

Otherwise, if you have to parse the whole CSV file anyway, the answer is most likely no. Try running your script without the checking part - just reading the CSV rows. You will see no change in running time. This is because most of the time is spent reading and parsing the CSV file.

You might wonder if there is a faster CSV library for ruby. There is indeed a gem called FasterCSV but Ruby 1.9 has adopted it as its built-in CSV library, so it probably won't get much faster using Ruby only.

There is a ruby gem named excelsior which uses a native CSV parser. You can install it via gem install excelsior and use it like this:

require 'excelsior'

Excelsior::Reader.rows(File.open('/tmp/big.csv')) do |row|

  row.each do |column|

    unless column
      puts "empty field"
    end
  end
end

I tested this code with a file like yours (72M, ~30k entries 2.5k fields) and it is about twice as fast, however it segfaults after a few lines, so the gem might not be stable.

As you mentioned in your comment, there are a few more idiomatic ways to write this, such as using each instead of the for loop or using unless instead of if !, and using two spaces for indentation, which will turn it into:

require 'csv'

CSV.foreach('/tmp/big.csv') do |row|

  row.each do |column|
    unless column
      puts "empty field"
    end
  end

end

Find out if CSV file contains empty field in Ruby? - Stack Overflow

ruby csv
Rectangle 27 238

I think you should not consider any specific parser implementation. Java API for XML Processing lets you use any conforming parser implementation in a standard way. The code should be much more portable, and when you realise that a specific parser has grown too old, you can replace it with another without changing a line of your code (if you do it correctly).

I think you should not consider any specific parser implementation. Java API for XML Processing lets you use any conforming parser implementation in a standard way. The code should be much more portable, and when you realise that a specific parser has grown too old, you can replace it with another without changing a line of your code (if you do it correctly).

Basically there are three ways of handling XML in a standard way:

Basically there are three ways of handling XML in a standard way:

  • SAX This is the simplest API. You read the XML by defining a Handler class that receives the data inside elements/attributes when the XML gets processed in a serial way. It is faster and simpler if you only plan to read some attributes/elements and/or write some values back (your case).
  • SAX This is the simplest API. You read the XML by defining a Handler class that receives the data inside elements/attributes when the XML gets processed in a serial way. It is faster and simpler if you only plan to read some attributes/elements and/or write some values back (your case).
  • DOM This method creates an object tree which lets you modify/access it randomly so it is better for complex XML manipulation and handling.
  • DOM This method creates an object tree which lets you modify/access it randomly so it is better for complex XML manipulation and handling.
  • StAX This is in the middle of the path between SAX and DOM. You just write code to pull the data from the parser you are interested in when it is processed.
  • StAX This is in the middle of the path between SAX and DOM. You just write code to pull the data from the parser you are interested in when it is processed.

Forget about proprietary APIs such as JDOM or Apache ones (i.e. Apache Xerces XMLSerializer) because will tie you to a specific implementation that can evolve in time or lose backwards compatibility, which will make you change your code in the future when you want to upgrade to a new version of JDOM or whatever parser you use. If you stick to Java standard API (using factories and interfaces) your code will be much more modular and maintainable.

Forget about proprietary APIs such as JDOM or Apache ones (i.e. Apache Xerces XMLSerializer) because will tie you to a specific implementation that can evolve in time or lose backwards compatibility, which will make you change your code in the future when you want to upgrade to a new version of JDOM or whatever parser you use. If you stick to Java standard API (using factories and interfaces) your code will be much more modular and maintainable.

There is no need to say that all (I haven't checked all, but I'm almost sure) of the parsers proposed comply with a JAXP implementation so technically you can use all, no matter which.

There is no need to say that all (I haven't checked all, but I'm almost sure) of the parsers proposed comply with a JAXP implementation so technically you can use all, no matter which.

Actually, 3 ways: StAX (javax.xml.stream) is the third standard one.

Actually, 3 ways: StAX (javax.xml.stream) is the third standard one.

Best XML parser for Java - Stack Overflow

java xml parsing
Rectangle 27 235

I think you should not consider any specific parser implementation. Java API for XML Processing lets you use any conforming parser implementation in a standard way. The code should be much more portable, and when you realise that a specific parser has grown too old, you can replace it with another without changing a line of your code (if you do it correctly).

Basically there are three ways of handling XML in a standard way:

  • SAX This is the simplest API. You read the XML by defining a Handler class that receives the data inside elements/attributes when the XML gets processed in a serial way. It is faster and simpler if you only plan to read some attributes/elements and/or write some values back (your case).
  • DOM This method creates an object tree which lets you modify/access it randomly so it is better for complex XML manipulation and handling.
  • StAX This is in the middle of the path between SAX and DOM. You just write code to pull the data from the parser you are interested in when it is processed.

Forget about proprietary APIs such as JDOM or Apache ones (i.e. Apache Xerces XMLSerializer) because will tie you to a specific implementation that can evolve in time or lose backwards compatibility, which will make you change your code in the future when you want to upgrade to a new version of JDOM or whatever parser you use. If you stick to Java standard API (using factories and interfaces) your code will be much more modular and maintainable.

There is no need to say that all (I haven't checked all, but I'm almost sure) of the parsers proposed comply with a JAXP implementation so technically you can use all, no matter which.

Actually, 3 ways: StAX (javax.xml.stream) is the third standard one.

Best XML parser for Java - Stack Overflow

java xml parsing
Rectangle 27 121

A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence "John sees Bill", a constituency parse would be:

Sentence
                     |
       +-------------+------------+
       |                          |
  Noun Phrase                Verb Phrase
       |                          |
     John                 +-------+--------+
                          |                |
                        Verb          Noun Phrase
                          |                |
                        sees              Bill

A dependency parse connects words according to their relationships. Each vertex in the tree represents a word, child nodes are words that are dependent on the parent, and edges are labeled by the relationship. A dependency parse of "John sees Bill", would be:

sees
                |
        +--------------+
subject |              | object
        |              |
      John            Bill

You should use the parser type that gets you closest to your goal. If you are interested in sub-phrases within the sentence, you probably want the constituency parse. If you are interested in the dependency relationships between words, then you probably want the dependency parse.

The Stanford parser can give you either (online demo). In fact, the way it really works is to always parse the sentence with the constituency parser, and then, if needed, it performs a deterministic (rule-based) transformation on the constituency parse tree to convert it into a dependency tree.

More can be found here:

parsing - Difference between constituency parser and dependency parser...

parsing nlp
Rectangle 27 28

There seems to be some confusion as to what type=bool and type='bool' might mean. Should one (or both) mean 'run the function bool(), or 'return a boolean'? As it stands type='bool' means nothing. add_argument gives a 'bool' is not callable error, same as if you used type='foobar', or type='int'.

But argparse does have registry that lets you define keywords like this. It is mostly used for action, e.g. `action='store_true'. You can see the registered keywords with:

parser._registries

which displays a dictionary

{'action': {None: argparse._StoreAction,
  'append': argparse._AppendAction,
  'append_const': argparse._AppendConstAction,
...
 'type': {None: <function argparse.identity>}}

There are lots of actions defined, but only one type, the default one, argparse.identity.

def str2bool(v):
  #susendberg's function
  return v.lower() in ("yes", "true", "t", "1")
p = argparse.ArgumentParser()
p.register('type','bool',str2bool) # add type keyword to registries
p.add_argument('-b',type='bool')  # do not use 'type=bool'
# p.add_argument('-b',type=str2bool) # works just as well
p.parse_args('-b false'.split())
Namespace(b=False)

parser.register() is not documented, but also not hidden. For the most part the programmer does not need to know about it because type and action take function and class values. There are lots of stackoverflow examples of defining custom values for both.

In case it isn't obvious from the previous discussion, bool() does not mean 'parse a string'. From the Python documentation:

bool(x): Convert a value to a Boolean, using the standard truth testing procedure.

Contrast this with

int(x): Convert a number or string x to an integer.

parser.register

python - Parsing boolean values with argparse - Stack Overflow

python boolean argparse command-line-parsing
Rectangle 27 26

There seems to be some confusion as to what type=bool and type='bool' might mean. Should one (or both) mean 'run the function bool(), or 'return a boolean'? As it stands type='bool' means nothing. add_argument gives a 'bool' is not callable error, same as if you used type='foobar', or type='int'.

But argparse does have registry that lets you define keywords like this. It is mostly used for action, e.g. `action='store_true'. You can see the registered keywords with:

parser._registries

which displays a dictionary

{'action': {None: argparse._StoreAction,
  'append': argparse._AppendAction,
  'append_const': argparse._AppendConstAction,
...
 'type': {None: <function argparse.identity>}}

There are lots of actions defined, but only one type, the default one, argparse.identity.

def str2bool(v):
  #susendberg's function
  return v.lower() in ("yes", "true", "t", "1")
p = argparse.ArgumentParser()
p.register('type','bool',str2bool) # add type keyword to registries
p.add_argument('-b',type='bool')  # do not use 'type=bool'
# p.add_argument('-b',type=str2bool) # works just as well
p.parse_args('-b false'.split())
Namespace(b=False)

parser.register() is not documented, but also not hidden. For the most part the programmer does not need to know about it because type and action take function and class values. There are lots of stackoverflow examples of defining custom values for both.

In case it isn't obvious from the previous discussion, bool() does not mean 'parse a string'. From the Python documentation:

bool(x): Convert a value to a Boolean, using the standard truth testing procedure.

Contrast this with

int(x): Convert a number or string x to an integer.

parser.register

python - Parsing boolean values with argparse - Stack Overflow

python boolean argparse command-line-parsing
Rectangle 27 26

There seems to be some confusion as to what type=bool and type='bool' might mean. Should one (or both) mean 'run the function bool(), or 'return a boolean'? As it stands type='bool' means nothing. add_argument gives a 'bool' is not callable error, same as if you used type='foobar', or type='int'.

But argparse does have registry that lets you define keywords like this. It is mostly used for action, e.g. `action='store_true'. You can see the registered keywords with:

parser._registries

which displays a dictionary

{'action': {None: argparse._StoreAction,
  'append': argparse._AppendAction,
  'append_const': argparse._AppendConstAction,
...
 'type': {None: <function argparse.identity>}}

There are lots of actions defined, but only one type, the default one, argparse.identity.

def str2bool(v):
  #susendberg's function
  return v.lower() in ("yes", "true", "t", "1")
p = argparse.ArgumentParser()
p.register('type','bool',str2bool) # add type keyword to registries
p.add_argument('-b',type='bool')  # do not use 'type=bool'
# p.add_argument('-b',type=str2bool) # works just as well
p.parse_args('-b false'.split())
Namespace(b=False)

parser.register() is not documented, but also not hidden. For the most part the programmer does not need to know about it because type and action take function and class values. There are lots of stackoverflow examples of defining custom values for both.

In case it isn't obvious from the previous discussion, bool() does not mean 'parse a string'. From the Python documentation:

bool(x): Convert a value to a Boolean, using the standard truth testing procedure.

Contrast this with

int(x): Convert a number or string x to an integer.

parser.register

python - Parsing boolean values with argparse - Stack Overflow

python boolean argparse command-line-parsing
Rectangle 27 178

So you want to build a XML parser to parse a RSS feed like this one.

<rss version="0.92">
<channel>
    <title>MyTitle</title>
    <link>http://myurl.com</link>
    <description>MyDescription</description>
    <lastBuildDate>SomeDate</lastBuildDate>
    <docs>http://someurl.com</docs>
    <language>SomeLanguage</language>

    <item>
        <title>TitleOne</title>
        <description><![CDATA[Some text.]]></description>
        <link>http://linktoarticle.com</link>
    </item>

    <item>
        <title>TitleTwo</title>
        <description><![CDATA[Some other text.]]></description>
        <link>http://linktoanotherarticle.com</link>
    </item>

</channel>
</rss>

Now you have two SAX implementations you can work with. Either you use the org.xml.sax or the android.sax implementation. I'm going to explain the pro's and con's of both after posting a short hander example.

Let's start with the android.sax implementation.

You have first have to define the XML structure using the RootElement and Element objects.

In any case I would work with POJOs (Plain Old Java Objects) which would hold your data. Here would be the POJOs needed.

public class Channel implements Serializable {

    private Items items;
    private String title;
    private String link;
    private String description;
    private String lastBuildDate;
    private String docs;
    private String language;

    public Channel() {
        setItems(null);
        setTitle(null);
        // set every field to null in the constructor
    }

    public void setItems(Items items) {
        this.items = items;
    }

    public Items getItems() {
        return items;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getTitle() {
        return title;
    }
    // rest of the class looks similar so just setters and getters
}

This class implements the Serializable interface so you can put it into a Bundle and do something with it.

Now we need a class to hold our items. In this case I'm just going to extend the ArrayList class.

public class Items extends ArrayList<Item> {

    public Items() {
        super();
    }

}

Thats it for our items container. We now need a class to hold the data of every single item.

public class Item implements Serializable {

    private String title;
    private String description;
    private String link;

    public Item() {
        setTitle(null);
        setDescription(null);
        setLink(null);
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getTitle() {
        return title;
    }

    // same as above.

}
public class Example extends DefaultHandler {

    private Channel channel;
    private Items items;
    private Item item;

    public Example() {
        items = new Items();
    }

    public Channel parse(InputStream is) {
        RootElement root = new RootElement("rss");
        Element chanElement = root.getChild("channel");
        Element chanTitle = chanElement.getChild("title");
        Element chanLink = chanElement.getChild("link");
        Element chanDescription = chanElement.getChild("description");
        Element chanLastBuildDate = chanElement.getChild("lastBuildDate");
        Element chanDocs = chanElement.getChild("docs");
        Element chanLanguage = chanElement.getChild("language");

        Element chanItem = chanElement.getChild("item");
        Element itemTitle = chanItem.getChild("title");
        Element itemDescription = chanItem.getChild("description");
        Element itemLink = chanItem.getChild("link");

        chanElement.setStartElementListener(new StartElementListener() {
            public void start(Attributes attributes) {
                channel = new Channel();
            }
        });

        // Listen for the end of a text element and set the text as our
        // channel's title.
        chanTitle.setEndTextElementListener(new EndTextElementListener() {
            public void end(String body) {
                channel.setTitle(body);
            }
        });

        // Same thing happens for the other elements of channel ex.

        // On every <item> tag occurrence we create a new Item object.
        chanItem.setStartElementListener(new StartElementListener() {
            public void start(Attributes attributes) {
                item = new Item();
            }
        });

        // On every </item> tag occurrence we add the current Item object
        // to the Items container.
        chanItem.setEndElementListener(new EndElementListener() {
            public void end() {
                items.add(item);
            }
        });

        itemTitle.setEndTextElementListener(new EndTextElementListener() {
            public void end(String body) {
                item.setTitle(body);
            }
        });

        // and so on

        // here we actually parse the InputStream and return the resulting
        // Channel object.
        try {
            Xml.parse(is, Xml.Encoding.UTF_8, root.getContentHandler());
            return channel;
        } catch (SAXException e) {
            // handle the exception
        } catch (IOException e) {
            // handle the exception
        }

        return null;
    }

}

Now that was a very quick example as you can see. The major advantage of using the android.sax SAX implementation is that you can define the structure of the XML you have to parse and then just add an event listener to the appropriate elements. The disadvantage is that the code get quite repeating and bloated.

The org.xml.sax SAX handler implementation is a bit different.

Here you don't specify or declare you XML structure but just listening for events. The most widely used ones are following events:

  • Characters between Element Start and Element End

An example handler implementation using the Channel object above looks like this.

public class ExampleHandler extends DefaultHandler {

    private Channel channel;
    private Items items;
    private Item item;
    private boolean inItem = false;

    private StringBuilder content;

    public ExampleHandler() {
        items = new Items();
        content = new StringBuilder();
    }

    public void startElement(String uri, String localName, String qName, 
            Attributes atts) throws SAXException {
        content = new StringBuilder();
        if(localName.equalsIgnoreCase("channel")) {
            channel = new Channel();
        } else if(localName.equalsIgnoreCase("item")) {
            inItem = true;
            item = new Item();
        }
    }

    public void endElement(String uri, String localName, String qName) 
            throws SAXException {
        if(localName.equalsIgnoreCase("title")) {
            if(inItem) {
                item.setTitle(content.toString());
            } else {
                channel.setTitle(content.toString());
            }
        } else if(localName.equalsIgnoreCase("link")) {
            if(inItem) {
                item.setLink(content.toString());
            } else {
                channel.setLink(content.toString());
            }
        } else if(localName.equalsIgnoreCase("description")) {
            if(inItem) {
                item.setDescription(content.toString());
            } else {
                channel.setDescription(content.toString());
            }
        } else if(localName.equalsIgnoreCase("lastBuildDate")) {
            channel.setLastBuildDate(content.toString());
        } else if(localName.equalsIgnoreCase("docs")) {
            channel.setDocs(content.toString());
        } else if(localName.equalsIgnoreCase("language")) {
            channel.setLanguage(content.toString());
        } else if(localName.equalsIgnoreCase("item")) {
            inItem = false;
            items.add(item);
        } else if(localName.equalsIgnoreCase("channel")) {
            channel.setItems(items);
        }
    }

    public void characters(char[] ch, int start, int length) 
            throws SAXException {
        content.append(ch, start, length);
    }

    public void endDocument() throws SAXException {
        // you can do something here for example send
        // the Channel object somewhere or whatever.
    }

}

Now to be honest I can't really tell you any real advantage of this handler implementation over the android.sax one. I can however tell you the disadvantage which should be pretty obvious by now. Take a look at the else if statement in the startElement method. Due to the fact that we have the tags <title>, link and description we have to track there in the XML structure we are at the moment. That is if we encounter a <item> starting tag we set the inItem flag to true to ensure that we map the correct data to the correct object and in the endElement method we set that flag to false if we encounter a </item> tag. To signalize that we are done with that item tag.

In this example it is pretty easy to manage that but having to parse a more complex structure with repeating tags in different levels becomes tricky. There you'd have to either use Enums for example to set your current state and a lot of switch/case statemenets to check where you are or a more elegant solution would be some kind of tag tracker using a tag stack.

@Adinia note that it is no problem to use both implementations together. There is no problem in doing that as long as you know why you do it.

@octavian-damiean It's true my code was working, but didn't really know why I wrote every line; I try to clean it up a bit now, as I understood how is working each of them. So thanks for the note that it's ok to use both together.

@Adinia I see. You're welcome. If you have more question about it you can also join out cool Android chat room.

@OctavianDamiean (sorry for the previous comment, coudn't edit it after 5 minutes) Do you know if there is any way to distinguish between different XML root elements using the android.sax parser? For example, if I wanted to parse an RSS or Atom feed in a generic way (i.e. make the parser aware of both formats and choose the correct one), can I handle both <rss> and <feed> root elements with the same ContentHandler? Or do I need to use plain org.xml.sax for that purpose?

Yes you can. You need a generic handler that just reads the first element (root element) and delegates to the next appropriate handler.

java - How to parse XML using the SAX parser - Stack Overflow

java android xml parsing sax
Rectangle 27 100

This happens because HTML parser as defined by W3C is totally separated from the JavaScript parser. After the <script> tag it looks for the closing </script>, regardless that it's inside comments or strings, because it treats JS code as normal text.

To add, it would be very bad if it worked any other way. Browsers that don't support JavaScript still need to be able to find the closing </script> without knowing the parsing rules of the language. And while browsers that don't support JavaScript may be uncommon (although they do exist, even current ones), the same argument can be made for any language that has ever been supported in any browser. Would Firefox need to know the VBScript parsing rules if the page contains a script that's designed to only work on IE?

javascript - Why does adding in a comment break the parser? ...

javascript html
Rectangle 27 38

The main difference between XML Parser and SimpleXML is that the latter is not a pull parser. SimpleXML is built on top of the DOM extensions and will load the entire XML file into memory. XML Parser like XMLReader will only load the current node into memory. You define handlers for specific nodes which will get triggered when the Parser encounters it. That is faster and saves on memory. You pay for that with not being able to use XPath.

Personally, I find SimpleXml quite limiting (hence simple) in what it offers over DOM. You can switch between DOM and SimpleXml easily though, but I usually dont bother and go the DOM route directly. DOM is an implementation of the W3C DOM API, so you might be familiar with it from other languages, for instance JavaScript.

Best XML Parser for PHP - Stack Overflow

php xml parsing xml-parsing
Rectangle 27 64

On Android, the Apache libraries provide a Query parser:

This is available in the apache http client library, not only on Android. Btw, the link to apache was changed. Latest is: hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/

Annoyingly URLEncodedUtils.parse() returns a List that you would then have to loop through to find the value for a specific key. It would be much nicer if it returned a Map like in BalusC's answer.

@Hanno Fietz you mean you trust these alternatives? I know they are buggy. I know pointing out the bugs I see will only encourage people to adopt 'fixed' versions, rather than themselves look for the bugs I've overlooked.

@Will - well, I would never just trust copy-and-paste snippets I got from any website, and no one should. But here, these snippets are rather well reviewed and commented on and thus are really helpful, actually. Simply seeing some suggestions on what might be wrong with the code is already a great help in thinking for myself. And mind you, I didn't mean to say "roll your own is better", but rather that it's great to have good material for an informed decision in my own code.

I imagine parse returns a list so that it maintain positional ordering and more easily allows duplicate entries.

java - Parsing query strings on Android - Stack Overflow

java android parsing url
Rectangle 27 14

Scala's parser combinators aren't very efficient. They weren't designed to be. They're good for doing small tasks with relatively small inputs.

So it really depends on your requirements. There shouldn't be any interop problems with ANTLR. Calling Scala from Java can get hairy, but calling Java from Scala almost always just works.

I've tried this and it works very well. It plays nicely with Scalas case-classes too (and so does JFlex/JavaCup too btw :-)

My problem is that my AST objects are in Scala exclusively...is there a way to call those objects in an ANTLR parser?

If you want to call Scala from Java and have it look nice your Scala can't rely on Scala-specific language features such as implicit conversions, implicit arguments, default arguments, symbolic method names, by-name parameters, etc. Generics should be kept relatively simple as well. If you're Scala objects are relatively simple then there shouldn't be any problem using them from Java.

parsing - Scala parser combinators vs ANTLR/Java generated parser? - S...

java parsing scala antlr3 parser-combinators